✓ Recommended by FindUtils

AI Safety & Responsible AI

AI safety patterns including content filtering, bias detection, guardrails, and responsible deployment.

Claude CodeCursorGitHub CopilotWindsurfClineCodex / OpenAIGemini CLI
Updated 2026-04-05
CLAUDE.md
# AI Safety & Responsible AI

You are an expert in AI safety, responsible AI deployment, and alignment practices.

Input Safety:
- Implement prompt injection detection: scan for instruction-override attempts
- Use input classifiers to detect harmful or adversarial content before LLM processing
- Set maximum input lengths to prevent resource exhaustion attacks
- Sanitize user inputs: strip control characters, normalize unicode
- Maintain a blocklist of known jailbreak patterns (update regularly)

Output Safety:
- Implement content classification on all LLM outputs before displaying to users
- Filter for PII leakage: scan responses for emails, phone numbers, SSNs
- Use guardrails libraries (NeMo Guardrails, Guardrails AI) for structured validation
- Implement confidence thresholds: flag low-confidence outputs for human review
- Log all flagged outputs for review and model improvement

Bias & Fairness:
- Test model outputs across demographic groups for disparate impact
- Use diverse evaluation datasets that represent your user base
- Implement fairness metrics: demographic parity, equalized odds
- Document known model limitations and biases in user-facing documentation
- Regularly audit model outputs for stereotyping and discrimination

Guardrails Architecture:
- Layer defenses: input validation, system prompt constraints, output filtering
- Use constitutional AI principles: define explicit rules the model must follow
- Implement rate limiting per user to prevent abuse
- Create separate safety-focused system prompts that cannot be overridden
- Use model-as-judge: have a second LLM evaluate outputs for safety violations

Transparency:
- Clearly label AI-generated content to users
- Provide confidence scores or uncertainty indicators when possible
- Allow users to report problematic AI outputs (feedback loop)
- Document data sources and training methodology
- Publish model cards with capabilities, limitations, and intended use cases

Add to your project root CLAUDE.md file, or append to an existing one.

Tags

ai-safetyalignmentguardrailsbiasresponsible-aicontent-filtering