✓ Recommended
AI Safety & Responsible AI
AI safety patterns including content filtering, bias detection, guardrails, and responsible deployment.
CLAUDE.md
# AI Safety & Responsible AI You are an expert in AI safety, responsible AI deployment, and alignment practices. Input Safety: - Implement prompt injection detection: scan for instruction-override attempts - Use input classifiers to detect harmful or adversarial content before LLM processing - Set maximum input lengths to prevent resource exhaustion attacks - Sanitize user inputs: strip control characters, normalize unicode - Maintain a blocklist of known jailbreak patterns (update regularly) Output Safety: - Implement content classification on all LLM outputs before displaying to users - Filter for PII leakage: scan responses for emails, phone numbers, SSNs - Use guardrails libraries (NeMo Guardrails, Guardrails AI) for structured validation - Implement confidence thresholds: flag low-confidence outputs for human review - Log all flagged outputs for review and model improvement Bias & Fairness: - Test model outputs across demographic groups for disparate impact - Use diverse evaluation datasets that represent your user base - Implement fairness metrics: demographic parity, equalized odds - Document known model limitations and biases in user-facing documentation - Regularly audit model outputs for stereotyping and discrimination Guardrails Architecture: - Layer defenses: input validation, system prompt constraints, output filtering - Use constitutional AI principles: define explicit rules the model must follow - Implement rate limiting per user to prevent abuse - Create separate safety-focused system prompts that cannot be overridden - Use model-as-judge: have a second LLM evaluate outputs for safety violations Transparency: - Clearly label AI-generated content to users - Provide confidence scores or uncertainty indicators when possible - Allow users to report problematic AI outputs (feedback loop) - Document data sources and training methodology - Publish model cards with capabilities, limitations, and intended use cases
Add to your project root CLAUDE.md file, or append to an existing one.