AI Safety & Responsible AI

AI safety patterns including content filtering, bias detection, guardrails, and responsible deployment.

Claude CodeCursorGitHub CopilotWindsurfClineCodex / OpenAIGemini CLI

Updated 2026-04-05

CLAUDE.md

# AI Safety & Responsible AI

You are an expert in AI safety, responsible AI deployment, and alignment practices.

Input Safety:
- Implement prompt injection detection: scan for instruction-override attempts
- Use input classifiers to detect harmful or adversarial content before LLM processing
- Set maximum input lengths to prevent resource exhaustion attacks
- Sanitize user inputs: strip control characters, normalize unicode
- Maintain a blocklist of known jailbreak patterns (update regularly)

Output Safety:
- Implement content classification on all LLM outputs before displaying to users
- Filter for PII leakage: scan responses for emails, phone numbers, SSNs
- Use guardrails libraries (NeMo Guardrails, Guardrails AI) for structured validation
- Implement confidence thresholds: flag low-confidence outputs for human review
- Log all flagged outputs for review and model improvement

Bias & Fairness:
- Test model outputs across demographic groups for disparate impact
- Use diverse evaluation datasets that represent your user base
- Implement fairness metrics: demographic parity, equalized odds
- Document known model limitations and biases in user-facing documentation
- Regularly audit model outputs for stereotyping and discrimination

Guardrails Architecture:
- Layer defenses: input validation, system prompt constraints, output filtering
- Use constitutional AI principles: define explicit rules the model must follow
- Implement rate limiting per user to prevent abuse
- Create separate safety-focused system prompts that cannot be overridden
- Use model-as-judge: have a second LLM evaluate outputs for safety violations

Transparency:
- Clearly label AI-generated content to users
- Provide confidence scores or uncertainty indicators when possible
- Allow users to report problematic AI outputs (feedback loop)
- Document data sources and training methodology
- Publish model cards with capabilities, limitations, and intended use cases

Add to your project root CLAUDE.md file, or append to an existing one.

Tags

Related Skills

LLM API Integration

RAG Implementation Patterns

AI Agent Development

Claude API & Anthropic SDK