Monitoring & Observability Stack

Production monitoring with Prometheus, Grafana, OpenTelemetry, and alerting best practices.

Claude CodeCursorGitHub CopilotWindsurfClineCodex / OpenAIGemini CLI

Updated 2026-04-05

CLAUDE.md

# Monitoring & Observability Stack

You are an expert in observability, monitoring, and production reliability engineering.

Three Pillars:
1. Metrics: numeric measurements over time (Prometheus, Datadog, CloudWatch)
2. Logs: discrete events with context (Loki, Elasticsearch, CloudWatch Logs)
3. Traces: request flow across services (Jaeger, Tempo, X-Ray)

Prometheus:
- Use the pull model: Prometheus scrapes /metrics endpoints
- Four metric types: counter (only up), gauge (up/down), histogram (distribution), summary
- Naming convention: namespace_subsystem_name_unit (e.g., http_requests_total)
- Use labels for dimensions but keep cardinality low (never use user IDs as labels)
- Set scrape_interval to 15s for most targets; 5s for critical services

Grafana:
- Create dashboards with the USE method: Utilization, Saturation, Errors
- Use variables for reusable dashboards across environments and services
- Set meaningful Y-axis ranges; avoid auto-scaling that hides context
- Use annotations to mark deployments and incidents on graphs
- Alert from Grafana or Alertmanager; prefer Alertmanager for deduplication

OpenTelemetry:
- Use OpenTelemetry SDK as the single instrumentation layer
- Auto-instrumentation for HTTP, database, and messaging frameworks
- Propagate trace context across service boundaries (W3C Trace Context)
- Export to any backend: Jaeger, Tempo, Datadog, New Relic via OTLP
- Add custom spans for business-critical operations

Alerting:
- Alert on symptoms (high error rate, slow response time), not causes
- Use SLOs: define error budgets and alert when budget is burning too fast
- Set severity levels: critical (page), warning (ticket), info (dashboard)
- Include runbook links in every alert
- Avoid alert fatigue: tune thresholds, group related alerts, suppress flapping

Structured Logging:
- Always use JSON format: timestamp, level, message, trace_id, service
- Include correlation IDs to trace requests across services
- Log at appropriate levels: ERROR (action needed), WARN (degraded), INFO (business events), DEBUG (development)
- Never log sensitive data: passwords, tokens, PII
- Use log aggregation: ship to Loki, Elasticsearch, or CloudWatch Logs

Add to your project root CLAUDE.md file, or append to an existing one.

Tags

Related Skills

Docker Best Practices

GitHub Actions CI/CD

Terraform Infrastructure as Code

Kubernetes Best Practices