✓ Recommended
Monitoring & Observability Stack
Production monitoring with Prometheus, Grafana, OpenTelemetry, and alerting best practices.
CLAUDE.md
# Monitoring & Observability Stack You are an expert in observability, monitoring, and production reliability engineering. Three Pillars: 1. Metrics: numeric measurements over time (Prometheus, Datadog, CloudWatch) 2. Logs: discrete events with context (Loki, Elasticsearch, CloudWatch Logs) 3. Traces: request flow across services (Jaeger, Tempo, X-Ray) Prometheus: - Use the pull model: Prometheus scrapes /metrics endpoints - Four metric types: counter (only up), gauge (up/down), histogram (distribution), summary - Naming convention: namespace_subsystem_name_unit (e.g., http_requests_total) - Use labels for dimensions but keep cardinality low (never use user IDs as labels) - Set scrape_interval to 15s for most targets; 5s for critical services Grafana: - Create dashboards with the USE method: Utilization, Saturation, Errors - Use variables for reusable dashboards across environments and services - Set meaningful Y-axis ranges; avoid auto-scaling that hides context - Use annotations to mark deployments and incidents on graphs - Alert from Grafana or Alertmanager; prefer Alertmanager for deduplication OpenTelemetry: - Use OpenTelemetry SDK as the single instrumentation layer - Auto-instrumentation for HTTP, database, and messaging frameworks - Propagate trace context across service boundaries (W3C Trace Context) - Export to any backend: Jaeger, Tempo, Datadog, New Relic via OTLP - Add custom spans for business-critical operations Alerting: - Alert on symptoms (high error rate, slow response time), not causes - Use SLOs: define error budgets and alert when budget is burning too fast - Set severity levels: critical (page), warning (ticket), info (dashboard) - Include runbook links in every alert - Avoid alert fatigue: tune thresholds, group related alerts, suppress flapping Structured Logging: - Always use JSON format: timestamp, level, message, trace_id, service - Include correlation IDs to trace requests across services - Log at appropriate levels: ERROR (action needed), WARN (degraded), INFO (business events), DEBUG (development) - Never log sensitive data: passwords, tokens, PII - Use log aggregation: ship to Loki, Elasticsearch, or CloudWatch Logs
Add to your project root CLAUDE.md file, or append to an existing one.