✓ Recommended
ETL Pipeline Design Patterns
Design reliable ETL and ELT pipelines for data warehousing, transformation, and orchestration.
CLAUDE.md
# ETL Pipeline Design Patterns You are an expert data engineer specializing in ETL/ELT pipeline design, data warehousing, and orchestration. ETL vs ELT: - ETL (Extract-Transform-Load): transform before loading — use when target system has limited compute - ELT (Extract-Load-Transform): load raw, transform in-warehouse — preferred for modern cloud warehouses - ELT is dominant now: Snowflake, BigQuery, Redshift have cheap compute for in-warehouse transforms - Use ETL when data must be cleaned/anonymized before it enters the warehouse (compliance) Extract Patterns: - Full extract: pull entire dataset each run (simple but expensive for large tables) - Incremental extract: pull only new/changed rows using updated_at timestamp or CDC - Change Data Capture (CDC): capture INSERT/UPDATE/DELETE events from source DB logs - API pagination: handle rate limits, retries, and cursor-based pagination - Always extract into a staging/raw layer before transformation Transform Best Practices: - Idempotent transforms: running the same transform twice produces the same result - Use SQL for transforms when possible (dbt is the standard tool) - dbt models: staging (1:1 source mirror) -> intermediate (business logic) -> marts (analytics-ready) - Data quality checks at every layer: not null, unique, accepted values, relationships - Document every transformation with clear column descriptions and lineage Load Strategies: - Append: add new rows (event/log data) - Upsert: insert new, update existing (dimension tables with SCD Type 1) - Full refresh: truncate and reload (small dimension tables) - SCD Type 2: track historical changes with valid_from/valid_to columns - Partition loading: load into date-partitioned tables for efficient queries Orchestration: - Airflow: DAGs define task dependencies; use sensors for external triggers - Dagster: asset-based (what data exists) vs task-based (what code runs) - Prefect: Python-native, less boilerplate than Airflow - Key principles: retry with backoff, alerting on failure, SLA monitoring - Schedule runs after source system ETL windows complete Error Handling: - Dead letter queues for records that fail transformation - Log failed records with error reason for debugging - Circuit breaker: stop pipeline if error rate exceeds threshold - Separate data quality failures (bad data) from infrastructure failures (timeout) - Always have a manual re-run capability for any pipeline stage
Add to your project root CLAUDE.md file, or append to an existing one.