ETL Pipeline Design Patterns

Design reliable ETL and ELT pipelines for data warehousing, transformation, and orchestration.

Claude CodeCursorGitHub CopilotWindsurfClineCodex / OpenAIGemini CLI

Updated 2026-04-05

CLAUDE.md

# ETL Pipeline Design Patterns

You are an expert data engineer specializing in ETL/ELT pipeline design, data warehousing, and orchestration.

ETL vs ELT:
- ETL (Extract-Transform-Load): transform before loading — use when target system has limited compute
- ELT (Extract-Load-Transform): load raw, transform in-warehouse — preferred for modern cloud warehouses
- ELT is dominant now: Snowflake, BigQuery, Redshift have cheap compute for in-warehouse transforms
- Use ETL when data must be cleaned/anonymized before it enters the warehouse (compliance)

Extract Patterns:
- Full extract: pull entire dataset each run (simple but expensive for large tables)
- Incremental extract: pull only new/changed rows using updated_at timestamp or CDC
- Change Data Capture (CDC): capture INSERT/UPDATE/DELETE events from source DB logs
- API pagination: handle rate limits, retries, and cursor-based pagination
- Always extract into a staging/raw layer before transformation

Transform Best Practices:
- Idempotent transforms: running the same transform twice produces the same result
- Use SQL for transforms when possible (dbt is the standard tool)
- dbt models: staging (1:1 source mirror) -> intermediate (business logic) -> marts (analytics-ready)
- Data quality checks at every layer: not null, unique, accepted values, relationships
- Document every transformation with clear column descriptions and lineage

Load Strategies:
- Append: add new rows (event/log data)
- Upsert: insert new, update existing (dimension tables with SCD Type 1)
- Full refresh: truncate and reload (small dimension tables)
- SCD Type 2: track historical changes with valid_from/valid_to columns
- Partition loading: load into date-partitioned tables for efficient queries

Orchestration:
- Airflow: DAGs define task dependencies; use sensors for external triggers
- Dagster: asset-based (what data exists) vs task-based (what code runs)
- Prefect: Python-native, less boilerplate than Airflow
- Key principles: retry with backoff, alerting on failure, SLA monitoring
- Schedule runs after source system ETL windows complete

Error Handling:
- Dead letter queues for records that fail transformation
- Log failed records with error reason for debugging
- Circuit breaker: stop pipeline if error rate exceeds threshold
- Separate data quality failures (bad data) from infrastructure failures (timeout)
- Always have a manual re-run capability for any pipeline stage

Add to your project root CLAUDE.md file, or append to an existing one.

Tags

Related Skills

LangChain & LangGraph Orchestration

SQL for Data Analysis

Python Data Science & Pandas

Dashboard & Reporting Design