Building Data Pipelines That Don't Break

Data pipelines have a reputation for brittleness. They break when source schemas change. They fail silently, corrupting downstream data. They’re difficult to debug and tedious to maintain.

This reputation is earned, but not inevitable. With careful design, data pipelines can be as reliable as any other production system.

Why Pipelines Break

Understanding failure modes helps prevent them:

Schema Evolution

Upstream systems change schemas without warning. New fields appear. Fields are renamed. Types change. Pipelines expecting specific schemas fail.

Data Quality Issues

Bad data enters the pipeline:

Null values where not expected
Values outside valid ranges
Malformed records
Duplicates

Pipelines assuming clean data corrupt downstream systems.

Volume Spikes

Traffic increases beyond expected levels. Pipelines that worked at 1000 records/minute fail at 10,000.

Dependency Failures

External systems become unavailable:

Source databases unreachable
API rate limits exceeded
Network issues
Cloud service outages

Pipelines without failure handling stop completely.

Silent Failures

The worst failure: pipelines that appear to work but produce wrong results. No errors, but data is incorrect or incomplete.

Design Principles

Schema Flexibility

Don’t assume rigid schemas:

Schema-on-read: Store raw data as received. Apply schema when reading. Tolerates upstream changes.

Explicit schema versioning: Track schema versions. Handle multiple versions in pipeline logic.

Graceful degradation: Unknown fields logged but not failed on. Missing optional fields handled with defaults.

def process_record(record):
    # Required fields - fail if missing
    user_id = record['user_id']

    # Optional fields - default if missing
    email = record.get('email', None)

    # Unknown fields - log and continue
    known_fields = {'user_id', 'email', 'name', ...}
    unknown = set(record.keys()) - known_fields
    if unknown:
        log.warning(f"Unknown fields: {unknown}")

    return process(user_id, email, ...)

Idempotency

Operations should be safe to retry:

Same result on re-run: Processing the same data twice produces the same outcome.

Deduplication: Detect and handle duplicate records.

Upsert over insert: Update if exists, insert if not. Handles replays gracefully.

def write_record(record):
    # Upsert - safe to retry
    db.execute("""
        INSERT INTO users (id, data)
        VALUES (%s, %s)
        ON CONFLICT (id) DO UPDATE SET data = %s
    """, record.id, record.data, record.data)

Data Validation

Validate data at pipeline boundaries:

Input validation: Check incoming data against expectations. Reject or quarantine invalid records.

Output validation: Verify pipeline output meets quality standards before committing.

Assertions: Explicit checks for invariants.

def validate_record(record):
    if not record.get('user_id'):
        raise ValidationError("Missing user_id")

    if record.get('age') and (record['age'] < 0 or record['age'] > 150):
        raise ValidationError(f"Invalid age: {record['age']}")

    return record

Dead Letter Queues

Don’t lose failed records:

Input → Processing → Success → Output
              ↓
            Failure
              ↓
         Dead Letter Queue

Failed records go to a dead letter queue for investigation and reprocessing.

Checkpointing

For long-running pipelines, save progress:

Checkpoint state: Periodically save current position.

Restart from checkpoint: On failure, resume from last checkpoint rather than restarting.

Consistent checkpoints: Ensure checkpoint represents consistent state.

Monitoring and Alerting

Key Metrics

Throughput: Records processed per time period. Sudden drops indicate problems.

Latency: Time from source to destination. Increasing latency indicates bottlenecks.

Error rate: Failed records / total records. Any sustained error rate needs investigation.

Lag: For streaming, how far behind real-time. Growing lag indicates processing can’t keep up.

Data freshness: Age of newest data in destination. Stale data indicates pipeline issues.

Alerting

Alert on:

Pipeline stopped running
Error rate above threshold
Lag exceeding acceptable limit
Data freshness exceeding SLA
Unexpected output volume (too high or too low)

Data Quality Monitoring

Check output data quality:

Row counts match expectations
Key metrics within normal ranges
No unexpected nulls in required fields
Referential integrity maintained

-- Example quality checks
SELECT
  COUNT(*) as row_count,
  COUNT(*) FILTER (WHERE user_id IS NULL) as null_user_ids,
  COUNT(DISTINCT user_id) as unique_users,
  MIN(created_at) as earliest_record,
  MAX(created_at) as latest_record
FROM output_table
WHERE created_at > NOW() - INTERVAL '1 day'

Testing

Unit Tests

Test transformation logic:

Individual transformations
Edge cases (nulls, empty strings, special characters)
Error handling paths

Integration Tests

Test against realistic data:

Sample of production-like data
Known edge cases
Schema variations

Data Tests

Run tests on actual output:

Expected row counts
Value distributions
Referential integrity
Business rule validation

Pipeline Tests

Test full pipeline behavior:

End-to-end with sample data
Failure scenarios (source unavailable, invalid data)
Restart and recovery

Operational Practices

Backfill Capability

Pipelines need to reprocess historical data:

Bug fixes require re-running affected time periods
New fields require backfilling
Schema changes require migration

Design for easy backfill:

Clear data boundaries (partitioned by date)
Idempotent processing
Efficient historical access

Version Control

Pipeline code is code. Treat it accordingly:

Version control all pipeline definitions
Code review for changes
Testing before production deployment
Change documentation

Runbooks

Document operational procedures:

How to monitor pipeline health
How to investigate failures
How to restart from checkpoint
How to backfill data
Who to contact for issues

Tool Selection

Batch Processing

Apache Spark: Distributed processing for large datasets.

Apache Airflow: Workflow orchestration. DAGs define pipeline dependencies.

dbt: SQL-based transformations. Great for analytics pipelines.

Stream Processing

Apache Kafka: Distributed streaming platform. Often the backbone of real-time pipelines.

Apache Flink: Stream processing with exactly-once semantics.

AWS Kinesis / GCP Dataflow: Managed streaming services.

Choose Based On

Data volume and velocity
Team expertise
Operational capacity
Cost constraints
Latency requirements

Key Takeaways

Pipelines break due to schema changes, bad data, volume spikes, dependencies, and silent failures
Design for schema flexibility, idempotency, and data validation
Use dead letter queues to preserve failed records
Implement checkpointing for reliable restart
Monitor throughput, latency, errors, lag, and data freshness
Test transformations, integrations, data quality, and full pipeline behavior
Design for backfill capability from the start
Document operational procedures in runbooks