Data Mesh: Decentralizing Data Ownership

July 29, 2019

Centralized data teams become bottlenecks. Every request flows through a small team that can’t keep up. Data quality suffers because producers are disconnected from consumers. Analytics lag months behind business needs.

Data mesh is an alternative: decentralize data ownership to domain teams while maintaining interoperability.

The Problem with Centralized Data

Bottleneck Teams

Business Teams → Central Data Team → Analytics Output
   (many)            (few)              (delayed)

Central teams can’t keep up with requests from many domains.

Disconnected Ownership

Data producers don’t see consumer needs:

Monolithic Data Platforms

Giant data warehouses become their own monoliths:

Data Mesh Principles

Domain Ownership

Domains own their data as a product:

Orders Domain:
  - Owns order data
  - Publishes order facts
  - Maintains quality
  - Serves consumers

Users Domain:
  - Owns user data
  - Publishes user facts
  - Maintains quality
  - Serves consumers

Data as a Product

Treat data consumers as customers:

data_product:
  name: orders-completed
  owner: orders-team
  description: "Completed order facts for analytics"
  sla:
    freshness: 15_minutes
    availability: 99.9%
  schema:
    - order_id: string
    - customer_id: string
    - total_amount: decimal
    - completed_at: timestamp
  documentation: https://data.company.com/products/orders-completed

Self-Serve Data Platform

Central platform enables domain teams:

Platform provides:
├── Data storage (warehouse/lake)
├── Processing infrastructure
├── Schema registry
├── Data quality tools
├── Discovery catalog
└── Access control

Domains provide:
├── Data products
├── Transformations
├── Quality rules
└── Documentation

Federated Governance

Standards without central control:

Global standards:
- Naming conventions
- Data formats
- Security requirements
- Quality minimums

Domain autonomy:
- Implementation details
- Tooling choices
- Publishing schedule

Implementation

Data Product Structure

orders-data-product/
├── src/
│   ├── transformations/
│   │   └── completed_orders.sql
│   └── quality/
│       └── completeness_checks.py
├── schema/
│   └── orders_completed.avsc
├── tests/
│   └── test_transformations.py
├── docs/
│   └── README.md
└── data_product.yaml

Schema Management

Central registry, domain ownership:

# Schema registry entry
apiVersion: schema/v1
kind: Schema
metadata:
  name: orders-completed
  domain: orders
  owner: orders-team@company.com
spec:
  type: avro
  compatibility: BACKWARD
  schema: |
    {
      "type": "record",
      "name": "OrderCompleted",
      "fields": [
        {"name": "order_id", "type": "string"},
        {"name": "customer_id", "type": "string"},
        {"name": "total_amount", "type": {"type": "bytes", "logicalType": "decimal"}},
        {"name": "completed_at", "type": {"type": "long", "logicalType": "timestamp-millis"}}
      ]
    }

Data Quality

Quality as code, owned by domains:

from great_expectations import expect

@data_quality_check
def validate_orders_completed(df):
    expect(df).column_values_to_not_be_null("order_id")
    expect(df).column_values_to_not_be_null("completed_at")
    expect(df).column_values_to_be_between("total_amount", 0, 1000000)
    expect(df).column_values_to_match_regex("order_id", r"ord_[a-z0-9]+")

Discovery

Data catalog for discoverability:

Data Catalog:
├── orders-completed (Orders Team)
│   ├── Description: Completed order facts
│   ├── Schema: order_id, customer_id, total_amount, completed_at
│   ├── Freshness: 15 minutes
│   ├── Quality Score: 98%
│   └── Lineage: orders_raw → orders_cleaned → orders_completed
├── users-active (Users Team)
│   └── ...
└── products-inventory (Products Team)
    └── ...

Team Structure

Domain Data Teams

Each domain needs data capability:

Orders Domain Team:
├── Backend Engineers
├── Data Engineer(s)  ← embedded
├── Analyst(s)        ← embedded
└── Product Manager

Platform Team

Enables domain teams:

Data Platform Team:
├── Build infrastructure
├── Provide tooling
├── Set standards
├── Support adoption
└── Don't own domain data

Federated Governance

Cross-domain coordination:

Data Guild:
├── Representatives from each domain
├── Platform team
├── Central analytics (if exists)

Responsibilities:
├── Agree on standards
├── Resolve cross-domain issues
├── Evolve governance
└── Share best practices

Challenges

Organizational Change

Data mesh requires:

Duplication Concerns

Some duplication is acceptable:

Interoperability

Cross-domain analytics need:

Platform Investment

Self-serve platform isn’t free:

When Data Mesh Fits

Good fit:

Poor fit:

Key Takeaways

Data mesh is organizational design, not just architecture. The technology follows the people structure.