Observability for Distributed Teams

September 14, 2020

When teams work remotely, the informal knowledge sharing that happens in an office disappears. You can’t walk over to someone’s desk to debug an issue together. Tribal knowledge about how systems behave becomes harder to maintain.

Observability fills this gap. Well-instrumented systems tell their own story. Here’s how to build observability that supports distributed teams.

The Challenge

What Changes Remotely

Before (co-located):

After (distributed):

Why Observability Matters More

Good observability enables:

The system becomes the source of truth, not tribal knowledge.

Building Observable Systems

The Three Pillars Plus

Classic three pillars:
├── Metrics (aggregated, time-series)
├── Logs (events, context)
└── Traces (request flow)

Additional:
├── Profiling (code-level performance)
├── Events (business-level occurrences)
└── Dashboards (visual understanding)

Metrics That Matter

Focus on actionable metrics:

# RED metrics (request-focused)
rate:     requests per second
errors:   error rate
duration: request latency (p50, p95, p99)

# USE metrics (resource-focused)
utilization: percentage busy
saturation:  queue depth
errors:      error count

# Business metrics
orders_per_minute: value delivery
signup_conversion: business health
active_users:      engagement

Avoid vanity metrics:

Structured Logging

Logs that machines and humans can read:

# Bad: Unstructured
logger.info(f"Processing order {order_id} for user {user_id}")

# Good: Structured
logger.info("Processing order", extra={
    "order_id": order_id,
    "user_id": user_id,
    "amount": order.total,
    "items": len(order.items),
    "correlation_id": context.correlation_id
})

Output (JSON):

{
  "timestamp": "2020-09-14T10:30:00Z",
  "level": "INFO",
  "message": "Processing order",
  "order_id": "12345",
  "user_id": "67890",
  "amount": 99.99,
  "items": 3,
  "correlation_id": "abc-123",
  "service": "orders",
  "environment": "production"
}

Distributed Tracing

Track requests across services:

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer(__name__)

def process_order(order_id):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        # Child span for database
        with tracer.start_as_current_span("fetch_order"):
            order = db.get_order(order_id)

        # Child span for payment
        with tracer.start_as_current_span("process_payment"):
            payment_result = payment_service.charge(order)

        if payment_result.failed:
            span.set_status(Status(StatusCode.ERROR))
            span.record_exception(payment_result.error)

        return order

Trace visualization shows:

process_order (150ms)
├── fetch_order (20ms)
├── process_payment (100ms)
│   ├── validate_card (10ms)
│   └── charge_card (90ms)
└── update_inventory (30ms)

Self-Service Debugging

Runbooks in Dashboards

Embed troubleshooting guidance:

# Grafana dashboard annotation
panels:
  - title: "Error Rate"
    description: |
      ## Troubleshooting High Error Rate

      1. Check recent deployments: [Deployment History](link)
      2. Look for correlated latency increase
      3. Check downstream dependencies: [Dependencies Dashboard](link)
      4. Review error logs: [Error Logs Query](link)

      **On-call escalation**: #platform-oncall

Correlation IDs

Enable request tracing across async channels:

# Include correlation ID everywhere
def api_handler(request):
    correlation_id = request.headers.get('X-Correlation-ID', generate_id())

    # Pass to all downstream calls
    response = downstream_service.call(
        data=request.data,
        headers={'X-Correlation-ID': correlation_id}
    )

    # Include in async jobs
    queue.enqueue('process_task', {
        'data': request.data,
        'correlation_id': correlation_id
    })

    # Include in logs
    logger.info("Request processed", extra={'correlation_id': correlation_id})

    return response

Search and Explore

Enable ad-hoc queries:

# Example queries engineers should be able to run

## "What errors are users seeing?"
level:ERROR user_id:* | stats count by error_message

## "How long did this request take?"
correlation_id:abc-123 | select timestamp, service, duration

## "What changed before this outage?"
deploy_event OR config_change | timeline

Dashboards for Teams

Hierarchy of Dashboards

Organization Level:
├── SLA/SLO Overview
├── Business Metrics
└── Cost Overview

Service Level:
├── Service Health (RED metrics)
├── Dependencies
└── Resource Utilization

Detailed Level:
├── Individual Endpoints
├── Database Performance
└── Cache Hit Rates

Home Dashboard

What every engineer should see first:

sections:
  - name: "Are we healthy?"
    panels:
      - SLA status (green/yellow/red)
      - Error rate vs baseline
      - Latency vs baseline

  - name: "What's happening?"
    panels:
      - Recent deployments
      - Active incidents
      - Upcoming maintenance

  - name: "Quick links"
    panels:
      - Current on-call
      - Runbooks
      - Escalation contacts

SLO Dashboard

Track service level objectives:

panels:
  - name: "Monthly Error Budget"
    query: |
      100 - (
        sum(rate(http_requests_total{status=~"5.."}[30d])) /
        sum(rate(http_requests_total[30d])) * 100
      )
    thresholds:
      - value: 99.9
        color: green
      - value: 99.5
        color: yellow
      - value: 99
        color: red

  - name: "Error Budget Burn Rate"
    description: "Hours until budget exhausted at current rate"

Alerting for Remote Teams

Alert Design

Good alerts:

Bad alerts:

Alert Content

Include everything needed to start investigating:

alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
labels:
  severity: critical
  team: platform
annotations:
  summary: "Error rate above 1% for {{ $labels.service }}"
  description: |
    Current error rate: {{ $value | humanizePercentage }}

    **Quick Links:**
    - [Error Logs](https://logs.example.com?query=service:{{ $labels.service }}+level:ERROR)
    - [Dashboard](https://grafana.example.com/d/service?var-service={{ $labels.service }})
    - [Recent Deployments](https://deploy.example.com/{{ $labels.service }}/history)

    **Runbook:** [High Error Rate](https://wiki.example.com/runbooks/high-error-rate)

Routing and Escalation

# PagerDuty/Opsgenie routing
routes:
  - match:
      severity: critical
      team: platform
    receiver: platform-oncall-pagerduty

  - match:
      severity: warning
    receiver: slack-alerts

receivers:
  - name: platform-oncall-pagerduty
    pagerduty_configs:
      - routing_key: xxx
    slack_configs:
      - channel: '#platform-incidents'

Documentation

Living Documentation

Keep runbooks next to code:

repo/
├── src/
├── docs/
│   ├── architecture.md
│   └── runbooks/
│       ├── high-error-rate.md
│       ├── high-latency.md
│       └── database-connection-issues.md
└── dashboards/
    └── service-dashboard.json

Runbook Template

# High Error Rate

## Symptoms
- Error rate above 1% for > 5 minutes
- PagerDuty alert: HighErrorRate

## Impact
- Users may see error pages
- API clients receiving 5xx responses

## Investigation Steps

### 1. Check Recent Changes
- [ ] Recent deployments? [Deploy History](link)
- [ ] Config changes? [Config History](link)
- [ ] New feature flags enabled?

### 2. Identify Error Type
- [ ] Check error logs: `service:api level:ERROR | top error_message`
- [ ] Check error dashboard: [Link](link)

### 3. Check Dependencies
- [ ] Database healthy? [DB Dashboard](link)
- [ ] Downstream services healthy? [Dependencies](link)

## Mitigation

### Rollback deployment
```bash
kubectl rollout undo deployment/api

Disable feature flag

./scripts/disable-flag.sh new-feature

Escalation


## Key Takeaways

- Remote teams can't rely on tribal knowledge; systems must be self-explanatory
- Structured logging enables search and correlation across services
- Distributed tracing shows request flow; essential for microservices
- Correlation IDs tie requests together across sync and async boundaries
- Dashboards should answer "is it healthy?" before diving into details
- Alerts must be actionable and include investigation starting points
- Runbooks live with the code and dashboards, not in a separate wiki
- SLOs provide objective measures of service health
- Self-service debugging reduces dependency on experts
- Observability is a team practice, not just tooling

Good observability makes remote work sustainable. When systems tell their own story, teams can investigate asynchronously and onboard newcomers faster.