Why Observability Matters More Than Monitoring

March 20, 2017

For years, monitoring meant dashboards and alerts: watch metrics, set thresholds, get paged when thresholds are crossed. This worked when systems were simple—a handful of servers running predictable workloads.

Modern distributed systems break this model. Microservices, containers, and dynamic scheduling create environments where traditional monitoring struggles. You see symptoms (high latency, errors) but can’t trace causes through layers of services, networks, and dependencies.

Observability offers a different approach: instrumenting systems so you can ask arbitrary questions about behavior, not just check predefined metrics.

The Monitoring Problem

Traditional monitoring works like this:

  1. Decide what metrics matter
  2. Create dashboards displaying those metrics
  3. Set alert thresholds
  4. When alerts fire, look at dashboards
  5. Hope you captured the right metrics

This approach has fundamental limitations:

Unknown Unknowns

Monitoring captures what you anticipated. You create metrics for scenarios you’ve imagined. But production failures often involve scenarios you didn’t imagine—combinations of circumstances you never predicted.

When something new fails, your dashboards show nothing unusual because you’re not measuring the right things.

Cardinality Limits

Traditional monitoring struggles with high-cardinality data. You can track error rates per endpoint, but tracking error rates per user, per endpoint, per HTTP status code, per datacenter quickly exceeds storage and query capabilities.

Yet debugging often requires exactly these high-cardinality breakdowns.

Distributed Context

A request touching ten services might fail anywhere in the chain. Traditional per-service monitoring shows each service’s metrics in isolation. Correlating them across services requires manual effort and shared context (timestamps, request IDs) that often doesn’t exist.

What Observability Adds

Observability extends monitoring with capabilities to understand system behavior without predicting what you’ll need in advance.

The Three Pillars

Metrics: Numeric measurements over time. What traditional monitoring provides. Efficient for dashboards and alerts but limited for exploration.

Logs: Discrete events with context. Rich detail but challenging to aggregate and correlate.

Traces: Request flows across services. Shows the path a request takes and where time is spent.

Together, these provide complementary views:

High-Cardinality Queries

Modern observability tools support high-cardinality queries:

error_rate{
  endpoint="/api/users",
  status_code="500",
  user_tier="enterprise",
  datacenter="us-east",
  version="2.1.3"
}

You can slice data along any dimension after collection, not just the dimensions you predefined.

Distributed Tracing

Tracing shows request flow across services:

Request → API Gateway → Auth Service → User Service → Database
                                    → Cache
                     → Payment Service → Stripe API

Each segment shows duration, errors, and context. When latency spikes, you see exactly which service and which operation is slow.

Exploratory Analysis

Instead of looking at predefined dashboards, observability enables questions like:

You explore data to form hypotheses, not just verify predefined metrics.

Implementing Observability

Structured Logging

Move from unstructured text logs:

2017-03-20 10:23:45 ERROR Failed to process payment for order 12345

To structured, context-rich logs:

{
  "timestamp": "2017-03-20T10:23:45Z",
  "level": "error",
  "message": "Payment processing failed",
  "trace_id": "abc123",
  "order_id": "12345",
  "user_id": "user_789",
  "payment_method": "credit_card",
  "error_code": "INSUFFICIENT_FUNDS",
  "service": "payment-service",
  "version": "2.1.3"
}

Structured logs enable:

Request Tracing

Implement distributed tracing across services:

  1. Generate trace ID at request entry point
  2. Propagate trace ID to all downstream services
  3. Each service records spans (operations with timing)
  4. Trace collector assembles spans into complete traces

Standards like OpenTracing provide vendor-neutral instrumentation. Tools like Jaeger, Zipkin, and commercial offerings collect and visualize traces.

Meaningful Metrics

Instrument what matters:

RED metrics per service:

USE metrics for resources:

Business metrics:

Include dimensions that enable slicing:

request_duration.observe(
    value=duration,
    labels={
        "endpoint": endpoint,
        "method": method,
        "status_code": status_code,
        "version": app_version,
    }
)

Context Propagation

Ensure context flows through your system:

# HTTP headers for trace context
X-Trace-ID: abc123
X-Span-ID: def456
X-Parent-Span-ID: ghi789

# Include in every log
logger.info("Processing request", extra={"trace_id": trace_id})

Context enables correlation across metrics, logs, and traces.

Observability in Practice

Incident Investigation

With observability:

  1. Alert fires: Error rate increased
  2. Quick scope: Query traces—which endpoints? Which users?
  3. Drill down: View specific failing traces—where do they fail?
  4. Root cause: Logs for that service/trace show database timeout
  5. Resolution: Database metrics confirm connection pool exhaustion

Without observability, step 2-4 involve guessing, checking dashboards that may not have relevant metrics, and grep-ing logs across servers.

Debugging Performance

User reports slowness. With observability:

  1. Find traces for that user
  2. Compare slow traces to normal traces
  3. Identify the difference—extra database calls, slow third-party API
  4. Fix the specific issue

Without observability, you profile entire services hoping to find the problem.

Deployment Verification

After deployment:

  1. Compare error rates, latency distributions before/after
  2. Break down by endpoint—any specific endpoints degraded?
  3. View traces for degraded endpoints
  4. Identify what’s different

Tooling Ecosystem

Metrics

Logging

Tracing

Unified Platforms

Increasingly, platforms combine all three:

Unified platforms enable correlation across signal types.

Cultural Shift

Observability isn’t just tooling—it’s a different approach to operating systems.

From Reactive to Exploratory

Traditional: wait for alerts, then check dashboards. Observability: regularly explore system behavior, find issues before they’re incidents.

From Silos to Shared Context

Traditional: each team monitors their service. Observability: shared traces show request flow across teams.

From Dashboards to Questions

Traditional: build dashboards for anticipated scenarios. Observability: ask questions as they arise, explore data to understand behavior.

Key Takeaways