For years, monitoring meant dashboards and alerts: watch metrics, set thresholds, get paged when thresholds are crossed. This worked when systems were simple—a handful of servers running predictable workloads.
Modern distributed systems break this model. Microservices, containers, and dynamic scheduling create environments where traditional monitoring struggles. You see symptoms (high latency, errors) but can’t trace causes through layers of services, networks, and dependencies.
Observability offers a different approach: instrumenting systems so you can ask arbitrary questions about behavior, not just check predefined metrics.
The Monitoring Problem
Traditional monitoring works like this:
- Decide what metrics matter
- Create dashboards displaying those metrics
- Set alert thresholds
- When alerts fire, look at dashboards
- Hope you captured the right metrics
This approach has fundamental limitations:
Unknown Unknowns
Monitoring captures what you anticipated. You create metrics for scenarios you’ve imagined. But production failures often involve scenarios you didn’t imagine—combinations of circumstances you never predicted.
When something new fails, your dashboards show nothing unusual because you’re not measuring the right things.
Cardinality Limits
Traditional monitoring struggles with high-cardinality data. You can track error rates per endpoint, but tracking error rates per user, per endpoint, per HTTP status code, per datacenter quickly exceeds storage and query capabilities.
Yet debugging often requires exactly these high-cardinality breakdowns.
Distributed Context
A request touching ten services might fail anywhere in the chain. Traditional per-service monitoring shows each service’s metrics in isolation. Correlating them across services requires manual effort and shared context (timestamps, request IDs) that often doesn’t exist.
What Observability Adds
Observability extends monitoring with capabilities to understand system behavior without predicting what you’ll need in advance.
The Three Pillars
Metrics: Numeric measurements over time. What traditional monitoring provides. Efficient for dashboards and alerts but limited for exploration.
Logs: Discrete events with context. Rich detail but challenging to aggregate and correlate.
Traces: Request flows across services. Shows the path a request takes and where time is spent.
Together, these provide complementary views:
- Metrics tell you something is wrong
- Traces show you where in the system
- Logs provide detail about what happened
High-Cardinality Queries
Modern observability tools support high-cardinality queries:
error_rate{
endpoint="/api/users",
status_code="500",
user_tier="enterprise",
datacenter="us-east",
version="2.1.3"
}
You can slice data along any dimension after collection, not just the dimensions you predefined.
Distributed Tracing
Tracing shows request flow across services:
Request → API Gateway → Auth Service → User Service → Database
→ Cache
→ Payment Service → Stripe API
Each segment shows duration, errors, and context. When latency spikes, you see exactly which service and which operation is slow.
Exploratory Analysis
Instead of looking at predefined dashboards, observability enables questions like:
- “Which users are experiencing this error?”
- “Which endpoints are slow for users in Europe?”
- “What changed between yesterday (working) and today (broken)?”
You explore data to form hypotheses, not just verify predefined metrics.
Implementing Observability
Structured Logging
Move from unstructured text logs:
2017-03-20 10:23:45 ERROR Failed to process payment for order 12345
To structured, context-rich logs:
{
"timestamp": "2017-03-20T10:23:45Z",
"level": "error",
"message": "Payment processing failed",
"trace_id": "abc123",
"order_id": "12345",
"user_id": "user_789",
"payment_method": "credit_card",
"error_code": "INSUFFICIENT_FUNDS",
"service": "payment-service",
"version": "2.1.3"
}
Structured logs enable:
- Filtering: “Show me all errors for user_789”
- Aggregation: “Count errors by error_code”
- Correlation: “Show all logs with trace_id abc123”
Request Tracing
Implement distributed tracing across services:
- Generate trace ID at request entry point
- Propagate trace ID to all downstream services
- Each service records spans (operations with timing)
- Trace collector assembles spans into complete traces
Standards like OpenTracing provide vendor-neutral instrumentation. Tools like Jaeger, Zipkin, and commercial offerings collect and visualize traces.
Meaningful Metrics
Instrument what matters:
RED metrics per service:
- Request rate
- Error rate
- Duration distribution
USE metrics for resources:
- Utilization
- Saturation
- Errors
Business metrics:
- Transactions processed
- Revenue
- User activity
Include dimensions that enable slicing:
request_duration.observe(
value=duration,
labels={
"endpoint": endpoint,
"method": method,
"status_code": status_code,
"version": app_version,
}
)
Context Propagation
Ensure context flows through your system:
# HTTP headers for trace context
X-Trace-ID: abc123
X-Span-ID: def456
X-Parent-Span-ID: ghi789
# Include in every log
logger.info("Processing request", extra={"trace_id": trace_id})
Context enables correlation across metrics, logs, and traces.
Observability in Practice
Incident Investigation
With observability:
- Alert fires: Error rate increased
- Quick scope: Query traces—which endpoints? Which users?
- Drill down: View specific failing traces—where do they fail?
- Root cause: Logs for that service/trace show database timeout
- Resolution: Database metrics confirm connection pool exhaustion
Without observability, step 2-4 involve guessing, checking dashboards that may not have relevant metrics, and grep-ing logs across servers.
Debugging Performance
User reports slowness. With observability:
- Find traces for that user
- Compare slow traces to normal traces
- Identify the difference—extra database calls, slow third-party API
- Fix the specific issue
Without observability, you profile entire services hoping to find the problem.
Deployment Verification
After deployment:
- Compare error rates, latency distributions before/after
- Break down by endpoint—any specific endpoints degraded?
- View traces for degraded endpoints
- Identify what’s different
Tooling Ecosystem
Metrics
- Prometheus: Pull-based metrics with powerful query language
- InfluxDB: Time-series database for metrics
- Datadog, New Relic: Commercial observability platforms
Logging
- ELK Stack: Elasticsearch, Logstash, Kibana for log aggregation
- Splunk: Enterprise log management
- Loki: Grafana’s log aggregation (labels-based, like Prometheus)
Tracing
- Jaeger: Open-source distributed tracing
- Zipkin: Twitter’s distributed tracing system
- Lightstep, Honeycomb: Commercial tracing platforms
Unified Platforms
Increasingly, platforms combine all three:
- Datadog: Metrics, logs, traces in one platform
- Grafana Stack: Prometheus + Loki + Tempo
- Elastic Observability: Metrics, logs, APM
Unified platforms enable correlation across signal types.
Cultural Shift
Observability isn’t just tooling—it’s a different approach to operating systems.
From Reactive to Exploratory
Traditional: wait for alerts, then check dashboards. Observability: regularly explore system behavior, find issues before they’re incidents.
From Silos to Shared Context
Traditional: each team monitors their service. Observability: shared traces show request flow across teams.
From Dashboards to Questions
Traditional: build dashboards for anticipated scenarios. Observability: ask questions as they arise, explore data to understand behavior.
Key Takeaways
- Monitoring tells you something is wrong; observability lets you understand why
- The three pillars—metrics, logs, traces—provide complementary views
- High-cardinality queries enable slicing data along any dimension
- Distributed tracing shows request flow across services
- Implement structured logging with context propagation
- Observability enables exploratory debugging, not just alert response
- Unified platforms correlate across signal types