Traditional monitoring asks predefined questions: Is CPU high? Is the service up? Is latency acceptable? These are necessary but insufficient for complex distributed systems.
Observability is the ability to ask arbitrary questions about your system’s behavior. When something unexpected happens, you can investigate without deploying new instrumentation.
The Three Pillars
Metrics
Aggregated numerical measurements over time:
http_requests_total{method="GET", endpoint="/api/users", status="200"} 15234
http_request_duration_seconds{method="GET", endpoint="/api/users", quantile="0.99"} 0.45
Strengths:
- Efficient storage (aggregated)
- Good for alerting
- Show trends over time
- Enable dashboards
Weaknesses:
- Limited cardinality (can’t have metric per user)
- Don’t explain why something happened
- Aggregation loses detail
Best practices:
- USE method for resources: Utilization, Saturation, Errors
- RED method for services: Rate, Errors, Duration
- Use histograms over averages for latency
Logs
Discrete events with context:
{
"timestamp": "2018-07-09T10:30:45.123Z",
"level": "error",
"service": "user-api",
"request_id": "abc123",
"user_id": "user_456",
"message": "Database connection timeout",
"duration_ms": 5000,
"query": "SELECT * FROM users WHERE id = ?"
}
Strengths:
- Rich context
- High cardinality (can include user IDs, request IDs)
- Human readable for debugging
- Capture unexpected events
Weaknesses:
- Expensive at scale
- Hard to aggregate
- Require structured format for useful querying
Best practices:
- Use structured logging (JSON)
- Include correlation IDs
- Log at appropriate levels
- Include relevant context
Traces
Distributed transaction paths:
Trace ID: abc123
├─ Span: HTTP GET /orders/789 (50ms)
│ ├─ Span: AuthMiddleware (2ms)
│ ├─ Span: Database: SELECT order (15ms)
│ ├─ Span: HTTP GET /users/456 (external) (25ms)
│ │ └─ Span: Cache lookup (1ms)
│ └─ Span: HTTP GET /products/123 (external) (20ms)
Strengths:
- Show request flow across services
- Identify where time is spent
- Reveal dependencies
- Debug distributed systems
Weaknesses:
- Expensive to collect everything (sampling needed)
- Complex to implement
- Require propagation across services
Best practices:
- Propagate trace context (B3, W3C Trace Context)
- Sample intelligently (100% errors, sample success)
- Add custom spans for business logic
- Include span attributes for debugging
Correlation Is Key
Individual pillars have limited value. Correlation multiplies it:
Alert: High latency on /orders endpoint
→ Metrics: P99 latency spike at 10:30
→ Traces: Sample trace shows database query slow
→ Logs: Query timeout errors with specific query
→ Root cause: Missing index on new column
Correlation IDs
Generate unique ID at request entry, propagate through:
func requestMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
requestID := r.Header.Get("X-Request-ID")
if requestID == "" {
requestID = uuid.New().String()
}
ctx := context.WithValue(r.Context(), requestIDKey, requestID)
logger := logger.With("request_id", requestID)
ctx = context.WithValue(ctx, loggerKey, logger)
// Propagate to downstream services
w.Header().Set("X-Request-ID", requestID)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
Linking Pillars
// Metrics include exemplar linking to traces
httpRequestDuration.With(
prometheus.Labels{"endpoint": endpoint},
).Observe(duration)
// Logs include trace ID
logger.WithFields(log.Fields{
"trace_id": span.SpanContext().TraceID().String(),
"span_id": span.SpanContext().SpanID().String(),
"request_id": requestID,
}).Info("Request completed")
// Traces include relevant data
span.SetAttributes(
attribute.String("user.id", userID),
attribute.Int("result.count", len(results)),
)
Implementing Observability
Instrumentation
Application-level:
// HTTP handler instrumentation
func instrumentHandler(handler http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
ctx, span := tracer.Start(r.Context(), "HTTP "+r.Method+" "+r.URL.Path)
defer span.End()
wrapped := wrapResponseWriter(w)
handler.ServeHTTP(wrapped, r.WithContext(ctx))
duration := time.Since(start)
// Metrics
httpRequestsTotal.WithLabelValues(
r.Method,
r.URL.Path,
strconv.Itoa(wrapped.status),
).Inc()
httpRequestDuration.WithLabelValues(
r.Method,
r.URL.Path,
).Observe(duration.Seconds())
// Span attributes
span.SetAttributes(
attribute.Int("http.status_code", wrapped.status),
attribute.Int64("http.response_size", wrapped.size),
)
// Log
logger.WithFields(log.Fields{
"method": r.Method,
"path": r.URL.Path,
"status": wrapped.status,
"duration": duration,
}).Info("Request completed")
})
}
Database instrumentation:
func (db *DB) Query(ctx context.Context, query string, args ...interface{}) (*Rows, error) {
ctx, span := tracer.Start(ctx, "db.query")
defer span.End()
start := time.Now()
rows, err := db.db.QueryContext(ctx, query, args...)
duration := time.Since(start)
span.SetAttributes(
attribute.String("db.statement", query),
attribute.Bool("db.error", err != nil),
)
dbQueryDuration.Observe(duration.Seconds())
if err != nil {
span.RecordError(err)
}
return rows, err
}
Tooling Stack
Metrics:
- Prometheus for collection and storage
- Grafana for visualization
- Alertmanager for alerting
Logs:
- Fluentd/Fluent Bit for collection
- Elasticsearch for storage
- Kibana for visualization
Traces:
- Jaeger or Zipkin for collection and visualization
- OpenTelemetry for instrumentation
Or unified:
- Datadog, Honeycomb, Lightstep for integrated observability
- Grafana Tempo + Loki + Prometheus for open-source stack
OpenTelemetry
OpenTelemetry is emerging as the standard:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() (*trace.TracerProvider, error) {
exporter, err := otlptrace.New(ctx,
otlptrace.WithEndpoint("collector:4317"),
)
if err != nil {
return nil, err
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
OpenTelemetry provides vendor-neutral instrumentation.
Designing for Debuggability
Cardinality Awareness
High cardinality breaks metrics systems:
# Dangerous - cardinality explosion
http_requests_total{user_id="...", request_id="..."}
# Better - bounded cardinality
http_requests_total{method="GET", endpoint="/api/users", status="200"}
Use logs and traces for high-cardinality data.
Meaningful Context
Include data that helps debugging:
span.SetAttributes(
// Identity
attribute.String("user.id", user.ID),
attribute.String("tenant.id", tenant.ID),
// Request
attribute.String("request.type", requestType),
attribute.Int("request.items", len(items)),
// Result
attribute.Bool("result.from_cache", fromCache),
attribute.Int("result.count", resultCount),
)
Service Boundaries
Trace across service boundaries:
// Client side - inject context
func callService(ctx context.Context, url string) (*Response, error) {
req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)
// Inject trace context into headers
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
return client.Do(req)
}
// Server side - extract context
func handler(w http.ResponseWriter, r *http.Request) {
ctx := otel.GetTextMapPropagator().Extract(
r.Context(),
propagation.HeaderCarrier(r.Header),
)
ctx, span := tracer.Start(ctx, "handler")
defer span.End()
// ...
}
Beyond Technical Metrics
Business Metrics
Technical metrics miss business impact:
// Technical
httpRequestDuration.Observe(duration)
// Business
ordersPlaced.WithLabelValues(region, category).Inc()
revenueTotal.Add(orderValue)
checkoutAbandoned.Inc()
Business metrics answer “does the product work?” not just “does the service work?”
User Experience Metrics
Measure what users experience:
- Core Web Vitals: LCP, FID, CLS
- User timing: Time to first interaction
- Client-side errors: JavaScript exceptions
- Real user monitoring: Actual user latency
Key Takeaways
- Observability enables asking arbitrary questions about system behavior
- Three pillars: metrics (what), logs (why), traces (where)
- Correlation IDs link data across pillars and services
- Instrument at application level, not just infrastructure
- Use OpenTelemetry for vendor-neutral instrumentation
- Be cardinality-aware: use logs/traces for high-cardinality data
- Include meaningful context in traces and logs
- Measure business metrics, not just technical metrics
- User experience metrics matter as much as backend metrics
Observability isn’t a tool—it’s a property of your system. Design for debuggability from the start.