Observability: Beyond Traditional Monitoring

July 9, 2018

Traditional monitoring asks predefined questions: Is CPU high? Is the service up? Is latency acceptable? These are necessary but insufficient for complex distributed systems.

Observability is the ability to ask arbitrary questions about your system’s behavior. When something unexpected happens, you can investigate without deploying new instrumentation.

The Three Pillars

Metrics

Aggregated numerical measurements over time:

http_requests_total{method="GET", endpoint="/api/users", status="200"} 15234
http_request_duration_seconds{method="GET", endpoint="/api/users", quantile="0.99"} 0.45

Strengths:

Weaknesses:

Best practices:

Logs

Discrete events with context:

{
  "timestamp": "2018-07-09T10:30:45.123Z",
  "level": "error",
  "service": "user-api",
  "request_id": "abc123",
  "user_id": "user_456",
  "message": "Database connection timeout",
  "duration_ms": 5000,
  "query": "SELECT * FROM users WHERE id = ?"
}

Strengths:

Weaknesses:

Best practices:

Traces

Distributed transaction paths:

Trace ID: abc123
├─ Span: HTTP GET /orders/789 (50ms)
│  ├─ Span: AuthMiddleware (2ms)
│  ├─ Span: Database: SELECT order (15ms)
│  ├─ Span: HTTP GET /users/456 (external) (25ms)
│  │  └─ Span: Cache lookup (1ms)
│  └─ Span: HTTP GET /products/123 (external) (20ms)

Strengths:

Weaknesses:

Best practices:

Correlation Is Key

Individual pillars have limited value. Correlation multiplies it:

Alert: High latency on /orders endpoint
  → Metrics: P99 latency spike at 10:30
    → Traces: Sample trace shows database query slow
      → Logs: Query timeout errors with specific query
        → Root cause: Missing index on new column

Correlation IDs

Generate unique ID at request entry, propagate through:

func requestMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        requestID := r.Header.Get("X-Request-ID")
        if requestID == "" {
            requestID = uuid.New().String()
        }

        ctx := context.WithValue(r.Context(), requestIDKey, requestID)
        logger := logger.With("request_id", requestID)
        ctx = context.WithValue(ctx, loggerKey, logger)

        // Propagate to downstream services
        w.Header().Set("X-Request-ID", requestID)

        next.ServeHTTP(w, r.WithContext(ctx))
    })
}

Linking Pillars

// Metrics include exemplar linking to traces
httpRequestDuration.With(
    prometheus.Labels{"endpoint": endpoint},
).Observe(duration)

// Logs include trace ID
logger.WithFields(log.Fields{
    "trace_id": span.SpanContext().TraceID().String(),
    "span_id": span.SpanContext().SpanID().String(),
    "request_id": requestID,
}).Info("Request completed")

// Traces include relevant data
span.SetAttributes(
    attribute.String("user.id", userID),
    attribute.Int("result.count", len(results)),
)

Implementing Observability

Instrumentation

Application-level:

// HTTP handler instrumentation
func instrumentHandler(handler http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        ctx, span := tracer.Start(r.Context(), "HTTP "+r.Method+" "+r.URL.Path)
        defer span.End()

        wrapped := wrapResponseWriter(w)
        handler.ServeHTTP(wrapped, r.WithContext(ctx))

        duration := time.Since(start)

        // Metrics
        httpRequestsTotal.WithLabelValues(
            r.Method,
            r.URL.Path,
            strconv.Itoa(wrapped.status),
        ).Inc()

        httpRequestDuration.WithLabelValues(
            r.Method,
            r.URL.Path,
        ).Observe(duration.Seconds())

        // Span attributes
        span.SetAttributes(
            attribute.Int("http.status_code", wrapped.status),
            attribute.Int64("http.response_size", wrapped.size),
        )

        // Log
        logger.WithFields(log.Fields{
            "method":   r.Method,
            "path":     r.URL.Path,
            "status":   wrapped.status,
            "duration": duration,
        }).Info("Request completed")
    })
}

Database instrumentation:

func (db *DB) Query(ctx context.Context, query string, args ...interface{}) (*Rows, error) {
    ctx, span := tracer.Start(ctx, "db.query")
    defer span.End()

    start := time.Now()
    rows, err := db.db.QueryContext(ctx, query, args...)
    duration := time.Since(start)

    span.SetAttributes(
        attribute.String("db.statement", query),
        attribute.Bool("db.error", err != nil),
    )

    dbQueryDuration.Observe(duration.Seconds())

    if err != nil {
        span.RecordError(err)
    }

    return rows, err
}

Tooling Stack

Metrics:

Logs:

Traces:

Or unified:

OpenTelemetry

OpenTelemetry is emerging as the standard:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace"
    "go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := otlptrace.New(ctx,
        otlptrace.WithEndpoint("collector:4317"),
    )
    if err != nil {
        return nil, err
    }

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("my-service"),
        )),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

OpenTelemetry provides vendor-neutral instrumentation.

Designing for Debuggability

Cardinality Awareness

High cardinality breaks metrics systems:

# Dangerous - cardinality explosion
http_requests_total{user_id="...", request_id="..."}

# Better - bounded cardinality
http_requests_total{method="GET", endpoint="/api/users", status="200"}

Use logs and traces for high-cardinality data.

Meaningful Context

Include data that helps debugging:

span.SetAttributes(
    // Identity
    attribute.String("user.id", user.ID),
    attribute.String("tenant.id", tenant.ID),

    // Request
    attribute.String("request.type", requestType),
    attribute.Int("request.items", len(items)),

    // Result
    attribute.Bool("result.from_cache", fromCache),
    attribute.Int("result.count", resultCount),
)

Service Boundaries

Trace across service boundaries:

// Client side - inject context
func callService(ctx context.Context, url string) (*Response, error) {
    req, _ := http.NewRequestWithContext(ctx, "GET", url, nil)

    // Inject trace context into headers
    otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))

    return client.Do(req)
}

// Server side - extract context
func handler(w http.ResponseWriter, r *http.Request) {
    ctx := otel.GetTextMapPropagator().Extract(
        r.Context(),
        propagation.HeaderCarrier(r.Header),
    )
    ctx, span := tracer.Start(ctx, "handler")
    defer span.End()
    // ...
}

Beyond Technical Metrics

Business Metrics

Technical metrics miss business impact:

// Technical
httpRequestDuration.Observe(duration)

// Business
ordersPlaced.WithLabelValues(region, category).Inc()
revenueTotal.Add(orderValue)
checkoutAbandoned.Inc()

Business metrics answer “does the product work?” not just “does the service work?”

User Experience Metrics

Measure what users experience:

Key Takeaways

Observability isn’t a tool—it’s a property of your system. Design for debuggability from the start.