Observability is often added after systems are built—and it shows. Logs are inconsistent, metrics are missing, and debugging production issues becomes archaeology. Observability-driven development flips this: build observability in from the start.
Here’s how to make observability a core development practice.
The Problem with Bolted-On Observability
What Goes Wrong
common_issues:
inconsistent_logging:
- Different formats per service
- Missing context (request ID, user ID)
- Inconsistent log levels
- Important events not logged
missing_metrics:
- No business metrics
- Only infrastructure metrics
- Missing latency percentiles
- No error breakdowns
poor_tracing:
- Traces don't cross service boundaries
- Missing span context
- No correlation with logs/metrics
- Critical paths not instrumented
result:
- Hours to debug production issues
- Can't answer business questions
- Reactive instead of proactive
Observability First
Design for Observability
Before writing code, define what you need to observe:
observability_design:
service: order-service
key_operations:
- name: create_order
success_metrics:
- orders_created_total
- order_value_histogram
failure_metrics:
- order_creation_failures_total
latency_targets:
p50: 100ms
p99: 500ms
required_traces:
- payment_processing
- inventory_check
- notification_send
- name: get_order
success_metrics:
- orders_retrieved_total
latency_targets:
p50: 20ms
p99: 100ms
business_questions:
- How many orders per hour?
- What's the average order value?
- Which payment methods are failing?
- Where is latency coming from?
The Three Pillars
┌─────────────────────────────────────────────────────────────────┐
│ Observability │
├───────────────────┬───────────────────┬─────────────────────────┤
│ Logs │ Metrics │ Traces │
│ │ │ │
│ What happened │ How much/many │ Request flow │
│ Event details │ Aggregated data │ Across services │
│ Debugging │ Alerting │ Latency breakdown │
│ │ │ │
│ Structured JSON │ Prometheus │ OpenTelemetry │
│ Request context │ Counters/Gauges │ Spans and traces │
│ Correlation IDs │ Histograms │ Context propagation │
└───────────────────┴───────────────────┴─────────────────────────┘
Structured Logging
Log Structure
// Structured logging with context
type Logger struct {
logger *zap.Logger
}
func (l *Logger) WithContext(ctx context.Context) *zap.Logger {
fields := []zap.Field{}
if requestID := ctx.Value("request_id"); requestID != nil {
fields = append(fields, zap.String("request_id", requestID.(string)))
}
if userID := ctx.Value("user_id"); userID != nil {
fields = append(fields, zap.String("user_id", userID.(string)))
}
if traceID := ctx.Value("trace_id"); traceID != nil {
fields = append(fields, zap.String("trace_id", traceID.(string)))
}
return l.logger.With(fields...)
}
// Usage
func (s *OrderService) CreateOrder(ctx context.Context, order Order) error {
log := s.logger.WithContext(ctx)
log.Info("creating order",
zap.String("customer_id", order.CustomerID),
zap.Float64("total", order.Total),
)
if err := s.validate(order); err != nil {
log.Warn("order validation failed",
zap.Error(err),
zap.Any("validation_errors", err.Details()),
)
return err
}
// Process order...
log.Info("order created",
zap.String("order_id", order.ID),
zap.Duration("processing_time", time.Since(start)),
)
return nil
}
Log Levels
log_levels:
debug:
- Detailed debugging info
- Not enabled in production
- Example: "checking inventory for SKU-123"
info:
- Normal operations
- Key business events
- Example: "order created", "user logged in"
warn:
- Unusual but handled situations
- Potential problems
- Example: "retry succeeded after 2 attempts"
error:
- Failures that need attention
- Include error details and context
- Example: "payment processing failed"
fatal:
- Application cannot continue
- Triggers shutdown
- Example: "database connection failed"
Metrics
Metric Types
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
// Counter - monotonically increasing
ordersCreated = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "orders_created_total",
Help: "Total number of orders created",
},
[]string{"status", "payment_method"},
)
// Histogram - distribution of values
orderLatency = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "order_creation_duration_seconds",
Help: "Order creation latency",
Buckets: []float64{.01, .05, .1, .25, .5, 1, 2.5, 5},
},
[]string{"status"},
)
// Gauge - current value
ordersInProgress = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "orders_in_progress",
Help: "Number of orders currently being processed",
},
)
// Summary - percentiles (calculated client-side)
orderValue = promauto.NewSummaryVec(
prometheus.SummaryOpts{
Name: "order_value_dollars",
Help: "Order values in dollars",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
},
[]string{"customer_tier"},
)
)
func (s *OrderService) CreateOrder(ctx context.Context, order Order) error {
timer := prometheus.NewTimer(orderLatency.WithLabelValues("pending"))
ordersInProgress.Inc()
defer ordersInProgress.Dec()
err := s.processOrder(ctx, order)
if err != nil {
ordersCreated.WithLabelValues("failed", order.PaymentMethod).Inc()
timer.ObserveDuration()
return err
}
ordersCreated.WithLabelValues("success", order.PaymentMethod).Inc()
orderValue.WithLabelValues(order.CustomerTier).Observe(order.Total)
timer.ObserveDuration()
return nil
}
RED and USE Methods
# RED Method - for services
red_metrics:
rate: requests per second
errors: failed requests per second
duration: distribution of latency
# USE Method - for resources
use_metrics:
utilization: percentage of time busy
saturation: queue depth / backlog
errors: error events
# Combined example
service_metrics:
- http_requests_total (rate)
- http_request_duration_seconds (duration)
- http_requests_failed_total (errors)
resource_metrics:
- cpu_usage_percent (utilization)
- request_queue_length (saturation)
- connection_errors_total (errors)
Distributed Tracing
OpenTelemetry Integration
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("order-service")
func (s *OrderService) CreateOrder(ctx context.Context, order Order) error {
ctx, span := tracer.Start(ctx, "CreateOrder",
trace.WithAttributes(
attribute.String("customer_id", order.CustomerID),
attribute.Float64("order_total", order.Total),
),
)
defer span.End()
// Validate order
if err := s.validate(ctx, order); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "validation failed")
return err
}
// Process payment (creates child span)
if err := s.paymentService.Process(ctx, order); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "payment failed")
return err
}
// Reserve inventory (creates child span)
if err := s.inventoryService.Reserve(ctx, order.Items); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "inventory reservation failed")
return err
}
span.SetAttributes(attribute.String("order_id", order.ID))
span.SetStatus(codes.Ok, "order created")
return nil
}
func (s *PaymentService) Process(ctx context.Context, order Order) error {
ctx, span := tracer.Start(ctx, "ProcessPayment")
defer span.End()
// Payment processing...
return nil
}
Context Propagation
// HTTP client with trace context
func (c *Client) DoRequest(ctx context.Context, req *http.Request) (*http.Response, error) {
// Inject trace context into headers
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
return c.httpClient.Do(req)
}
// HTTP server extracting trace context
func TracingMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx := otel.GetTextMapPropagator().Extract(r.Context(),
propagation.HeaderCarrier(r.Header))
ctx, span := tracer.Start(ctx, r.URL.Path)
defer span.End()
next.ServeHTTP(w, r.WithContext(ctx))
})
}
Correlation
Connecting Logs, Metrics, and Traces
// Middleware that establishes correlation
func ObservabilityMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Get or create request ID
requestID := r.Header.Get("X-Request-ID")
if requestID == "" {
requestID = uuid.New().String()
}
// Extract trace context
ctx := otel.GetTextMapPropagator().Extract(r.Context(),
propagation.HeaderCarrier(r.Header))
// Start span
ctx, span := tracer.Start(ctx, r.URL.Path)
defer span.End()
// Add to context
ctx = context.WithValue(ctx, "request_id", requestID)
ctx = context.WithValue(ctx, "trace_id", span.SpanContext().TraceID().String())
// Wrapped response writer for status code
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
start := time.Now()
next.ServeHTTP(wrapped, r.WithContext(ctx))
duration := time.Since(start)
// Log with correlation
logger.Info("request completed",
zap.String("request_id", requestID),
zap.String("trace_id", span.SpanContext().TraceID().String()),
zap.String("method", r.Method),
zap.String("path", r.URL.Path),
zap.Int("status", wrapped.statusCode),
zap.Duration("duration", duration),
)
// Metrics with exemplar (trace link)
requestDuration.WithLabelValues(r.Method, r.URL.Path).
(Observe(duration.Seconds()))
})
}
Development Workflow
Observability in CI/CD
# GitHub Actions
name: CI
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run tests
run: go test ./...
- name: Check observability coverage
run: |
# Verify all handlers have metrics
./scripts/check-metrics-coverage.sh
# Verify structured logging
./scripts/check-logging-standards.sh
# Verify tracing instrumentation
./scripts/check-tracing-coverage.sh
Local Development
# docker-compose for local observability stack
version: '3'
services:
app:
build: .
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
- LOG_FORMAT=json
otel-collector:
image: otel/opentelemetry-collector:latest
volumes:
- ./otel-config.yaml:/etc/otel/config.yaml
ports:
- "4317:4317"
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
Key Takeaways
- Design observability before writing code; define what you need to observe
- Use structured logging with consistent context (request ID, user ID, trace ID)
- Implement RED metrics for services, USE metrics for resources
- Instrument distributed tracing across all service boundaries
- Correlate logs, metrics, and traces with common identifiers
- Include observability checks in CI/CD pipelines
- Run local observability stack for development
- Business metrics are as important as technical metrics
- Make observability a team standard, not individual choice
- Test your observability: can you answer key questions about production?
Observability isn’t a feature—it’s a capability that enables everything else. Build it in from the start.