Production Monitoring: Metrics That Actually Matter

Modern infrastructure generates endless metrics. CPU usage, memory consumption, disk I/O, network throughput, request counts, error rates, queue depths, cache hit rates—the list is overwhelming. Many teams collect everything, dashboard everything, and alert on nothing useful.

Effective monitoring focuses on metrics that matter: indicators of user experience and system health. Everything else is noise.

The Monitoring Hierarchy

Not all metrics are equal. Organize your monitoring in layers:

User-Facing Metrics (Most Important)

What users actually experience:

Request latency (from user’s perspective)
Error rates
Availability
Feature functionality

If users are happy, the system is healthy—regardless of what infrastructure metrics show.

Service-Level Metrics

How services are performing:

Request rate, error rate, duration (RED metrics)
Saturation (how “full” the service is)
Dependencies’ health

Service metrics explain user-facing problems and provide early warning.

Resource Metrics

Infrastructure utilization:

CPU, memory, disk, network
Container/pod metrics
Database connections, query times

Resource metrics help diagnose service problems but rarely matter on their own.

Business Metrics

Business outcomes:

Signups, conversions, revenue
Feature usage
User engagement

Business metrics validate that technical health translates to business success.

The RED Method

For request-driven services, focus on three metrics:

Rate: Requests per second. How much traffic is the service handling?

Errors: Failed requests per second. What percentage of requests fail?

Duration: Time to respond. How long do requests take?

These three metrics capture the user experience of a service:

High rate + low errors + low duration = healthy
High errors = something’s wrong
High duration = performance problems

Implement RED metrics for every service. They’re your first alert line.

The USE Method

For resources (CPU, memory, disk, network), focus on:

Utilization: Percentage of resource capacity used.

Saturation: Work that’s queuing because the resource is full.

Errors: Resource-related failures.

USE metrics identify bottlenecks:

High utilization without saturation: running efficiently
High saturation: resource is overloaded, work is queuing
Errors: resource is failing

Latency: The Golden Signal

Latency deserves special attention. It directly reflects user experience and reveals problems before errors increase.

Measure Percentiles, Not Averages

Average latency hides important information. A service with 100ms average might have most requests at 50ms with some at 500ms—the slow requests are invisible in the average.

Measure percentiles:

p50 (median): Typical experience
p90: Most users’ experience
p99: Worst 1% experience
p99.9: Extreme cases

Alert on high percentiles (p99, p99.9). Problems often appear first in the tail.

Measure at Multiple Points

Latency at different points reveals different problems:

Client-observed latency: Complete user experience
Load balancer latency: Network + service time
Application latency: Just service processing
Database latency: Data layer performance

When user latency increases, these measurements show where time is spent.

Error Rates and Types

Not all errors are equal. Distinguish:

Client Errors (4xx)

Bad requests, authentication failures, not found. Often not your problem—but spikes might indicate:

API changes breaking clients
Authentication issues
Missing resources that should exist

Server Errors (5xx)

Your problem. Something failed that shouldn’t have. Always investigate server error spikes.

Differentiate Error Sources

Within errors, differentiate:

Which endpoints are failing?
Which error codes?
Which users or clients?

“Errors increased” is a starting point. “Auth endpoint returning 500 for mobile clients” is actionable.

Dependency Monitoring

Services depend on other services, databases, and external APIs. Monitor dependencies separately:

Are dependencies responding?
How fast?
What’s their error rate?

When your service degrades, dependency metrics show whether the problem is yours or upstream.

Alerting Philosophy

Metrics are useless without alerts. But poor alerting is worse than no alerting—alert fatigue makes teams ignore everything.

Alert on Symptoms, Not Causes

Alert on what users experience (high error rate, high latency) rather than potential causes (high CPU, low disk space).

High CPU doesn’t necessarily mean users are affected. High error rate definitely means users are affected.

Actionable Alerts

Every alert should be actionable. When it fires, someone should be able to do something about it.

If an alert isn’t actionable, either:

Make it actionable (add a runbook)
Remove it (it’s noise)

Reduce Alert Noise

Alert fatigue is real. If alerts fire constantly, teams stop responding.

Tune thresholds to reduce false positives
Aggregate related alerts
Distinguish urgent (page) from important (ticket)
Regularly review and remove noisy alerts

A few high-quality alerts beat many low-quality alerts.

Dashboard Design

Dashboards should tell stories, not show data.

Landing Page Dashboard

The first dashboard someone sees should answer: “Is everything okay?”

Overall health indicators
Key metrics for each service
Recent changes (deployments, incidents)

Red/yellow/green indicators show status at a glance.

Service Dashboards

Each service should have a dashboard showing:

RED metrics over time
Current status vs. normal
Dependency health
Recent changes

When investigating problems, these dashboards provide context.

Investigation Dashboards

Deeper dashboards for debugging:

Detailed metrics breakdown
Correlation across signals
Historical comparison

These aren’t for daily monitoring but for incident investigation.

Monitoring as Code

Define monitoring alongside code:

Alert definitions in version control
Dashboard configurations as code
Monitoring deployed with services

This ensures:

Monitoring changes are reviewed
New services include monitoring from day one
Historical context for monitoring decisions

Anti-Patterns

Metric Hoarding

Collecting everything “just in case” creates:

High storage costs
Slow queries
Difficulty finding relevant data

Collect what you’ll use. Add metrics when needed, not speculatively.

Dashboard Sprawl

Hundreds of dashboards means no one knows which to look at. Curate dashboards. Archive unused ones.

Alert on Everything

Alerting on every metric produces noise. Teams either ignore alerts or turn them off entirely.

Reserve alerts for actionable conditions that affect users.

Missing the User Perspective

Monitoring infrastructure but not user experience misses the point. The CPU could be at 10%, but if requests are failing, users don’t care.

Always include user-facing metrics.

Getting Started

If you’re building monitoring from scratch:

Implement RED metrics for services
Set up user-facing health endpoints
Create a landing page dashboard showing overall health
Alert on error rates and latency percentiles
Add USE metrics for resources as needed
Build investigation dashboards for debugging

Start simple. Add complexity when you need it, not before.

Key Takeaways

Focus on user-facing metrics first; infrastructure metrics support diagnosis
Use RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for resources
Measure latency percentiles, not averages; alert on p99 and p99.9
Alert on symptoms (user impact), not causes (resource utilization)
Every alert should be actionable; reduce noise aggressively
Design dashboards to tell stories, with clear landing pages and service views
Define monitoring as code alongside service definitions