Production Monitoring: Metrics That Actually Matter

December 12, 2016

Modern infrastructure generates endless metrics. CPU usage, memory consumption, disk I/O, network throughput, request counts, error rates, queue depths, cache hit rates—the list is overwhelming. Many teams collect everything, dashboard everything, and alert on nothing useful.

Effective monitoring focuses on metrics that matter: indicators of user experience and system health. Everything else is noise.

The Monitoring Hierarchy

Not all metrics are equal. Organize your monitoring in layers:

User-Facing Metrics (Most Important)

What users actually experience:

If users are happy, the system is healthy—regardless of what infrastructure metrics show.

Service-Level Metrics

How services are performing:

Service metrics explain user-facing problems and provide early warning.

Resource Metrics

Infrastructure utilization:

Resource metrics help diagnose service problems but rarely matter on their own.

Business Metrics

Business outcomes:

Business metrics validate that technical health translates to business success.

The RED Method

For request-driven services, focus on three metrics:

Rate: Requests per second. How much traffic is the service handling?

Errors: Failed requests per second. What percentage of requests fail?

Duration: Time to respond. How long do requests take?

These three metrics capture the user experience of a service:

Implement RED metrics for every service. They’re your first alert line.

The USE Method

For resources (CPU, memory, disk, network), focus on:

Utilization: Percentage of resource capacity used.

Saturation: Work that’s queuing because the resource is full.

Errors: Resource-related failures.

USE metrics identify bottlenecks:

Latency: The Golden Signal

Latency deserves special attention. It directly reflects user experience and reveals problems before errors increase.

Measure Percentiles, Not Averages

Average latency hides important information. A service with 100ms average might have most requests at 50ms with some at 500ms—the slow requests are invisible in the average.

Measure percentiles:

Alert on high percentiles (p99, p99.9). Problems often appear first in the tail.

Measure at Multiple Points

Latency at different points reveals different problems:

When user latency increases, these measurements show where time is spent.

Error Rates and Types

Not all errors are equal. Distinguish:

Client Errors (4xx)

Bad requests, authentication failures, not found. Often not your problem—but spikes might indicate:

Server Errors (5xx)

Your problem. Something failed that shouldn’t have. Always investigate server error spikes.

Differentiate Error Sources

Within errors, differentiate:

“Errors increased” is a starting point. “Auth endpoint returning 500 for mobile clients” is actionable.

Dependency Monitoring

Services depend on other services, databases, and external APIs. Monitor dependencies separately:

When your service degrades, dependency metrics show whether the problem is yours or upstream.

Alerting Philosophy

Metrics are useless without alerts. But poor alerting is worse than no alerting—alert fatigue makes teams ignore everything.

Alert on Symptoms, Not Causes

Alert on what users experience (high error rate, high latency) rather than potential causes (high CPU, low disk space).

High CPU doesn’t necessarily mean users are affected. High error rate definitely means users are affected.

Actionable Alerts

Every alert should be actionable. When it fires, someone should be able to do something about it.

If an alert isn’t actionable, either:

Reduce Alert Noise

Alert fatigue is real. If alerts fire constantly, teams stop responding.

A few high-quality alerts beat many low-quality alerts.

Dashboard Design

Dashboards should tell stories, not show data.

Landing Page Dashboard

The first dashboard someone sees should answer: “Is everything okay?”

Red/yellow/green indicators show status at a glance.

Service Dashboards

Each service should have a dashboard showing:

When investigating problems, these dashboards provide context.

Investigation Dashboards

Deeper dashboards for debugging:

These aren’t for daily monitoring but for incident investigation.

Monitoring as Code

Define monitoring alongside code:

This ensures:

Anti-Patterns

Metric Hoarding

Collecting everything “just in case” creates:

Collect what you’ll use. Add metrics when needed, not speculatively.

Dashboard Sprawl

Hundreds of dashboards means no one knows which to look at. Curate dashboards. Archive unused ones.

Alert on Everything

Alerting on every metric produces noise. Teams either ignore alerts or turn them off entirely.

Reserve alerts for actionable conditions that affect users.

Missing the User Perspective

Monitoring infrastructure but not user experience misses the point. The CPU could be at 10%, but if requests are failing, users don’t care.

Always include user-facing metrics.

Getting Started

If you’re building monitoring from scratch:

  1. Implement RED metrics for services
  2. Set up user-facing health endpoints
  3. Create a landing page dashboard showing overall health
  4. Alert on error rates and latency percentiles
  5. Add USE metrics for resources as needed
  6. Build investigation dashboards for debugging

Start simple. Add complexity when you need it, not before.

Key Takeaways