Modern infrastructure generates endless metrics. CPU usage, memory consumption, disk I/O, network throughput, request counts, error rates, queue depths, cache hit rates—the list is overwhelming. Many teams collect everything, dashboard everything, and alert on nothing useful.
Effective monitoring focuses on metrics that matter: indicators of user experience and system health. Everything else is noise.
The Monitoring Hierarchy
Not all metrics are equal. Organize your monitoring in layers:
User-Facing Metrics (Most Important)
What users actually experience:
- Request latency (from user’s perspective)
- Error rates
- Availability
- Feature functionality
If users are happy, the system is healthy—regardless of what infrastructure metrics show.
Service-Level Metrics
How services are performing:
- Request rate, error rate, duration (RED metrics)
- Saturation (how “full” the service is)
- Dependencies’ health
Service metrics explain user-facing problems and provide early warning.
Resource Metrics
Infrastructure utilization:
- CPU, memory, disk, network
- Container/pod metrics
- Database connections, query times
Resource metrics help diagnose service problems but rarely matter on their own.
Business Metrics
Business outcomes:
- Signups, conversions, revenue
- Feature usage
- User engagement
Business metrics validate that technical health translates to business success.
The RED Method
For request-driven services, focus on three metrics:
Rate: Requests per second. How much traffic is the service handling?
Errors: Failed requests per second. What percentage of requests fail?
Duration: Time to respond. How long do requests take?
These three metrics capture the user experience of a service:
- High rate + low errors + low duration = healthy
- High errors = something’s wrong
- High duration = performance problems
Implement RED metrics for every service. They’re your first alert line.
The USE Method
For resources (CPU, memory, disk, network), focus on:
Utilization: Percentage of resource capacity used.
Saturation: Work that’s queuing because the resource is full.
Errors: Resource-related failures.
USE metrics identify bottlenecks:
- High utilization without saturation: running efficiently
- High saturation: resource is overloaded, work is queuing
- Errors: resource is failing
Latency: The Golden Signal
Latency deserves special attention. It directly reflects user experience and reveals problems before errors increase.
Measure Percentiles, Not Averages
Average latency hides important information. A service with 100ms average might have most requests at 50ms with some at 500ms—the slow requests are invisible in the average.
Measure percentiles:
- p50 (median): Typical experience
- p90: Most users’ experience
- p99: Worst 1% experience
- p99.9: Extreme cases
Alert on high percentiles (p99, p99.9). Problems often appear first in the tail.
Measure at Multiple Points
Latency at different points reveals different problems:
- Client-observed latency: Complete user experience
- Load balancer latency: Network + service time
- Application latency: Just service processing
- Database latency: Data layer performance
When user latency increases, these measurements show where time is spent.
Error Rates and Types
Not all errors are equal. Distinguish:
Client Errors (4xx)
Bad requests, authentication failures, not found. Often not your problem—but spikes might indicate:
- API changes breaking clients
- Authentication issues
- Missing resources that should exist
Server Errors (5xx)
Your problem. Something failed that shouldn’t have. Always investigate server error spikes.
Differentiate Error Sources
Within errors, differentiate:
- Which endpoints are failing?
- Which error codes?
- Which users or clients?
“Errors increased” is a starting point. “Auth endpoint returning 500 for mobile clients” is actionable.
Dependency Monitoring
Services depend on other services, databases, and external APIs. Monitor dependencies separately:
- Are dependencies responding?
- How fast?
- What’s their error rate?
When your service degrades, dependency metrics show whether the problem is yours or upstream.
Alerting Philosophy
Metrics are useless without alerts. But poor alerting is worse than no alerting—alert fatigue makes teams ignore everything.
Alert on Symptoms, Not Causes
Alert on what users experience (high error rate, high latency) rather than potential causes (high CPU, low disk space).
High CPU doesn’t necessarily mean users are affected. High error rate definitely means users are affected.
Actionable Alerts
Every alert should be actionable. When it fires, someone should be able to do something about it.
If an alert isn’t actionable, either:
- Make it actionable (add a runbook)
- Remove it (it’s noise)
Reduce Alert Noise
Alert fatigue is real. If alerts fire constantly, teams stop responding.
- Tune thresholds to reduce false positives
- Aggregate related alerts
- Distinguish urgent (page) from important (ticket)
- Regularly review and remove noisy alerts
A few high-quality alerts beat many low-quality alerts.
Dashboard Design
Dashboards should tell stories, not show data.
Landing Page Dashboard
The first dashboard someone sees should answer: “Is everything okay?”
- Overall health indicators
- Key metrics for each service
- Recent changes (deployments, incidents)
Red/yellow/green indicators show status at a glance.
Service Dashboards
Each service should have a dashboard showing:
- RED metrics over time
- Current status vs. normal
- Dependency health
- Recent changes
When investigating problems, these dashboards provide context.
Investigation Dashboards
Deeper dashboards for debugging:
- Detailed metrics breakdown
- Correlation across signals
- Historical comparison
These aren’t for daily monitoring but for incident investigation.
Monitoring as Code
Define monitoring alongside code:
- Alert definitions in version control
- Dashboard configurations as code
- Monitoring deployed with services
This ensures:
- Monitoring changes are reviewed
- New services include monitoring from day one
- Historical context for monitoring decisions
Anti-Patterns
Metric Hoarding
Collecting everything “just in case” creates:
- High storage costs
- Slow queries
- Difficulty finding relevant data
Collect what you’ll use. Add metrics when needed, not speculatively.
Dashboard Sprawl
Hundreds of dashboards means no one knows which to look at. Curate dashboards. Archive unused ones.
Alert on Everything
Alerting on every metric produces noise. Teams either ignore alerts or turn them off entirely.
Reserve alerts for actionable conditions that affect users.
Missing the User Perspective
Monitoring infrastructure but not user experience misses the point. The CPU could be at 10%, but if requests are failing, users don’t care.
Always include user-facing metrics.
Getting Started
If you’re building monitoring from scratch:
- Implement RED metrics for services
- Set up user-facing health endpoints
- Create a landing page dashboard showing overall health
- Alert on error rates and latency percentiles
- Add USE metrics for resources as needed
- Build investigation dashboards for debugging
Start simple. Add complexity when you need it, not before.
Key Takeaways
- Focus on user-facing metrics first; infrastructure metrics support diagnosis
- Use RED (Rate, Errors, Duration) for services; USE (Utilization, Saturation, Errors) for resources
- Measure latency percentiles, not averages; alert on p99 and p99.9
- Alert on symptoms (user impact), not causes (resource utilization)
- Every alert should be actionable; reduce noise aggressively
- Design dashboards to tell stories, with clear landing pages and service views
- Define monitoring as code alongside service definitions