When teams work remotely, the informal knowledge sharing that happens in an office disappears. You can’t walk over to someone’s desk to debug an issue together. Tribal knowledge about how systems behave becomes harder to maintain.
Observability fills this gap. Well-instrumented systems tell their own story. Here’s how to build observability that supports distributed teams.
The Challenge
What Changes Remotely
Before (co-located):
- “Hey, is the API slow for you too?”
- Tap shoulder, share screen, debug together
- Hallway conversations about system behavior
- Whiteboard sessions when things break
After (distributed):
- Hours before someone notices the problem
- Async debugging through chat messages
- Knowledge siloed in individual heads
- Video calls for collaboration (scheduling overhead)
Why Observability Matters More
Good observability enables:
- Self-service debugging
- Async investigation
- Knowledge preservation
- Faster incident response
The system becomes the source of truth, not tribal knowledge.
Building Observable Systems
The Three Pillars Plus
Classic three pillars:
├── Metrics (aggregated, time-series)
├── Logs (events, context)
└── Traces (request flow)
Additional:
├── Profiling (code-level performance)
├── Events (business-level occurrences)
└── Dashboards (visual understanding)
Metrics That Matter
Focus on actionable metrics:
# RED metrics (request-focused)
rate: requests per second
errors: error rate
duration: request latency (p50, p95, p99)
# USE metrics (resource-focused)
utilization: percentage busy
saturation: queue depth
errors: error count
# Business metrics
orders_per_minute: value delivery
signup_conversion: business health
active_users: engagement
Avoid vanity metrics:
- Server count (unless capacity planning)
- Lines of code
- Number of deployments (without quality context)
Structured Logging
Logs that machines and humans can read:
# Bad: Unstructured
logger.info(f"Processing order {order_id} for user {user_id}")
# Good: Structured
logger.info("Processing order", extra={
"order_id": order_id,
"user_id": user_id,
"amount": order.total,
"items": len(order.items),
"correlation_id": context.correlation_id
})
Output (JSON):
{
"timestamp": "2020-09-14T10:30:00Z",
"level": "INFO",
"message": "Processing order",
"order_id": "12345",
"user_id": "67890",
"amount": 99.99,
"items": 3,
"correlation_id": "abc-123",
"service": "orders",
"environment": "production"
}
Distributed Tracing
Track requests across services:
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer(__name__)
def process_order(order_id):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
# Child span for database
with tracer.start_as_current_span("fetch_order"):
order = db.get_order(order_id)
# Child span for payment
with tracer.start_as_current_span("process_payment"):
payment_result = payment_service.charge(order)
if payment_result.failed:
span.set_status(Status(StatusCode.ERROR))
span.record_exception(payment_result.error)
return order
Trace visualization shows:
process_order (150ms)
├── fetch_order (20ms)
├── process_payment (100ms)
│ ├── validate_card (10ms)
│ └── charge_card (90ms)
└── update_inventory (30ms)
Self-Service Debugging
Runbooks in Dashboards
Embed troubleshooting guidance:
# Grafana dashboard annotation
panels:
- title: "Error Rate"
description: |
## Troubleshooting High Error Rate
1. Check recent deployments: [Deployment History](link)
2. Look for correlated latency increase
3. Check downstream dependencies: [Dependencies Dashboard](link)
4. Review error logs: [Error Logs Query](link)
**On-call escalation**: #platform-oncall
Correlation IDs
Enable request tracing across async channels:
# Include correlation ID everywhere
def api_handler(request):
correlation_id = request.headers.get('X-Correlation-ID', generate_id())
# Pass to all downstream calls
response = downstream_service.call(
data=request.data,
headers={'X-Correlation-ID': correlation_id}
)
# Include in async jobs
queue.enqueue('process_task', {
'data': request.data,
'correlation_id': correlation_id
})
# Include in logs
logger.info("Request processed", extra={'correlation_id': correlation_id})
return response
Search and Explore
Enable ad-hoc queries:
# Example queries engineers should be able to run
## "What errors are users seeing?"
level:ERROR user_id:* | stats count by error_message
## "How long did this request take?"
correlation_id:abc-123 | select timestamp, service, duration
## "What changed before this outage?"
deploy_event OR config_change | timeline
Dashboards for Teams
Hierarchy of Dashboards
Organization Level:
├── SLA/SLO Overview
├── Business Metrics
└── Cost Overview
Service Level:
├── Service Health (RED metrics)
├── Dependencies
└── Resource Utilization
Detailed Level:
├── Individual Endpoints
├── Database Performance
└── Cache Hit Rates
Home Dashboard
What every engineer should see first:
sections:
- name: "Are we healthy?"
panels:
- SLA status (green/yellow/red)
- Error rate vs baseline
- Latency vs baseline
- name: "What's happening?"
panels:
- Recent deployments
- Active incidents
- Upcoming maintenance
- name: "Quick links"
panels:
- Current on-call
- Runbooks
- Escalation contacts
SLO Dashboard
Track service level objectives:
panels:
- name: "Monthly Error Budget"
query: |
100 - (
sum(rate(http_requests_total{status=~"5.."}[30d])) /
sum(rate(http_requests_total[30d])) * 100
)
thresholds:
- value: 99.9
color: green
- value: 99.5
color: yellow
- value: 99
color: red
- name: "Error Budget Burn Rate"
description: "Hours until budget exhausted at current rate"
Alerting for Remote Teams
Alert Design
Good alerts:
- Actionable (something to do)
- Urgent (needs attention now)
- Specific (points to problem area)
Bad alerts:
- Informational (should be dashboard)
- Flappy (fires and resolves repeatedly)
- Vague (“something is wrong”)
Alert Content
Include everything needed to start investigating:
alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01
labels:
severity: critical
team: platform
annotations:
summary: "Error rate above 1% for {{ $labels.service }}"
description: |
Current error rate: {{ $value | humanizePercentage }}
**Quick Links:**
- [Error Logs](https://logs.example.com?query=service:{{ $labels.service }}+level:ERROR)
- [Dashboard](https://grafana.example.com/d/service?var-service={{ $labels.service }})
- [Recent Deployments](https://deploy.example.com/{{ $labels.service }}/history)
**Runbook:** [High Error Rate](https://wiki.example.com/runbooks/high-error-rate)
Routing and Escalation
# PagerDuty/Opsgenie routing
routes:
- match:
severity: critical
team: platform
receiver: platform-oncall-pagerduty
- match:
severity: warning
receiver: slack-alerts
receivers:
- name: platform-oncall-pagerduty
pagerduty_configs:
- routing_key: xxx
slack_configs:
- channel: '#platform-incidents'
Documentation
Living Documentation
Keep runbooks next to code:
repo/
├── src/
├── docs/
│ ├── architecture.md
│ └── runbooks/
│ ├── high-error-rate.md
│ ├── high-latency.md
│ └── database-connection-issues.md
└── dashboards/
└── service-dashboard.json
Runbook Template
# High Error Rate
## Symptoms
- Error rate above 1% for > 5 minutes
- PagerDuty alert: HighErrorRate
## Impact
- Users may see error pages
- API clients receiving 5xx responses
## Investigation Steps
### 1. Check Recent Changes
- [ ] Recent deployments? [Deploy History](link)
- [ ] Config changes? [Config History](link)
- [ ] New feature flags enabled?
### 2. Identify Error Type
- [ ] Check error logs: `service:api level:ERROR | top error_message`
- [ ] Check error dashboard: [Link](link)
### 3. Check Dependencies
- [ ] Database healthy? [DB Dashboard](link)
- [ ] Downstream services healthy? [Dependencies](link)
## Mitigation
### Rollback deployment
```bash
kubectl rollout undo deployment/api
Disable feature flag
./scripts/disable-flag.sh new-feature
Escalation
- Platform team: #platform-oncall
- Database team: #data-oncall
## Key Takeaways
- Remote teams can't rely on tribal knowledge; systems must be self-explanatory
- Structured logging enables search and correlation across services
- Distributed tracing shows request flow; essential for microservices
- Correlation IDs tie requests together across sync and async boundaries
- Dashboards should answer "is it healthy?" before diving into details
- Alerts must be actionable and include investigation starting points
- Runbooks live with the code and dashboards, not in a separate wiki
- SLOs provide objective measures of service health
- Self-service debugging reduces dependency on experts
- Observability is a team practice, not just tooling
Good observability makes remote work sustainable. When systems tell their own story, teams can investigate asynchronously and onboard newcomers faster.