Traditional application monitoring doesn’t capture what matters for LLMs. Response times and error rates are necessary but insufficient. You also need to monitor output quality, prompt effectiveness, and user satisfaction. LLM observability is an emerging discipline.
Here’s how to monitor AI systems effectively.
Why LLM Observability Is Different
Traditional vs. LLM Monitoring
traditional_monitoring:
focus:
- Uptime and availability
- Response time
- Error rates
- Resource utilization
assumption: Correct code → correct behavior
llm_monitoring:
additional_focus:
- Output quality
- Prompt effectiveness
- Model behavior drift
- User satisfaction
reality: Correct code ≠ correct/useful outputs
What Can Go Wrong
llm_failure_modes:
obvious_failures:
- API errors
- Timeouts
- Rate limits
- Invalid responses
subtle_failures:
- Declining output quality
- Hallucination increase
- Style/tone drift
- Irrelevant responses
business_impact:
- User dissatisfaction
- Trust erosion
- Support tickets
- Feature abandonment
Core Metrics
Operational Metrics
operational_metrics:
availability:
- API success rate
- Error types and frequency
- Timeout rate
performance:
- Latency (p50, p95, p99)
- Time to first token (streaming)
- Tokens per second
cost:
- Tokens consumed (input/output)
- Cost per request
- Cost by feature/user
capacity:
- Rate limit utilization
- Queue depth
- Concurrent requests
Quality Metrics
quality_metrics:
automated:
- Response length distribution
- Format compliance rate
- Parsing success rate
- Content filter trigger rate
human_feedback:
- Thumbs up/down ratio
- Edit/correction rate
- Report/flag rate
- Feature retention
evaluation:
- Golden dataset scores
- A/B test metrics
- Regression detection
Implementation
Logging Framework
from dataclasses import dataclass
from datetime import datetime
import json
import uuid
@dataclass
class LLMRequestLog:
request_id: str
timestamp: datetime
user_id: str
feature: str
model: str
prompt_template: str
prompt_tokens: int
completion_tokens: int
latency_ms: float
status: str
error: str = None
# Quality fields
output_format_valid: bool = None
content_filtered: bool = None
user_feedback: str = None
class LLMLogger:
def __init__(self, sink):
self.sink = sink
def log_request(self, **kwargs):
log = LLMRequestLog(
request_id=str(uuid.uuid4()),
timestamp=datetime.utcnow(),
**kwargs
)
self.sink.write(log)
return log.request_id
def update_feedback(self, request_id: str, feedback: str):
self.sink.update(request_id, {'user_feedback': feedback})
Metrics Collection
from prometheus_client import Counter, Histogram, Gauge
# Counters
llm_requests_total = Counter(
'llm_requests_total',
'Total LLM requests',
['model', 'feature', 'status']
)
llm_tokens_total = Counter(
'llm_tokens_total',
'Total tokens consumed',
['model', 'type'] # type: input/output
)
# Histograms
llm_latency_seconds = Histogram(
'llm_latency_seconds',
'LLM request latency',
['model', 'feature'],
buckets=[0.1, 0.5, 1, 2, 5, 10, 30]
)
# Gauges
llm_quality_score = Gauge(
'llm_quality_score',
'Rolling quality score',
['feature']
)
class MetricsCollector:
def record_request(self, model, feature, status, latency, input_tokens, output_tokens):
llm_requests_total.labels(model=model, feature=feature, status=status).inc()
llm_latency_seconds.labels(model=model, feature=feature).observe(latency)
llm_tokens_total.labels(model=model, type='input').inc(input_tokens)
llm_tokens_total.labels(model=model, type='output').inc(output_tokens)
Quality Evaluation Pipeline
class QualityEvaluator:
def __init__(self, golden_dataset):
self.golden_dataset = golden_dataset
self.baseline_scores = {}
def evaluate_batch(self, feature: str, samples: list) -> dict:
"""Evaluate a batch of recent outputs."""
scores = []
for sample in samples:
score = self.evaluate_single(sample)
scores.append(score)
avg_score = sum(scores) / len(scores)
# Check for regression
baseline = self.baseline_scores.get(feature)
if baseline and avg_score < baseline * 0.9:
self.alert_regression(feature, avg_score, baseline)
return {
'feature': feature,
'sample_size': len(samples),
'avg_score': avg_score,
'min_score': min(scores),
'max_score': max(scores),
'baseline': baseline
}
def evaluate_single(self, sample) -> float:
"""Evaluate a single output."""
scores = []
# Format compliance
scores.append(1.0 if sample.format_valid else 0.0)
# Length appropriateness
scores.append(self.score_length(sample))
# Semantic similarity to good outputs (if available)
if sample.reference:
scores.append(self.semantic_similarity(sample.output, sample.reference))
return sum(scores) / len(scores)
Dashboards
Key Dashboard Panels
llm_dashboard:
overview:
- Request rate by feature
- Success rate
- P95 latency
- Daily cost
quality:
- User feedback ratio
- Quality score trend
- Content filter triggers
- Format validation failures
operational:
- Error rate by type
- Rate limit utilization
- Token usage trend
- Model distribution
cost:
- Cost by feature
- Cost by user segment
- Token efficiency
- Forecast vs actual
Alerts
alerting_rules:
critical:
- name: LLM Error Rate High
condition: error_rate > 5% for 5 minutes
action: Page on-call
- name: LLM Latency Spike
condition: p95_latency > 10s for 5 minutes
action: Page on-call
warning:
- name: Quality Score Drop
condition: quality_score < baseline * 0.9
action: Slack notification
- name: Cost Anomaly
condition: daily_cost > 150% of average
action: Slack notification
- name: Negative Feedback Spike
condition: negative_feedback_rate > 20%
action: Slack notification
Tracing
Distributed Tracing for AI
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
tracer = trace.get_tracer("llm_service")
class TracedLLMClient:
def __init__(self, client):
self.client = client
async def generate(self, prompt, **kwargs):
with tracer.start_as_current_span("llm_generate") as span:
span.set_attribute("model", kwargs.get('model', 'default'))
span.set_attribute("prompt_length", len(prompt))
try:
start = time.time()
response = await self.client.generate(prompt, **kwargs)
span.set_attribute("completion_tokens", response.usage.completion_tokens)
span.set_attribute("prompt_tokens", response.usage.prompt_tokens)
span.set_attribute("latency_ms", (time.time() - start) * 1000)
span.set_status(Status(StatusCode.OK))
return response
except Exception as e:
span.set_status(Status(StatusCode.ERROR, str(e)))
span.record_exception(e)
raise
Key Takeaways
- LLM monitoring needs quality metrics, not just operational
- Log prompts, responses, and metadata for debugging
- Track user feedback as primary quality signal
- Evaluate against golden datasets regularly
- Alert on quality regression, not just errors
- Monitor costs closely—they can spike unexpectedly
- Use distributed tracing for complex AI pipelines
- Build dashboards for different audiences (ops, product, business)
- Iterate on metrics as you learn what matters
- Observability enables iteration and improvement
You can’t improve what you can’t measure. LLM observability makes AI improvement possible.