LLM Observability: Monitoring AI in Production

August 21, 2023

Traditional application monitoring doesn’t capture what matters for LLMs. Response times and error rates are necessary but insufficient. You also need to monitor output quality, prompt effectiveness, and user satisfaction. LLM observability is an emerging discipline.

Here’s how to monitor AI systems effectively.

Why LLM Observability Is Different

Traditional vs. LLM Monitoring

traditional_monitoring:
  focus:
    - Uptime and availability
    - Response time
    - Error rates
    - Resource utilization

  assumption: Correct code → correct behavior

llm_monitoring:
  additional_focus:
    - Output quality
    - Prompt effectiveness
    - Model behavior drift
    - User satisfaction

  reality: Correct code ≠ correct/useful outputs

What Can Go Wrong

llm_failure_modes:
  obvious_failures:
    - API errors
    - Timeouts
    - Rate limits
    - Invalid responses

  subtle_failures:
    - Declining output quality
    - Hallucination increase
    - Style/tone drift
    - Irrelevant responses

  business_impact:
    - User dissatisfaction
    - Trust erosion
    - Support tickets
    - Feature abandonment

Core Metrics

Operational Metrics

operational_metrics:
  availability:
    - API success rate
    - Error types and frequency
    - Timeout rate

  performance:
    - Latency (p50, p95, p99)
    - Time to first token (streaming)
    - Tokens per second

  cost:
    - Tokens consumed (input/output)
    - Cost per request
    - Cost by feature/user

  capacity:
    - Rate limit utilization
    - Queue depth
    - Concurrent requests

Quality Metrics

quality_metrics:
  automated:
    - Response length distribution
    - Format compliance rate
    - Parsing success rate
    - Content filter trigger rate

  human_feedback:
    - Thumbs up/down ratio
    - Edit/correction rate
    - Report/flag rate
    - Feature retention

  evaluation:
    - Golden dataset scores
    - A/B test metrics
    - Regression detection

Implementation

Logging Framework

from dataclasses import dataclass
from datetime import datetime
import json
import uuid

@dataclass
class LLMRequestLog:
    request_id: str
    timestamp: datetime
    user_id: str
    feature: str
    model: str
    prompt_template: str
    prompt_tokens: int
    completion_tokens: int
    latency_ms: float
    status: str
    error: str = None

    # Quality fields
    output_format_valid: bool = None
    content_filtered: bool = None
    user_feedback: str = None

class LLMLogger:
    def __init__(self, sink):
        self.sink = sink

    def log_request(self, **kwargs):
        log = LLMRequestLog(
            request_id=str(uuid.uuid4()),
            timestamp=datetime.utcnow(),
            **kwargs
        )
        self.sink.write(log)
        return log.request_id

    def update_feedback(self, request_id: str, feedback: str):
        self.sink.update(request_id, {'user_feedback': feedback})

Metrics Collection

from prometheus_client import Counter, Histogram, Gauge

# Counters
llm_requests_total = Counter(
    'llm_requests_total',
    'Total LLM requests',
    ['model', 'feature', 'status']
)

llm_tokens_total = Counter(
    'llm_tokens_total',
    'Total tokens consumed',
    ['model', 'type']  # type: input/output
)

# Histograms
llm_latency_seconds = Histogram(
    'llm_latency_seconds',
    'LLM request latency',
    ['model', 'feature'],
    buckets=[0.1, 0.5, 1, 2, 5, 10, 30]
)

# Gauges
llm_quality_score = Gauge(
    'llm_quality_score',
    'Rolling quality score',
    ['feature']
)

class MetricsCollector:
    def record_request(self, model, feature, status, latency, input_tokens, output_tokens):
        llm_requests_total.labels(model=model, feature=feature, status=status).inc()
        llm_latency_seconds.labels(model=model, feature=feature).observe(latency)
        llm_tokens_total.labels(model=model, type='input').inc(input_tokens)
        llm_tokens_total.labels(model=model, type='output').inc(output_tokens)

Quality Evaluation Pipeline

class QualityEvaluator:
    def __init__(self, golden_dataset):
        self.golden_dataset = golden_dataset
        self.baseline_scores = {}

    def evaluate_batch(self, feature: str, samples: list) -> dict:
        """Evaluate a batch of recent outputs."""
        scores = []

        for sample in samples:
            score = self.evaluate_single(sample)
            scores.append(score)

        avg_score = sum(scores) / len(scores)

        # Check for regression
        baseline = self.baseline_scores.get(feature)
        if baseline and avg_score < baseline * 0.9:
            self.alert_regression(feature, avg_score, baseline)

        return {
            'feature': feature,
            'sample_size': len(samples),
            'avg_score': avg_score,
            'min_score': min(scores),
            'max_score': max(scores),
            'baseline': baseline
        }

    def evaluate_single(self, sample) -> float:
        """Evaluate a single output."""
        scores = []

        # Format compliance
        scores.append(1.0 if sample.format_valid else 0.0)

        # Length appropriateness
        scores.append(self.score_length(sample))

        # Semantic similarity to good outputs (if available)
        if sample.reference:
            scores.append(self.semantic_similarity(sample.output, sample.reference))

        return sum(scores) / len(scores)

Dashboards

Key Dashboard Panels

llm_dashboard:
  overview:
    - Request rate by feature
    - Success rate
    - P95 latency
    - Daily cost

  quality:
    - User feedback ratio
    - Quality score trend
    - Content filter triggers
    - Format validation failures

  operational:
    - Error rate by type
    - Rate limit utilization
    - Token usage trend
    - Model distribution

  cost:
    - Cost by feature
    - Cost by user segment
    - Token efficiency
    - Forecast vs actual

Alerts

alerting_rules:
  critical:
    - name: LLM Error Rate High
      condition: error_rate > 5% for 5 minutes
      action: Page on-call

    - name: LLM Latency Spike
      condition: p95_latency > 10s for 5 minutes
      action: Page on-call

  warning:
    - name: Quality Score Drop
      condition: quality_score < baseline * 0.9
      action: Slack notification

    - name: Cost Anomaly
      condition: daily_cost > 150% of average
      action: Slack notification

    - name: Negative Feedback Spike
      condition: negative_feedback_rate > 20%
      action: Slack notification

Tracing

Distributed Tracing for AI

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("llm_service")

class TracedLLMClient:
    def __init__(self, client):
        self.client = client

    async def generate(self, prompt, **kwargs):
        with tracer.start_as_current_span("llm_generate") as span:
            span.set_attribute("model", kwargs.get('model', 'default'))
            span.set_attribute("prompt_length", len(prompt))

            try:
                start = time.time()
                response = await self.client.generate(prompt, **kwargs)

                span.set_attribute("completion_tokens", response.usage.completion_tokens)
                span.set_attribute("prompt_tokens", response.usage.prompt_tokens)
                span.set_attribute("latency_ms", (time.time() - start) * 1000)
                span.set_status(Status(StatusCode.OK))

                return response

            except Exception as e:
                span.set_status(Status(StatusCode.ERROR, str(e)))
                span.record_exception(e)
                raise

Key Takeaways

You can’t improve what you can’t measure. LLM observability makes AI improvement possible.