Engineering Metrics That Actually Matter

Engineering teams measure many things. Lines of code, story points, tickets closed. Most of these metrics are noise at best, actively harmful at worst. The right metrics drive improvement; the wrong ones create perverse incentives.

Here are metrics that actually matter.

The Problem with Common Metrics

Metrics That Don’t Work

bad_metrics:
  lines_of_code:
    problem: Incentivizes verbosity
    reality: Best code is deleted code
    result: Bloated, unmaintainable systems

  story_points:
    problem: Velocity becomes target
    reality: Point inflation
    result: Gaming instead of delivery

  tickets_closed:
    problem: Incentivizes splitting
    reality: Quality of work ignored
    result: Shallow work, rework

  hours_worked:
    problem: Presence over impact
    reality: Burnout, inefficiency
    result: Declining productivity

Goodhart’s Law

goodharts_law:
  statement: "When a measure becomes a target, it ceases to be a good measure"

  examples:
    - Target: Deploy frequency → Result: Empty deployments
    - Target: Test coverage → Result: Meaningless tests
    - Target: PR merge time → Result: Rubber-stamp reviews

  solution: Measure outcomes, not activities

DORA Metrics

The Four Key Metrics

dora_metrics:
  deployment_frequency:
    what: How often code deploys to production
    elite: Multiple times per day
    high: Weekly to monthly
    medium: Monthly to every 6 months
    low: Less than every 6 months

  lead_time_for_changes:
    what: Time from commit to production
    elite: Less than one hour
    high: One day to one week
    medium: One week to one month
    low: More than one month

  mean_time_to_recover:
    what: Time to restore service after incident
    elite: Less than one hour
    high: Less than one day
    medium: One day to one week
    low: More than one week

  change_failure_rate:
    what: Percentage of deployments causing incidents
    elite: 0-15%
    high: 16-30%
    medium: 31-45%
    low: 46-60%

Measuring DORA

deployment_frequency:
  data_source: CI/CD pipeline
  calculation: Count of production deployments / time period

lead_time:
  data_source: Git + deployment events
  calculation: Median time from first commit to deployment

mttr:
  data_source: Incident management system
  calculation: Mean time from incident start to resolution

change_failure_rate:
  data_source: Incident management + deployments
  calculation: Incidents caused by changes / total deployments

# Calculate deployment frequency
def deployment_frequency(deployments, days=30):
    recent = [d for d in deployments if d.date > datetime.now() - timedelta(days=days)]
    return len(recent) / days

# Calculate lead time
def lead_time(pull_requests, days=30):
    recent = [pr for pr in pull_requests if pr.merged_at > datetime.now() - timedelta(days=days)]
    lead_times = [(pr.merged_at - pr.first_commit_at).total_seconds() / 3600 for pr in recent]
    return statistics.median(lead_times)  # Return median hours

System Reliability

SLIs, SLOs, SLAs

reliability_metrics:
  availability:
    formula: Uptime / Total time
    example: "99.9% = 8.76 hours downtime/year"
    measurement: Synthetic monitoring, real user monitoring

  latency:
    formula: Request duration at percentile
    example: "p99 < 200ms"
    measurement: APM, distributed tracing

  error_rate:
    formula: Failed requests / Total requests
    example: "< 0.1% 5xx errors"
    measurement: Application logs, load balancer metrics

  throughput:
    formula: Requests handled per time unit
    example: "10,000 RPS sustained"
    measurement: Load balancer, application metrics

Error Budget

error_budget:
  concept: "Amount of unreliability allowed before slowing feature work"

  calculation:
    slo: 99.9%
    budget: 0.1%
    monthly_minutes: 43,200
    budget_minutes: 43.2

  usage:
    - Track budget consumption
    - Slow features when budget exhausted
    - Balance reliability and velocity

Developer Experience

SPACE Framework

space_framework:
  satisfaction_wellbeing:
    what: How developers feel about work
    measures:
      - Survey satisfaction scores
      - Retention rates
      - Burnout indicators

  performance:
    what: Outcomes of developer work
    measures:
      - Quality (defects, incidents)
      - Customer impact
      - Code review quality

  activity:
    what: What developers do (use carefully)
    measures:
      - Commits, PRs (context matters)
      - Deployments
      - Code reviews completed

  communication:
    what: How developers collaborate
    measures:
      - PR review turnaround
      - Documentation quality
      - Knowledge sharing

  efficiency:
    what: Getting work done without friction
    measures:
      - Build times
      - Test suite duration
      - Time to first productive day

Developer Friction

friction_metrics:
  time_to_first_commit:
    what: Days from start to first merged code
    target: "< 1 week"
    indicates: Onboarding effectiveness

  build_time:
    what: Time from code change to runnable build
    target: "< 5 minutes"
    indicates: Development loop speed

  test_suite_duration:
    what: Time to run full test suite
    target: "< 10 minutes"
    indicates: Feedback loop quality

  deploy_wait_time:
    what: Time from merge to production
    target: "< 1 hour"
    indicates: Pipeline efficiency

  pr_review_time:
    what: Time from PR open to first review
    target: "< 4 hours"
    indicates: Team collaboration

Business Alignment

Impact Metrics

impact_metrics:
  feature_adoption:
    what: Users using new features
    why: Measures actual value delivery

  customer_incidents:
    what: Customer-reported issues
    why: Quality from user perspective

  revenue_per_engineer:
    what: Company revenue / engineering headcount
    why: Engineering leverage (use carefully)

  time_to_market:
    what: Idea to customer availability
    why: Competitive advantage

Technical Debt Indicators

tech_debt_metrics:
  rework_rate:
    what: Changes to recently changed code
    interpretation: High rework indicates quality issues

  incident_frequency:
    what: Production incidents per service
    interpretation: Reliability of codebase

  deployment_pain:
    what: Failed deployments, rollbacks
    interpretation: Deployment automation quality

  dependency_age:
    what: Average age of dependencies
    interpretation: Security and maintenance burden

Implementing Metrics

Start Small

implementation_approach:
  phase_1:
    duration: 1-2 months
    metrics:
      - Deployment frequency
      - Lead time
    focus: Establish baseline, build tooling

  phase_2:
    duration: 2-3 months
    metrics:
      - Add MTTR
      - Add change failure rate
    focus: Incident correlation

  phase_3:
    duration: Ongoing
    metrics:
      - Developer experience
      - Business alignment
    focus: Continuous improvement

Avoid Common Pitfalls

pitfalls:
  individual_metrics:
    problem: Creates competition, gaming
    solution: Team-level metrics only

  too_many_metrics:
    problem: Attention fragmented
    solution: 3-5 key metrics maximum

  no_context:
    problem: Numbers without meaning
    solution: Add trends, comparisons, targets

  punitive_use:
    problem: Metrics become weapons
    solution: Use for improvement, not judgment

Key Takeaways

Most common engineering metrics create perverse incentives
DORA metrics (deployment frequency, lead time, MTTR, change failure rate) predict performance
Error budgets balance reliability with feature velocity
Developer experience metrics catch productivity killers
Connect engineering metrics to business outcomes
Measure at team level, not individual
Use metrics for improvement, not judgment
Start with few metrics, add as you mature
Context matters more than raw numbers
Goodhart’s Law: targets corrupt measures

Metrics are a lens for improvement, not a scorecard. Use them to ask questions, not to judge performance.