SRE Team Structures: Models That Work

November 8, 2021

Site Reliability Engineering has proven its value, but organizational models vary widely. Should SRE be centralized or embedded? How does it interact with development teams? The right structure depends on your organization’s size, culture, and needs.

Here are the SRE team models that work.

SRE Organizational Models

Centralized SRE

centralized_model:
  structure:
    - Single SRE team
    - Supports all services
    - Centralized expertise

  works_well_when:
    - Organization is small/medium
    - Services share infrastructure
    - Need consistent practices
    - Building initial SRE capability

  challenges:
    - Can become bottleneck
    - May not understand all services deeply
    - "Us vs them" dynamics
    - Prioritization conflicts

  example_org:
    sre_team: 5-10 engineers
    coverage: All production services
    reporting: VP Engineering or CTO

Embedded SRE

embedded_model:
  structure:
    - SREs within product teams
    - Report to team leads
    - Dedicated to specific services

  works_well_when:
    - Organization is large
    - Services are complex/specialized
    - Teams have different needs
    - Strong team ownership desired

  challenges:
    - Inconsistent practices across teams
    - Career path unclear
    - Can become "ops person for the team"
    - Difficult to share learnings

  example_org:
    team_a: 1-2 embedded SREs
    team_b: 1-2 embedded SREs
    reporting: Product team lead

Platform/Consulting SRE

platform_model:
  structure:
    - Central platform team
    - SREs build tools/platforms
    - Dev teams self-serve

  works_well_when:
    - Mature infrastructure exists
    - Self-service is valued
    - Want to scale SRE practices
    - Dev teams can own reliability

  challenges:
    - Requires mature dev teams
    - Platform must be excellent
    - May not help struggling services
    - Indirect impact

  example_org:
    platform_team: Builds self-service tools
    dev_teams: Use platform, own reliability
    sre_role: Consulting and escalation

Hybrid Model

hybrid_model:
  structure:
    - Central SRE team for platform/shared
    - Embedded SREs for critical services
    - Rotation between central and embedded

  works_well_when:
    - Large organization
    - Mix of service criticality
    - Want best of both worlds
    - Can afford the structure

  components:
    central_sre:
      - Shared infrastructure
      - Tools and platform
      - Incident response coordination
      - Best practices and standards

    embedded_sre:
      - Critical/complex services
      - Deep service knowledge
      - Team-specific reliability

    rotation:
      - SREs rotate between roles
      - Cross-pollinate knowledge
      - Career development

Engagement Models

Service Ownership Spectrum

Full Dev Ownership          Shared                    Full SRE Ownership
       │                      │                              │
       ▼                      ▼                              ▼
┌─────────────┐       ┌─────────────┐              ┌─────────────┐
│   Dev owns  │       │   Shared    │              │  SRE owns   │
│ everything  │       │   on-call   │              │ operations  │
│             │       │   Shared    │              │   Dev       │
│             │       │   SLOs      │              │  builds     │
└─────────────┘       └─────────────┘              └─────────────┘
                              │
                          (Recommended)

SRE Engagement Levels

engagement_levels:
  advisory:
    involvement: Light touch
    activities:
      - Review architecture
      - Advise on reliability
      - Occasional consultation
    sre_time: 10-20%
    team_ownership: High

  collaborative:
    involvement: Regular partnership
    activities:
      - Joint SLO definition
      - Shared on-call (optional)
      - Regular reliability reviews
      - Incident response support
    sre_time: 30-50%
    team_ownership: Shared

  full_support:
    involvement: Deep engagement
    activities:
      - SRE owns production
      - Dedicated SRE resources
      - SRE on-call
      - Feature reliability review
    sre_time: 80-100%
    team_ownership: Low (reliability)

Earning SRE Support

# Google's model: Teams must qualify
sre_engagement_criteria:
  requirements:
    - Service has defined SLOs
    - Error budget exists
    - Service is well-documented
    - Runbooks are current
    - Sufficient test coverage
    - Deployment is automated

  ongoing:
    - Maintain reliability standards
    - Participate in postmortems
    - Address reliability improvements
    - Stay within error budget

  disengagement:
    - If standards not maintained
    - Error budget consistently exceeded
    - Team not responsive to issues
    - SRE can hand back to dev team

Team Sizing

Ratios

sizing_guidelines:
  sre_to_dev_ratio:
    google_guidance: 1 SRE per 10 developers
    adjust_for:
      - Service complexity
      - Reliability requirements
      - Operational burden

  service_coverage:
    embedded: 1-2 SREs per critical service
    centralized: 5-10 SREs for small org
    platform: Depends on self-service maturity

  on_call:
    minimum: 5-6 people for sustainable rotation
    coverage: Consider timezone distribution
    escalation: Always have backup

Growth Stages

sre_growth_stages:
  startup:
    size: 0-50 engineers
    model: No dedicated SRE
    approach: Developers own reliability
    hire_when: Pain is clear

  scaling:
    size: 50-200 engineers
    model: First SRE hires
    approach: Centralized, foundational
    focus: Infrastructure, CI/CD, monitoring

  growth:
    size: 200-500 engineers
    model: Growing SRE team
    approach: Mix of centralized and embedded
    focus: Platform, automation, practices

  enterprise:
    size: 500+ engineers
    model: Full hybrid model
    approach: Platform team + embedded SREs
    focus: Self-service, scale, efficiency

Making It Work

Clear Responsibilities

raci_example:
  slo_definition:
    responsible: Product team + SRE
    accountable: Product team
    consulted: SRE
    informed: Leadership

  incident_response:
    responsible: On-call (SRE or dev)
    accountable: Service owner
    consulted: Subject matter experts
    informed: Stakeholders

  production_changes:
    responsible: Deploying team
    accountable: Service owner
    consulted: SRE (for complex changes)
    informed: Affected teams

  capacity_planning:
    responsible: SRE
    accountable: Service owner
    consulted: Product team
    informed: Finance

Communication

communication_patterns:
  regular:
    - Weekly reliability review
    - Monthly SLO review
    - Quarterly planning together

  as_needed:
    - Architecture review
    - Pre-launch review
    - Incident postmortem

  async:
    - Shared dashboards
    - Documentation
    - Chat channels

Career Development

sre_career_path:
  individual_contributor:
    - Junior SRE
    - SRE
    - Senior SRE
    - Staff SRE
    - Principal SRE

  management:
    - SRE Tech Lead
    - SRE Manager
    - Senior SRE Manager
    - Director of SRE

  rotation:
    - SRE to dev team
    - Dev to SRE team
    - Cross-team exposure
    - Leadership opportunities

Anti-Patterns

anti_patterns:
  ops_renamed:
    problem: "We renamed Ops to SRE"
    result: Same work, new name, no change
    fix: Adopt SRE practices, not just title

  developer_gatekeeping:
    problem: SRE blocks all changes
    result: Adversarial relationship, slow delivery
    fix: Error budgets, self-service, collaboration

  firefighting_only:
    problem: SRE only responds to incidents
    result: No prevention, burnout
    fix: Balance project work with ops

  no_dev_involvement:
    problem: Devs throw code over wall
    result: Poor reliability, SRE burnout
    fix: Shared ownership, on-call rotation

  under_resourced:
    problem: 2 SREs for 500-person org
    result: Bottleneck, burnout, failure
    fix: Appropriate staffing or self-service

Key Takeaways

The right model enables reliability without becoming a bottleneck.