Site Reliability Engineering has proven its value, but organizational models vary widely. Should SRE be centralized or embedded? How does it interact with development teams? The right structure depends on your organization’s size, culture, and needs.
Here are the SRE team models that work.
SRE Organizational Models
Centralized SRE
centralized_model:
structure:
- Single SRE team
- Supports all services
- Centralized expertise
works_well_when:
- Organization is small/medium
- Services share infrastructure
- Need consistent practices
- Building initial SRE capability
challenges:
- Can become bottleneck
- May not understand all services deeply
- "Us vs them" dynamics
- Prioritization conflicts
example_org:
sre_team: 5-10 engineers
coverage: All production services
reporting: VP Engineering or CTO
Embedded SRE
embedded_model:
structure:
- SREs within product teams
- Report to team leads
- Dedicated to specific services
works_well_when:
- Organization is large
- Services are complex/specialized
- Teams have different needs
- Strong team ownership desired
challenges:
- Inconsistent practices across teams
- Career path unclear
- Can become "ops person for the team"
- Difficult to share learnings
example_org:
team_a: 1-2 embedded SREs
team_b: 1-2 embedded SREs
reporting: Product team lead
Platform/Consulting SRE
platform_model:
structure:
- Central platform team
- SREs build tools/platforms
- Dev teams self-serve
works_well_when:
- Mature infrastructure exists
- Self-service is valued
- Want to scale SRE practices
- Dev teams can own reliability
challenges:
- Requires mature dev teams
- Platform must be excellent
- May not help struggling services
- Indirect impact
example_org:
platform_team: Builds self-service tools
dev_teams: Use platform, own reliability
sre_role: Consulting and escalation
Hybrid Model
hybrid_model:
structure:
- Central SRE team for platform/shared
- Embedded SREs for critical services
- Rotation between central and embedded
works_well_when:
- Large organization
- Mix of service criticality
- Want best of both worlds
- Can afford the structure
components:
central_sre:
- Shared infrastructure
- Tools and platform
- Incident response coordination
- Best practices and standards
embedded_sre:
- Critical/complex services
- Deep service knowledge
- Team-specific reliability
rotation:
- SREs rotate between roles
- Cross-pollinate knowledge
- Career development
Engagement Models
Service Ownership Spectrum
Full Dev Ownership Shared Full SRE Ownership
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Dev owns │ │ Shared │ │ SRE owns │
│ everything │ │ on-call │ │ operations │
│ │ │ Shared │ │ Dev │
│ │ │ SLOs │ │ builds │
└─────────────┘ └─────────────┘ └─────────────┘
│
(Recommended)
SRE Engagement Levels
engagement_levels:
advisory:
involvement: Light touch
activities:
- Review architecture
- Advise on reliability
- Occasional consultation
sre_time: 10-20%
team_ownership: High
collaborative:
involvement: Regular partnership
activities:
- Joint SLO definition
- Shared on-call (optional)
- Regular reliability reviews
- Incident response support
sre_time: 30-50%
team_ownership: Shared
full_support:
involvement: Deep engagement
activities:
- SRE owns production
- Dedicated SRE resources
- SRE on-call
- Feature reliability review
sre_time: 80-100%
team_ownership: Low (reliability)
Earning SRE Support
# Google's model: Teams must qualify
sre_engagement_criteria:
requirements:
- Service has defined SLOs
- Error budget exists
- Service is well-documented
- Runbooks are current
- Sufficient test coverage
- Deployment is automated
ongoing:
- Maintain reliability standards
- Participate in postmortems
- Address reliability improvements
- Stay within error budget
disengagement:
- If standards not maintained
- Error budget consistently exceeded
- Team not responsive to issues
- SRE can hand back to dev team
Team Sizing
Ratios
sizing_guidelines:
sre_to_dev_ratio:
google_guidance: 1 SRE per 10 developers
adjust_for:
- Service complexity
- Reliability requirements
- Operational burden
service_coverage:
embedded: 1-2 SREs per critical service
centralized: 5-10 SREs for small org
platform: Depends on self-service maturity
on_call:
minimum: 5-6 people for sustainable rotation
coverage: Consider timezone distribution
escalation: Always have backup
Growth Stages
sre_growth_stages:
startup:
size: 0-50 engineers
model: No dedicated SRE
approach: Developers own reliability
hire_when: Pain is clear
scaling:
size: 50-200 engineers
model: First SRE hires
approach: Centralized, foundational
focus: Infrastructure, CI/CD, monitoring
growth:
size: 200-500 engineers
model: Growing SRE team
approach: Mix of centralized and embedded
focus: Platform, automation, practices
enterprise:
size: 500+ engineers
model: Full hybrid model
approach: Platform team + embedded SREs
focus: Self-service, scale, efficiency
Making It Work
Clear Responsibilities
raci_example:
slo_definition:
responsible: Product team + SRE
accountable: Product team
consulted: SRE
informed: Leadership
incident_response:
responsible: On-call (SRE or dev)
accountable: Service owner
consulted: Subject matter experts
informed: Stakeholders
production_changes:
responsible: Deploying team
accountable: Service owner
consulted: SRE (for complex changes)
informed: Affected teams
capacity_planning:
responsible: SRE
accountable: Service owner
consulted: Product team
informed: Finance
Communication
communication_patterns:
regular:
- Weekly reliability review
- Monthly SLO review
- Quarterly planning together
as_needed:
- Architecture review
- Pre-launch review
- Incident postmortem
async:
- Shared dashboards
- Documentation
- Chat channels
Career Development
sre_career_path:
individual_contributor:
- Junior SRE
- SRE
- Senior SRE
- Staff SRE
- Principal SRE
management:
- SRE Tech Lead
- SRE Manager
- Senior SRE Manager
- Director of SRE
rotation:
- SRE to dev team
- Dev to SRE team
- Cross-team exposure
- Leadership opportunities
Anti-Patterns
anti_patterns:
ops_renamed:
problem: "We renamed Ops to SRE"
result: Same work, new name, no change
fix: Adopt SRE practices, not just title
developer_gatekeeping:
problem: SRE blocks all changes
result: Adversarial relationship, slow delivery
fix: Error budgets, self-service, collaboration
firefighting_only:
problem: SRE only responds to incidents
result: No prevention, burnout
fix: Balance project work with ops
no_dev_involvement:
problem: Devs throw code over wall
result: Poor reliability, SRE burnout
fix: Shared ownership, on-call rotation
under_resourced:
problem: 2 SREs for 500-person org
result: Bottleneck, burnout, failure
fix: Appropriate staffing or self-service
Key Takeaways
- No single SRE model fits all—choose based on org size, culture, needs
- Centralized works for smaller orgs, embedded for specialized services
- Hybrid model combines benefits but requires more resources
- Shared ownership (dev + SRE) generally works better than extremes
- Teams should earn and maintain SRE engagement
- Clear responsibilities prevent conflicts
- Size appropriately: ~1:10 SRE:dev ratio as starting point
- Career development keeps SREs engaged and growing
- Avoid anti-patterns: renamed ops, gatekeeping, firefighting only
- Evolve model as organization grows and matures
The right model enables reliability without becoming a bottleneck.