// Topics / SRE

SRE

Definition

SRE coverage in this archive spans 8 posts from Oct 2017 to Nov 2021 and focuses on reliability, delivery speed, and cost discipline as one system, not three separate concerns. The strongest adjacent threads are reliability, devops, and incident management. Recurring title motifs include incident, sre, engineering, and outage.

What the archive argues

  • Most posts prioritize predictable operations over feature breadth or stack novelty.
  • Early posts lean on incident and process, while newer posts lean on observability-driven and development as constraints shifted.
  • This topic repeatedly intersects with reliability, devops, and incident management, so design choices here rarely stand alone.

Execution checklist

  • Set SLOs first, then choose tooling that keeps deploy, observability, and rollback simple.
  • Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
  • When boundary questions appear, cross-read reliability and devops before committing implementation details.

Common failure modes

  • Adding platform layers faster than the team can operate and debug them.
  • Chasing throughput gains without proving they improve end-user reliability.
  • Applying guidance from 2017 to 2021 without revisiting assumptions as context changed.

Suggested reading path

References

    What a 3 AM Outage Taught Me About Incident Management Good incident response is not about preventing failure. It is about failing well. Lessons from a decade of on-call, including NATO and telecom-scale operations. incident-management sre on-call Stop Renaming Your Ops Team to SRE Opinionated take on SRE team models from someone who has seen them all fail in interesting ways. sre teams organization Database Reliability Engineering: What I've Learned the Hard Way Practical database reliability from running Postgres at the fintech startup and at large enterprises. Includes config examples, migration patterns, and the operational habits that actually prevent outages. databases reliability sre Observability-Driven Development Is Just Instrumenting Your Code ODD sounds fancy. It's not. It means writing logs, metrics, and traces before you ship, not after your first outage. observability monitoring development Most Chaos Engineering Is Theater Teams love saying they do chaos engineering. Few actually have hypotheses. Even fewer fix what they find. chaos-engineering reliability sre Your SLOs Are Probably Useless (Here's How to Fix Them) Most SLOs are dashboards nobody acts on. Here's how to pick indicators that reflect real users, set targets grounded in data, and make error budgets actually change how your team ships. sre slo reliability SRE Principles Are Great. The Cargo-Culting Is Not. The SRE hype train has everyone copying Google's playbook without asking whether it fits. Here's what actually matters when you're not running planet-scale infrastructure. sre devops reliability Your Incident Process Will Break at 15 People. Here's What to Do. What I learned building incident management at the fintech startup — from five people shouting across a room to actual structured response. incident-management devops on-call