// Topics / Reliability
Reliability
Definition
Reliability coverage in this archive spans 18 posts from Jul 2016 to Jan 2026 and focuses on reliability, delivery speed, and cost discipline as one system, not three separate concerns. The strongest adjacent threads are architecture, sre, and ai. Recurring title motifs include production, ai, outage, and taught.
Key claims
- Most posts prioritize predictable operations over feature breadth or stack novelty.
- Early posts lean on systems and production, while newer posts lean on engineering and outage as constraints shifted.
- This topic repeatedly intersects with architecture, sre, and ai, so design choices here rarely stand alone.
Practical checklist
- Set SLOs first, then choose tooling that keeps deploy, observability, and rollback simple.
- Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
- When boundary questions appear, cross-read architecture and sre before committing implementation details.
Failure modes
- Adding platform layers faster than the team can operate and debug them.
- Chasing throughput gains without proving they improve end-user reliability.
- Applying guidance from 2016 to 2026 without revisiting assumptions as context changed.
Suggested reading path
- Start here (current state): Building Reliable AI Agents in Go
- Then read (operating middle): Your Load Tests Are Lying to You
- Finish with (foundational context): Building Resilient Systems: Lessons from Production Failures
Related posts
- Building Reliable AI Agents in Go
- AI Incidents Don’t Look Like Outages. That’s the Problem.
- Agentic Workflows: From Demo Magic to Production Reality
- Why I Run Multiple Models in Production
- The AWS us-east-1 Outage Was Predictable. Your Architecture Was Not Ready.
- What a 3 AM Outage Taught Me About Incident Management
- Database Reliability Engineering: What I’ve Learned the Hard Way
- Most Chaos Engineering Is Theater
References
21 entries tagged “Reliability”