// Topics / Production

Production

Definition

Production coverage in this archive spans 27 posts from Feb 2016 to Jul 2026 and treats production as a production discipline: evaluation loops, tool boundaries, escalation paths, and cost control. The strongest adjacent threads are ai, llm, and infrastructure. Recurring title motifs include ai, production, engineering, and kubernetes.

Key claims

  • The archive repeatedly argues that production only creates leverage when it is wired into an existing workflow.
  • Early posts lean on production and kubernetes, while newer posts lean on ai and production as constraints shifted.
  • This topic repeatedly intersects with ai, llm, and infrastructure, so design choices here rarely stand alone.

Practical checklist

  • Define quality gates up front: eval sets, guardrails, and explicit rollback criteria.
  • Start with the newest post to calibrate current constraints, then backtrack to older entries for first principles.
  • When boundary questions appear, cross-read ai and llm before committing implementation details.

Failure modes

  • Shipping agent behavior without hard boundaries for tools, data access, and approvals.
  • Optimizing for model novelty while ignoring reliability, latency, or cost drift.
  • Applying guidance from 2016 to 2026 without revisiting assumptions as context changed.

Suggested reading path

References

    AI Production Governance: A Maturity Model By mid-April 2026, the gap between teams shipping stable AI features and teams shipping chaos isn't tools—it's production governance. Here is how mature teams evaluate, deploy, and rollback. governance ai reliability AI Security: Evolving Threats and Defenses As of late February 2026, AI security is defined by adaptive attacks and layered, operational defenses. security ai threats AI-Native Architecture Patterns 2026: Production Guide Production AI architecture patterns for gateways, retrieval, evaluation, fallbacks, cost control, and ownership. architecture ai patterns AI Video Applications in Practice Video AI is practical for scoped workflows. This post covers what works, how to design for reliability, and where human review still matters. video ai applications AI Incidents Don't Look Like Outages. That's the Problem. Your AI system can return 200 OK and still be wrong, unsafe, or confidently hallucinating. Here's how to detect, contain, and learn from AI incidents -- drawing from the same IR principles that work for traditional systems. incident-management ai reliability AI Workflow Automation: Decisions Are Cheap, Actions Are Expensive The trick to AI workflow automation is simple: let the model decide, let deterministic code act, and never confuse the two. automation ai workflow AI Customer Support That Doesn't Make People Hate You Most AI support systems are built to deflect tickets. The ones that actually work are built around escalation, grounding, and the simple idea that customers aren't idiots. customer-support ai chatbot AI Security: Same Principles, New Attack Surface AI systems are exposed APIs with real blast radius. The threats are injection, leakage, and tool misuse. The defenses are the same ones we've always needed -- just applied to a new surface. security ai threats Testing AI Where It Actually Runs Offline evals are necessary but not sufficient. Here's how I test AI features in production with shadow mode, canaries, and rollback automation -- with Go code. testing ai production Your AI System Looks Healthy. It Is Not. Traditional monitoring will tell you your AI service is up. It won't tell you it's returning confident garbage. Here's what observability actually looks like for AI. observability ai monitoring Reasoning Models in Production: A Practical Guide Reasoning models are powerful but expensive and slow. Here's how I integrate them in Go services with routing, async patterns, and cost controls that actually work. reasoning o1 llm Your AI Infrastructure Is Not Special AI infrastructure at scale is just infrastructure. The same boring patterns -- gateways, caching, circuit breakers, budget enforcement -- solve the same boring problems. ai infrastructure scale AI Safety Is Just Production Engineering AI safety in production isn't a research problem. It's defense in depth, the same way cyber defense works -- layered controls, assumed breach, observable boundaries. ai safety production Function Calling Patterns That Survive Production Function calling is how LLMs touch real systems. Treat tools like APIs, arguments like untrusted input, and permissions like the model is an intern with root access. function-calling llm ai Agentic Workflows: From Demo Magic to Production Reality AI agents that can take actions are fundamentally different from chatbots. The engineering bar must match the blast radius. agents ai production Why I Run Multiple Models in Production Betting on a single model provider is like having a single database with no failover. Here is why multi-model is the only sane production strategy. ai architecture llm AI Engineering Is Its Own Discipline Now AI engineering is not ML research with a product hat. It is the discipline of making models behave in production -- and it demands its own skill set. ai-engineering career skills LLM Observability: Your Existing Monitoring Is Not Enough Traditional monitoring tells you the service is up. It doesn't tell you the model started confidently returning garbage last Tuesday. Here's how to actually observe LLM systems. observability llm ai AI in Production Is Just Engineering. Treat It That Way. ChatGPT changed expectations overnight, but shipping AI features that actually work is an engineering problem, not a model problem. ai production engineering Your Staging Environment Is Lying to You Staging never catches the real bugs. Here's how I learned to test in production without burning everything down. testing production feature-flags The Boring Kubernetes Checklist That Actually Keeps Production Alive Most Kubernetes outages come from skipping the basics. Here's the checklist I use after running clusters at the fintech startup and now at Decloud. kubernetes devops infrastructure GraphQL in Production Is Harder Than They Tell You After a year running GraphQL at the fintech startup, here's what the conference talks leave out. graphql api backend Two Years of Kubernetes in Production — The Boring Parts Are the Hard Parts Year two of running Kubernetes at the fintech startup. The panic is gone. Now it's networking, resource tuning, and all the operational grunt work nobody blogs about. kubernetes containers devops A Year Running Kubernetes in Production — What Actually Happened After a year of running Kubernetes in production, the wins are real but the sharp edges drew blood first. Here's what paid off, what bit us, and what I'd do differently. kubernetes containers devops Why We Deleted 42 Grafana Panels Most teams monitor too much and alert on the wrong things. Five metrics are enough to run a startup backend. monitoring observability devops Building Resilient Systems: Lessons from Production Failures Production incidents show where architecture bends and where it breaks. These lessons focus on designing for failure, limiting blast radius, and making recovery routine. reliability resilience architecture Docker in Production: What We Learned Running Containers at Dropbyke Running Docker in production at Dropbyke forced us to get serious about image builds, container networking, log aggregation, and security. Here is what actually worked. docker containers devops