Kubernetes in Production: A Year of Lessons Learned

A year ago, we migrated our first production services to Kubernetes. The promise was compelling: declarative deployment, self-healing systems, efficient resource utilization, and simplified operations. The reality was more nuanced. Here’s what we learned.

What Worked Well

Declarative Configuration

Kubernetes’ declarative model—describing desired state and letting the system converge—is genuinely powerful. Deployments describe what we want; Kubernetes figures out how to get there.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
      - name: api
        image: our-registry/api:v1.2.3
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Configuration in version control means:

Changes are reviewed before application
History shows what changed when
Rollback is straightforward

Self-Healing

Kubernetes automatically replaces failed pods, reschedules workloads from failed nodes, and respects health checks. This happened invisibly—we’d see pods restart in logs but users experienced no impact.

Before Kubernetes, server failures required human intervention. Now the system handles most failures automatically. The on-call experience improved measurably.

Resource Efficiency

Bin-packing multiple services onto nodes improved utilization. Services that previously needed dedicated servers share infrastructure. We run similar workloads on fewer machines.

Rolling Updates

Zero-downtime deployments became trivial:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

Old approach: maintenance windows, careful coordination, anxious engineers. New approach: merge and deploy, automatically rolled out safely.

What Surprised Us

Networking Complexity

Kubernetes networking is complex. CNI plugins, service mesh considerations, network policies, ingress controllers—each decision has tradeoffs we didn’t initially understand.

We spent significant time debugging:

Service discovery issues
DNS resolution delays
Network policy mistakes that broke communication
Ingress configuration that dropped requests

Investment: The networking abstraction is powerful but requires deep understanding when things go wrong.

Resource Requests and Limits

Getting resource configuration right took iteration:

Requests too low: pods get scheduled but OOM-killed or CPU-throttled
Limits too low: legitimate traffic spikes cause failures
Requests too high: waste resources, can’t schedule efficiently

We ended up profiling applications under realistic load to set appropriate values. Default values from examples rarely matched our needs.

Stateful Workloads

Kubernetes is designed for stateless workloads. Stateful applications (databases, caches, queues) require additional machinery:

Persistent volumes and storage classes
StatefulSets for ordered deployment and stable network identities
Careful consideration of failure modes

We kept databases outside Kubernetes initially. Over time, we moved some stateful workloads in, but it required more planning than stateless services.

Debugging Difficulty

When things go wrong, debugging Kubernetes is harder than debugging traditional deployments. Problems could be:

Application code
Container configuration
Kubernetes scheduling
Networking
Storage
Node issues

The layers of abstraction that provide power also obscure what’s happening. We invested in logging, tracing, and Kubernetes-specific observability.

Hard Lessons

Start with Cluster Operations Expertise

We initially treated Kubernetes as a black box. That worked until it didn’t. Cluster upgrades, node failures, and capacity planning required understanding we didn’t have.

Recommendation: Invest in understanding Kubernetes internals before depending on it for production. Someone on the team needs deep knowledge.

Managed Kubernetes Is Worth It

We started self-managing clusters. It consumed significant engineering time: etcd maintenance, control plane upgrades, node management, security patching.

Switching to managed Kubernetes (GKE, EKS, AKS) removed significant operational burden. For most teams, managed services are worth the cost.

YAML Management Gets Out of Hand

YAML configuration proliferates. Without discipline, you end up with copy-pasted, inconsistent, hard-to-maintain configurations.

We adopted Helm for templating and Kustomize for environment-specific variations. This reduced duplication but added tooling complexity.

Secrets Management Is Harder

Kubernetes Secrets are base64 encoded, not encrypted. They’re visible to anyone with cluster access. This isn’t adequate for sensitive data.

We integrated with external secrets management (Vault, cloud provider secrets) and used operators to sync secrets into Kubernetes. This added moving parts.

Local Development Needs Attention

Developers need to test against Kubernetes-like environments. Options:

Minikube/Kind for local clusters
Telepresence for hybrid local/remote development
Dedicated dev namespaces in shared clusters

Each has tradeoffs. We settled on dedicated namespaces with automated deployment from feature branches.

Operational Practices

GitOps

All configuration lives in Git. Changes merge to main, trigger deployment. No manual kubectl commands in production.

Benefits:

Audit trail for all changes
Peer review before deployment
Easy rollback via git revert

Health Checks Matter

Proper readiness and liveness probes are essential:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Liveness: Restart pod if it’s stuck
Readiness: Don’t send traffic until ready

Bad health checks cause cascading failures. We test health check behavior explicitly.

Resource Quotas and Limits

Namespaces should have quotas preventing runaway resource consumption:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"

Without quotas, one team’s mistake can affect the entire cluster.

Pod Disruption Budgets

For high-availability services, PDBs ensure updates don’t take down too many replicas:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

Monitoring and Alerting

Kubernetes-specific monitoring matters:

Cluster health (control plane, etcd)
Node health (resources, conditions)
Pod health (restarts, status, resources)
Application metrics (RED metrics per service)

Prometheus and Grafana are the de facto standard. Invest in dashboards and alerts specific to Kubernetes operations.

Was It Worth It?

A year in: yes, but with caveats.

For whom it’s worth it:

Teams deploying many services
Organizations with container expertise or willingness to learn
Workloads that benefit from dynamic scheduling
Teams with resources to invest in Kubernetes operations

For whom it might not be:

Small teams with few services
Organizations without container expertise
Simple workloads that don’t need orchestration
Teams that can’t invest in learning and operations

Kubernetes solves real problems at scale. For smaller operations, simpler solutions might be more appropriate. Evaluate honestly whether you have the problems Kubernetes solves.

Key Takeaways

Declarative configuration, self-healing, and rolling updates deliver real value
Networking, resource configuration, and debugging are harder than expected
Invest in understanding Kubernetes internals, not just usage
Consider managed Kubernetes to reduce operational burden
Implement GitOps for configuration management
Proper health checks, resource limits, and monitoring are essential, not optional
Evaluate honestly whether your scale justifies Kubernetes complexity