Kubernetes in Production: A Year of Lessons Learned

January 16, 2017

A year ago, we migrated our first production services to Kubernetes. The promise was compelling: declarative deployment, self-healing systems, efficient resource utilization, and simplified operations. The reality was more nuanced. Here’s what we learned.

What Worked Well

Declarative Configuration

Kubernetes’ declarative model—describing desired state and letting the system converge—is genuinely powerful. Deployments describe what we want; Kubernetes figures out how to get there.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
    spec:
      containers:
      - name: api
        image: our-registry/api:v1.2.3
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Configuration in version control means:

Self-Healing

Kubernetes automatically replaces failed pods, reschedules workloads from failed nodes, and respects health checks. This happened invisibly—we’d see pods restart in logs but users experienced no impact.

Before Kubernetes, server failures required human intervention. Now the system handles most failures automatically. The on-call experience improved measurably.

Resource Efficiency

Bin-packing multiple services onto nodes improved utilization. Services that previously needed dedicated servers share infrastructure. We run similar workloads on fewer machines.

Rolling Updates

Zero-downtime deployments became trivial:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

Old approach: maintenance windows, careful coordination, anxious engineers. New approach: merge and deploy, automatically rolled out safely.

What Surprised Us

Networking Complexity

Kubernetes networking is complex. CNI plugins, service mesh considerations, network policies, ingress controllers—each decision has tradeoffs we didn’t initially understand.

We spent significant time debugging:

Investment: The networking abstraction is powerful but requires deep understanding when things go wrong.

Resource Requests and Limits

Getting resource configuration right took iteration:

We ended up profiling applications under realistic load to set appropriate values. Default values from examples rarely matched our needs.

Stateful Workloads

Kubernetes is designed for stateless workloads. Stateful applications (databases, caches, queues) require additional machinery:

We kept databases outside Kubernetes initially. Over time, we moved some stateful workloads in, but it required more planning than stateless services.

Debugging Difficulty

When things go wrong, debugging Kubernetes is harder than debugging traditional deployments. Problems could be:

The layers of abstraction that provide power also obscure what’s happening. We invested in logging, tracing, and Kubernetes-specific observability.

Hard Lessons

Start with Cluster Operations Expertise

We initially treated Kubernetes as a black box. That worked until it didn’t. Cluster upgrades, node failures, and capacity planning required understanding we didn’t have.

Recommendation: Invest in understanding Kubernetes internals before depending on it for production. Someone on the team needs deep knowledge.

Managed Kubernetes Is Worth It

We started self-managing clusters. It consumed significant engineering time: etcd maintenance, control plane upgrades, node management, security patching.

Switching to managed Kubernetes (GKE, EKS, AKS) removed significant operational burden. For most teams, managed services are worth the cost.

YAML Management Gets Out of Hand

YAML configuration proliferates. Without discipline, you end up with copy-pasted, inconsistent, hard-to-maintain configurations.

We adopted Helm for templating and Kustomize for environment-specific variations. This reduced duplication but added tooling complexity.

Secrets Management Is Harder

Kubernetes Secrets are base64 encoded, not encrypted. They’re visible to anyone with cluster access. This isn’t adequate for sensitive data.

We integrated with external secrets management (Vault, cloud provider secrets) and used operators to sync secrets into Kubernetes. This added moving parts.

Local Development Needs Attention

Developers need to test against Kubernetes-like environments. Options:

Each has tradeoffs. We settled on dedicated namespaces with automated deployment from feature branches.

Operational Practices

GitOps

All configuration lives in Git. Changes merge to main, trigger deployment. No manual kubectl commands in production.

Benefits:

Health Checks Matter

Proper readiness and liveness probes are essential:

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Bad health checks cause cascading failures. We test health check behavior explicitly.

Resource Quotas and Limits

Namespaces should have quotas preventing runaway resource consumption:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
spec:
  hard:
    requests.cpu: "10"
    requests.memory: "20Gi"
    limits.cpu: "20"
    limits.memory: "40Gi"

Without quotas, one team’s mistake can affect the entire cluster.

Pod Disruption Budgets

For high-availability services, PDBs ensure updates don’t take down too many replicas:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

Monitoring and Alerting

Kubernetes-specific monitoring matters:

Prometheus and Grafana are the de facto standard. Invest in dashboards and alerts specific to Kubernetes operations.

Was It Worth It?

A year in: yes, but with caveats.

For whom it’s worth it:

For whom it might not be:

Kubernetes solves real problems at scale. For smaller operations, simpler solutions might be more appropriate. Evaluate honestly whether you have the problems Kubernetes solves.

Key Takeaways