Kubernetes in Production: Lessons Learned After Two Years

January 22, 2018

Two years ago, we made the decision to move our production workloads to Kubernetes. It was early—Kubernetes 1.2 was current, and the ecosystem was immature. We made mistakes, learned lessons, and eventually built a platform that serves us well.

Here’s what we’ve learned that might save you time and pain.

What We Got Right

Starting with Non-Critical Workloads

We didn’t migrate production on day one. Our progression:

  1. Development environments (months 1-3)
  2. CI/CD infrastructure (months 3-6)
  3. Internal tools (months 6-9)
  4. Non-critical production (months 9-12)
  5. Critical production (months 12+)

This allowed us to learn incrementally, make mistakes safely, and build operational expertise before stakes were high.

Investing in Observability Early

We prioritized observability from the start:

When things break in Kubernetes (and they will), observability is what saves you. Debugging distributed systems without metrics and traces is nearly impossible.

Standardizing on Helm Early

We adopted Helm for packaging applications. Despite its complexity, having a standard deployment format across all applications paid dividends:

The alternatives (raw manifests, kustomize—which didn’t exist yet) would have left us with inconsistent, hard-to-maintain configurations.

What We Got Wrong

Underestimating Networking Complexity

Kubernetes networking is complex. We underestimated this badly.

CNI selection: We started with flannel for simplicity but hit performance issues and eventually migrated to Calico. This migration was painful.

Network policies: We didn’t implement network policies initially, leaving all pods able to communicate with all others. Adding network policies retroactively meant auditing every service’s communication patterns.

DNS issues: CoreDNS (then kube-dns) scaling issues caused intermittent resolution failures under load. We spent weeks debugging what seemed like random application failures.

Lesson learned: Invest time understanding Kubernetes networking deeply before going to production. It’s not something you can learn incrementally.

Resource Requests and Limits

Getting resource requests and limits right is harder than it appears.

Initial approach: We set conservative limits based on guesses. Result: OOMKills for some services, wasted resources for others.

What we learned:

resources:
  requests:
    memory: "256Mi"  # Must be based on actual usage
    cpu: "100m"      # Determines scheduling
  limits:
    memory: "512Mi"  # OOMKill boundary
    cpu: "1000m"     # Throttling boundary (usually omit)

Key insights:

We now require services to run in staging with realistic load before production, with resource profiling.

StatefulSet Complexity

We tried running stateful workloads on Kubernetes too early.

What we tried: PostgreSQL, Redis, Elasticsearch on Kubernetes.

What happened:

Where we landed:

Kubernetes excels at stateless workloads. Stateful workloads are possible but require significant operational investment.

Ignoring Pod Disruption Budgets

We didn’t configure PodDisruptionBudgets initially. When we needed to drain nodes for maintenance, we discovered services going fully offline during rolling updates.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2  # Or maxUnavailable: 1
  selector:
    matchLabels:
      app: api

PDBs are essential for safe cluster operations. Configure them for all production services.

Operational Lessons

Cluster Upgrades Are Major Events

Upgrading Kubernetes versions is not trivial:

Our approach:

  1. Read every release note and changelog
  2. Test upgrade in staging with production-like load
  3. Schedule maintenance window
  4. Upgrade control plane, then node pools incrementally
  5. Have rollback plan ready

We budget significant time for each upgrade cycle.

Multi-Cluster Is Worth It

Running multiple clusters provides:

The operational overhead of multiple clusters is lower than you might think, especially with infrastructure as code.

GitOps Changed Everything

Adopting GitOps (we use ArgoCD) transformed our deployment workflow:

Before GitOps, deployments were imperative commands. With GitOps, deployments are Git commits. The improvement in reliability and auditability is substantial.

Security Lessons

RBAC From Day One

Kubernetes RBAC is complex but essential. We initially ran with overly permissive configurations and had to tighten later.

Minimum viable RBAC:

Pod Security Policies (Now Pod Security Standards)

Running containers as root, with host networking, or with privileged access creates security risks. Enforce restrictions:

apiVersion: v1
kind: Pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
  containers:
  - name: app
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true

Note: Pod Security Policies are deprecated in 1.21 and removed in 1.25. Plan for Pod Security Standards or alternatives like OPA Gatekeeper.

Secrets Management

Kubernetes Secrets are base64 encoded, not encrypted (by default). We learned this the hard way when secrets appeared in logs.

Better approaches:

Performance Lessons

Java Applications Need Tuning

JVM-based applications required special attention:

containers:
- name: java-app
  env:
  - name: JAVA_OPTS
    value: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0"
  resources:
    requests:
      memory: "1Gi"
    limits:
      memory: "1Gi"

Without container-aware JVM settings (available since Java 8u191), JVM reads host memory, not container limits, and may OOM.

DNS Caching Matters

Kubernetes DNS can become a bottleneck. Applications making many external DNS queries can overwhelm CoreDNS.

Mitigations:

Horizontal Pod Autoscaler Tuning

Default HPA settings are rarely optimal:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Prevent flapping

Tune based on your application’s scaling characteristics. The defaults optimize for different use cases than yours.

What We’d Do Differently

Managed Kubernetes Earlier

We ran self-managed Kubernetes initially (kubeadm). Managing the control plane is significant work. We eventually moved to managed Kubernetes (EKS), and the reduction in operational burden was substantial.

Self-managed makes sense when:

Managed makes sense when:

Service Mesh Evaluation Timing

We adopted a service mesh (Istio) later than we should have. mTLS, traffic management, and observability features would have helped earlier.

However, we also see organizations adopting service mesh too early, before they understand base Kubernetes. Find the right timing for your journey.

More Investment in Developer Experience

We focused on infrastructure before developer experience. This slowed adoption.

What developers need:

Better developer experience earlier would have driven faster adoption.

Key Takeaways

Kubernetes has been transformative for our infrastructure. The lessons came at a cost, but the resulting platform is reliable, scalable, and efficient. Learn from our mistakes.