Kubernetes in Production: Lessons Learned After Two Years

Two years ago, we made the decision to move our production workloads to Kubernetes. It was early—Kubernetes 1.2 was current, and the ecosystem was immature. We made mistakes, learned lessons, and eventually built a platform that serves us well.

Here’s what we’ve learned that might save you time and pain.

What We Got Right

Starting with Non-Critical Workloads

We didn’t migrate production on day one. Our progression:

Development environments (months 1-3)
CI/CD infrastructure (months 3-6)
Internal tools (months 6-9)
Non-critical production (months 9-12)
Critical production (months 12+)

This allowed us to learn incrementally, make mistakes safely, and build operational expertise before stakes were high.

Investing in Observability Early

We prioritized observability from the start:

Prometheus + Grafana for metrics
EFK stack (Elasticsearch, Fluentd, Kibana) for logs
Jaeger for distributed tracing

When things break in Kubernetes (and they will), observability is what saves you. Debugging distributed systems without metrics and traces is nearly impossible.

Standardizing on Helm Early

We adopted Helm for packaging applications. Despite its complexity, having a standard deployment format across all applications paid dividends:

Consistent deployment patterns
Reusable charts for common patterns
Easier onboarding of new services

The alternatives (raw manifests, kustomize—which didn’t exist yet) would have left us with inconsistent, hard-to-maintain configurations.

What We Got Wrong

Underestimating Networking Complexity

Kubernetes networking is complex. We underestimated this badly.

CNI selection: We started with flannel for simplicity but hit performance issues and eventually migrated to Calico. This migration was painful.

Network policies: We didn’t implement network policies initially, leaving all pods able to communicate with all others. Adding network policies retroactively meant auditing every service’s communication patterns.

DNS issues: CoreDNS (then kube-dns) scaling issues caused intermittent resolution failures under load. We spent weeks debugging what seemed like random application failures.

Lesson learned: Invest time understanding Kubernetes networking deeply before going to production. It’s not something you can learn incrementally.

Resource Requests and Limits

Getting resource requests and limits right is harder than it appears.

Initial approach: We set conservative limits based on guesses. Result: OOMKills for some services, wasted resources for others.

What we learned:

resources:
  requests:
    memory: "256Mi"  # Must be based on actual usage
    cpu: "100m"      # Determines scheduling
  limits:
    memory: "512Mi"  # OOMKill boundary
    cpu: "1000m"     # Throttling boundary (usually omit)

Key insights:

Memory limits should be close to requests (OOMKill is brutal)
CPU limits often cause more harm than good (throttling creates latency)
Requests must be based on actual measurements, not guesses
Vertical Pod Autoscaler helps but isn’t magic

We now require services to run in staging with realistic load before production, with resource profiling.

StatefulSet Complexity

We tried running stateful workloads on Kubernetes too early.

What we tried: PostgreSQL, Redis, Elasticsearch on Kubernetes.

What happened:

Storage provisioning complexity
Backup and recovery challenges
Operational complexity during failures
Performance issues with some storage backends

Where we landed:

Managed services for databases (RDS, Cloud SQL)
Redis on Kubernetes only for caching (not persistence)
Elasticsearch on dedicated infrastructure

Kubernetes excels at stateless workloads. Stateful workloads are possible but require significant operational investment.

Ignoring Pod Disruption Budgets

We didn’t configure PodDisruptionBudgets initially. When we needed to drain nodes for maintenance, we discovered services going fully offline during rolling updates.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2  # Or maxUnavailable: 1
  selector:
    matchLabels:
      app: api

PDBs are essential for safe cluster operations. Configure them for all production services.

Operational Lessons

Cluster Upgrades Are Major Events

Upgrading Kubernetes versions is not trivial:

API deprecations break manifests
Node upgrades require workload migration
etcd upgrades require careful handling
CNI and other components may need updates

Our approach:

Read every release note and changelog
Test upgrade in staging with production-like load
Schedule maintenance window
Upgrade control plane, then node pools incrementally
Have rollback plan ready

We budget significant time for each upgrade cycle.

Multi-Cluster Is Worth It

Running multiple clusters provides:

Blast radius isolation
Upgrade testing path
Regional deployment options
Easier disaster recovery

The operational overhead of multiple clusters is lower than you might think, especially with infrastructure as code.

GitOps Changed Everything

Adopting GitOps (we use ArgoCD) transformed our deployment workflow:

All configuration in Git
Automatic sync from Git to cluster
Audit trail for all changes
Easy rollback via Git revert

Before GitOps, deployments were imperative commands. With GitOps, deployments are Git commits. The improvement in reliability and auditability is substantial.

Security Lessons

RBAC From Day One

Kubernetes RBAC is complex but essential. We initially ran with overly permissive configurations and had to tighten later.

Minimum viable RBAC:

Separate namespaces per team/application
Service accounts per application (not default)
Role bindings limited to necessary resources
No cluster-admin for regular operations

Pod Security Policies (Now Pod Security Standards)

Running containers as root, with host networking, or with privileged access creates security risks. Enforce restrictions:

apiVersion: v1
kind: Pod
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
  containers:
  - name: app
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true

Note: Pod Security Policies are deprecated in 1.21 and removed in 1.25. Plan for Pod Security Standards or alternatives like OPA Gatekeeper.

Secrets Management

Kubernetes Secrets are base64 encoded, not encrypted (by default). We learned this the hard way when secrets appeared in logs.

Better approaches:

Enable encryption at rest for etcd
External secrets management (HashiCorp Vault, AWS Secrets Manager)
Sealed Secrets or similar for GitOps

Performance Lessons

Java Applications Need Tuning

JVM-based applications required special attention:

containers:
- name: java-app
  env:
  - name: JAVA_OPTS
    value: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0"
  resources:
    requests:
      memory: "1Gi"
    limits:
      memory: "1Gi"

Without container-aware JVM settings (available since Java 8u191), JVM reads host memory, not container limits, and may OOM.

DNS Caching Matters

Kubernetes DNS can become a bottleneck. Applications making many external DNS queries can overwhelm CoreDNS.

Mitigations:

Node-local DNS cache
Application-level DNS caching
CoreDNS scaling and tuning

Horizontal Pod Autoscaler Tuning

Default HPA settings are rarely optimal:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Prevent flapping

Tune based on your application’s scaling characteristics. The defaults optimize for different use cases than yours.

What We’d Do Differently

Managed Kubernetes Earlier

We ran self-managed Kubernetes initially (kubeadm). Managing the control plane is significant work. We eventually moved to managed Kubernetes (EKS), and the reduction in operational burden was substantial.

Self-managed makes sense when:

You need capabilities managed offerings don’t provide
You’re running on-premises
You have significant Kubernetes expertise

Managed makes sense when:

You want to focus on applications, not infrastructure
You don’t have deep Kubernetes expertise
You’re running in a cloud that offers it

Service Mesh Evaluation Timing

We adopted a service mesh (Istio) later than we should have. mTLS, traffic management, and observability features would have helped earlier.

However, we also see organizations adopting service mesh too early, before they understand base Kubernetes. Find the right timing for your journey.

More Investment in Developer Experience

We focused on infrastructure before developer experience. This slowed adoption.

What developers need:

Simple deployment workflows
Local development that matches production
Clear documentation and examples
Self-service for common operations

Better developer experience earlier would have driven faster adoption.

Key Takeaways

Start with non-critical workloads and progress gradually to production
Invest heavily in observability from day one
Understand networking deeply before going to production
Base resource requests on measurements, not guesses
Keep stateful workloads on managed services unless you have strong reasons not to
Configure PodDisruptionBudgets for all production services
Plan significant time for cluster upgrades
Adopt GitOps for reliable, auditable deployments
Use managed Kubernetes unless you have specific reasons for self-managed
Prioritize developer experience for adoption

Kubernetes has been transformative for our infrastructure. The lessons came at a cost, but the resulting platform is reliable, scalable, and efficient. Learn from our mistakes.