Two years ago, we made the decision to move our production workloads to Kubernetes. It was early—Kubernetes 1.2 was current, and the ecosystem was immature. We made mistakes, learned lessons, and eventually built a platform that serves us well.
Here’s what we’ve learned that might save you time and pain.
What We Got Right
Starting with Non-Critical Workloads
We didn’t migrate production on day one. Our progression:
- Development environments (months 1-3)
- CI/CD infrastructure (months 3-6)
- Internal tools (months 6-9)
- Non-critical production (months 9-12)
- Critical production (months 12+)
This allowed us to learn incrementally, make mistakes safely, and build operational expertise before stakes were high.
Investing in Observability Early
We prioritized observability from the start:
- Prometheus + Grafana for metrics
- EFK stack (Elasticsearch, Fluentd, Kibana) for logs
- Jaeger for distributed tracing
When things break in Kubernetes (and they will), observability is what saves you. Debugging distributed systems without metrics and traces is nearly impossible.
Standardizing on Helm Early
We adopted Helm for packaging applications. Despite its complexity, having a standard deployment format across all applications paid dividends:
- Consistent deployment patterns
- Reusable charts for common patterns
- Easier onboarding of new services
The alternatives (raw manifests, kustomize—which didn’t exist yet) would have left us with inconsistent, hard-to-maintain configurations.
What We Got Wrong
Underestimating Networking Complexity
Kubernetes networking is complex. We underestimated this badly.
CNI selection: We started with flannel for simplicity but hit performance issues and eventually migrated to Calico. This migration was painful.
Network policies: We didn’t implement network policies initially, leaving all pods able to communicate with all others. Adding network policies retroactively meant auditing every service’s communication patterns.
DNS issues: CoreDNS (then kube-dns) scaling issues caused intermittent resolution failures under load. We spent weeks debugging what seemed like random application failures.
Lesson learned: Invest time understanding Kubernetes networking deeply before going to production. It’s not something you can learn incrementally.
Resource Requests and Limits
Getting resource requests and limits right is harder than it appears.
Initial approach: We set conservative limits based on guesses. Result: OOMKills for some services, wasted resources for others.
What we learned:
resources:
requests:
memory: "256Mi" # Must be based on actual usage
cpu: "100m" # Determines scheduling
limits:
memory: "512Mi" # OOMKill boundary
cpu: "1000m" # Throttling boundary (usually omit)
Key insights:
- Memory limits should be close to requests (OOMKill is brutal)
- CPU limits often cause more harm than good (throttling creates latency)
- Requests must be based on actual measurements, not guesses
- Vertical Pod Autoscaler helps but isn’t magic
We now require services to run in staging with realistic load before production, with resource profiling.
StatefulSet Complexity
We tried running stateful workloads on Kubernetes too early.
What we tried: PostgreSQL, Redis, Elasticsearch on Kubernetes.
What happened:
- Storage provisioning complexity
- Backup and recovery challenges
- Operational complexity during failures
- Performance issues with some storage backends
Where we landed:
- Managed services for databases (RDS, Cloud SQL)
- Redis on Kubernetes only for caching (not persistence)
- Elasticsearch on dedicated infrastructure
Kubernetes excels at stateless workloads. Stateful workloads are possible but require significant operational investment.
Ignoring Pod Disruption Budgets
We didn’t configure PodDisruptionBudgets initially. When we needed to drain nodes for maintenance, we discovered services going fully offline during rolling updates.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # Or maxUnavailable: 1
selector:
matchLabels:
app: api
PDBs are essential for safe cluster operations. Configure them for all production services.
Operational Lessons
Cluster Upgrades Are Major Events
Upgrading Kubernetes versions is not trivial:
- API deprecations break manifests
- Node upgrades require workload migration
- etcd upgrades require careful handling
- CNI and other components may need updates
Our approach:
- Read every release note and changelog
- Test upgrade in staging with production-like load
- Schedule maintenance window
- Upgrade control plane, then node pools incrementally
- Have rollback plan ready
We budget significant time for each upgrade cycle.
Multi-Cluster Is Worth It
Running multiple clusters provides:
- Blast radius isolation
- Upgrade testing path
- Regional deployment options
- Easier disaster recovery
The operational overhead of multiple clusters is lower than you might think, especially with infrastructure as code.
GitOps Changed Everything
Adopting GitOps (we use ArgoCD) transformed our deployment workflow:
- All configuration in Git
- Automatic sync from Git to cluster
- Audit trail for all changes
- Easy rollback via Git revert
Before GitOps, deployments were imperative commands. With GitOps, deployments are Git commits. The improvement in reliability and auditability is substantial.
Security Lessons
RBAC From Day One
Kubernetes RBAC is complex but essential. We initially ran with overly permissive configurations and had to tighten later.
Minimum viable RBAC:
- Separate namespaces per team/application
- Service accounts per application (not default)
- Role bindings limited to necessary resources
- No cluster-admin for regular operations
Pod Security Policies (Now Pod Security Standards)
Running containers as root, with host networking, or with privileged access creates security risks. Enforce restrictions:
apiVersion: v1
kind: Pod
spec:
securityContext:
runAsNonRoot: true
runAsUser: 1000
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
Note: Pod Security Policies are deprecated in 1.21 and removed in 1.25. Plan for Pod Security Standards or alternatives like OPA Gatekeeper.
Secrets Management
Kubernetes Secrets are base64 encoded, not encrypted (by default). We learned this the hard way when secrets appeared in logs.
Better approaches:
- Enable encryption at rest for etcd
- External secrets management (HashiCorp Vault, AWS Secrets Manager)
- Sealed Secrets or similar for GitOps
Performance Lessons
Java Applications Need Tuning
JVM-based applications required special attention:
containers:
- name: java-app
env:
- name: JAVA_OPTS
value: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0"
resources:
requests:
memory: "1Gi"
limits:
memory: "1Gi"
Without container-aware JVM settings (available since Java 8u191), JVM reads host memory, not container limits, and may OOM.
DNS Caching Matters
Kubernetes DNS can become a bottleneck. Applications making many external DNS queries can overwhelm CoreDNS.
Mitigations:
- Node-local DNS cache
- Application-level DNS caching
- CoreDNS scaling and tuning
Horizontal Pod Autoscaler Tuning
Default HPA settings are rarely optimal:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Prevent flapping
Tune based on your application’s scaling characteristics. The defaults optimize for different use cases than yours.
What We’d Do Differently
Managed Kubernetes Earlier
We ran self-managed Kubernetes initially (kubeadm). Managing the control plane is significant work. We eventually moved to managed Kubernetes (EKS), and the reduction in operational burden was substantial.
Self-managed makes sense when:
- You need capabilities managed offerings don’t provide
- You’re running on-premises
- You have significant Kubernetes expertise
Managed makes sense when:
- You want to focus on applications, not infrastructure
- You don’t have deep Kubernetes expertise
- You’re running in a cloud that offers it
Service Mesh Evaluation Timing
We adopted a service mesh (Istio) later than we should have. mTLS, traffic management, and observability features would have helped earlier.
However, we also see organizations adopting service mesh too early, before they understand base Kubernetes. Find the right timing for your journey.
More Investment in Developer Experience
We focused on infrastructure before developer experience. This slowed adoption.
What developers need:
- Simple deployment workflows
- Local development that matches production
- Clear documentation and examples
- Self-service for common operations
Better developer experience earlier would have driven faster adoption.
Key Takeaways
- Start with non-critical workloads and progress gradually to production
- Invest heavily in observability from day one
- Understand networking deeply before going to production
- Base resource requests on measurements, not guesses
- Keep stateful workloads on managed services unless you have strong reasons not to
- Configure PodDisruptionBudgets for all production services
- Plan significant time for cluster upgrades
- Adopt GitOps for reliable, auditable deployments
- Use managed Kubernetes unless you have specific reasons for self-managed
- Prioritize developer experience for adoption
Kubernetes has been transformative for our infrastructure. The lessons came at a cost, but the resulting platform is reliable, scalable, and efficient. Learn from our mistakes.