A year ago, we migrated our first production services to Kubernetes. The promise was compelling: declarative deployment, self-healing systems, efficient resource utilization, and simplified operations. The reality was more nuanced. Here’s what we learned.
What Worked Well
Declarative Configuration
Kubernetes’ declarative model—describing desired state and letting the system converge—is genuinely powerful. Deployments describe what we want; Kubernetes figures out how to get there.
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: our-registry/api:v1.2.3
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Configuration in version control means:
- Changes are reviewed before application
- History shows what changed when
- Rollback is straightforward
Self-Healing
Kubernetes automatically replaces failed pods, reschedules workloads from failed nodes, and respects health checks. This happened invisibly—we’d see pods restart in logs but users experienced no impact.
Before Kubernetes, server failures required human intervention. Now the system handles most failures automatically. The on-call experience improved measurably.
Resource Efficiency
Bin-packing multiple services onto nodes improved utilization. Services that previously needed dedicated servers share infrastructure. We run similar workloads on fewer machines.
Rolling Updates
Zero-downtime deployments became trivial:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
Old approach: maintenance windows, careful coordination, anxious engineers. New approach: merge and deploy, automatically rolled out safely.
What Surprised Us
Networking Complexity
Kubernetes networking is complex. CNI plugins, service mesh considerations, network policies, ingress controllers—each decision has tradeoffs we didn’t initially understand.
We spent significant time debugging:
- Service discovery issues
- DNS resolution delays
- Network policy mistakes that broke communication
- Ingress configuration that dropped requests
Investment: The networking abstraction is powerful but requires deep understanding when things go wrong.
Resource Requests and Limits
Getting resource configuration right took iteration:
- Requests too low: pods get scheduled but OOM-killed or CPU-throttled
- Limits too low: legitimate traffic spikes cause failures
- Requests too high: waste resources, can’t schedule efficiently
We ended up profiling applications under realistic load to set appropriate values. Default values from examples rarely matched our needs.
Stateful Workloads
Kubernetes is designed for stateless workloads. Stateful applications (databases, caches, queues) require additional machinery:
- Persistent volumes and storage classes
- StatefulSets for ordered deployment and stable network identities
- Careful consideration of failure modes
We kept databases outside Kubernetes initially. Over time, we moved some stateful workloads in, but it required more planning than stateless services.
Debugging Difficulty
When things go wrong, debugging Kubernetes is harder than debugging traditional deployments. Problems could be:
- Application code
- Container configuration
- Kubernetes scheduling
- Networking
- Storage
- Node issues
The layers of abstraction that provide power also obscure what’s happening. We invested in logging, tracing, and Kubernetes-specific observability.
Hard Lessons
Start with Cluster Operations Expertise
We initially treated Kubernetes as a black box. That worked until it didn’t. Cluster upgrades, node failures, and capacity planning required understanding we didn’t have.
Recommendation: Invest in understanding Kubernetes internals before depending on it for production. Someone on the team needs deep knowledge.
Managed Kubernetes Is Worth It
We started self-managing clusters. It consumed significant engineering time: etcd maintenance, control plane upgrades, node management, security patching.
Switching to managed Kubernetes (GKE, EKS, AKS) removed significant operational burden. For most teams, managed services are worth the cost.
YAML Management Gets Out of Hand
YAML configuration proliferates. Without discipline, you end up with copy-pasted, inconsistent, hard-to-maintain configurations.
We adopted Helm for templating and Kustomize for environment-specific variations. This reduced duplication but added tooling complexity.
Secrets Management Is Harder
Kubernetes Secrets are base64 encoded, not encrypted. They’re visible to anyone with cluster access. This isn’t adequate for sensitive data.
We integrated with external secrets management (Vault, cloud provider secrets) and used operators to sync secrets into Kubernetes. This added moving parts.
Local Development Needs Attention
Developers need to test against Kubernetes-like environments. Options:
- Minikube/Kind for local clusters
- Telepresence for hybrid local/remote development
- Dedicated dev namespaces in shared clusters
Each has tradeoffs. We settled on dedicated namespaces with automated deployment from feature branches.
Operational Practices
GitOps
All configuration lives in Git. Changes merge to main, trigger deployment. No manual kubectl commands in production.
Benefits:
- Audit trail for all changes
- Peer review before deployment
- Easy rollback via git revert
Health Checks Matter
Proper readiness and liveness probes are essential:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
- Liveness: Restart pod if it’s stuck
- Readiness: Don’t send traffic until ready
Bad health checks cause cascading failures. We test health check behavior explicitly.
Resource Quotas and Limits
Namespaces should have quotas preventing runaway resource consumption:
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
spec:
hard:
requests.cpu: "10"
requests.memory: "20Gi"
limits.cpu: "20"
limits.memory: "40Gi"
Without quotas, one team’s mistake can affect the entire cluster.
Pod Disruption Budgets
For high-availability services, PDBs ensure updates don’t take down too many replicas:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api
Monitoring and Alerting
Kubernetes-specific monitoring matters:
- Cluster health (control plane, etcd)
- Node health (resources, conditions)
- Pod health (restarts, status, resources)
- Application metrics (RED metrics per service)
Prometheus and Grafana are the de facto standard. Invest in dashboards and alerts specific to Kubernetes operations.
Was It Worth It?
A year in: yes, but with caveats.
For whom it’s worth it:
- Teams deploying many services
- Organizations with container expertise or willingness to learn
- Workloads that benefit from dynamic scheduling
- Teams with resources to invest in Kubernetes operations
For whom it might not be:
- Small teams with few services
- Organizations without container expertise
- Simple workloads that don’t need orchestration
- Teams that can’t invest in learning and operations
Kubernetes solves real problems at scale. For smaller operations, simpler solutions might be more appropriate. Evaluate honestly whether you have the problems Kubernetes solves.
Key Takeaways
- Declarative configuration, self-healing, and rolling updates deliver real value
- Networking, resource configuration, and debugging are harder than expected
- Invest in understanding Kubernetes internals, not just usage
- Consider managed Kubernetes to reduce operational burden
- Implement GitOps for configuration management
- Proper health checks, resource limits, and monitoring are essential, not optional
- Evaluate honestly whether your scale justifies Kubernetes complexity