Kubernetes makes it easy to request resources. Too easy. Without attention, clusters become massively over-provisioned. I’ve seen clusters running at 15% utilization while costing hundreds of thousands per month. Cost optimization is possible without sacrificing reliability.
Here’s how to right-size Kubernetes workloads.
The Over-Provisioning Problem
Why It Happens
over_provisioning_causes:
developer_behavior:
- "I'll just request 4GB to be safe"
- Copy-paste from examples
- No feedback on actual usage
fear:
- OOMKilled trauma
- Don't want pager to go off
- Easier to over-provision
no_visibility:
- Can't see actual usage
- No cost attribution
- No accountability
cluster_admin:
- Large nodes for flexibility
- No resource quotas
- No enforcement
The Cost Impact
typical_scenario:
requested: 4 CPU, 8GB memory per pod
actual_usage: 0.5 CPU, 1GB memory
utilization: 12.5%
waste: 87.5% of spend
at_scale:
100_pods: Wasting ~$50k/year
1000_pods: Wasting ~$500k/year
Measuring Utilization
Key Metrics
utilization_metrics:
cpu:
requested: sum(kube_pod_container_resource_requests{resource="cpu"})
used: sum(rate(container_cpu_usage_seconds_total[5m]))
utilization: used / requested
memory:
requested: sum(kube_pod_container_resource_requests{resource="memory"})
used: sum(container_memory_working_set_bytes)
utilization: used / requested
node:
allocatable: kube_node_status_allocatable
requested: sum by node (pod requests)
utilization: requested / allocatable
Prometheus Queries
# CPU utilization (actual vs requested)
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m]))
/
sum(kube_pod_container_resource_requests{namespace="production", resource="cpu"})
# Memory utilization
sum(container_memory_working_set_bytes{namespace="production"})
/
sum(kube_pod_container_resource_requests{namespace="production", resource="memory"})
# Pods with high over-provisioning
(
kube_pod_container_resource_requests{resource="cpu"}
-
rate(container_cpu_usage_seconds_total[24h])
) > 0.5
Dashboard
# Grafana dashboard panels
utilization_dashboard:
cluster_overview:
- Total CPU requested vs used
- Total memory requested vs used
- Cost estimate (CPU*price + Memory*price)
by_namespace:
- CPU utilization per namespace
- Memory utilization per namespace
- Cost per namespace
by_workload:
- Top 10 over-provisioned deployments
- Utilization heatmap
- Recommendations
Right-Sizing
Request/Limit Strategy
resource_strategy:
requests:
purpose: Scheduling and guaranteed resources
guidance: Set to actual typical usage
cpu: p95 of actual usage
memory: Peak usage + 10% buffer
limits:
cpu:
recommendation: Often not needed
rationale: CPU is compressible, throttling is graceful
memory:
recommendation: Set to prevent runaway
guidance: 2x requests or based on max observed
Right-Sizing Example
# Before: Copy-pasted values
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
# After actual measurement (p95 CPU: 200m, peak memory: 800MB)
resources:
requests:
cpu: "250m" # p95 + buffer
memory: "1Gi" # Peak + 25%
limits:
memory: "2Gi" # Prevent runaway
# No CPU limit - let it burst
VPA (Vertical Pod Autoscaler)
# VPA for automatic right-sizing recommendations
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Recommendation only
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: "50m"
memory: "128Mi"
maxAllowed:
cpu: "2"
memory: "4Gi"
# Check VPA recommendations
kubectl describe vpa api-server-vpa
# Output:
# Recommendation:
# Container Recommendations:
# Container Name: api
# Lower Bound: Cpu: 50m, Memory: 200Mi
# Target: Cpu: 200m, Memory: 500Mi
# Upper Bound: Cpu: 500m, Memory: 1Gi
Node Optimization
Right-Size Nodes
node_strategy:
small_nodes:
pros:
- Better bin packing
- Lower blast radius
- Matches small workloads
cons:
- More overhead (kubelet, etc.)
- Can't fit large pods
large_nodes:
pros:
- Less overhead
- Fits any workload
cons:
- Worse bin packing
- Higher blast radius
- May waste resources
recommendation:
- Mix of node sizes
- Node pools per workload type
- Match nodes to workload profiles
Spot/Preemptible Instances
spot_strategy:
suitable_workloads:
- Stateless applications with replicas
- Batch jobs
- Dev/staging environments
- Workloads tolerant of interruption
implementation:
- Separate node pool for spot
- Node affinity/tolerations
- Multiple instance types
- Graceful shutdown handling
savings: 60-90% vs on-demand
# Spot node pool configuration (EKS)
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: my-cluster
nodeGroups:
- name: spot-workers
instancesDistribution:
instanceTypes: ["m5.large", "m5a.large", "m4.large"]
onDemandBaseCapacity: 0
onDemandPercentageAboveBaseCapacity: 0
spotAllocationStrategy: "capacity-optimized"
labels:
node-type: spot
taints:
- key: spot
value: "true"
effect: NoSchedule
# Workload configuration for spot
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values:
- spot
terminationGracePeriodSeconds: 30
Autoscaling
HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
Cluster Autoscaler
# Cluster autoscaler configuration
cluster_autoscaler:
scale_down:
enabled: true
delay_after_add: 10m
unneeded_time: 10m
utilization_threshold: 0.5
scale_up:
enabled: true
expander: least-waste
node_groups:
- name: workers
min: 2
max: 50
Resource Quotas
Namespace Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-a
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
pods: "50"
LimitRanges
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-a
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "256Mi"
max:
cpu: "4"
memory: "8Gi"
min:
cpu: "50m"
memory: "64Mi"
type: Container
Cost Visibility
Cost Allocation
cost_allocation:
labels:
required:
- team
- environment
- application
enforced: Via OPA/Kyverno
tools:
- Kubecost
- OpenCost
- Cloud provider cost explorer
reports:
- Cost by namespace
- Cost by team
- Cost by application
- Efficiency score
Key Takeaways
- Measure utilization: you can’t optimize what you don’t measure
- Right-size requests based on actual usage, not guesses
- Consider removing CPU limits for most workloads
- Use VPA for recommendations, even if not auto-applying
- Mix node sizes to match workload profiles
- Use spot instances for suitable workloads (60-90% savings)
- Implement HPA to scale with demand
- Set resource quotas to prevent runaway costs
- Enforce labels for cost allocation
- Review and optimize monthly
Cost optimization and reliability aren’t opposites. Over-provisioned clusters often have worse reliability because problems are hidden by slack resources.