Cost-Effective Kubernetes: Right-Sizing and Optimization

Kubernetes makes it easy to request resources. Too easy. Without attention, clusters become massively over-provisioned. I’ve seen clusters running at 15% utilization while costing hundreds of thousands per month. Cost optimization is possible without sacrificing reliability.

Here’s how to right-size Kubernetes workloads.

The Over-Provisioning Problem

Why It Happens

over_provisioning_causes:
  developer_behavior:
    - "I'll just request 4GB to be safe"
    - Copy-paste from examples
    - No feedback on actual usage

  fear:
    - OOMKilled trauma
    - Don't want pager to go off
    - Easier to over-provision

  no_visibility:
    - Can't see actual usage
    - No cost attribution
    - No accountability

  cluster_admin:
    - Large nodes for flexibility
    - No resource quotas
    - No enforcement

The Cost Impact

typical_scenario:
  requested: 4 CPU, 8GB memory per pod
  actual_usage: 0.5 CPU, 1GB memory
  utilization: 12.5%
  waste: 87.5% of spend

  at_scale:
    100_pods: Wasting ~$50k/year
    1000_pods: Wasting ~$500k/year

Measuring Utilization

Key Metrics

utilization_metrics:
  cpu:
    requested: sum(kube_pod_container_resource_requests{resource="cpu"})
    used: sum(rate(container_cpu_usage_seconds_total[5m]))
    utilization: used / requested

  memory:
    requested: sum(kube_pod_container_resource_requests{resource="memory"})
    used: sum(container_memory_working_set_bytes)
    utilization: used / requested

  node:
    allocatable: kube_node_status_allocatable
    requested: sum by node (pod requests)
    utilization: requested / allocatable

Prometheus Queries

# CPU utilization (actual vs requested)
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m]))
/
sum(kube_pod_container_resource_requests{namespace="production", resource="cpu"})

# Memory utilization
sum(container_memory_working_set_bytes{namespace="production"})
/
sum(kube_pod_container_resource_requests{namespace="production", resource="memory"})

# Pods with high over-provisioning
(
  kube_pod_container_resource_requests{resource="cpu"}
  -
  rate(container_cpu_usage_seconds_total[24h])
) > 0.5

Dashboard

# Grafana dashboard panels
utilization_dashboard:
  cluster_overview:
    - Total CPU requested vs used
    - Total memory requested vs used
    - Cost estimate (CPU*price + Memory*price)

  by_namespace:
    - CPU utilization per namespace
    - Memory utilization per namespace
    - Cost per namespace

  by_workload:
    - Top 10 over-provisioned deployments
    - Utilization heatmap
    - Recommendations

Right-Sizing

Request/Limit Strategy

resource_strategy:
  requests:
    purpose: Scheduling and guaranteed resources
    guidance: Set to actual typical usage
    cpu: p95 of actual usage
    memory: Peak usage + 10% buffer

  limits:
    cpu:
      recommendation: Often not needed
      rationale: CPU is compressible, throttling is graceful
    memory:
      recommendation: Set to prevent runaway
      guidance: 2x requests or based on max observed

Right-Sizing Example

# Before: Copy-pasted values
resources:
  requests:
    cpu: "2"
    memory: "4Gi"
  limits:
    cpu: "4"
    memory: "8Gi"

# After actual measurement (p95 CPU: 200m, peak memory: 800MB)
resources:
  requests:
    cpu: "250m"      # p95 + buffer
    memory: "1Gi"    # Peak + 25%
  limits:
    memory: "2Gi"    # Prevent runaway
    # No CPU limit - let it burst

VPA (Vertical Pod Autoscaler)

# VPA for automatic right-sizing recommendations
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"  # Recommendation only
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: "50m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2"
          memory: "4Gi"

# Check VPA recommendations
kubectl describe vpa api-server-vpa

# Output:
# Recommendation:
#   Container Recommendations:
#     Container Name: api
#     Lower Bound:    Cpu: 50m,  Memory: 200Mi
#     Target:         Cpu: 200m, Memory: 500Mi
#     Upper Bound:    Cpu: 500m, Memory: 1Gi

Node Optimization

Right-Size Nodes

node_strategy:
  small_nodes:
    pros:
      - Better bin packing
      - Lower blast radius
      - Matches small workloads
    cons:
      - More overhead (kubelet, etc.)
      - Can't fit large pods

  large_nodes:
    pros:
      - Less overhead
      - Fits any workload
    cons:
      - Worse bin packing
      - Higher blast radius
      - May waste resources

  recommendation:
    - Mix of node sizes
    - Node pools per workload type
    - Match nodes to workload profiles

Spot/Preemptible Instances

spot_strategy:
  suitable_workloads:
    - Stateless applications with replicas
    - Batch jobs
    - Dev/staging environments
    - Workloads tolerant of interruption

  implementation:
    - Separate node pool for spot
    - Node affinity/tolerations
    - Multiple instance types
    - Graceful shutdown handling

  savings: 60-90% vs on-demand

# Spot node pool configuration (EKS)
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: my-cluster
nodeGroups:
  - name: spot-workers
    instancesDistribution:
      instanceTypes: ["m5.large", "m5a.large", "m4.large"]
      onDemandBaseCapacity: 0
      onDemandPercentageAboveBaseCapacity: 0
      spotAllocationStrategy: "capacity-optimized"
    labels:
      node-type: spot
    taints:
      - key: spot
        value: "true"
        effect: NoSchedule

# Workload configuration for spot
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      tolerations:
        - key: "spot"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: node-type
                    operator: In
                    values:
                      - spot
      terminationGracePeriodSeconds: 30

Autoscaling

HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

Cluster Autoscaler

# Cluster autoscaler configuration
cluster_autoscaler:
  scale_down:
    enabled: true
    delay_after_add: 10m
    unneeded_time: 10m
    utilization_threshold: 0.5

  scale_up:
    enabled: true
    expander: least-waste

  node_groups:
    - name: workers
      min: 2
      max: 50

Resource Quotas

Namespace Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    pods: "50"

LimitRanges

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-a
spec:
  limits:
    - default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "256Mi"
      max:
        cpu: "4"
        memory: "8Gi"
      min:
        cpu: "50m"
        memory: "64Mi"
      type: Container

Cost Visibility

Cost Allocation

cost_allocation:
  labels:
    required:
      - team
      - environment
      - application
    enforced: Via OPA/Kyverno

  tools:
    - Kubecost
    - OpenCost
    - Cloud provider cost explorer

  reports:
    - Cost by namespace
    - Cost by team
    - Cost by application
    - Efficiency score

Key Takeaways

Measure utilization: you can’t optimize what you don’t measure
Right-size requests based on actual usage, not guesses
Consider removing CPU limits for most workloads
Use VPA for recommendations, even if not auto-applying
Mix node sizes to match workload profiles
Use spot instances for suitable workloads (60-90% savings)
Implement HPA to scale with demand
Set resource quotas to prevent runaway costs
Enforce labels for cost allocation
Review and optimize monthly

Cost optimization and reliability aren’t opposites. Over-provisioned clusters often have worse reliability because problems are hidden by slack resources.