Cloud computing promised cost savings. Reality: most companies overspend by 30-40%. The pay-as-you-go model makes it easy to spin up resources and forget about them. Cloud cost management is now a critical discipline.
Here’s how to optimize cloud costs systematically.
The Cost Problem
Why Cloud Bills Grow
cost_growth_factors:
easy_provisioning:
- No procurement delays
- Self-service encourages experimentation
- Resources accumulate
complexity:
- Pricing models are complex
- Hidden costs (data transfer, API calls)
- Hundreds of services
lack_of_ownership:
- No one responsible for costs
- Developers provision, finance pays
- Disconnect from business value
overprovisioning:
- Fear of under-sizing
- "It might need it"
- Default to large instances
Visibility First
Cost Allocation
tagging_strategy:
required_tags:
- environment: prod/staging/dev
- team: engineering/data/ml
- service: order-api/payment-service
- cost-center: CC-1234
- owner: team-email
enforcement:
- Tag policies (AWS Organizations)
- Deny untagged resources
- Automated tagging via IaC
# Terraform: Enforce tags
variable "required_tags" {
type = map(string)
validation {
condition = alltrue([
contains(keys(var.required_tags), "environment"),
contains(keys(var.required_tags), "team"),
contains(keys(var.required_tags), "service"),
])
error_message = "Required tags: environment, team, service"
}
}
resource "aws_instance" "example" {
ami = var.ami_id
instance_type = var.instance_type
tags = merge(var.required_tags, {
Name = var.name
})
}
Cost Dashboards
dashboard_components:
overview:
- Total spend vs budget
- Month-over-month trend
- Forecast for month end
breakdown:
- By service (EC2, RDS, S3)
- By team/cost center
- By environment
anomalies:
- Spike detection
- New high-cost resources
- Unusual patterns
Right-Sizing
Instance Analysis
rightsizing_approach:
collect_metrics:
- CPU utilization (avg, max, p99)
- Memory utilization
- Network throughput
- Disk I/O
analysis_period: 14-30 days minimum
decision_criteria:
underutilized:
cpu_avg: "< 20%"
memory_avg: "< 30%"
action: Downsize or consider serverless
right_sized:
cpu_avg: "20-60%"
memory_avg: "40-70%"
action: Monitor, no change needed
constrained:
cpu_avg: "> 80%"
memory_avg: "> 85%"
action: Upsize or optimize application
# Right-sizing analysis script
import boto3
from datetime import datetime, timedelta
def analyze_instance(instance_id, days=14):
cloudwatch = boto3.client('cloudwatch')
ec2 = boto3.client('ec2')
end_time = datetime.utcnow()
start_time = end_time - timedelta(days=days)
# Get CPU metrics
cpu_stats = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=start_time,
EndTime=end_time,
Period=3600,
Statistics=['Average', 'Maximum']
)
avg_cpu = sum(p['Average'] for p in cpu_stats['Datapoints']) / len(cpu_stats['Datapoints'])
max_cpu = max(p['Maximum'] for p in cpu_stats['Datapoints'])
# Get instance details
instance = ec2.describe_instances(InstanceIds=[instance_id])
instance_type = instance['Reservations'][0]['Instances'][0]['InstanceType']
# Recommendation
if avg_cpu < 20 and max_cpu < 50:
return {
'instance_id': instance_id,
'current_type': instance_type,
'recommendation': 'downsize',
'avg_cpu': avg_cpu,
'potential_savings': estimate_savings(instance_type)
}
return {'recommendation': 'keep'}
Database Right-Sizing
rds_optimization:
metrics_to_watch:
- DatabaseConnections
- CPUUtilization
- FreeableMemory
- ReadIOPS/WriteIOPS
common_issues:
oversized_instances:
symptom: Low CPU, high cost
action: Downsize instance class
underprovisioned_storage:
symptom: High I/O wait
action: Increase IOPS or use gp3
unnecessary_multi_az:
symptom: Dev/staging with Multi-AZ
action: Disable for non-prod
Reserved Capacity
Commitment Strategies
reservation_strategy:
when_to_commit:
- Stable, predictable workloads
- Running 24/7 in production
- At least 1 year planned usage
commitment_levels:
conservative:
coverage: 50-60% of baseline
term: 1 year
payment: No upfront
savings: ~20%
moderate:
coverage: 70-80% of baseline
term: 1-3 years mixed
payment: Partial upfront
savings: ~30-40%
aggressive:
coverage: 90%+ of baseline
term: 3 years
payment: All upfront
savings: ~50-60%
Savings Plans vs Reserved Instances
comparison:
reserved_instances:
flexibility: Specific instance type and region
discount: Higher (up to 72%)
use_case: Known, stable workloads
savings_plans:
compute_savings_plan:
flexibility: Any instance type, region, OS
discount: Good (up to 66%)
use_case: Flexible compute needs
ec2_instance_savings_plan:
flexibility: Instance family in region
discount: Higher than compute SP
use_case: Known region, flexible size
Spot and Preemptible
Spot Instance Strategy
spot_strategy:
suitable_workloads:
- Batch processing
- CI/CD workers
- Stateless web tier (with fallback)
- Data processing
not_suitable:
- Databases
- Stateful services
- Single points of failure
best_practices:
- Diversify instance types
- Use multiple AZs
- Handle interruption gracefully
- Set appropriate max price
# Kubernetes spot node pool
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: spot-workers
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large", "m5.xlarge", "m5a.large", "m5a.xlarge", "m6i.large"]
limits:
resources:
cpu: 100
ttlSecondsAfterEmpty: 30
ttlSecondsUntilExpired: 604800 # 7 days
Storage Optimization
S3 Lifecycle Policies
s3_optimization:
lifecycle_policy:
- transition_to_ia:
days: 30
class: STANDARD_IA
savings: ~45%
- transition_to_glacier:
days: 90
class: GLACIER
savings: ~80%
- expire:
days: 365
action: Delete
savings: 100%
{
"Rules": [
{
"ID": "ArchiveOldLogs",
"Status": "Enabled",
"Filter": {
"Prefix": "logs/"
},
"Transitions": [
{
"Days": 30,
"StorageClass": "STANDARD_IA"
},
{
"Days": 90,
"StorageClass": "GLACIER"
}
],
"Expiration": {
"Days": 365
}
}
]
}
EBS Optimization
ebs_optimization:
volume_types:
gp3:
use_case: Most workloads
advantage: Cheaper than gp2, configurable IOPS
io2:
use_case: High-performance databases
consideration: Expensive, use only when needed
cleanup:
- Delete unattached volumes
- Right-size over-provisioned volumes
- Snapshot to S3 and delete old volumes
Network Costs
Data Transfer Optimization
data_transfer_costs:
expensive:
- Cross-region transfer
- Internet egress
- Cross-AZ in some cases
optimization:
use_vpc_endpoints:
benefit: No internet gateway costs
services: S3, DynamoDB, many others
same_az_placement:
benefit: Free data transfer
trade_off: Reduced availability
cdn_for_static:
benefit: Cheaper egress from CloudFront
savings: 40-60% vs direct S3
FinOps Culture
Cost Accountability
finops_practices:
showback:
- Show teams their costs
- Compare to budget
- Trend visibility
chargeback:
- Actual billing to cost centers
- Creates ownership
- Requires accurate tagging
cost_reviews:
frequency: Weekly or bi-weekly
participants: Engineering leads, finance
agenda:
- Review spend vs forecast
- Identify anomalies
- Discuss optimization opportunities
Developer Enablement
developer_tools:
cost_visibility:
- IDE plugins showing resource costs
- PR comments with cost impact
- Slack alerts for anomalies
guardrails:
- Instance type restrictions
- Budget alerts
- Auto-shutdown for dev environments
Key Takeaways
- Tag everything for cost allocation and accountability
- Right-size based on actual utilization, not guesses
- Commit to reserved capacity for stable workloads (50-70%)
- Use spot instances for fault-tolerant workloads
- Implement S3 lifecycle policies to reduce storage costs
- Optimize data transfer with VPC endpoints and CDN
- Build FinOps culture: visibility, accountability, optimization
- Review costs regularly with engineering and finance
- Automate cost alerts and enforcement
- Cost optimization is ongoing, not a one-time project
Cloud cost management is a discipline. Treat it like security or reliability—continuous attention.