With economic uncertainty affecting tech budgets, infrastructure cost optimization has moved from nice-to-have to essential. But cost cutting done wrong degrades reliability and slows development. The goal is efficiency, not just expense reduction.
Here are practical strategies that work.
Optimization Hierarchy
Where to Focus
optimization_priority:
highest_impact:
- Unused resources (immediate 100% savings)
- Over-provisioned resources (50-70% savings)
- Reserved capacity (30-50% savings)
medium_impact:
- Architecture optimization (20-40% savings)
- Storage tiering (20-50% savings)
- Network optimization (10-30% savings)
lower_impact:
- Instance type optimization (5-20% savings)
- Spot instances (60-90% savings, limited scope)
- Region selection (10-20% savings)
Quick Wins
Identify Waste
waste_identification:
unused_resources:
compute:
- Stopped instances still incurring costs
- Unattached EBS volumes
- Old AMIs and snapshots
- Idle load balancers
storage:
- Orphaned S3 buckets
- Stale EBS snapshots
- Unused Elastic IPs
- Old ECR images
databases:
- Unused RDS instances
- Over-provisioned read replicas
- Idle ElastiCache clusters
tools:
- AWS Cost Explorer
- Cloud Custodian
- Spot.io
- CloudHealth
Immediate Actions
# Find unattached EBS volumes
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}'
# Find old snapshots (older than 90 days)
aws ec2 describe-snapshots \
--owner-ids self \
--query 'Snapshots[?StartTime<=`2022-09-01`].{ID:SnapshotId,Size:VolumeSize}'
# Find idle load balancers (no healthy targets)
aws elbv2 describe-target-health \
--target-group-arn $TARGET_GROUP_ARN
Right-Sizing
Compute Optimization
rightsizing_approach:
data_collection:
duration: 2-4 weeks minimum
metrics:
- CPU utilization (avg, max, p95)
- Memory utilization
- Network throughput
- Disk I/O
analysis:
underutilized:
criteria: CPU avg < 20%, Memory avg < 30%
action: Downsize or consolidate
right_sized:
criteria: CPU avg 30-60%, Memory avg 50-70%
action: No change needed
constrained:
criteria: CPU or Memory > 80% sustained
action: Upsize or optimize application
Database Right-Sizing
database_optimization:
rds:
check:
- CPU utilization
- Database connections
- IOPS consumption
- Storage utilization
common_savings:
- Dev/staging: Use smaller instances
- Disable Multi-AZ for non-prod
- Use gp3 instead of io1 where possible
- Consider Aurora Serverless for variable loads
dynamodb:
check:
- Consumed vs provisioned capacity
- On-demand vs provisioned pricing
optimization:
- Use on-demand for unpredictable loads
- Use provisioned with auto-scaling for predictable
- Review reserved capacity
Commitment Strategies
Reserved Capacity Planning
reservation_strategy:
analyze_baseline:
- Identify steady-state usage
- Exclude temporary workloads
- Account for planned changes
coverage_approach:
conservative: 50-60% of baseline
moderate: 70-80% of baseline
aggressive: 85-90% of baseline
commitment_mix:
1_year_no_upfront:
discount: ~20%
flexibility: Highest
use: Uncertain growth
1_year_partial_upfront:
discount: ~30%
flexibility: Medium
use: Moderate confidence
3_year_all_upfront:
discount: ~50%
flexibility: Lowest
use: Stable, long-term workloads
Savings Plans
savings_plans:
compute_savings_plan:
flexibility: Any instance, any region
discount: Up to 66%
best_for: Mixed or changing workloads
ec2_instance_savings_plan:
flexibility: Instance family in region
discount: Up to 72%
best_for: Known instance families
recommendation:
- Start with Compute Savings Plans
- Layer EC2 Instance Plans for stable workloads
- Review coverage monthly
Architecture Optimization
Serverless Where Appropriate
serverless_evaluation:
good_fit:
- Variable, unpredictable load
- Event-driven processing
- Infrequent execution
- Quick development cycles
calculate:
lambda_cost: invocations * duration * memory
ec2_cost: instance_cost * hours
break_even:
- Calculate at current load
- Project at growth scenarios
- Consider operational overhead savings
Caching Layers
caching_optimization:
benefits:
- Reduce database load
- Lower compute requirements
- Faster response times
options:
cdn:
use: Static content, API caching
cost: Often cheaper than origin
application_cache:
use: Frequently accessed data
options: ElastiCache, Memorystore
query_cache:
use: Repeated database queries
options: Read replicas, materialized views
Storage Optimization
Tiering Strategy
storage_tiering:
s3_lifecycle:
hot_data:
storage: S3 Standard
access: Frequent
cost: Highest
warm_data:
storage: S3 Standard-IA
access: Monthly
cost: ~45% less
transition: After 30 days
cold_data:
storage: S3 Glacier
access: Rare
cost: ~80% less
transition: After 90 days
archive:
storage: Glacier Deep Archive
access: Yearly or less
cost: ~95% less
transition: After 180 days
Database Storage
database_storage:
ebs_optimization:
- Use gp3 over gp2 (20% cheaper, better performance)
- Right-size provisioned IOPS
- Consider io2 Block Express for high performance
cleanup:
- Delete old snapshots
- Automate snapshot lifecycle
- Review backup retention
Environment Management
Non-Production Optimization
non_prod_savings:
dev_environments:
- Smaller instance sizes
- Single-AZ deployments
- Scheduled shutdown (nights/weekends)
- Shared resources where possible
staging:
- Right-sized for testing needs
- Shutdown when not in use
- Consider spot instances
automation:
schedule_shutdown:
- Lambda function on schedule
- Tag-based targeting
- Slack notification
# Lambda function to stop dev instances
import boto3
def lambda_handler(event, context):
ec2 = boto3.client('ec2')
# Find instances with auto-stop tag
instances = ec2.describe_instances(
Filters=[
{'Name': 'tag:AutoStop', 'Values': ['true']},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)
instance_ids = []
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_ids.append(instance['InstanceId'])
if instance_ids:
ec2.stop_instances(InstanceIds=instance_ids)
return f'Stopped {len(instance_ids)} instances'
return 'No instances to stop'
FinOps Practices
Cost Accountability
finops_implementation:
tagging:
required:
- Environment
- Team
- Service
- Cost-center
enforcement:
- SCPs to require tags
- Automated tagging
- Regular audits
reporting:
- Weekly cost reports by team
- Anomaly alerts
- Trend analysis
- Forecast vs actual
ownership:
- Teams see their costs
- Budget allocation
- Savings incentives
Key Takeaways
- Start with unused resources—immediate 100% savings
- Right-size based on actual metrics, not guesses
- Commit to reserved capacity for stable workloads (50-70% coverage)
- Evaluate serverless for variable workloads
- Implement storage tiering to reduce costs by 50-80%
- Shut down non-production environments when not in use
- Tag everything for cost visibility
- Make teams accountable for their costs
- Cost optimization is ongoing, not one-time
- Efficiency enables investment in what matters
Optimize costs to fund innovation, not just to cut expenses.