Infrastructure Cost Optimization in Uncertain Times

December 19, 2022

With economic uncertainty affecting tech budgets, infrastructure cost optimization has moved from nice-to-have to essential. But cost cutting done wrong degrades reliability and slows development. The goal is efficiency, not just expense reduction.

Here are practical strategies that work.

Optimization Hierarchy

Where to Focus

optimization_priority:
  highest_impact:
    - Unused resources (immediate 100% savings)
    - Over-provisioned resources (50-70% savings)
    - Reserved capacity (30-50% savings)

  medium_impact:
    - Architecture optimization (20-40% savings)
    - Storage tiering (20-50% savings)
    - Network optimization (10-30% savings)

  lower_impact:
    - Instance type optimization (5-20% savings)
    - Spot instances (60-90% savings, limited scope)
    - Region selection (10-20% savings)

Quick Wins

Identify Waste

waste_identification:
  unused_resources:
    compute:
      - Stopped instances still incurring costs
      - Unattached EBS volumes
      - Old AMIs and snapshots
      - Idle load balancers

    storage:
      - Orphaned S3 buckets
      - Stale EBS snapshots
      - Unused Elastic IPs
      - Old ECR images

    databases:
      - Unused RDS instances
      - Over-provisioned read replicas
      - Idle ElastiCache clusters

  tools:
    - AWS Cost Explorer
    - Cloud Custodian
    - Spot.io
    - CloudHealth

Immediate Actions

# Find unattached EBS volumes
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}'

# Find old snapshots (older than 90 days)
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`2022-09-01`].{ID:SnapshotId,Size:VolumeSize}'

# Find idle load balancers (no healthy targets)
aws elbv2 describe-target-health \
  --target-group-arn $TARGET_GROUP_ARN

Right-Sizing

Compute Optimization

rightsizing_approach:
  data_collection:
    duration: 2-4 weeks minimum
    metrics:
      - CPU utilization (avg, max, p95)
      - Memory utilization
      - Network throughput
      - Disk I/O

  analysis:
    underutilized:
      criteria: CPU avg < 20%, Memory avg < 30%
      action: Downsize or consolidate

    right_sized:
      criteria: CPU avg 30-60%, Memory avg 50-70%
      action: No change needed

    constrained:
      criteria: CPU or Memory > 80% sustained
      action: Upsize or optimize application

Database Right-Sizing

database_optimization:
  rds:
    check:
      - CPU utilization
      - Database connections
      - IOPS consumption
      - Storage utilization

    common_savings:
      - Dev/staging: Use smaller instances
      - Disable Multi-AZ for non-prod
      - Use gp3 instead of io1 where possible
      - Consider Aurora Serverless for variable loads

  dynamodb:
    check:
      - Consumed vs provisioned capacity
      - On-demand vs provisioned pricing

    optimization:
      - Use on-demand for unpredictable loads
      - Use provisioned with auto-scaling for predictable
      - Review reserved capacity

Commitment Strategies

Reserved Capacity Planning

reservation_strategy:
  analyze_baseline:
    - Identify steady-state usage
    - Exclude temporary workloads
    - Account for planned changes

  coverage_approach:
    conservative: 50-60% of baseline
    moderate: 70-80% of baseline
    aggressive: 85-90% of baseline

  commitment_mix:
    1_year_no_upfront:
      discount: ~20%
      flexibility: Highest
      use: Uncertain growth

    1_year_partial_upfront:
      discount: ~30%
      flexibility: Medium
      use: Moderate confidence

    3_year_all_upfront:
      discount: ~50%
      flexibility: Lowest
      use: Stable, long-term workloads

Savings Plans

savings_plans:
  compute_savings_plan:
    flexibility: Any instance, any region
    discount: Up to 66%
    best_for: Mixed or changing workloads

  ec2_instance_savings_plan:
    flexibility: Instance family in region
    discount: Up to 72%
    best_for: Known instance families

  recommendation:
    - Start with Compute Savings Plans
    - Layer EC2 Instance Plans for stable workloads
    - Review coverage monthly

Architecture Optimization

Serverless Where Appropriate

serverless_evaluation:
  good_fit:
    - Variable, unpredictable load
    - Event-driven processing
    - Infrequent execution
    - Quick development cycles

  calculate:
    lambda_cost: invocations * duration * memory
    ec2_cost: instance_cost * hours

  break_even:
    - Calculate at current load
    - Project at growth scenarios
    - Consider operational overhead savings

Caching Layers

caching_optimization:
  benefits:
    - Reduce database load
    - Lower compute requirements
    - Faster response times

  options:
    cdn:
      use: Static content, API caching
      cost: Often cheaper than origin

    application_cache:
      use: Frequently accessed data
      options: ElastiCache, Memorystore

    query_cache:
      use: Repeated database queries
      options: Read replicas, materialized views

Storage Optimization

Tiering Strategy

storage_tiering:
  s3_lifecycle:
    hot_data:
      storage: S3 Standard
      access: Frequent
      cost: Highest

    warm_data:
      storage: S3 Standard-IA
      access: Monthly
      cost: ~45% less
      transition: After 30 days

    cold_data:
      storage: S3 Glacier
      access: Rare
      cost: ~80% less
      transition: After 90 days

    archive:
      storage: Glacier Deep Archive
      access: Yearly or less
      cost: ~95% less
      transition: After 180 days

Database Storage

database_storage:
  ebs_optimization:
    - Use gp3 over gp2 (20% cheaper, better performance)
    - Right-size provisioned IOPS
    - Consider io2 Block Express for high performance

  cleanup:
    - Delete old snapshots
    - Automate snapshot lifecycle
    - Review backup retention

Environment Management

Non-Production Optimization

non_prod_savings:
  dev_environments:
    - Smaller instance sizes
    - Single-AZ deployments
    - Scheduled shutdown (nights/weekends)
    - Shared resources where possible

  staging:
    - Right-sized for testing needs
    - Shutdown when not in use
    - Consider spot instances

  automation:
    schedule_shutdown:
      - Lambda function on schedule
      - Tag-based targeting
      - Slack notification
# Lambda function to stop dev instances
import boto3

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')

    # Find instances with auto-stop tag
    instances = ec2.describe_instances(
        Filters=[
            {'Name': 'tag:AutoStop', 'Values': ['true']},
            {'Name': 'instance-state-name', 'Values': ['running']}
        ]
    )

    instance_ids = []
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_ids.append(instance['InstanceId'])

    if instance_ids:
        ec2.stop_instances(InstanceIds=instance_ids)
        return f'Stopped {len(instance_ids)} instances'

    return 'No instances to stop'

FinOps Practices

Cost Accountability

finops_implementation:
  tagging:
    required:
      - Environment
      - Team
      - Service
      - Cost-center

    enforcement:
      - SCPs to require tags
      - Automated tagging
      - Regular audits

  reporting:
    - Weekly cost reports by team
    - Anomaly alerts
    - Trend analysis
    - Forecast vs actual

  ownership:
    - Teams see their costs
    - Budget allocation
    - Savings incentives

Key Takeaways

Optimize costs to fund innovation, not just to cut expenses.