Terraform at Scale: Lessons from Managing 500+ Resources

September 23, 2019

Terraform works beautifully for managing a handful of resources. But as infrastructure grows to hundreds or thousands of resources across multiple environments, new challenges emerge: slow plans, state file conflicts, and organizational complexity.

Here’s what we’ve learned managing large-scale Terraform deployments.

State Organization

Split State Files

One state file doesn’t scale:

# Bad: Single state file
terraform/
└── main.tf  # 500 resources in one state

# Good: Split by concern
terraform/
├── network/          # VPC, subnets, NAT
├── security/         # IAM, security groups
├── database/         # RDS, ElastiCache
├── kubernetes/       # EKS cluster
└── services/
    ├── api/          # API service resources
    └── web/          # Web service resources

Benefits of Split State

Faster operations:

# Full state: terraform plan takes 5 minutes
# Split state: terraform plan takes 30 seconds

Reduced blast radius:

Change to network? Only network state affected.
Database issue? Only database state needs recovery.

Team ownership:

Platform team → network, security, kubernetes
Database team → database
App teams → services/*

Data Sources for Cross-State References

# In services/api/main.tf
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "terraform-state"
    key    = "network/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "api" {
  subnet_id = data.terraform_remote_state.network.outputs.private_subnet_id
}

Module Design

Composable Modules

Small, focused modules:

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block = var.cidr
}

# modules/subnet/main.tf
resource "aws_subnet" "main" {
  vpc_id     = var.vpc_id
  cidr_block = var.cidr
}

# Compose in root
module "vpc" {
  source = "./modules/vpc"
  cidr   = "10.0.0.0/16"
}

module "public_subnet" {
  source = "./modules/subnet"
  vpc_id = module.vpc.vpc_id
  cidr   = "10.0.1.0/24"
}

Version Pinning

module "vpc" {
  source  = "git::https://github.com/org/terraform-modules.git//vpc?ref=v1.2.0"
}

# Or with registry
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "2.77.0"
}

Never use main or latest.

Module Interface Design

# Good: Clear, minimal interface
variable "name" {
  type        = string
  description = "Name prefix for all resources"
}

variable "environment" {
  type        = string
  description = "Environment (dev, staging, production)"
}

variable "vpc_cidr" {
  type        = string
  default     = "10.0.0.0/16"
  description = "CIDR block for VPC"
}

# Outputs that consumers need
output "vpc_id" {
  value       = aws_vpc.main.id
  description = "ID of the created VPC"
}

Performance Optimization

Targeted Operations

# Only plan/apply specific resources
terraform plan -target=module.api
terraform apply -target=aws_instance.api

Use sparingly—can lead to drift.

Parallelism

# Default parallelism is 10
terraform apply -parallelism=20

Be careful with API rate limits.

State Optimization

# Remove old resources from state (cleanup)
terraform state rm 'aws_instance.old[*]'

# Move resources between states
terraform state mv -state=old.tfstate -state-out=new.tfstate aws_instance.app aws_instance.app

Plan Files

# Generate plan
terraform plan -out=plan.tfplan

# Apply specific plan
terraform apply plan.tfplan

Ensures apply matches reviewed plan.

Workflows

Branch Strategy

main
  └── feature/add-redis
  └── feature/update-vpc

Workflow:
1. Create branch
2. Make changes
3. Run terraform plan (automated)
4. Code review + plan review
5. Merge to main
6. Auto-apply to staging
7. Manual approval for production

Environment Management

terraform/
├── modules/             # Shared modules
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   └── production/
│       ├── main.tf
│       └── terraform.tfvars

Atlantis for Automation

# atlantis.yaml
version: 3
projects:
- name: network
  dir: terraform/network
  workflow: default
- name: api
  dir: terraform/services/api
  workflow: default

workflows:
  default:
    plan:
      steps:
      - init
      - plan
    apply:
      steps:
      - apply

Safety Practices

State Locking

terraform {
  backend "s3" {
    bucket         = "terraform-state"
    key            = "network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"  # Locking
    encrypt        = true
  }
}

Pre-Commit Hooks

# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
  hooks:
  - id: terraform_fmt
  - id: terraform_validate
  - id: terraform_tflint
  - id: terraform_docs

Policy Enforcement

# Using Sentinel (Terraform Enterprise) or OPA
policy "no-public-s3" {
  enforcement_level = "hard-mandatory"
}

# OPA example
deny[msg] {
  resource := input.planned_values.root_module.resources[_]
  resource.type == "aws_s3_bucket"
  resource.values.acl == "public-read"
  msg := "S3 buckets must not be public"
}

Drift Detection

# Schedule regular drift checks
terraform plan -detailed-exitcode
# Exit code 2 = changes detected (drift)

Team Practices

Code Ownership

terraform/
├── network/     # CODEOWNERS: @platform-team
├── security/    # CODEOWNERS: @security-team
└── services/
    ├── api/     # CODEOWNERS: @api-team
    └── web/     # CODEOWNERS: @web-team

Review Process

## Terraform PR Template

### Changes
- [ ] Describe what this changes

### Plan Output
<details>
<summary>terraform plan</summary>

[plan output here]


</details>

### Checklist
- [ ] Plan reviewed
- [ ] No unexpected destroys
- [ ] Sensitive resources marked
- [ ] Documentation updated

Documentation

# Use terraform-docs
/**
 * # Network Module
 *
 * Creates VPC with public and private subnets.
 *
 * ## Usage
 *
 * ```hcl
 * module "network" {
 *   source = "./modules/network"
 *   name   = "production"
 * }
 * ```
 */

Key Takeaways

Large-scale Terraform requires discipline. Invest in structure early.