Terraform at Scale: Lessons from Managing 500+ Resources

Terraform works beautifully for managing a handful of resources. But as infrastructure grows to hundreds or thousands of resources across multiple environments, new challenges emerge: slow plans, state file conflicts, and organizational complexity.

Here’s what we’ve learned managing large-scale Terraform deployments.

State Organization

Split State Files

One state file doesn’t scale:

# Bad: Single state file
terraform/
└── main.tf  # 500 resources in one state

# Good: Split by concern
terraform/
├── network/          # VPC, subnets, NAT
├── security/         # IAM, security groups
├── database/         # RDS, ElastiCache
├── kubernetes/       # EKS cluster
└── services/
    ├── api/          # API service resources
    └── web/          # Web service resources

Benefits of Split State

Faster operations:

# Full state: terraform plan takes 5 minutes
# Split state: terraform plan takes 30 seconds

Reduced blast radius:

Change to network? Only network state affected.
Database issue? Only database state needs recovery.

Team ownership:

Platform team → network, security, kubernetes
Database team → database
App teams → services/*

Data Sources for Cross-State References

# In services/api/main.tf
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "terraform-state"
    key    = "network/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "api" {
  subnet_id = data.terraform_remote_state.network.outputs.private_subnet_id
}

Module Design

Composable Modules

Small, focused modules:

# modules/vpc/main.tf
resource "aws_vpc" "main" {
  cidr_block = var.cidr
}

# modules/subnet/main.tf
resource "aws_subnet" "main" {
  vpc_id     = var.vpc_id
  cidr_block = var.cidr
}

# Compose in root
module "vpc" {
  source = "./modules/vpc"
  cidr   = "10.0.0.0/16"
}

module "public_subnet" {
  source = "./modules/subnet"
  vpc_id = module.vpc.vpc_id
  cidr   = "10.0.1.0/24"
}

Version Pinning

module "vpc" {
  source  = "git::https://github.com/org/terraform-modules.git//vpc?ref=v1.2.0"
}

# Or with registry
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "2.77.0"
}

Never use main or latest.

Module Interface Design

# Good: Clear, minimal interface
variable "name" {
  type        = string
  description = "Name prefix for all resources"
}

variable "environment" {
  type        = string
  description = "Environment (dev, staging, production)"
}

variable "vpc_cidr" {
  type        = string
  default     = "10.0.0.0/16"
  description = "CIDR block for VPC"
}

# Outputs that consumers need
output "vpc_id" {
  value       = aws_vpc.main.id
  description = "ID of the created VPC"
}

Performance Optimization

Targeted Operations

# Only plan/apply specific resources
terraform plan -target=module.api
terraform apply -target=aws_instance.api

Use sparingly—can lead to drift.

Parallelism

# Default parallelism is 10
terraform apply -parallelism=20

Be careful with API rate limits.

State Optimization

# Remove old resources from state (cleanup)
terraform state rm 'aws_instance.old[*]'

# Move resources between states
terraform state mv -state=old.tfstate -state-out=new.tfstate aws_instance.app aws_instance.app

Plan Files

# Generate plan
terraform plan -out=plan.tfplan

# Apply specific plan
terraform apply plan.tfplan

Ensures apply matches reviewed plan.

Workflows

Branch Strategy

main
  └── feature/add-redis
  └── feature/update-vpc

Workflow:
1. Create branch
2. Make changes
3. Run terraform plan (automated)
4. Code review + plan review
5. Merge to main
6. Auto-apply to staging
7. Manual approval for production

Environment Management

terraform/
├── modules/             # Shared modules
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   │   ├── main.tf
│   │   └── terraform.tfvars
│   └── production/
│       ├── main.tf
│       └── terraform.tfvars

Atlantis for Automation

# atlantis.yaml
version: 3
projects:
- name: network
  dir: terraform/network
  workflow: default
- name: api
  dir: terraform/services/api
  workflow: default

workflows:
  default:
    plan:
      steps:
      - init
      - plan
    apply:
      steps:
      - apply

Safety Practices

State Locking

terraform {
  backend "s3" {
    bucket         = "terraform-state"
    key            = "network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"  # Locking
    encrypt        = true
  }
}

Pre-Commit Hooks

# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
  hooks:
  - id: terraform_fmt
  - id: terraform_validate
  - id: terraform_tflint
  - id: terraform_docs

Policy Enforcement

# Using Sentinel (Terraform Enterprise) or OPA
policy "no-public-s3" {
  enforcement_level = "hard-mandatory"
}

# OPA example
deny[msg] {
  resource := input.planned_values.root_module.resources[_]
  resource.type == "aws_s3_bucket"
  resource.values.acl == "public-read"
  msg := "S3 buckets must not be public"
}

Drift Detection

# Schedule regular drift checks
terraform plan -detailed-exitcode
# Exit code 2 = changes detected (drift)

Team Practices

Code Ownership

terraform/
├── network/     # CODEOWNERS: @platform-team
├── security/    # CODEOWNERS: @security-team
└── services/
    ├── api/     # CODEOWNERS: @api-team
    └── web/     # CODEOWNERS: @web-team

Review Process

## Terraform PR Template

### Changes
- [ ] Describe what this changes

### Plan Output
<details>
<summary>terraform plan</summary>

[plan output here]


</details>

### Checklist
- [ ] Plan reviewed
- [ ] No unexpected destroys
- [ ] Sensitive resources marked
- [ ] Documentation updated

Documentation

# Use terraform-docs
/**
 * # Network Module
 *
 * Creates VPC with public and private subnets.
 *
 * ## Usage
 *
 * ```hcl
 * module "network" {
 *   source = "./modules/network"
 *   name   = "production"
 * }
 * ```
 */

Key Takeaways

Split state by concern and team ownership
Use data sources for cross-state references
Build small, composable, versioned modules
Use targeted operations carefully
Automate with Atlantis or similar
Enable state locking and encryption
Enforce policies with Sentinel or OPA
Run regular drift detection
Define code ownership clearly
Document modules and require plan review

Large-scale Terraform requires discipline. Invest in structure early.