Terraform works beautifully for managing a handful of resources. But as infrastructure grows to hundreds or thousands of resources across multiple environments, new challenges emerge: slow plans, state file conflicts, and organizational complexity.
Here’s what we’ve learned managing large-scale Terraform deployments.
State Organization
Split State Files
One state file doesn’t scale:
# Bad: Single state file
terraform/
└── main.tf # 500 resources in one state
# Good: Split by concern
terraform/
├── network/ # VPC, subnets, NAT
├── security/ # IAM, security groups
├── database/ # RDS, ElastiCache
├── kubernetes/ # EKS cluster
└── services/
├── api/ # API service resources
└── web/ # Web service resources
Benefits of Split State
Faster operations:
# Full state: terraform plan takes 5 minutes
# Split state: terraform plan takes 30 seconds
Reduced blast radius:
Change to network? Only network state affected.
Database issue? Only database state needs recovery.
Team ownership:
Platform team → network, security, kubernetes
Database team → database
App teams → services/*
Data Sources for Cross-State References
# In services/api/main.tf
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "terraform-state"
key = "network/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_instance" "api" {
subnet_id = data.terraform_remote_state.network.outputs.private_subnet_id
}
Module Design
Composable Modules
Small, focused modules:
# modules/vpc/main.tf
resource "aws_vpc" "main" {
cidr_block = var.cidr
}
# modules/subnet/main.tf
resource "aws_subnet" "main" {
vpc_id = var.vpc_id
cidr_block = var.cidr
}
# Compose in root
module "vpc" {
source = "./modules/vpc"
cidr = "10.0.0.0/16"
}
module "public_subnet" {
source = "./modules/subnet"
vpc_id = module.vpc.vpc_id
cidr = "10.0.1.0/24"
}
Version Pinning
module "vpc" {
source = "git::https://github.com/org/terraform-modules.git//vpc?ref=v1.2.0"
}
# Or with registry
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "2.77.0"
}
Never use main or latest.
Module Interface Design
# Good: Clear, minimal interface
variable "name" {
type = string
description = "Name prefix for all resources"
}
variable "environment" {
type = string
description = "Environment (dev, staging, production)"
}
variable "vpc_cidr" {
type = string
default = "10.0.0.0/16"
description = "CIDR block for VPC"
}
# Outputs that consumers need
output "vpc_id" {
value = aws_vpc.main.id
description = "ID of the created VPC"
}
Performance Optimization
Targeted Operations
# Only plan/apply specific resources
terraform plan -target=module.api
terraform apply -target=aws_instance.api
Use sparingly—can lead to drift.
Parallelism
# Default parallelism is 10
terraform apply -parallelism=20
Be careful with API rate limits.
State Optimization
# Remove old resources from state (cleanup)
terraform state rm 'aws_instance.old[*]'
# Move resources between states
terraform state mv -state=old.tfstate -state-out=new.tfstate aws_instance.app aws_instance.app
Plan Files
# Generate plan
terraform plan -out=plan.tfplan
# Apply specific plan
terraform apply plan.tfplan
Ensures apply matches reviewed plan.
Workflows
Branch Strategy
main
└── feature/add-redis
└── feature/update-vpc
Workflow:
1. Create branch
2. Make changes
3. Run terraform plan (automated)
4. Code review + plan review
5. Merge to main
6. Auto-apply to staging
7. Manual approval for production
Environment Management
terraform/
├── modules/ # Shared modules
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ │ ├── main.tf
│ │ └── terraform.tfvars
│ └── production/
│ ├── main.tf
│ └── terraform.tfvars
Atlantis for Automation
# atlantis.yaml
version: 3
projects:
- name: network
dir: terraform/network
workflow: default
- name: api
dir: terraform/services/api
workflow: default
workflows:
default:
plan:
steps:
- init
- plan
apply:
steps:
- apply
Safety Practices
State Locking
terraform {
backend "s3" {
bucket = "terraform-state"
key = "network/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks" # Locking
encrypt = true
}
}
Pre-Commit Hooks
# .pre-commit-config.yaml
repos:
- repo: https://github.com/antonbabenko/pre-commit-terraform
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_tflint
- id: terraform_docs
Policy Enforcement
# Using Sentinel (Terraform Enterprise) or OPA
policy "no-public-s3" {
enforcement_level = "hard-mandatory"
}
# OPA example
deny[msg] {
resource := input.planned_values.root_module.resources[_]
resource.type == "aws_s3_bucket"
resource.values.acl == "public-read"
msg := "S3 buckets must not be public"
}
Drift Detection
# Schedule regular drift checks
terraform plan -detailed-exitcode
# Exit code 2 = changes detected (drift)
Team Practices
Code Ownership
terraform/
├── network/ # CODEOWNERS: @platform-team
├── security/ # CODEOWNERS: @security-team
└── services/
├── api/ # CODEOWNERS: @api-team
└── web/ # CODEOWNERS: @web-team
Review Process
## Terraform PR Template
### Changes
- [ ] Describe what this changes
### Plan Output
<details>
<summary>terraform plan</summary>
[plan output here]
</details>
### Checklist
- [ ] Plan reviewed
- [ ] No unexpected destroys
- [ ] Sensitive resources marked
- [ ] Documentation updated
Documentation
# Use terraform-docs
/**
* # Network Module
*
* Creates VPC with public and private subnets.
*
* ## Usage
*
* ```hcl
* module "network" {
* source = "./modules/network"
* name = "production"
* }
* ```
*/
Key Takeaways
- Split state by concern and team ownership
- Use data sources for cross-state references
- Build small, composable, versioned modules
- Use targeted operations carefully
- Automate with Atlantis or similar
- Enable state locking and encryption
- Enforce policies with Sentinel or OPA
- Run regular drift detection
- Define code ownership clearly
- Document modules and require plan review
Large-scale Terraform requires discipline. Invest in structure early.