For years, our infrastructure existed in a state I’ll charitably call “organic.” Servers were provisioned through the AWS console. Configuration lived in a mix of shell scripts, wiki pages, and institutional memory. When someone asked how our production environment was configured, the honest answer was “check the console and hope the wiki is current.”
This worked—until it didn’t. A production incident required rebuilding infrastructure quickly, and we discovered our documentation was dangerously incomplete. What should have been a straightforward recovery became hours of archaeology, reconstructing configuration by examining running systems.
That incident convinced me: our infrastructure needed to be code. After evaluating options, we chose Terraform. Six months later, every piece of our infrastructure is defined in version-controlled configuration files. Here’s what we learned.
The Case for Infrastructure as Code
Infrastructure as code (IaC) treats infrastructure configuration the same way we treat application code: version-controlled, reviewed, tested, and automated.
The benefits are substantial:
Reproducibility. Infrastructure defined in code can be recreated exactly. Disaster recovery becomes running a command, not following a checklist. New environments mirror production precisely.
Visibility. Code review for infrastructure changes. Git history for audit trails. Diff commands for understanding what changed and when.
Consistency. The same configuration deploys everywhere. No manual deviations, no forgotten settings, no configuration drift.
Safety. Changes can be previewed before applying. Automated testing catches errors. Rollback means reverting to a previous commit.
Why Terraform
We evaluated several tools: CloudFormation, Ansible, Chef, and Terraform. Each has strengths, but Terraform aligned best with our needs.
Provider Agnostic
Terraform supports multiple cloud providers through its provider model. We’re primarily AWS, but we use some GCP services and manage DNS through Cloudflare. Terraform handles all three with consistent syntax.
CloudFormation is AWS-only. If we ever need multi-cloud (or even multi-service), Terraform is ready.
Declarative Model
Terraform configuration describes desired state. You declare what should exist; Terraform figures out how to achieve it. This is easier to reason about than imperative scripts that describe a sequence of actions.
resource "aws_instance" "web" {
ami = "ami-abc123"
instance_type = "t2.micro"
tags = {
Name = "web-server"
}
}
This declares an EC2 instance should exist with these properties. Terraform handles creation, updates, and deletion.
State Management
Terraform maintains state: a record of what it has created and the mapping between configuration and real resources. This enables:
- Understanding what changes will occur before applying
- Detecting drift between configuration and reality
- Efficient updates (only changing what’s necessary)
State management has sharp edges (more on this later), but the capability is essential for managing real infrastructure.
Plan Before Apply
Terraform’s plan command shows exactly what changes will occur:
$ terraform plan
+ aws_instance.web
ami: "ami-abc123"
instance_type: "t2.micro"
...
Plan: 1 to add, 0 to change, 0 to destroy.
This preview eliminates surprises. You see what will be created, modified, or destroyed before it happens. For infrastructure changes that could cause downtime, this preview is invaluable.
HCL Language
Terraform’s HashiCorp Configuration Language (HCL) is designed for infrastructure definition. It’s more readable than JSON or YAML, supports variables and modules, and has just enough programming capability without becoming a full programming language.
variable "environment" {
description = "Environment name"
default = "production"
}
resource "aws_instance" "web" {
count = var.environment == "production" ? 3 : 1
ami = var.ami_id
instance_type = var.instance_type
tags = {
Name = "web-${var.environment}-${count.index}"
Environment = var.environment
}
}
The Migration
Migrating existing infrastructure to Terraform requires care. You can’t just delete everything and recreate it.
Import Existing Resources
Terraform’s import command associates existing resources with Terraform configuration:
$ terraform import aws_instance.web i-1234567890abcdef0
This tells Terraform that the EC2 instance i-1234567890abcdef0 corresponds to the aws_instance.web configuration. Terraform then manages it going forward.
Importing is tedious—each resource requires a separate command—but it’s safer than recreating infrastructure.
Start with New Resources
We took a hybrid approach. New infrastructure was Terraform from the start. Existing infrastructure was imported gradually, prioritizing:
- Resources we modified frequently
- Resources that were complex or poorly documented
- Resources critical to production stability
Low-change, well-understood resources were imported last.
Module Everything
Terraform modules encapsulate related resources:
module "vpc" {
source = "./modules/vpc"
cidr_block = "10.0.0.0/16"
environment = var.environment
}
module "web_cluster" {
source = "./modules/web-cluster"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
instance_count = 3
}
Modules provide reusability and abstraction. Our VPC module creates a consistent network topology across environments. Our web cluster module creates auto-scaling groups with consistent configuration.
Build modules for your common patterns. The initial investment pays off in consistency and reduced duplication.
State Management Realities
Terraform state is powerful and dangerous. Understanding state is essential for Terraform success.
Remote State
By default, Terraform stores state in a local file. This is fine for learning but terrible for teams. Local state can’t be shared, doesn’t support locking, and gets lost with your laptop.
Use remote state backends. S3 with DynamoDB locking is the common AWS choice:
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}
Remote state enables team collaboration and provides durability.
State Locking
Concurrent Terraform runs can corrupt state. State locking ensures only one operation runs at a time. DynamoDB provides locking for S3 backends; other backends have their own locking mechanisms.
Always enable locking in team environments.
State Secrets
Terraform state contains sensitive information: database passwords, API keys, and other secrets that appear in configuration. State files should be encrypted at rest and access-controlled strictly.
Never commit state files to version control. Never share state files casually. Treat state with the same sensitivity as production credentials.
State Manipulation
Sometimes you need to modify state directly: renaming resources, moving resources between modules, or removing resources Terraform shouldn’t manage anymore.
terraform state commands provide this capability:
$ terraform state mv aws_instance.old aws_instance.new
$ terraform state rm aws_instance.manual
State manipulation is risky. Always backup state before manipulation. Prefer configuration refactoring over state manipulation when possible.
Lessons Learned
Start with Conventions
Establish conventions before writing much Terraform:
- Directory structure (by environment? by service? by both?)
- Naming conventions for resources
- Tagging standards
- Module interface patterns
- State backend organization
Conventions are easier to establish early than to retrofit later.
One Environment at a Time
Don’t try to Terraform everything simultaneously. Start with one environment (staging is less risky than production), learn the patterns, then expand.
Our path: development → staging → production. By the time we reached production, we’d made our mistakes in lower environments.
Embrace Modules Early
Resist the temptation to define everything in one big file. Create modules for:
- Common infrastructure patterns (VPC, database, cache)
- Service definitions (web cluster, worker fleet)
- Cross-cutting concerns (monitoring, alerting)
Modules add overhead but pay back in maintainability and reusability.
Plan Review Is Essential
Every Terraform change should be reviewed:
- Author runs
terraform plan - Plan output is included in code review
- Reviewers verify plan matches expectations
- Only after approval does
terraform applyrun
Plan review catches mistakes before they affect infrastructure. We’ve caught numerous issues—wrong regions, missing dependencies, unintended deletions—through plan review.
Automate Application
Manual terraform apply from laptops doesn’t scale. Implement CI/CD for Terraform:
- Pull requests trigger plan runs
- Plan output posts to the PR for review
- Merged PRs trigger apply runs
- Apply output is logged and notifications sent
Automation ensures consistency and provides audit trails.
Where We Are Now
Six months in, all our infrastructure is Terraform-managed. We can:
- Spin up identical environments for testing
- Review infrastructure changes like code
- Recover from disasters by applying configuration
- Understand exactly what’s deployed by reading configuration
The investment was significant—weeks of migration effort and ongoing learning. But the payoff is substantial: infrastructure that’s visible, reproducible, and safe to change.
If you’re still managing infrastructure through consoles and scripts, the transition is worth it. Your future self, recovering from a disaster at 2 AM, will thank you.
Key Takeaways
- Infrastructure as code provides reproducibility, visibility, and safety for infrastructure management
- Terraform’s declarative model, multi-provider support, and plan-before-apply workflow suited our needs
- Remote state with locking is essential for team environments; state contains secrets and requires protection
- Migrate incrementally: start with new resources, import existing resources gradually
- Establish conventions early; create modules for common patterns
- Automate Terraform execution through CI/CD for consistency and audit trails