IaC for Data Engineering with Terraform
Skill by ara.so — Data Skills collection.
This project provides Infrastructure-as-Code (IaC) templates and patterns for data engineers using Terraform to provision and manage AWS resources. It focuses on creating reproducible, version-controlled infrastructure for data platforms including S3 storage, EC2 compute instances, and IAM permissions.
What This Project Does
- Provides Terraform configurations for common data engineering infrastructure on AWS
- Demonstrates IaC best practices for S3 buckets, EC2 instances, and IAM roles
- Shows state management and lifecycle operations for data infrastructure
- Teaches reproducible infrastructure provisioning for data pipelines
Prerequisites
Before using this project, ensure you have:
- AWS Account with root or admin access
- Terraform CLI installed (installation guide)
- AWS CLI installed and configured (setup guide)
- AWS Credentials configured via
aws configure
AWS IAM Setup
Create an IAM user with appropriate permissions:
- Create IAM User: Navigate to AWS Console → IAM → Users → Create user
- Create Inline Policy: Attach a custom policy to the user
- Grant Permissions: For development/learning, grant full access to:
- Amazon S3
- Amazon EC2
- AWS IAM
⚠️ Security Note: Full service access is NOT recommended for production. Use least-privilege policies in production environments.
Project Structure
terraform/
├── main.tf # Main Terraform configuration
├── variables.tf # Input variables (if present)
├── outputs.tf # Output values (if present)
└── terraform.tfstate # State file (generated)
Key Terraform Commands
Initialize Terraform
Initialize the working directory and download provider plugins:
terraform -chdir=terraform init
Validate Configuration
Check if the configuration is syntactically valid:
terraform -chdir=terraform validate
Format Code
Automatically format Terraform files to canonical style:
terraform -chdir=terraform fmt
Plan Infrastructure Changes
Preview what Terraform will create/modify/destroy:
terraform -chdir=terraform plan
Apply Configuration
Create or update infrastructure:
terraform -chdir=terraform apply
Terraform will show a plan and ask for confirmation. Type yes to proceed.
Auto-approve (for automation)
terraform -chdir=terraform apply -auto-approve
Destroy Infrastructure
Remove all resources managed by Terraform:
terraform -chdir=terraform destroy
Configuration
Basic Terraform Configuration Example
Before applying, modify terraform/main.tf to customize resource names:
# terraform/main.tf
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
# S3 bucket for data storage
resource "aws_s3_bucket" "data_bucket" {
bucket = "my-unique-data-engineering-bucket-12345"
tags = {
Name = "Data Engineering Bucket"
Environment = "dev"
ManagedBy = "Terraform"
}
}
# EC2 instance for data processing
resource "aws_instance" "data_processor" {
ami = "ami-0c55b159cbfafe1f0" # Amazon Linux 2
instance_type = "t2.micro"
tags = {
Name = "Data Processor"
Environment = "dev"
ManagedBy = "Terraform"
}
}
# IAM role for EC2 instance
resource "aws_iam_role" "ec2_s3_role" {
name = "ec2-s3-access-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}
]
})
}
Variables Configuration
Create terraform/variables.tf for reusable configurations:
variable "aws_region" {
description = "AWS region for resources"
type = string
default = "us-east-1"
}
variable "environment" {
description = "Environment name"
type = string
default = "dev"
}
variable "bucket_name" {
description = "S3 bucket name for data storage"
type = string
# Set via terraform.tfvars or -var flag
}
Use variables in main.tf:
provider "aws" {
region = var.aws_region
}
resource "aws_s3_bucket" "data_bucket" {
bucket = var.bucket_name
tags = {
Environment = var.environment
}
}
Create terraform/terraform.tfvars:
bucket_name = "my-unique-bucket-name-2026"
aws_region = "us-west-2"
environment = "production"
State Management
Inspect State
List all resources in the state:
terraform -chdir=terraform state list
View detailed state information:
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'
Remote State (Production Pattern)
For production, store state remotely in S3:
# terraform/backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state-bucket"
key = "data-platform/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
Initialize with backend configuration:
terraform -chdir=terraform init -backend-config="bucket=${TERRAFORM_STATE_BUCKET}"
Verification Commands
Verify S3 Bucket Creation
aws s3 ls
Verify EC2 Instance
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==`Name`].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}' \
--output table
Check Specific Resource
terraform -chdir=terraform show aws_s3_bucket.data_bucket
Common Patterns for Data Engineering
Pattern 1: Data Lake with Multiple Buckets
# Raw data bucket
resource "aws_s3_bucket" "raw_data" {
bucket = "my-data-lake-raw-${var.environment}"
}
# Processed data bucket
resource "aws_s3_bucket" "processed_data" {
bucket = "my-data-lake-processed-${var.environment}"
}
# Enable versioning for data lineage
resource "aws_s3_bucket_versioning" "raw_data_versioning" {
bucket = aws_s3_bucket.raw_data.id
versioning_configuration {
status = "Enabled"
}
}
# Lifecycle rules for cost optimization
resource "aws_s3_bucket_lifecycle_configuration" "raw_data_lifecycle" {
bucket = aws_s3_bucket.raw_data.id
rule {
id = "archive-old-data"
status = "Enabled"
transition {
days = 90
storage_class = "GLACIER"
}
}
}
Pattern 2: EC2 with Data Processing Tools
# Security group for data processor
resource "aws_security_group" "data_processor_sg" {
name = "data-processor-sg"
description = "Security group for data processing instances"
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # Restrict in production
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
# EC2 instance with user data for setup
resource "aws_instance" "data_processor" {
ami = var.ami_id
instance_type = "t3.medium"
vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
iam_instance_profile = aws_iam_instance_profile.ec2_profile.name
user_data = <<-EOF
#!/bin/bash
yum update -y
yum install -y python3 python3-pip
pip3 install pandas boto3
EOF
tags = {
Name = "Data Processor Instance"
}
}
# IAM instance profile
resource "aws_iam_instance_profile" "ec2_profile" {
name = "ec2-data-processor-profile"
role = aws_iam_role.ec2_s3_role.name
}
Pattern 3: Outputs for Integration
# terraform/outputs.tf
output "s3_bucket_name" {
description = "Name of the S3 bucket"
value = aws_s3_bucket.data_bucket.id
}
output "s3_bucket_arn" {
description = "ARN of the S3 bucket"
value = aws_s3_bucket.data_bucket.arn
}
output "ec2_instance_id" {
description = "ID of the EC2 instance"
value = aws_instance.data_processor.id
}
output "ec2_public_ip" {
description = "Public IP of the EC2 instance"
value = aws_instance.data_processor.public_ip
}
Access outputs:
terraform -chdir=terraform output
terraform -chdir=terraform output -json | jq -r '.s3_bucket_name.value'
Troubleshooting
Issue: "Error acquiring the state lock"
Cause: Another Terraform process is running or a previous run didn't release the lock.
Solution:
# Force unlock (use with caution)
terraform -chdir=terraform force-unlock <LOCK_ID>
Issue: "bucket name already exists"
Cause: S3 bucket names must be globally unique across all AWS accounts.
Solution: Change the bucket name in main.tf to something unique:
resource "aws_s3_bucket" "data_bucket" {
bucket = "my-unique-name-${random_id.bucket_suffix.hex}"
}
resource "random_id" "bucket_suffix" {
byte_length = 4
}
Issue: "insufficient IAM permissions"
Cause: The IAM user doesn't have required permissions.
Solution: Verify IAM policy includes necessary actions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*",
"ec2:*",
"iam:*"
],
"Resource": "*"
}
]
}
Issue: State file out of sync
Cause: Manual changes made outside Terraform.
Solution: Refresh the state:
terraform -chdir=terraform refresh
Or import existing resources:
terraform -chdir=terraform import aws_s3_bucket.data_bucket my-existing-bucket
Workflow Example
Complete workflow for setting up data infrastructure:
# 1. Configure AWS credentials
export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}"
export AWS_DEFAULT_REGION="us-east-1"
# 2. Customize configuration
cd terraform
# Edit main.tf to set unique bucket name
# 3. Initialize Terraform
terraform init
# 4. Validate configuration
terraform validate
# 5. Format code
terraform fmt
# 6. Preview changes
terraform plan
# 7. Apply configuration
terraform apply
# 8. Verify resources
aws s3 ls
aws ec2 describe-instances --output table
# 9. When done, clean up
terraform destroy
Best Practices for Data Engineering IaC
- Use variables for environment-specific values
- Enable S3 versioning for data lineage and recovery
- Tag all resources for cost tracking and management
- Store state remotely in S3 with encryption and locking
- Use modules to organize reusable infrastructure components
- Never commit
.tfstatefiles or AWS credentials to version control - Implement lifecycle rules on S3 for cost optimization
- Use IAM roles instead of access keys for EC2 instances
- Plan before apply to review changes
- Destroy unused resources to avoid unnecessary costs



