IaC for Data Engineering with Terraform

Skill by ara.so — Data Skills collection.

This project provides Infrastructure-as-Code (IaC) templates and patterns for data engineers using Terraform to provision and manage AWS resources. It focuses on creating reproducible, version-controlled infrastructure for data platforms including S3 storage, EC2 compute instances, and IAM permissions.

What This Project Does

Provides Terraform configurations for common data engineering infrastructure on AWS
Demonstrates IaC best practices for S3 buckets, EC2 instances, and IAM roles
Shows state management and lifecycle operations for data infrastructure
Teaches reproducible infrastructure provisioning for data pipelines

Prerequisites

Before using this project, ensure you have:

AWS Account with root or admin access
Terraform CLI installed (installation guide)
AWS CLI installed and configured (setup guide)
AWS Credentials configured via aws configure

AWS IAM Setup

Create an IAM user with appropriate permissions:

Create IAM User: Navigate to AWS Console → IAM → Users → Create user
Create Inline Policy: Attach a custom policy to the user
Grant Permissions: For development/learning, grant full access to:

Amazon S3
Amazon EC2
AWS IAM

⚠️ Security Note: Full service access is NOT recommended for production. Use least-privilege policies in production environments.

Project Structure

terraform/
├── main.tf           # Main Terraform configuration
├── variables.tf      # Input variables (if present)
├── outputs.tf        # Output values (if present)
└── terraform.tfstate # State file (generated)

Key Terraform Commands

Initialize Terraform

Initialize the working directory and download provider plugins:

terraform -chdir=terraform init

Validate Configuration

Check if the configuration is syntactically valid:

terraform -chdir=terraform validate

Format Code

Automatically format Terraform files to canonical style:

terraform -chdir=terraform fmt

Plan Infrastructure Changes

Preview what Terraform will create/modify/destroy:

terraform -chdir=terraform plan

Apply Configuration

Create or update infrastructure:

terraform -chdir=terraform apply

Terraform will show a plan and ask for confirmation. Type yes to proceed.

Auto-approve (for automation)

terraform -chdir=terraform apply -auto-approve

Destroy Infrastructure

Remove all resources managed by Terraform:

terraform -chdir=terraform destroy

Configuration

Basic Terraform Configuration Example

Before applying, modify terraform/main.tf to customize resource names:

# terraform/main.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

# S3 bucket for data storage
resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-unique-data-engineering-bucket-12345"
  
  tags = {
    Name        = "Data Engineering Bucket"
    Environment = "dev"
    ManagedBy   = "Terraform"
  }
}

# EC2 instance for data processing
resource "aws_instance" "data_processor" {
  ami           = "ami-0c55b159cbfafe1f0"  # Amazon Linux 2
  instance_type = "t2.micro"
  
  tags = {
    Name        = "Data Processor"
    Environment = "dev"
    ManagedBy   = "Terraform"
  }
}

# IAM role for EC2 instance
resource "aws_iam_role" "ec2_s3_role" {
  name = "ec2-s3-access-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

Variables Configuration

Create terraform/variables.tf for reusable configurations:

variable "aws_region" {
  description = "AWS region for resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Environment name"
  type        = string
  default     = "dev"
}

variable "bucket_name" {
  description = "S3 bucket name for data storage"
  type        = string
  # Set via terraform.tfvars or -var flag
}

Use variables in main.tf:

provider "aws" {
  region = var.aws_region
}

resource "aws_s3_bucket" "data_bucket" {
  bucket = var.bucket_name
  
  tags = {
    Environment = var.environment
  }
}

Create terraform/terraform.tfvars:

bucket_name  = "my-unique-bucket-name-2026"
aws_region   = "us-west-2"
environment  = "production"

State Management

Inspect State

List all resources in the state:

terraform -chdir=terraform state list

View detailed state information:

cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'

Remote State (Production Pattern)

For production, store state remotely in S3:

# terraform/backend.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "data-platform/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

Initialize with backend configuration:

terraform -chdir=terraform init -backend-config="bucket=${TERRAFORM_STATE_BUCKET}"

Verification Commands

Verify S3 Bucket Creation

aws s3 ls

Verify EC2 Instance

aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==`Name`].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}' \
  --output table

Check Specific Resource

terraform -chdir=terraform show aws_s3_bucket.data_bucket

Common Patterns for Data Engineering

Pattern 1: Data Lake with Multiple Buckets

# Raw data bucket
resource "aws_s3_bucket" "raw_data" {
  bucket = "my-data-lake-raw-${var.environment}"
}

# Processed data bucket
resource "aws_s3_bucket" "processed_data" {
  bucket = "my-data-lake-processed-${var.environment}"
}

# Enable versioning for data lineage
resource "aws_s3_bucket_versioning" "raw_data_versioning" {
  bucket = aws_s3_bucket.raw_data.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

# Lifecycle rules for cost optimization
resource "aws_s3_bucket_lifecycle_configuration" "raw_data_lifecycle" {
  bucket = aws_s3_bucket.raw_data.id
  
  rule {
    id     = "archive-old-data"
    status = "Enabled"
    
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
  }
}

Pattern 2: EC2 with Data Processing Tools

# Security group for data processor
resource "aws_security_group" "data_processor_sg" {
  name        = "data-processor-sg"
  description = "Security group for data processing instances"
  
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Restrict in production
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# EC2 instance with user data for setup
resource "aws_instance" "data_processor" {
  ami           = var.ami_id
  instance_type = "t3.medium"
  
  vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
  iam_instance_profile   = aws_iam_instance_profile.ec2_profile.name
  
  user_data = <<-EOF
              #!/bin/bash
              yum update -y
              yum install -y python3 python3-pip
              pip3 install pandas boto3
              EOF
  
  tags = {
    Name = "Data Processor Instance"
  }
}

# IAM instance profile
resource "aws_iam_instance_profile" "ec2_profile" {
  name = "ec2-data-processor-profile"
  role = aws_iam_role.ec2_s3_role.name
}

Pattern 3: Outputs for Integration

# terraform/outputs.tf
output "s3_bucket_name" {
  description = "Name of the S3 bucket"
  value       = aws_s3_bucket.data_bucket.id
}

output "s3_bucket_arn" {
  description = "ARN of the S3 bucket"
  value       = aws_s3_bucket.data_bucket.arn
}

output "ec2_instance_id" {
  description = "ID of the EC2 instance"
  value       = aws_instance.data_processor.id
}

output "ec2_public_ip" {
  description = "Public IP of the EC2 instance"
  value       = aws_instance.data_processor.public_ip
}

Access outputs:

terraform -chdir=terraform output
terraform -chdir=terraform output -json | jq -r '.s3_bucket_name.value'

Troubleshooting

Issue: "Error acquiring the state lock"

Cause: Another Terraform process is running or a previous run didn't release the lock.

Solution:

# Force unlock (use with caution)
terraform -chdir=terraform force-unlock <LOCK_ID>

Issue: "bucket name already exists"

Cause: S3 bucket names must be globally unique across all AWS accounts.

Solution: Change the bucket name in main.tf to something unique:

resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-unique-name-${random_id.bucket_suffix.hex}"
}

resource "random_id" "bucket_suffix" {
  byte_length = 4
}

Issue: "insufficient IAM permissions"

Cause: The IAM user doesn't have required permissions.

Solution: Verify IAM policy includes necessary actions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:*",
        "ec2:*",
        "iam:*"
      ],
      "Resource": "*"
    }
  ]
}

Issue: State file out of sync

Cause: Manual changes made outside Terraform.

Solution: Refresh the state:

terraform -chdir=terraform refresh

Or import existing resources:

terraform -chdir=terraform import aws_s3_bucket.data_bucket my-existing-bucket

Workflow Example

Complete workflow for setting up data infrastructure:

# 1. Configure AWS credentials
export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}"
export AWS_DEFAULT_REGION="us-east-1"

# 2. Customize configuration
cd terraform
# Edit main.tf to set unique bucket name

# 3. Initialize Terraform
terraform init

# 4. Validate configuration
terraform validate

# 5. Format code
terraform fmt

# 6. Preview changes
terraform plan

# 7. Apply configuration
terraform apply

# 8. Verify resources
aws s3 ls
aws ec2 describe-instances --output table

# 9. When done, clean up
terraform destroy

Best Practices for Data Engineering IaC

Use variables for environment-specific values
Enable S3 versioning for data lineage and recovery
Tag all resources for cost tracking and management
Store state remotely in S3 with encryption and locking
Use modules to organize reusable infrastructure components
Never commit .tfstate files or AWS credentials to version control
Implement lifecycle rules on S3 for cost optimization
Use IAM roles instead of access keys for EC2 instances
Plan before apply to review changes
Destroy unused resources to avoid unnecessary costs

iac-terraform-data-engineering