Remote OpenClaw
Menu
SkillsMCPPluginsMarketplaceGuideAgentsAdvertise
Remote OpenClaw
SkillsMCPPluginsMarketplaceGuideAgentsAdvertise
Skills/aradotso/data-skills/iac-terraform-data-engineering

iac-terraform-data-engineering

aradotso/data-skills
612 installs1 stars

Installation

npx skills add https://github.com/aradotso/data-skills --skill iac-terraform-data-engineering

Summary

Infrastructure-as-Code fundamentals for data engineers using Terraform to provision AWS resources (S3, EC2, IAM)

SKILL.md

IaC for Data Engineering with Terraform

Skill by ara.so — Data Skills collection.

This project provides Infrastructure-as-Code (IaC) templates and patterns for data engineers using Terraform to provision and manage AWS resources. It focuses on creating reproducible, version-controlled infrastructure for data platforms including S3 storage, EC2 compute instances, and IAM permissions.

What This Project Does

  • Provides Terraform configurations for common data engineering infrastructure on AWS
  • Demonstrates IaC best practices for S3 buckets, EC2 instances, and IAM roles
  • Shows state management and lifecycle operations for data infrastructure
  • Teaches reproducible infrastructure provisioning for data pipelines

Prerequisites

Before using this project, ensure you have:

  1. AWS Account with root or admin access
  2. Terraform CLI installed (installation guide)
  3. AWS CLI installed and configured (setup guide)
  4. AWS Credentials configured via aws configure

AWS IAM Setup

Create an IAM user with appropriate permissions:

  1. Create IAM User: Navigate to AWS Console → IAM → Users → Create user
  2. Create Inline Policy: Attach a custom policy to the user
  3. Grant Permissions: For development/learning, grant full access to:
  • Amazon S3
  • Amazon EC2
  • AWS IAM

⚠️ Security Note: Full service access is NOT recommended for production. Use least-privilege policies in production environments.

Project Structure

terraform/
├── main.tf           # Main Terraform configuration
├── variables.tf      # Input variables (if present)
├── outputs.tf        # Output values (if present)
└── terraform.tfstate # State file (generated)

Key Terraform Commands

Initialize Terraform

Initialize the working directory and download provider plugins:

terraform -chdir=terraform init

Validate Configuration

Check if the configuration is syntactically valid:

terraform -chdir=terraform validate

Format Code

Automatically format Terraform files to canonical style:

terraform -chdir=terraform fmt

Plan Infrastructure Changes

Preview what Terraform will create/modify/destroy:

terraform -chdir=terraform plan

Apply Configuration

Create or update infrastructure:

terraform -chdir=terraform apply

Terraform will show a plan and ask for confirmation. Type yes to proceed.

Auto-approve (for automation)

terraform -chdir=terraform apply -auto-approve

Destroy Infrastructure

Remove all resources managed by Terraform:

terraform -chdir=terraform destroy

Configuration

Basic Terraform Configuration Example

Before applying, modify terraform/main.tf to customize resource names:

# terraform/main.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

# S3 bucket for data storage
resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-unique-data-engineering-bucket-12345"
  
  tags = {
    Name        = "Data Engineering Bucket"
    Environment = "dev"
    ManagedBy   = "Terraform"
  }
}

# EC2 instance for data processing
resource "aws_instance" "data_processor" {
  ami           = "ami-0c55b159cbfafe1f0"  # Amazon Linux 2
  instance_type = "t2.micro"
  
  tags = {
    Name        = "Data Processor"
    Environment = "dev"
    ManagedBy   = "Terraform"
  }
}

# IAM role for EC2 instance
resource "aws_iam_role" "ec2_s3_role" {
  name = "ec2-s3-access-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

Variables Configuration

Create terraform/variables.tf for reusable configurations:

variable "aws_region" {
  description = "AWS region for resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Environment name"
  type        = string
  default     = "dev"
}

variable "bucket_name" {
  description = "S3 bucket name for data storage"
  type        = string
  # Set via terraform.tfvars or -var flag
}

Use variables in main.tf:

provider "aws" {
  region = var.aws_region
}

resource "aws_s3_bucket" "data_bucket" {
  bucket = var.bucket_name
  
  tags = {
    Environment = var.environment
  }
}

Create terraform/terraform.tfvars:

bucket_name  = "my-unique-bucket-name-2026"
aws_region   = "us-west-2"
environment  = "production"

State Management

Inspect State

List all resources in the state:

terraform -chdir=terraform state list

View detailed state information:

cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'

Remote State (Production Pattern)

For production, store state remotely in S3:

# terraform/backend.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "data-platform/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

Initialize with backend configuration:

terraform -chdir=terraform init -backend-config="bucket=${TERRAFORM_STATE_BUCKET}"

Verification Commands

Verify S3 Bucket Creation

aws s3 ls

Verify EC2 Instance

aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{ID:InstanceId, Name:Tags[?Key==`Name`].Value, Type:InstanceType, State:State.Name, PublicIP:PublicIpAddress, PrivateIP:PrivateIpAddress}' \
  --output table

Check Specific Resource

terraform -chdir=terraform show aws_s3_bucket.data_bucket

Common Patterns for Data Engineering

Pattern 1: Data Lake with Multiple Buckets

# Raw data bucket
resource "aws_s3_bucket" "raw_data" {
  bucket = "my-data-lake-raw-${var.environment}"
}

# Processed data bucket
resource "aws_s3_bucket" "processed_data" {
  bucket = "my-data-lake-processed-${var.environment}"
}

# Enable versioning for data lineage
resource "aws_s3_bucket_versioning" "raw_data_versioning" {
  bucket = aws_s3_bucket.raw_data.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

# Lifecycle rules for cost optimization
resource "aws_s3_bucket_lifecycle_configuration" "raw_data_lifecycle" {
  bucket = aws_s3_bucket.raw_data.id
  
  rule {
    id     = "archive-old-data"
    status = "Enabled"
    
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
  }
}

Pattern 2: EC2 with Data Processing Tools

# Security group for data processor
resource "aws_security_group" "data_processor_sg" {
  name        = "data-processor-sg"
  description = "Security group for data processing instances"
  
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Restrict in production
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# EC2 instance with user data for setup
resource "aws_instance" "data_processor" {
  ami           = var.ami_id
  instance_type = "t3.medium"
  
  vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
  iam_instance_profile   = aws_iam_instance_profile.ec2_profile.name
  
  user_data = <<-EOF
              #!/bin/bash
              yum update -y
              yum install -y python3 python3-pip
              pip3 install pandas boto3
              EOF
  
  tags = {
    Name = "Data Processor Instance"
  }
}

# IAM instance profile
resource "aws_iam_instance_profile" "ec2_profile" {
  name = "ec2-data-processor-profile"
  role = aws_iam_role.ec2_s3_role.name
}

Pattern 3: Outputs for Integration

# terraform/outputs.tf
output "s3_bucket_name" {
  description = "Name of the S3 bucket"
  value       = aws_s3_bucket.data_bucket.id
}

output "s3_bucket_arn" {
  description = "ARN of the S3 bucket"
  value       = aws_s3_bucket.data_bucket.arn
}

output "ec2_instance_id" {
  description = "ID of the EC2 instance"
  value       = aws_instance.data_processor.id
}

output "ec2_public_ip" {
  description = "Public IP of the EC2 instance"
  value       = aws_instance.data_processor.public_ip
}

Access outputs:

terraform -chdir=terraform output
terraform -chdir=terraform output -json | jq -r '.s3_bucket_name.value'

Troubleshooting

Issue: "Error acquiring the state lock"

Cause: Another Terraform process is running or a previous run didn't release the lock.

Solution:

# Force unlock (use with caution)
terraform -chdir=terraform force-unlock <LOCK_ID>

Issue: "bucket name already exists"

Cause: S3 bucket names must be globally unique across all AWS accounts.

Solution: Change the bucket name in main.tf to something unique:

resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-unique-name-${random_id.bucket_suffix.hex}"
}

resource "random_id" "bucket_suffix" {
  byte_length = 4
}

Issue: "insufficient IAM permissions"

Cause: The IAM user doesn't have required permissions.

Solution: Verify IAM policy includes necessary actions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:*",
        "ec2:*",
        "iam:*"
      ],
      "Resource": "*"
    }
  ]
}

Issue: State file out of sync

Cause: Manual changes made outside Terraform.

Solution: Refresh the state:

terraform -chdir=terraform refresh

Or import existing resources:

terraform -chdir=terraform import aws_s3_bucket.data_bucket my-existing-bucket

Workflow Example

Complete workflow for setting up data infrastructure:

# 1. Configure AWS credentials
export AWS_ACCESS_KEY_ID="${AWS_ACCESS_KEY_ID}"
export AWS_SECRET_ACCESS_KEY="${AWS_SECRET_ACCESS_KEY}"
export AWS_DEFAULT_REGION="us-east-1"

# 2. Customize configuration
cd terraform
# Edit main.tf to set unique bucket name

# 3. Initialize Terraform
terraform init

# 4. Validate configuration
terraform validate

# 5. Format code
terraform fmt

# 6. Preview changes
terraform plan

# 7. Apply configuration
terraform apply

# 8. Verify resources
aws s3 ls
aws ec2 describe-instances --output table

# 9. When done, clean up
terraform destroy

Best Practices for Data Engineering IaC

  1. Use variables for environment-specific values
  2. Enable S3 versioning for data lineage and recovery
  3. Tag all resources for cost tracking and management
  4. Store state remotely in S3 with encryption and locking
  5. Use modules to organize reusable infrastructure components
  6. Never commit .tfstate files or AWS credentials to version control
  7. Implement lifecycle rules on S3 for cost optimization
  8. Use IAM roles instead of access keys for EC2 instances
  9. Plan before apply to review changes
  10. Destroy unused resources to avoid unnecessary costs

Featured

QwikClaw — one-click deploy OpenClaw logoQwikClaw — one-click deploy OpenClaw

Your own always-on OpenClaw agent, live in 60 seconds. No server, no setup — pick a model, connect Telegram, done.

Deploy your agent →
MoltAwards - Agent internet for government contracts + jobs. logoMoltAwards - Agent internet for government contracts + jobs.

MoltAwards is an agent-native social layer for matchawards.com.

Learn more →
CLN.Work — Stop prompting, start hiring AI employees logoCLN.Work — Stop prompting, start hiring AI employees

Turn your Claude agents into a real team — onboard them, assign tasks, and manage them like staff.

Hire AI employees →
Deploy your own AI agent logoDeploy your own AI agent

Launch OpenClaw or Hermes on Hostinger in about 60 seconds, keep your agent live 24/7, earn 20%-40% on your next referral up to $25-$45, and give your friend 20% off.

Launch on Hostinger →
AdvertiseGet your AI tool in front of 67,000+ AI enthusiastsSee placements & pricing →

Categories

Command ExecutionExternal Downloads
View on GitHub

Recommended skills

Browse all →

firebase-data-connect

firebase/agent-skills

90K installsInstall

find-skills

vercel-labs/skills

2.2M installsInstall

frontend-design

anthropics/skills

601K installsInstall

vercel-react-best-practices

vercel-labs/agent-skills

509K installsInstall

agent-browser

vercel-labs/agent-browser

492K installsInstall

web-design-guidelines

vercel-labs/agent-skills

423K installsInstall

Browse

Skills by category

Frontend250Git198Data154Testing120Design105Docs103Security96Automation87Backend76Devops37Productivity29Mcp23

Advertise on Remote OpenClaw

Get your AI tool in front of 67,000+ AI enthusiasts a month

See placements & pricing →

Remote OpenClaw

AI agent skills directory, marketplace, and workflow hub for OpenClaw, Hermes Agent, Claude Code, Codex, and MCP-powered operator stacks.

Explore

  • Home
  • Skills Directory
  • Claude Code Skills
  • Codex Skills
  • Marketplace
  • Hermes Ecosystem
  • Agents
  • Guide
  • Learn
  • Blog

More

  • Playbook
  • Free Tools
  • Shipping
  • Contact
  • Terms
  • Privacy
© 2026 Remote OpenClaw