Remote OpenClaw
Menu
SkillsMCPPluginsMarketplaceGuideAgentsAdvertise
Remote OpenClaw
SkillsMCPPluginsMarketplaceGuideAgentsAdvertise
Skills/aradotso/data-skills/iac-data-engineering-terraform

iac-data-engineering-terraform

aradotso/data-skills
624 installs1 stars

Installation

npx skills add https://github.com/aradotso/data-skills --skill iac-data-engineering-terraform

Summary

Infrastructure-as-Code patterns for data engineering with Terraform on AWS (S3, EC2, IAM)

SKILL.md

IaC for Data Engineering with Terraform

Skill by ara.so — Data Skills collection.

This project demonstrates Infrastructure-as-Code (IaC) fundamentals for data engineers using Terraform to provision AWS resources including S3 buckets, EC2 instances, and IAM configurations. It provides reusable patterns for managing data infrastructure declaratively.

What This Project Does

  • Provisions AWS S3 buckets for data storage
  • Creates and configures EC2 instances for data processing
  • Sets up IAM roles and policies with proper permissions
  • Manages infrastructure state with Terraform
  • Provides reproducible data engineering environments

Prerequisites

Before using this project, ensure you have:

# Install Terraform
brew tap hashicorp/tap
brew install hashicorp/tap/terraform

# Install AWS CLI
brew install awscli

# Configure AWS credentials
aws configure
# Enter your AWS Access Key ID, Secret Access Key, region, and output format

Set up required environment variables:

export AWS_ACCESS_KEY_ID=$YOUR_ACCESS_KEY
export AWS_SECRET_ACCESS_KEY=$YOUR_SECRET_KEY
export AWS_DEFAULT_REGION=us-east-1

Project Structure

terraform/
├── main.tf          # Main infrastructure definitions
├── variables.tf     # Input variables
├── outputs.tf       # Output values
└── terraform.tfstate # State file (auto-generated)

Core Terraform Commands

Initialize Terraform

# Initialize the working directory and download providers
terraform -chdir=terraform init

# Validate configuration syntax
terraform -chdir=terraform validate

# Format configuration files
terraform -chdir=terraform fmt

Plan and Apply Infrastructure

# Preview changes without applying
terraform -chdir=terraform plan

# Apply infrastructure changes
terraform -chdir=terraform apply

# Auto-approve without prompts (use carefully)
terraform -chdir=terraform apply -auto-approve

Inspect Infrastructure

# List all resources in state
terraform -chdir=terraform state list

# Show detailed state information
terraform -chdir=terraform show

# Output specific values
terraform -chdir=terraform output

Destroy Infrastructure

# Destroy all managed infrastructure
terraform -chdir=terraform destroy

# Destroy specific resource
terraform -chdir=terraform destroy -target=aws_s3_bucket.data_bucket

Key Configuration Patterns

S3 Bucket for Data Storage

# main.tf
resource "aws_s3_bucket" "data_lake" {
  bucket = "my-data-engineering-bucket-${random_id.bucket_suffix.hex}"
  
  tags = {
    Environment = "dev"
    Purpose     = "data-engineering"
    ManagedBy   = "terraform"
  }
}

resource "random_id" "bucket_suffix" {
  byte_length = 4
}

# Enable versioning for data protection
resource "aws_s3_bucket_versioning" "data_lake_versioning" {
  bucket = aws_s3_bucket.data_lake.id
  
  versioning_configuration {
    status = "Enabled"
  }
}

# Configure lifecycle rules
resource "aws_s3_bucket_lifecycle_configuration" "data_lake_lifecycle" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "archive-old-data"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    expiration {
      days = 365
    }
  }
}

EC2 Instance for Data Processing

# main.tf
resource "aws_instance" "data_processor" {
  ami           = "ami-0c55b159cbfafe1f0"  # Amazon Linux 2
  instance_type = "t3.medium"
  
  key_name = aws_key_pair.data_eng_key.key_name
  
  vpc_security_group_ids = [aws_security_group.data_processor_sg.id]
  
  iam_instance_profile = aws_iam_instance_profile.data_processor_profile.name
  
  user_data = <<-EOF
              #!/bin/bash
              yum update -y
              yum install -y python3 python3-pip
              pip3 install pandas boto3 awscli
              EOF
  
  tags = {
    Name        = "data-processor"
    Environment = "dev"
    ManagedBy   = "terraform"
  }
  
  root_block_device {
    volume_size = 50
    volume_type = "gp3"
  }
}

resource "aws_key_pair" "data_eng_key" {
  key_name   = "data-engineering-key"
  public_key = file("~/.ssh/id_rsa.pub")
}

Security Group Configuration

resource "aws_security_group" "data_processor_sg" {
  name        = "data-processor-sg"
  description = "Security group for data processing EC2 instances"
  
  # SSH access
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Restrict in production
  }
  
  # Allow all outbound traffic
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = {
    Name = "data-processor-sg"
  }
}

IAM Role for EC2 with S3 Access

resource "aws_iam_role" "data_processor_role" {
  name = "data-processor-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ec2.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_role_policy" "s3_access_policy" {
  name = "s3-access-policy"
  role = aws_iam_role.data_processor_role.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
        ]
        Resource = [
          aws_s3_bucket.data_lake.arn,
          "${aws_s3_bucket.data_lake.arn}/*"
        ]
      }
    ]
  })
}

resource "aws_iam_instance_profile" "data_processor_profile" {
  name = "data-processor-profile"
  role = aws_iam_role.data_processor_role.name
}

Variables and Outputs

Define Variables

# variables.tf
variable "aws_region" {
  description = "AWS region for resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Environment name"
  type        = string
  default     = "dev"
}

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t3.medium"
}

variable "bucket_prefix" {
  description = "Prefix for S3 bucket names"
  type        = string
  default     = "data-engineering"
}

Configure Outputs

# outputs.tf
output "s3_bucket_name" {
  description = "Name of the created S3 bucket"
  value       = aws_s3_bucket.data_lake.id
}

output "s3_bucket_arn" {
  description = "ARN of the S3 bucket"
  value       = aws_s3_bucket.data_lake.arn
}

output "ec2_instance_id" {
  description = "ID of the EC2 instance"
  value       = aws_instance.data_processor.id
}

output "ec2_public_ip" {
  description = "Public IP of the EC2 instance"
  value       = aws_instance.data_processor.public_ip
}

output "ec2_private_ip" {
  description = "Private IP of the EC2 instance"
  value       = aws_instance.data_processor.private_ip
}

Remote State Management

For team collaboration, use S3 backend for state:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "terraform-state-bucket-name"
    key            = "data-engineering/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

Create the backend resources:

resource "aws_s3_bucket" "terraform_state" {
  bucket = "terraform-state-bucket-name"
  
  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
  
  attribute {
    name = "LockID"
    type = "S"
  }
}

Verification Commands

After applying infrastructure:

# Verify S3 buckets
aws s3 ls

# Verify EC2 instances
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{ID:InstanceId,Name:Tags[?Key==`Name`].Value,Type:InstanceType,State:State.Name,PublicIP:PublicIpAddress,PrivateIP:PrivateIpAddress}' \
  --output table

# Check IAM roles
aws iam list-roles --query 'Roles[?contains(RoleName, `data-processor`)].RoleName'

# Inspect Terraform state
terraform -chdir=terraform state list
cat terraform/terraform.tfstate | jq -r '.resources[] | [.type, .name] | join(",")'

Common Patterns

Multi-Environment Setup

# environments/dev/main.tf
module "data_infrastructure" {
  source = "../../modules/data-infra"
  
  environment   = "dev"
  instance_type = "t3.small"
  bucket_prefix = "dev-data"
}

# environments/prod/main.tf
module "data_infrastructure" {
  source = "../../modules/data-infra"
  
  environment   = "prod"
  instance_type = "t3.large"
  bucket_prefix = "prod-data"
}

Using terraform.tfvars

# terraform.tfvars
aws_region    = "us-west-2"
environment   = "staging"
instance_type = "t3.medium"
bucket_prefix = "staging-data-lake"

Apply with variables:

terraform -chdir=terraform apply -var-file="terraform.tfvars"

Troubleshooting

State Lock Issues

# Force unlock if state is stuck
terraform -chdir=terraform force-unlock LOCK_ID

# View current state
terraform -chdir=terraform show

S3 Bucket Name Conflicts

If bucket name is taken:

# Use random suffix
resource "random_id" "bucket_suffix" {
  byte_length = 8
}

resource "aws_s3_bucket" "data_lake" {
  bucket = "${var.bucket_prefix}-${random_id.bucket_suffix.hex}"
}

Import Existing Resources

# Import existing S3 bucket
terraform -chdir=terraform import aws_s3_bucket.data_lake existing-bucket-name

# Import EC2 instance
terraform -chdir=terraform import aws_instance.data_processor i-1234567890abcdef0

Debugging Terraform

# Enable detailed logging
export TF_LOG=DEBUG
terraform -chdir=terraform apply

# Disable logging
unset TF_LOG

Refresh State

# Sync state with real infrastructure
terraform -chdir=terraform refresh

# Replace corrupted resource
terraform -chdir=terraform apply -replace=aws_instance.data_processor

Best Practices

  1. Always use variables for environment-specific values
  2. Enable S3 versioning for data protection
  3. Use IAM roles instead of access keys for EC2
  4. Tag all resources for cost tracking and management
  5. Store state remotely for team collaboration
  6. Use modules for reusable infrastructure patterns
  7. Run terraform plan before every apply
  8. Never commit .tfstate files or sensitive variables to Git
  9. Use .gitignore for Terraform files:
# .gitignore
.terraform/
*.tfstate
*.tfstate.backup
.terraform.lock.hcl
terraform.tfvars
*.auto.tfvars

Featured

QwikClaw — one-click deploy OpenClaw logoQwikClaw — one-click deploy OpenClaw

Your own always-on OpenClaw agent, live in 60 seconds. No server, no setup — pick a model, connect Telegram, done.

Deploy your agent →
MoltAwards - Agent internet for government contracts + jobs. logoMoltAwards - Agent internet for government contracts + jobs.

MoltAwards is an agent-native social layer for matchawards.com.

Learn more →
CLN.Work — Stop prompting, start hiring AI employees logoCLN.Work — Stop prompting, start hiring AI employees

Turn your Claude agents into a real team — onboard them, assign tasks, and manage them like staff.

Hire AI employees →
Deploy your own AI agent logoDeploy your own AI agent

Launch OpenClaw or Hermes on Hostinger in about 60 seconds, keep your agent live 24/7, earn 20%-40% on your next referral up to $25-$45, and give your friend 20% off.

Launch on Hostinger →
AdvertiseGet your AI tool in front of 67,000+ AI enthusiastsSee placements & pricing →
View on GitHub

Recommended skills

Browse all →

firebase-data-connect

firebase/agent-skills

90K installsInstall

find-skills

vercel-labs/skills

2.2M installsInstall

frontend-design

anthropics/skills

601K installsInstall

vercel-react-best-practices

vercel-labs/agent-skills

509K installsInstall

agent-browser

vercel-labs/agent-browser

492K installsInstall

web-design-guidelines

vercel-labs/agent-skills

423K installsInstall

Browse

Skills by category

Frontend250Git198Data154Testing120Design105Docs103Security96Automation87Backend76Devops37Productivity29Mcp23

Advertise on Remote OpenClaw

Get your AI tool in front of 67,000+ AI enthusiasts a month

See placements & pricing →

Remote OpenClaw

AI agent skills directory, marketplace, and workflow hub for OpenClaw, Hermes Agent, Claude Code, Codex, and MCP-powered operator stacks.

Explore

  • Home
  • Skills Directory
  • Claude Code Skills
  • Codex Skills
  • Marketplace
  • Hermes Ecosystem
  • Agents
  • Guide
  • Learn
  • Blog

More

  • Playbook
  • Free Tools
  • Shipping
  • Contact
  • Terms
  • Privacy
© 2026 Remote OpenClaw