Create ETL (Extract, Transform, Load) scripts
✓Works with OpenClaudeYou are a data engineer specializing in ETL pipeline design. The user wants to create a production-ready ETL script that extracts data from a source, transforms it according to business rules, and loads it into a target system.
What to check first
- Verify source system credentials and connectivity:
curl -X GET https://api.source.com/healthor test database connection withpsql -h localhost -U user -d database -c "SELECT 1" - Confirm target database exists and user has INSERT/UPDATE permissions:
SHOW GRANTS FOR 'etl_user'@'localhost';(MySQL) or\dp(PostgreSQL) - Check available disk space for staging area:
df -h /staging— ETL jobs can be I/O intensive - Validate required Python packages:
pip list | grep -E "pandas|sqlalchemy|requests"
Steps
- Define source connector using appropriate library (requests for APIs, sqlalchemy for databases, boto3 for S3) with retry logic and pagination support
- Implement extraction function that batches records to avoid memory overflow — use
chunk_size=10000for large datasets - Build transformation pipeline using pandas DataFrame operations, apply business logic functions, and validate data quality rules
- Create data validation layer to check for nulls, duplicates, type mismatches using assertions or pandera schema validation
- Implement error handling with detailed logging at extraction, transform, and load stages — use Python's
loggingmodule with file handlers - Build load function with upsert capability (INSERT ... ON DUPLICATE KEY UPDATE or MERGE) to handle incremental loads safely
- Add transaction rollback mechanism — wrap load operations in try/except with explicit rollback on constraint violations
- Schedule execution using cron jobs or Airflow DAGs with failure notifications and idempotency checks
Code
import pandas as pd
import logging
from sqlalchemy import create_engine, text
from sqlalchemy.exc import IntegrityError
import requests
from datetime import datetime
from typing import List, Dict, Any
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('/var/log/etl_pipeline.log'),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
class ETLPipeline:
def __init__(self, source_url: str, db_connection: str, batch_size: int = 10000):
self.source_url = source_url
self.engine = create_engine(db_connection)
self.batch_size = batch_size
self.records_processed = 0
self.records_failed = 0
def extract(self) -> List[Dict[str, Any]]:
"""Extract data from source API with pagination and retry logic."""
try:
logger.info(f"Starting extraction from {self.source_url
Note: this example was truncated in the source. See the GitHub repo for the latest full version.
Common Pitfalls
- Treating this skill as a one-shot solution — most workflows need iteration and verification
- Skipping the verification steps — you don't know it worked until you measure
- Applying this skill without understanding the underlying problem — read the related docs first
When NOT to Use This Skill
- When a simpler manual approach would take less than 10 minutes
- On critical production systems without testing in staging first
- When you don't have permission or authorization to make these changes
How to Verify It Worked
- Run the verification steps documented above
- Compare the output against your expected baseline
- Check logs for any warnings or errors — silent failures are the worst kind
Production Considerations
- Test in staging before deploying to production
- Have a rollback plan — every change should be reversible
- Monitor the affected systems for at least 24 hours after the change
Related Data & Analytics Skills
Other Claude Code skills in the same category — free to download.
CSV Parser
Parse and process CSV files
Data Transformer
Transform data between formats (JSON, XML, CSV)
Analytics Setup
Set up analytics tracking (GA4, Mixpanel, PostHog)
Data Pipeline
Create data processing pipeline
Report Generator
Generate reports from data
Chart Creator
Create charts and visualizations (Chart.js, D3)
Data Exporter
Export data in multiple formats
Data Validator
Validate data integrity and format
Want a Data & Analytics skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.