$120 tested Claude codes · real before/after data · Full tier $15 one-timebuy --sheet=15 →
$Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. download --free →
clskills.sh — terminal v2.4 — 2,347 skills indexed● online
[CL]Skills_
DatabricksintermediateNew

Databricks ETL Pipeline

Share

Build medallion architecture ETL pipelines (bronze/silver/gold)

Works with OpenClaude

You are a Databricks Data Engineer. The user wants to build a medallion architecture ETL pipeline with bronze (raw), silver (cleaned), and gold (aggregated) layers.

What to check first

  • Run databricks workspace ls / to confirm workspace access and CLI configuration
  • Verify you have a Databricks cluster running with Spark 3.x and Unity Catalog enabled (optional but recommended)
  • Check spark.sql.warehouse.dir is set in your cluster config for Delta Lake table locations

Steps

  1. Create a bronze layer by reading raw data (CSV, JSON, Parquet) with spark.read and writing to a Delta table with mode("overwrite") and option("path", "/mnt/bronze/...")
  2. Add metadata columns (_loaded_at, _source_file) using current_timestamp() and input_file_name() in the SELECT clause
  3. Build a silver layer by reading the bronze Delta table, applying transformations (deduplication, data quality checks, column renames) using dropDuplicates() and filter(), then write to /mnt/silver/
  4. Implement quality gates using assert_valid_records() or custom UDF that counts failed rows and raises exception if threshold exceeded
  5. Create a gold layer by reading silver tables, aggregating data with groupBy() and agg(), joining related datasets, and writing to /mnt/gold/ with mode("overwrite")
  6. Define the pipeline as a Databricks Job with three tasks (bronze → silver → gold) using job dependencies, setting each task to use a shared cluster
  7. Add error handling with try-catch blocks and log transformation metrics (rows processed, failures, duration) to a control table
  8. Schedule the job using schedule="0 2 * * *" (2 AM daily) in the Databricks Jobs API or UI

Code

from pyspark.sql.functions import current_timestamp, input_file_name, col, to_date, row_number, sum as spark_sum
from pyspark.sql.window import Window
from datetime import datetime

# ===== BRONZE LAYER =====
def load_bronze(source_path: str, table_name: str, catalog: str = "main", schema: str = "default"):
    """Load raw data into bronze layer"""
    try:
        df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(source_path)
        
        # Add metadata columns
        df_bronze = df.withColumn("_loaded_at", current_timestamp()) \
                      .withColumn("_source_file", input_file_name()) \
                      .withColumn("_bronze_id", row_number().over(Window.orderBy(col("*"))))
        
        # Write to Delta table
        bronze_table = f"{catalog}.{schema}.{table_name}_bronze"
        df_bronze.write.format("delta").

Note: this example was truncated in the source. See the GitHub repo for the latest full version.

Common Pitfalls

  • Treating this skill as a one-shot solution — most workflows need iteration and verification
  • Skipping the verification steps — you don't know it worked until you measure
  • Applying this skill without understanding the underlying problem — read the related docs first

When NOT to Use This Skill

  • When a simpler manual approach would take less than 10 minutes
  • On critical production systems without testing in staging first
  • When you don't have permission or authorization to make these changes

How to Verify It Worked

  • Run the verification steps documented above
  • Compare the output against your expected baseline
  • Check logs for any warnings or errors — silent failures are the worst kind

Production Considerations

  • Test in staging before deploying to production
  • Have a rollback plan — every change should be reversible
  • Monitor the affected systems for at least 24 hours after the change

Quick Info

CategoryDatabricks
Difficultyintermediate
Version1.0.0
AuthorClaude Skills Hub
databricksetlpipeline

Install command:

curl -o ~/.claude/skills/databricks-etl.md https://clskills.in/skills/databricks/databricks-etl.md

Related Databricks Skills

Other Claude Code skills in the same category — free to download.

Want a Databricks skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.