Build medallion architecture ETL pipelines (bronze/silver/gold)
✓Works with OpenClaudeYou are a Databricks Data Engineer. The user wants to build a medallion architecture ETL pipeline with bronze (raw), silver (cleaned), and gold (aggregated) layers.
What to check first
- Run
databricks workspace ls /to confirm workspace access and CLI configuration - Verify you have a Databricks cluster running with Spark 3.x and Unity Catalog enabled (optional but recommended)
- Check
spark.sql.warehouse.diris set in your cluster config for Delta Lake table locations
Steps
- Create a bronze layer by reading raw data (CSV, JSON, Parquet) with
spark.readand writing to a Delta table withmode("overwrite")andoption("path", "/mnt/bronze/...") - Add metadata columns (
_loaded_at,_source_file) usingcurrent_timestamp()andinput_file_name()in the SELECT clause - Build a silver layer by reading the bronze Delta table, applying transformations (deduplication, data quality checks, column renames) using
dropDuplicates()andfilter(), then write to/mnt/silver/ - Implement quality gates using
assert_valid_records()or custom UDF that counts failed rows and raises exception if threshold exceeded - Create a gold layer by reading silver tables, aggregating data with
groupBy()andagg(), joining related datasets, and writing to/mnt/gold/withmode("overwrite") - Define the pipeline as a Databricks Job with three tasks (bronze → silver → gold) using job dependencies, setting each task to use a shared cluster
- Add error handling with try-catch blocks and log transformation metrics (rows processed, failures, duration) to a control table
- Schedule the job using
schedule="0 2 * * *"(2 AM daily) in the Databricks Jobs API or UI
Code
from pyspark.sql.functions import current_timestamp, input_file_name, col, to_date, row_number, sum as spark_sum
from pyspark.sql.window import Window
from datetime import datetime
# ===== BRONZE LAYER =====
def load_bronze(source_path: str, table_name: str, catalog: str = "main", schema: str = "default"):
"""Load raw data into bronze layer"""
try:
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(source_path)
# Add metadata columns
df_bronze = df.withColumn("_loaded_at", current_timestamp()) \
.withColumn("_source_file", input_file_name()) \
.withColumn("_bronze_id", row_number().over(Window.orderBy(col("*"))))
# Write to Delta table
bronze_table = f"{catalog}.{schema}.{table_name}_bronze"
df_bronze.write.format("delta").
Note: this example was truncated in the source. See the GitHub repo for the latest full version.
Common Pitfalls
- Treating this skill as a one-shot solution — most workflows need iteration and verification
- Skipping the verification steps — you don't know it worked until you measure
- Applying this skill without understanding the underlying problem — read the related docs first
When NOT to Use This Skill
- When a simpler manual approach would take less than 10 minutes
- On critical production systems without testing in staging first
- When you don't have permission or authorization to make these changes
How to Verify It Worked
- Run the verification steps documented above
- Compare the output against your expected baseline
- Check logs for any warnings or errors — silent failures are the worst kind
Production Considerations
- Test in staging before deploying to production
- Have a rollback plan — every change should be reversible
- Monitor the affected systems for at least 24 hours after the change
Related Databricks Skills
Other Claude Code skills in the same category — free to download.
Databricks Notebook
Write PySpark and SQL notebooks with widgets and visualizations
Databricks Delta Lake
Build Delta Lake tables with ACID transactions, time travel, and optimization
Databricks Unity Catalog
Configure Unity Catalog for data governance, lineage, and access control
Databricks MLflow
Track experiments, register models, and deploy with MLflow
Databricks Auto Loader
Ingest data incrementally with Auto Loader and cloud storage
Databricks SQL Warehouse
Query and visualize data with Databricks SQL warehouses and dashboards
Databricks Workflows
Orchestrate multi-task jobs with Databricks Workflows
Want a Databricks skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.