$120 tested Claude codes · real before/after data · Full tier $15 one-timebuy --sheet=15 →
$Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. download --free →
clskills.sh — terminal v2.4 — 2,347 skills indexed● online
[CL]Skills_
DatabricksintermediateNew

Databricks Auto Loader

Share

Ingest data incrementally with Auto Loader and cloud storage

Works with OpenClaude

You are a Databricks data engineer. The user wants to set up incremental data ingestion using Databricks Auto Loader to monitor cloud storage and process new files automatically.

What to check first

  • Run spark.conf.get("spark.databricks.cloudFiles.schemaInference.sampleSize.numFiles") to verify schema inference is enabled
  • Confirm cloud storage credentials are configured in your Databricks workspace (S3, ADLS Gen2, or GCS)
  • Check that you have cloudFiles library available by running spark.sql("SELECT 1") — Auto Loader is built into Databricks Runtime 7.3+

Steps

  1. Define the cloud storage path as a variable: cloud_path = "s3://my-bucket/raw-data/" (use abfss:// for ADLS, gs:// for GCS)
  2. Create a target Delta table or use an external location where Auto Loader will write processed data
  3. Use spark.readStream.format("cloudFiles") with the cloudFiles.format option set to your file type (csv, json, parquet)
  4. Set cloudFiles.schemaLocation to a checkpoint directory where Auto Loader stores schema metadata and file tracking
  5. Add cloudFiles.schemaEvolutionMode set to "addNewColumns" or "rescue" to handle schema changes gracefully
  6. Call .option("cloudFiles.useNotifications", True) to enable notification-based ingestion (faster than directory listing)
  7. Specify .option("cloudFiles.validateOptions", True) for built-in validation of source files
  8. Chain .writeStream.format("delta").mode("append").option("checkpointLocation", "/path/to/checkpoint") and .start(path_to_target_table)

Code

from pyspark.sql.functions import col, current_timestamp

# Define cloud storage paths
source_path = "s3://my-bucket/raw-data/"
checkpoint_path = "/mnt/checkpoints/auto-loader-checkpoint"
schema_path = "/mnt/checkpoints/schema-location"
target_table = "catalog.schema.raw_events"

# Read streaming data with Auto Loader
df = (spark
    .readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", schema_path)
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .option("cloudFiles.useNotifications", True)
    .option("cloudFiles.validateOptions", True)
    .option("cloudFiles.maxFileAge", "7d")  # Only process files modified within 7 days
    .load(source_path)
)

# Add ingestion timestamp and deduplicate
processed_df = (df
    .withColumn("_ingestion_time", current_timestamp())
    .withColumn("_source_file

Note: this example was truncated in the source. See the GitHub repo for the latest full version.

Common Pitfalls

  • Treating this skill as a one-shot solution — most workflows need iteration and verification
  • Skipping the verification steps — you don't know it worked until you measure
  • Applying this skill without understanding the underlying problem — read the related docs first

When NOT to Use This Skill

  • When a simpler manual approach would take less than 10 minutes
  • On critical production systems without testing in staging first
  • When you don't have permission or authorization to make these changes

How to Verify It Worked

  • Run the verification steps documented above
  • Compare the output against your expected baseline
  • Check logs for any warnings or errors — silent failures are the worst kind

Production Considerations

  • Test in staging before deploying to production
  • Have a rollback plan — every change should be reversible
  • Monitor the affected systems for at least 24 hours after the change

Quick Info

CategoryDatabricks
Difficultyintermediate
Version1.0.0
AuthorClaude Skills Hub
databricksautoloaderingestion

Install command:

curl -o ~/.claude/skills/databricks-autoloader.md https://clskills.in/skills/databricks/databricks-autoloader.md

Related Databricks Skills

Other Claude Code skills in the same category — free to download.

Want a Databricks skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.