Run dlt in Databricks Notebooks – No cluster restart

Introduction Note: this article is about running dlt (data loading tool) in Databricks, if you are here looking for Delta Live Tables you are in the wrong table. Due to poor decisions, Databricks called the Python module for their Delta Live Tables dlt, but assuming that it will be used only within the perimeter of…

By.

min read

Clashing of dlt and Databricks

Introduction

Note: this article is about running dlt (data loading tool) in Databricks, if you are here looking for Delta Live Tables you are in the wrong table.

Due to poor decisions, Databricks called the Python module for their Delta Live Tables dlt, but assuming that it will be used only within the perimeter of the Databricks platform, they didn’t reserve the name in PyPI. When an open source project came along and used the name dlt for their library, which is also data related but works everywhere, data engineers had to deal with the naming collision. Fun ensued.

The dltHub team provides a comprehensive Databricks notebook initialization guide—which is the recommended way to bootstrap a cluster with the DLT Hub client while still preserving native Delta Live Tables functionality in Python (DLT Hub Databricks notebook instructions).

Sometimes, though, you just want to quickly experiment with DLT in a notebook—without restarting the entire cluster or relying on a custom init.sh. This short guide shows you exactly how to:

  1. Install the dltHub dlt module
  2. Quickly restart the python environment
  3. Define & run a simple dlt.pipeline to load data into DuckDB

1. Install the DLT Hub dlt Client (and DuckDB)

Run this single shell cell in your notebook to relocate Databricks’ built-in dlt package, then install the DLT Hub client and DuckDB from PyPI:

%sh

# 1) Rename the built-in DLT package so it won't clash
mv /databricks/spark/python/dlt/ /databricks/spark/python/dlt_dbricks
find /databricks/spark/python/dlt_dbricks/ -type f \
  -exec sed -i 's/from dlt/from dlt_dbricks/g' {} \;

# 2) Upgrade pip and install the DLT Hub client + DuckDB
pip install --upgrade pip
pip install dlt duckdb

Why this works: Renaming dltdlt_dbricks prevents import collisions, letting the PyPI-installed dlt client be the one you import in Python.

At this point you will need to restart you Python environment (not the whole cluster), which can be conveniently done with:

%restart_python

At this point, dlt should installed and ready to use.

2. Define & Run a Simple Pipeline

In a fresh Python cell, paste the following code to load a tiny country dataset into DuckDB:

import dlt
import duckdb

# 1. Create a DuckDB-backed pipeline
    pipeline = dlt.pipeline(
        pipeline_name="test_dlt",
        destination="duckdb",
        dataset_name="country_data",
    )


# 2. Sample data to load
data = [
    {'country': 'USA',    'population': 331_449_281, 'capital': 'Washington, D.C.'},
    {'country': 'Canada', 'population':  38_005_238, 'capital': 'Ottawa'},
    {'country': 'Germany','population':  83_019_200, 'capital': 'Berlin'}
]

# 3. Run the pipeline (replace any existing table)
info = pipeline.run(data, table_name="countries", write_disposition="replace")
print(info)

In the cell output you should see two lines like:

Pipeline dlt_db_ipykernel_launcher load step completed in 2.03 seconds
1 load package(s) were loaded to destination duckdb and into dataset country_data
The duckdb destination used duckdb:////Workspace/Users/YOUR_EMAIL_HERE/test_dlt.duckdb location to store data
Load package 1746787358.816344 is LOADED and contains no failed jobs

The duckdb file location is important for the next step.

3. Checking the results

Your data should now be loaded, but where? If you keep your Workspace panel open you should see a new duckdb file there.

You can peak inside the duckdb database doing:

conn = duckdb.connect("/Workspace/Users/YOUR_EMAIL_HERE/test_dlt.duckdb")
conn.sql("SELECT * FROM country_data.countries").show()
conn.close()

After execution you should see console information about load timing and status, followed by your table contents:

┌─────────┬────────────┬──────────────────┬───────────────────┬────────────────┐
│ country │ population │     capital      │   _dlt_load_id    │    _dlt_id     │
├─────────┼────────────┼──────────────────┼───────────────────┼────────────────┤
│ USA     │  331449281 │ Washington, D.C. │ 1746787358.816344 │ jyV/+RNQCJceqg │
│ Canada  │   38005238 │ Ottawa           │ 1746787358.816344 │ IbjLpEgUxHGmDQ │
│ Germany │   83019200 │ Berlin           │ 1746787358.816344 │ MvqHOjfP4EHtlw │
└─────────┴────────────┴──────────────────┴───────────────────┴────────────────┘

Conclusion & Next Steps

You now have a fast, repeatable recipe for using the dlt withing a Databricks notebook. As mentioned at the start, for production workloads, it is better to use the init script provided by the dltHub team (which also allows the possibility to continue to use Delta Live Tables).

But is this really the best solution? The reality is that dlt will run in a single and you will be probably overpaying the compute power you use to run your Python script.

In case you want to discuss alternatives and what is the best setup to run dlt along Databricks, feel free to reach out to us. Whether you are exploring init-script automation, dedicated ETL clusters, or hybrid architectures, we will be happy to advise on the best fit for your use case.

Leave a Reply

Your email address will not be published. Required fields are marked *