Querying Azure Databricks Delta Tables Directly Using Python

We’re seeing more and more interest in, and adoption of, the Delta Lake format in large-scale data projects at Catapult Systems. One question that we are often asked is how data scientists, who may not be working in Databricks itself, can access the data stored in a Databricks Delta Lake. Although there is the open source “deltalake” package in Python, the package requires usage of PySpark for data access, adding complexity where a data scientist may not want it.

Recently, Databricks released a new Python package called “databricks-sql-connector”. This package makes use of SQL endpoints, recently promoted to general availability, in Databricks to call tables in a Delta Lakehouse using native SQL syntax, directly within Python.

In this guide, we will explain how to use this package in your own Python projects. We will assume that you have access to:

An Azure Databricks workspace
A SQL endpoint in Azure Databricks workspace connected to a Delta Lake
A Delta table that has been defined within your Databricks workspace

Step 1 – Get Connection Data for the Databricks SQL Endpoint

Navigate to the SQL view in your Databricks workspace, and select SQL endpoints from the left-hand menu:

This will bring up a list of the SQL endpoints that are available to you. Click on the desired endpoint, and then click on “Connection details”. You will need to copy the “Server hostname” and the “HTTP path” fields:

Step 2: Create a Databricks Personal Access Token (PAT)

In the left-hand menu, navigate to Settings -> User Settings -> Personal Access Tokens. Click on “Generate new token” and follow the instructions.

Some important notes:

All PATs will expire. If you are running production workloads using this, you will need to create a key rotation process to account for this.
Your token will only be available to copy at creation time, so copy it before closing the window that it is shown in.
We strongly recommend you store this token in a secure, encrypted location such as Azure Key Vault.

Step 3: Install the Databricks SQL Python Package

In your compute environment (for example in Azure Machine Learning Studio), bring up a terminal. Navigate to the appropriate virtual environment and install the Databricks SQL connector for Python. You must be running Python 3.7 or higher this to work:

pip install databricks-sql-connector

Step 4: Execute the SQL query

Sample Python code to execute a query against a table in the Delta Lake is below. In this example, we output results to a Pandas DataFrame, but this can be tweaked depending on specific project needs.

from databricks import sql as dbsql 
import pandas as pd 

# Configuration variables - from your keyvault and Databricks 
databricks_pat = "databricks_pat" 
server_hostname = "server_hostname" 
http_path = "http_path" 

# Define the connection to the SQL endpoint  
connection = dbsql.connect( 
server_hostname = server_hostname 
http_path = http_path 
access_token = databricks_pat ) 

# Represent the query as a string 
sql_query = "SELECT * FROM database_name.table_name LIMIT 1"  

# Execute the query  
with connection.cursor() as cursor:
# Get the data
     cursor.execute(sql_query)
     data = cursor.fetchall()  

# Get the column names    
     cursor.execute("SHOW COLUMNS IN database_name.table_name")     
     columns = cursor.fetchall()  

columns = [column[0] for column in columns]  
df = pd.DataFrame(data=data, columns=columns)

Step 1 – Get Connection Data for the Databricks SQL Endpoint

Step 2: Create a Databricks Personal Access Token (PAT)

Step 3: Install the Databricks SQL Python Package

Step 4: Execute the SQL query

Share this: