Querying Azure Databricks Delta Tables Directly Using Python | Quisitive

We’re seeing more and more interest in, and adoption of, the Delta Lake format in large-scale data projects at Catapult Systems. One question that we are often asked is how data scientists, who may not be working in Databricks itself, can access the data stored in a Databricks Delta Lake. Although there is the open source “deltalake” package in Python, the package requires usage of PySpark for data access, adding complexity where a data scientist may not want it.  

Recently, Databricks released a new Python package called “databricks-sql-connector”. This package makes use of SQL endpoints, recently promoted to general availability, in Databricks to call tables in a Delta Lakehouse using native SQL syntax, directly within Python. 

In this guide, we will explain how to use this package in your own Python projects. We will assume that you have access to: 

Step 1 – Get Connection Data for the Databricks SQL Endpoint 

Navigate to the SQL view in your Databricks workspace, and select SQL endpoints from the left-hand menu: 

Databricks SQL Menu

This will bring up a list of the SQL endpoints that are available to you. Click on the desired endpoint, and then click on “Connection details”. You will need to copy the “Server hostname” and the “HTTP path” fields:

SQL Endpoint Connection Details

Step 2: Create a Databricks Personal Access Token (PAT)

In the left-hand menu, navigate to Settings -> User Settings -> Personal Access Tokens. Click on “Generate new token” and follow the instructions.

Some important notes:

Step 3: Install the Databricks SQL Python Package

In your compute environment (for example in Azure Machine Learning Studio), bring up a terminal. Navigate to the appropriate virtual environment and install the Databricks SQL connector for Python. You must be running Python 3.7 or higher this to work:

pip install databricks-sql-connector

Step 4: Execute the SQL query

Sample Python code to execute a query against a table in the Delta Lake is below. In this example, we output results to a Pandas DataFrame, but this can be tweaked depending on specific project needs.

from databricks import sql as dbsql 
import pandas as pd 

# Configuration variables - from your keyvault and Databricks 
databricks_pat = "databricks_pat" 
server_hostname = "server_hostname" 
http_path = "http_path" 

# Define the connection to the SQL endpoint  
connection = dbsql.connect( 
server_hostname = server_hostname 
http_path = http_path 
access_token = databricks_pat ) 

# Represent the query as a string 
sql_query = "SELECT * FROM database_name.table_name LIMIT 1"  

# Execute the query  
with connection.cursor() as cursor:
# Get the data
     data = cursor.fetchall()  

# Get the column names    
     cursor.execute("SHOW COLUMNS IN database_name.table_name")     
     columns = cursor.fetchall()  

columns = [column[0] for column in columns]  
df = pd.DataFrame(data=data, columns=columns)