Speeding up databricks SQL queries
Retrieving data from a datawarehouse is a common operation for any data scientist. In August 2021 databricks released a blog post describing how [Databricks] achieved high-bandwidth connectivity with BI-tools. In it, they introduced cloud fetch, promising a 12x experimental speedup on a dataset with 4M rows and 20 columns, achieved mainly by doing downloads in parallel. When I read this I immediately dove head-first into the rabbit hole, hoping to reduce the time from running a SQL query to having it inside a pandas
dataframe. This blogpost details the journey on how I achieved a significant speedup for our databricks queries.
Update 2023: Since the databricks-sql-connector
v2.8.0 release from July 2023, there is support for cloudfetch. Enabling this leads to the fasted method, and I have updated the post benchmark and example code below.
The baseline
Our reference dataset is a sample of 2M rows from a table with 146 columns of mixed types. Initially, I was using a basic setup of databricks SQL connector. Very easy setup and it worked great for smaller queries, but for larger queries it got slow quickly:
Method | Speed |
---|---|
Baseline | 6m57s |
I had already figured out one optimization while browsing the documentation: using .fetchall_arrow()
(link) instead of .fetchall()
. This “gets all (or all remaining) rows of a query, as a PyArrow table”. That helped a lot:
Method | Speed |
---|---|
Baseline | 6m57s |
.fetchall_arrow | 3m38s |
Cloudfetch + Simba ODBC drivers
Hoping for blazing speeds, I set up the databricks custom ‘Simba’ ODBC drivers as instructed. Getting the connection string exactly right together with active directory tokens took quite an effort, but once I got connected I ran the benchmark:
Method | Speed |
---|---|
Baseline | 6m57s |
.fetchall_arrow | 3m38s |
Cloudfetch | 4m24s |
Significantly slower ! This was disappointing. I re-ran the benchmark in different time periods but busy clusters could not explain the slower results. I had to dig deeper.
Reading everything I could find online about cloud fetch and the databricks ODBC drivers, it seems you cannot see whether Cloudfetch is actually enabled or working (update: you can now explicitly set the cloud fetch override on a cluster configuration). ****I did find a section stating Databricks automatically disables Cloud Fetch for S3 buckets that have enabled versioning. I checked with an infra engineer, and this was not the case. We tried running queries over a custom proxy to monitor traffic, and it did seem multiple connections were opened. ODBC logs showed the file connections also.
So, likely Cloudfetch was working, but something else was going on.
Back to basics
I estimated the final pandas
dataset size using pandas’s .memory_usage(deep=True)
(link) to be ~1.5Gb. The benchmark timings translate to ~4.5Mb/s. The cloudfetch blog is stating 500 MB/s.
I ran a speedtest on my compute instance and confirmed bandwidth was not the problem (a comfortable 8000 MB/s down and 400 MB/s up).
Together with the infra engineer we were not being able to detect anything wrong with the databricks / cloudfetch setup, so I tried something else.
arrow-odbc
Given the first speedup in the databricks SQL connector was due to using arrow tables, I searched and found the arrow-odbc-py project. It “Reads Apache Arrow batches from ODBC data sources in Python”.
Method | Speed |
---|---|
Baseline | 6m57s |
.fetchall_arrow | 3m38s |
Cloudfetch | 4m24s |
arrow-odbc | 1m25s |
That is a very nice speedup! This connection allows you to tweak the batch sizes, so as a proper data scientist I decided to run some more benchmarks and optimize the batch size parameter. I highly recommend the memo package for this kind of analysis. Tweaking the batch size helped but the performance gains were not huge across datasizes.
For reference, this is some sample code for connecting via arrow-odbc
:
import pandas as pd
from arrow_odbc import read_arrow_batches_from_odbc
def read_sql(query: str) -> pd.DataFrame:
reader = read_arrow_batches_from_odbc(
query=f"select * from your_table",
connection_string=get_your_connection_string(),
batch_size=20_000,
)
if not reader:
return None
dfs = []
for arrowbatch in reader:
# Process arrow batches
dfs.append(arrowbatch.to_pandas(timestamp_as_object=True))
if dfs:
return pd.concat(dfs, ignore_index=True)
else:
return None
Turbodbc
Another project that should be mentioned is Turbodbc. It’s a python project which uses many optimizations (like arrow and batched queries) to offer superior performance over ‘vanilla’ ODBC connections.
Method | Speed |
---|---|
Baseline | 6m57s |
.fetchall_arrow | 3m38s |
Cloudfetch | 4m24s |
arrow-odbc | 1m25s |
turbodbc | 1m10s |
Using the memo package I tracked many tweaks to the settings, including using asyncio, strings_as_dictionary
and adaptive_integers
. The gains were minor but still worth exploring the combinations.
A downside of Turbodbc however is that you need additional software to compile the C++ code that is required for installation. The package is also available on conda but installation was still less straightforward.
databricks-sql-connector
As of July 2023, the databricks-sql-connector
v2.8.0 release supports cloudfetch. The option is not well documented but here's a reference implementation:
import os
from databricks import sql
import pandas as pd
def read_sql(query: str) -> pd.DataFrame:
connection = sql.connect(
server_hostname=os.getenv("DATABRICKS_HOST"),
http_path=os.getenv("DATABRICKS_HTTP_PATH"),
access_token=os.getenv("DATABRICKS_TOKEN"),
use_cloud_fetch=True, # <-- Make sure to specify this, as default is False
)
cursor = connection.cursor()
try:
cursor.execute(query)
table = cursor.fetchall_arrow()
df = table.to_pandas(timestamp_as_object=True)
finally:
cursor.close()
connection.close()
return df
This method is almost as fast as turbodbc
:
Method | Speed |
---|---|
Baseline | 6m57s |
.fetchall_arrow | 3m38s |
Cloudfetch | 4m24s |
arrow-odbc | 1m25s |
databricks-sql-connector | 1m19s |
turbodbc | 1m10s |
In my benchmarks, the method is much faster for smaller queries as well. If I compare the time taken with the size of the final pandas dataframe, I measure speeds of ~80-150 MB/s (depending on the size of the query).
The final setup
Ease and reliability of installation is important. We need the connection to databricks during various batch deployments, CI/CD builds and in different development environments. So we decided not to go for turbodbc
and instead opt for the much simpler to setup (and only slightly slower) databricks-sql-connector.
Conclusion
Investing some time in optimizing frequent and slow operations definitely pays off. In this case queries are >4x faster.
The rabbit hole is much deeper however and the potential for further speedups is still significant. For example, Databricks delta tables use parquet files under the hood, which means it might be possible to hook them up to duckdb, which in turn has many optimizations for fetching data. And there’s the apache arrow flight project announced in February 2022 that aims to get rid of many intermediate steps and natively support columnar, batched data transfers.