Speeding up databricks SQL queries
Retrieving data from a datawarehouse is a common operation for any data scientist. In August 2021 databricks released a blog post describing how [Databricks] achieved high-bandwidth connectivity with BI-tools. In it, they introduced cloud fetch, promising a 12x experimental speedup on a dataset with 4M rows and 20 columns, achieved mainly by doing downloads in parallel. When I read this I immediately dove head-first into the rabbit hole, hoping to reduce the time from running a SQL query to having it inside a pandas
dataframe. This blogpost details the journey on how I achieved a significant speedup for our databricks queries.