Skip to content

2022

Speeding up databricks SQL queries

Retrieving data from a datawarehouse is a common operation for any data scientist. In August 2021 databricks released a blog post describing how [Databricks] achieved high-bandwidth connectivity with BI-tools. In it, they introduced cloud fetch, promising a 12x experimental speedup on a dataset with 4M rows and 20 columns, achieved mainly by doing downloads in parallel. When I read this I immediately dove head-first into the rabbit hole, hoping to reduce the time from running a SQL query to having it inside a pandas dataframe. This blogpost details the journey on how I achieved a significant speedup for our databricks queries.

Is XGBoost really all we need?

If you have experience building machine learning models on tabular data you will have experienced that gradient boosting based algorithms like catboost, lightgbm and xgboost are almost always superior.

It's not for nothing Bojan Tunguz (a quadruple kaggle grandmaster employed by Nvidia) states:

... but aren't we all fooling ourselves?

Adjusting the bootstrap in Random Forest

The RandomForest algorithm was introduced by Breiman back in 2001 (paper). In 2022 it is still a commonly used algorithm by many data scientists. The only difference is that the current scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class (source).

Reproducible Reports with MkDocs

In the post Using MkDocs for technical reporting I explained how MkDocs works and why it's a good choice for writing technical reports.

In this post I'll explain how to work with different MkDocs plugins to make your documentation more reproducible. I find the topic exciting as the combination of these plugins is especially powerful. That's also why I wrote multiple MkDocs plugins and contributed to many more to make the workflow even smoother.