Skip to content

Tim's blog

Get notified of new posts

Intuit Mailchimp

Speeding up databricks SQL queries

Retrieving data from a datawarehouse is a common operation for any data scientist. In August 2021 databricks released a blog post describing how [Databricks] achieved high-bandwidth connectivity with BI-tools. In it, they introduced cloud fetch, promising a 12x experimental speedup on a dataset with 4M rows and 20 columns, achieved mainly by doing downloads in parallel. When I read this I immediately dove head-first into the rabbit hole, hoping to reduce the time from running a SQL query to having it inside a pandas dataframe. This blogpost details the journey on how I achieved a significant speedup for our databricks queries.

Is XGBoost really all we need?

If you have experience building machine learning models on tabular data you will have experienced that gradient boosting based algorithms like catboost, lightgbm and xgboost are almost always superior.

It's not for nothing Bojan Tunguz (a quadruple kaggle grandmaster employed by Nvidia) states:

... but aren't we all fooling ourselves?

Adjusting the bootstrap in Random Forest

The RandomForest algorithm was introduced by Breiman back in 2001 (paper). In 2022 it is still a commonly used algorithm by many data scientists. The only difference is that the current scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class (source).

Reproducible Reports with MkDocs

In the post Using MkDocs for technical reporting I explained how MkDocs works and why it's a good choice for writing technical reports.

In this post I'll explain how to work with different MkDocs plugins to make your documentation more reproducible. I find the topic exciting as the combination of these plugins is especially powerful. That's also why I wrote multiple MkDocs plugins and contributed to many more to make the workflow even smoother.

Introducing Skorecard for building better logistic regression models

skorecard is an open source python package that provides scikit-learn compatible tools for bucketing categorical and numerical features and building traditional credit risk acceptance models (scorecards) on top of them. These tools have applications outside of the context of scorecards and this blogpost will show you how to use them to potentially improve your own machine learning models.

From Central Limit Theorem to Bayes's Theorem via Linear Regression

Take any statistics course and you'll have heard about the central limit theorem. And you might have read about Bayes' theorem offering a different, more probabilistic method. In this long post I'll show how they are related, explaining concepts such as linear regression along the way. I'll use math, history, code, examples and plots to show you why both theorems are still very relevant for modern data scientists.