Tim's blog

Get notified of new posts

March 21, 2025
7 min read

Introducing an actually helpful sudoku solver

I occasionally enjoy playing sudoku puzzles. I particularly like the harder ones that pose a real challenge. Play them and you will get stuck inevitable. I often wonder if I missed something obvious, or if I need to apply a more advanced strategy to solve this puzzle. So I wrote a sudoku solver that gives me the easiest possible strategy required to make progress. This blog has an interactive demo of the solver and reflects on how I wrote it.

March 10, 2025
4 min read

Lessons learned after hitting 1M downloads/month

I hit that magic 1 million downloads/month milestone for one of the first open source projects I built: mkdocs-git-revision-date-localized-plugin. This post reflects on what I’ve learned maintaining that package since the first release end of 2019.

1M downloads/month

September 22, 2024
4 min read

Benchmarking scikit-learn across python versions using `uv`

When python 3.11 came out 2 years ago (24 October 2022) it promised to be 10-60% faster than python 3.10, and 1.25x faster on the standard benchmark suite (see the what's new in 3.11). I've always wondered how that translates to training machine learning models in python, but I couldn't be bothered to write a benchmark. That is, until astral released uv 0.4.0 which introduces "a new, unified toolchain that takes the complexity out of Python development".

February 12, 2024
4 min read

Thoughts on Constrained Intelligence

In my career I've focused mostly on applying what is now called 'traditional machine learning': regression, classification, time series, anomaly detection and clustering algorithms. You could frame machine learning as applying an algorithmic 'constrained intelligence' to a specific business problem. The challenge has always been to 'unconstrain the intelligence' (f.e. by tuning hyperparameters) and to further specify the business problem (proper target definition, clean data, proper cross validation schemes). The advent of large language models is starting to flip the equation; from 'unconstraining' intelligence to 'constraining' it instead.

October 14, 2023
2 min read

Inserting interactive scikit-learn diagrams into mkdocs

scikit-learn has this nice feature where you can display an interactive visualization of a pipeline. This post shows how to insert interactive diagrams into your mkdocs documentation, which is great for documenting your machine learning projects.

December 19, 2022
5 min read

Speeding up databricks SQL queries

Retrieving data from a datawarehouse is a common operation for any data scientist. In August 2021 databricks released a blog post describing how [Databricks] achieved high-bandwidth connectivity with BI-tools. In it, they introduced cloud fetch, promising a 12x experimental speedup on a dataset with 4M rows and 20 columns, achieved mainly by doing downloads in parallel. When I read this I immediately dove head-first into the rabbit hole, hoping to reduce the time from running a SQL query to having it inside a pandas dataframe. This blogpost details the journey on how I achieved a significant speedup for our databricks queries.

September 19, 2022
7 min read

Is XGBoost really all we need?

If you have experience building machine learning models on tabular data you will have experienced that gradient boosting based algorithms like catboost, lightgbm and xgboost are almost always superior.

It's not for nothing Bojan Tunguz (a quadruple kaggle grandmaster employed by Nvidia) states:

XGBoost Is All You Need

Deep Neural Networks and Tabular Data: A Surveyhttps://t.co/Z2KsHP3fvp pic.twitter.com/uh5NLS1fVP
— Bojan Tunguz (@tunguz) March 30, 2022

... but aren't we all fooling ourselves?

June 15, 2022
9 min read

Adjusting the bootstrap in Random Forest

The RandomForest algorithm was introduced by Breiman back in 2001 (paper). In 2022 it is still a commonly used algorithm by many data scientists. The only difference is that the current scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class (source).

January 19, 2022
4 min read

Reproducible Reports with MkDocs

In the post Using MkDocs for technical reporting I explained how MkDocs works and why it's a good choice for writing technical reports.

In this post I'll explain how to work with different MkDocs plugins to make your documentation more reproducible. I find the topic exciting as the combination of these plugins is especially powerful. That's also why I wrote multiple MkDocs plugins and contributed to many more to make the workflow even smoother.

December 24, 2021
6 min read

Introducing Skorecard for building better logistic regression models

skorecard is an open source python package that provides scikit-learn compatible tools for bucketing categorical and numerical features and building traditional credit risk acceptance models (scorecards) on top of them. These tools have applications outside of the context of scorecards and this blogpost will show you how to use them to potentially improve your own machine learning models.