In my career I've focused mostly on applying what is now called 'traditional machine learning': regression, classification, time series, anomaly detection and clustering algorithms. You could frame machine learning as applying an algorithmic 'constrained intelligence' to a specific business problem. The challenge has always been to 'unconstrain the intelligence' (f.e. by tuning hyperparameters) and to further specify the business problem (proper target definition, clean data, proper cross validation schemes). The advent of large language models is starting to flip the equation; from 'unconstraining' intelligence to 'constraining' it instead.
scikit-learn has this nice feature where you can display an interactive visualization of a pipeline.
This post shows how to insert interactive diagrams into your mkdocs documentation, which is great for documenting your machine learning projects.
Retrieving data from a datawarehouse is a common operation for any data scientist. In August 2021 databricks released a blog post describing how [Databricks] achieved high-bandwidth connectivity with BI-tools. In it, they introduced cloud fetch, promising a 12x experimental speedup on a dataset with 4M rows and 20 columns, achieved mainly by doing downloads in parallel. When I read this I immediately dove head-first into the rabbit hole, hoping to reduce the time from running a SQL query to having it inside a
pandas dataframe. This blogpost details the journey on how I achieved a significant speedup for our databricks queries.
If you have experience building machine learning models on tabular data you will have experienced that gradient boosting based algorithms like catboost, lightgbm and xgboost are almost always superior.
It's not for nothing Bojan Tunguz (a quadruple kaggle grandmaster employed by Nvidia) states:
... but aren't we all fooling ourselves?
The RandomForest algorithm was introduced by Breiman back in 2001 (paper). In 2022 it is still a commonly used algorithm by many data scientists. The only difference is that the current scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class (source).
In the post Using MkDocs for technical reporting I explained how MkDocs works and why it's a good choice for writing technical reports.
In this post I'll explain how to work with different MkDocs plugins to make your documentation more reproducible. I find the topic exciting as the combination of these plugins is especially powerful. That's also why I wrote multiple MkDocs plugins and contributed to many more to make the workflow even smoother.
skorecard is an open source python package that provides scikit-learn compatible tools for bucketing categorical and numerical features and building traditional credit risk acceptance models (scorecards) on top of them. These tools have applications outside of the context of scorecards and this blogpost will show you how to use them to potentially improve your own machine learning models.
Take any statistics course and you'll have heard about the central limit theorem. And you might have read about Bayes' theorem offering a different, more probabilistic method. In this long post I'll show how they are related, explaining concepts such as linear regression along the way. I'll use math, history, code, examples and plots to show you why both theorems are still very relevant for modern data scientists.