Pau Labarta Bajo
Pau Labarta Bajo

@paulabartabajo_

11 Tweets 3 reads Dec 07, 2022
Tired of training lots of Machine Learning models, and not getting better results? 😵‍💫
This is how you solve this 🧠↓
A Machine Learning model is the output of a 3-step workflow where you:
1 → Fetch raw data, for example from an external database.
2 → Process the data into a tabular format, so you have N features and 1 target.
3 → Train ML models (e.g. XGBoost) and tune hyper-parameters.
If your ML model does not work, you have at least 1 of these 2 problems:
1 → The model is too simple to capture the patterns in the training data, and you need a more powerful model (step 3).
2 → The Training data has no patterns, so no model will work (steps 1 and 2).
Problem #1: "How do I know if I need a more complex model?"
If a tabular dataset is solvable (aka there are patterns between the features and the target), a boosting tree (e.g XGBoost) will find them.
If an XGBoost model does not work, the problem is not the model, but the data
If you train a classifier, and the data is very imbalanced, adjust one key hyper-parameter, `scale_pos_weight` to help XGBoost focus on the least represented class.
Apart from that, it should work out of the box.
If it doesn't, then you have Problem 2.
Problem #2: "Why there are no patterns in the training data?"
2 possible reasons:
1 → The problem is intrinsically very hard because the target is almost random (e.g. predict crypto prices). Not much you can do here.
2 → You missed predictive features. This is solvable 😉
How? 🤔
When you generate your training data, you typically write a long SQL query against an enterprise database, that
- fetches,
- aggregates,
- and merges data
from many tables.
Enterprise databases are large collections of tables...
... and chances are, you are missing some important tables in your SQL query.
Ask around the team,
🧑‍🔬: "What features do you think are important for this problem?".
You often hear things you did not expect, which turn out to be pure gold for your ML project.
If these features are not yet in the database, talk to data engineers to see how they can be added.
Add them to the training data, and your models will start to work.
Voilà!
Wanna get more real-world Machine Learning tips and tricks?
Join my e-mail list and get precious advice right in your inbox
datamachines.xyz
Wanna become a professional Machine Learning engineer?
→ Follow me @paulabartabajo_
Wanna help?
Like/Retweet the first tweet below to spread the wisdom 🙏

Loading suggestions...