Machine Learning students try more complex ML models when they wanna improve their results.
So they miss the elephant in the room 🐘 ↓
So they miss the elephant in the room 🐘 ↓
An ML learning model is like a cake, whose 2 main ingredients are:
→ a dataset
→ an ML algorithm, for example, linear regression, or XGBoost.
And the thing is, no matter what algorithm you choose, the resulting ML model can only be as good as the dataset you used to train it.
→ a dataset
→ an ML algorithm, for example, linear regression, or XGBoost.
And the thing is, no matter what algorithm you choose, the resulting ML model can only be as good as the dataset you used to train it.
The problem is that in online courses, and ML competitions, you work with a static dataset that someone has generated for you.
In real-world projects, there is no dataset waiting for you.
Instead, you need to create it.
And this is the most critical step in the project.
In real-world projects, there is no dataset waiting for you.
Instead, you need to create it.
And this is the most critical step in the project.
Most ML problems in the real world are solved in a supervised manner, which means your dataset contains
→ a collection of features, that serve as inputs to your model
→ a target metric you want to predict, aka the model output.
→ a collection of features, that serve as inputs to your model
→ a target metric you want to predict, aka the model output.
Useful features bring information and signal relevant to the target you want to predict.
Useless features are just noise, and add no value to your ML model, no matter how complex your algorithm is.
Useless features are just noise, and add no value to your ML model, no matter how complex your algorithm is.
Adding a useful feature to your model is the best way to improve it.
Adding two useful features works even better.
And having 3 of them is a blessing.
Adding two useful features works even better.
And having 3 of them is a blessing.
To add new useful features, you need to
→ think beyond the data available right now at the data warehouse.
→ talk to senior colleagues who have context about the business.
→ think outside of the box you put yourself into after 2 weeks of working on the model.
→ think beyond the data available right now at the data warehouse.
→ talk to senior colleagues who have context about the business.
→ think outside of the box you put yourself into after 2 weeks of working on the model.
You often find pieces of information, relevant to the problem, that are scattered in the company's IT systems, or maybe outside on a third-party vendor, that will greatly help your model.
In conclusion,
→ in real-world ML, the dataset is not set in stone. YOU have the power to expand it.
→ adding useful features to your dataset is the best way to improve your model.
→ improving ML models in the real world is more about data engineering than fancy ML models.
→ in real-world ML, the dataset is not set in stone. YOU have the power to expand it.
→ adding useful features to your dataset is the best way to improve your model.
→ improving ML models in the real world is more about data engineering than fancy ML models.
Wanna get more real-world ML content?
Subscribe to my newsletter and get for FREE my eBook
"How to become a freelance data scientist"
which has specific advice to help you become a freelance data scientist
↓↓↓
datamachines.xyz
Subscribe to my newsletter and get for FREE my eBook
"How to become a freelance data scientist"
which has specific advice to help you become a freelance data scientist
↓↓↓
datamachines.xyz
That's all for today folks.
I hope you find this content useful for your path 🥾⛰️
Wanna connect? ↓
Follow me @paulabartabajo_
Wanna help?
Like/Retweet the first tweet below to spread the wisdom
↓↓↓
I hope you find this content useful for your path 🥾⛰️
Wanna connect? ↓
Follow me @paulabartabajo_
Wanna help?
Like/Retweet the first tweet below to spread the wisdom
↓↓↓
Loading suggestions...