10 Tweets 3 reads Nov 24, 2022
Comparing ๐ŸŽ to ๐ŸŽ; Comparing โ“ to โ“
How the Decision Tree model chooses the best questions?
Gini impurity index ๐Ÿงต
Decision Tree - Day 2
If you missed the first ๐Ÿงต on Decision Trees you can read it here ๐Ÿ”ฝ
We know that for the best result, we need to ask great questions.
How do we know if a question is good?
We do measurements and comparisons.
In the case of Decision Trees we usually use Gini Impurity Index or Gini Index in short.
Let's see how Gini Index works:
1/8
Gini Index will tell us how diverse the dataset is.
If a set contains similar elements the set has low Gini Index.
Let's consider two sets:
Set 1: 8 ๐ŸŽ , 2 ๐ŸŒ
Set 2: 4 ๐ŸŽ , 3 ๐ŸŒ , 2 ๐Ÿ‰ , 1 ๐Ÿ‹
Set 1 seems more pure, meaning it has less diversity.
2/8
Set 1 contains mostly ๐ŸŽ , while Set 2 has many different fruits.
We can see this difference, but we need to use numbers for easy comparison - we need a measurement.
Impurity is based on probability.
Let's see what is happening in the background.
3/8
What is the probability that if we choose 2 different fruits from the sets, they will be different?
- For Set 1, the probability is low, since it contains mainly ๐ŸŽ
- For Set 2 the probability is higher, cause it is more diverse.
4/8
Here is the Gini Index Formula ๐Ÿ”ฝ
(Note: I will not go deep into the probability calculations here)
5/8
For our sets the calculation looks like this ๐Ÿ”ฝ
The Index is lover for Set 1.
As we expected, Set 1 is more pure.
6/8
But how do we use the Gini Index to decide which is the best question?
1. We calculate the Gini index for every leaf in the tree
2. Take the average of the leaves to get the Gini Index for the tree
3. Take the lowest Gini Index
Lower Gini Index = Better split!
7/8
That's it for today.
I hope you've found this thread helpful.
Like/Retweet the first tweet below for support and follow @levikul09 for more Data Science threads.
Thanks ๐Ÿ˜‰
8/8

Loading suggestions...