Technology
Data Science
Machine Learning
Data Visualization
Unsupervised Learning
Dimensionality Reduction
Feature Selection
The PCA's ability to reduce the dimension of the dataset motivates other use cases.
Below are some:
◆ To visualize high dimensional datasets, particularly because visualizing such datasets is impractical.
Below are some:
◆ To visualize high dimensional datasets, particularly because visualizing such datasets is impractical.
◆ To select the most useful features while getting rid of useless information/redundant features.
But not always, sometimes useful information will be lost too especially if the original data was already good and didn't contain noises.
But not always, sometimes useful information will be lost too especially if the original data was already good and didn't contain noises.
◆ PCA can be used before a typical machine learning model training to merely increase the training speed, given that the training data is reduced or no longer contain redundant features.
It is not guaranteed to speed up but in some cases, it can.
It is not guaranteed to speed up but in some cases, it can.
In many ML resources, you will find PCA in the category of unsupervised learning algorithms.
Below is a simple reason 👇
Below is a simple reason 👇
PCA reduces the dimension of datasets without instructions (labels in other words) of how that is going to be done other than specifying the number of principal components, just like specifying the number of clusters in KMeans clustering.
In order to reduce the dimension of the dataset, we have to specify the number of principal components.
Think of principal components as coordinates that we want to project the data in.
Or reduced features that hold the most information of the dataset.
Think of principal components as coordinates that we want to project the data in.
Or reduced features that hold the most information of the dataset.
The Explained Variance Ration in our case is [0.99809123, 0.00173592].
It means that 99.8% of the dataset variance lies on the first component, and the rest 0.17% lies on the second component.
If you look back to the heatmap above, on the y axis, these ratios can make sense.
It means that 99.8% of the dataset variance lies on the first component, and the rest 0.17% lies on the second component.
If you look back to the heatmap above, on the y axis, these ratios can make sense.
As a bonus, let's also use PCA to visualize the digit datasets. It has 64 dimensions, each digit has 8*8 pixels.
We can use PCA to project those 64 dimensions into 2 components.
We can use PCA to project those 64 dimensions into 2 components.
This is so fantastic.
Imagine that we are able to visualize all 10 digits into one plot, just because we have reduced their dimensions from 64 to 2.
Imagine that we are able to visualize all 10 digits into one plot, just because we have reduced their dimensions from 64 to 2.
This is the end of the thread.
The thread was about PCA. There are a whole maths behind it, but a lot of time, having a high-level understanding of things like this is quite enough to make things work.
The thread was about PCA. There are a whole maths behind it, but a lot of time, having a high-level understanding of things like this is quite enough to make things work.
Here are the main key takeaways:
PCA is a dimensional reduction algorithm. It reduces dimensions of the dataset while also preserving as much information as possible into fewer components.
PCA is a dimensional reduction algorithm. It reduces dimensions of the dataset while also preserving as much information as possible into fewer components.
It can also be used to:
◆ Visualize large datasets
◆ Remove redundant features
◆ To speed up model training when applied to the input data (in some cases) before training.
◆ Visualize large datasets
◆ Remove redundant features
◆ To speed up model training when applied to the input data (in some cases) before training.
Thank you for reading!
I am actively writing about machine learning techniques, concepts, and ideas.
You can support me by following @Jeande_d and sharing the first tweet with your friends who are interested in ML content.
More content to come 🙌🏻
I am actively writing about machine learning techniques, concepts, and ideas.
You can support me by following @Jeande_d and sharing the first tweet with your friends who are interested in ML content.
More content to come 🙌🏻
Loading suggestions...