This is my personal learning note of the book, Python Data Science Cookook.
Tip
before learning about the following example , we need to have the notion of the principle of PCA
- the principle of PCA in English refers to https://en.wikipedia.org/wiki/Principal_component_analysis
- the principle of PCA in Chinese refers to http://blog.codinglabs.org/articles/pca-tutorial.html
this example of data set reder to https://archive.ics.uci.edu/ml/datasets/Iris
Let’s use the Iris dataset to understand how to use PCA efficiently in reducing the dimension of the dataset. The Iris dataset contains measurements for 150 iris flowers from three different species. The three classes in the Iris dataset are as follows:
- Iris Setosa
- Iris Versicolor
- Iris Virginica
The following are the four features in the Iris dataset:
- The sepal length in cm
- The sepal width in cm
- The petal length in cm
- The petal width in cm
Can we use, say, two columns instead of all the four columns to express most of the variations in the data ?Can we reduce the number of columns from four to two and still achieve a good accuracy for our classifier?
The steps of PCA algorithm
if you have the notion of the principle of PCA , the following steps is easy to understand :
- Standardize the dataset to have a zero mean value.
- Find the correlation matrix for the dataset and unit standard deviation value.
- Reduce the Correlation matrix matrix into its Eigenvectors and values.
- Select the top nEigenvectors based on the Eigenvalues sorted in descending order.
- Project the input Eigenvectors matrix into the new subspace.
1 | #!/usr/bin/env python2 |
output
1 | Eigen values |
how many components/dimensions should we choose?
The following are a 2 ways to select the components more empirically:
The Eigenvalue criterion:
An Eigenvalue of one would mean that the component would explain about one variable’s worth of variability. So, according to this criterion, a component should at least explain one variable’s worth of variability. We can say that we will include only those Eigenvalues whose value is greater than or equal to one. Based on your data set you can set the threshold. In a very large dimensional dataset including components capable of explaining only one variable may not be very useful.
The proportion of the variance explained criterion:
Let’s run the following code:
1 | print "Component, Eigen Value, % of Variance, Cummulative %" |
The output is as follows:
1 | Component, Eigen Value, % of Variance, Cummulative % |
For each component, we printed the Eigenvalue, percentage of the variance explained by that component, and cumulative percentage value of the variance explained. For example, component 1 has an Eigenvalue of 2.91; 2.91/4 gives the percentage of the variance explained, which is 72.80%. Now, if we include the first two components, then we can explain 95.80% of the variance(namely distribution) in the data.
The decomposition of a correlation matrix into its Eigenvectors and values is a general technique that can be applied to any matrix. In this case, we will apply it to a correlation matrix in order to understand the principal axes of data distribution, that is, axes through which the maximum variation in the data is observed.
A drawback of PCA
A drawback of PCA worth mentioning here is that it is computationally expensive operation. Finally a point about numpy’s corrcoeff function. The corrcoeff function will standardize your data internally as a part of its calculation. But since we want to explicitly state the reason for scaling, we have included it in our recipe.
When would PCA work?
The input dataset should have correlated columns for PCA to work effectively. Without a correlation of the input variables, PCA cannot help us.