Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a dimentionality reduction technique. It is an unsupervised learning algorithm. Dimentionality refers to the number of input features or columns in a dataset. PCA reduces the number of input features in the model by grouping them together to create new features while preserving as much information as possible. The number of new features (components) of PCA is equal to the number of input features. The reasons for using this technique are -
- Provide smaller set of input features to the model, after having removed unwanted columns and columns having no effect on the output
- Group columns which are redundant, highly correlated and depict the same underlying concept. Having these extra columns leads to overfitting and unnecessary complexity.
- For model regularization. PCA does lead to some information loss while reducing the features but it can make the model simpler to understand and increases validation accuracy
PCA uses covariance matrix. The first component accounts for the highest variance in data, the second component (which is perpendicular to the first) accounts for second highest variance, and so on. We can choose how many new components to select for further analysis. The %age variance explained by each component can be viewed in the scree plot.
It is a good practice to convert each feature to the same scale before applying PCA. Also, PCA assumes linear correlation between input features. If there exists non-linear correlation between features, it would be better to transform features before applying PCA. PCA outputs un-correlated components (new features) which are combinations of input features.
New components calculation - New components are called eigenvectors and the coefficients of input variables which make up the components are called loadings. The loadings matrix can be accessed via ".components_" attribute of PCA object in Python.
Python Implementation: sklearn.decomposition -> import PCA
Comments
Post a Comment