Data Science

Posts

Showing posts from January, 2021

Principal Component Analysis (PCA)

January 05, 2021

Principal Component Analysis (PCA) is a dimentionality reduction technique. It is an unsupervised learning algorithm. Dimentionality refers to the number of input features or columns in a dataset. PCA reduces the number of input features in the model by grouping them together to create new features while preserving as much information as possible. The number of new features (components) of PCA is equal to the number of input features. The reasons for using this technique are - Provide smaller set of input features to the model, after having removed unwanted columns and columns having no effect on the output Group columns which are redundant, highly correlated and depict the same underlying concept. Having these extra columns leads to overfitting and unnecessary complexity. For model regularization. PCA does lead to some information loss while reducing the features but it can make the model simpler to understand and increases validation accuracy PCA uses covariance matrix. The first co...

Precision Recall Curve

January 02, 2021

Precision Recall (PR) curve is used for the analysis of results from a classification model. A classification model outputs probabilities. A cutoff decides whether to classify a probability as a positive or a negative output. By default, this is assumed to be 0.5. So, a probability below 0.5 belongs to negative class and greater equal-to 0.5 belongs to positive class. But this is an assumption. A precision-recall (PR) curve helps us play with this assumption to best suit the problem at hand i.e. whether we want to decrease or increase the number of positives from the prediction. This cutoff is tuned between 0 and 1 to plot the two metrics i.e. Precision and Recall in a PR curve. Corresponding probability cutoffs are present separately in a table. Now coming to the concept of Precision and Recall - For all predicted positives, they can be either true (right) or false (wrong). For all predicted negatives, they can be either true (right) or false (wrong) . Precision is a...

Transfer Learning

January 01, 2021

Transfer learning refers to using pre-trained deep learning models in similar datasets to which it was trained on. Pre-trained deep learning models have weights (or coefficients or parameters) determined beforehand by training and optimizing them on extremely huge datasets. These models have some of the highest test accuracy% for the specific dataset types they have been trained on and have been the de-facto standard, state-of-the-art (SOTA) models when they were introduced first. These models are built using a variation of either Convolutional neural nets (CNNs) or Recurrent neural nets (RNNs). CNNs and RNNs work extremely well, thus, mostly used on datasets that have correlated patterns amongst nearby input values e.g. images, languages, music, videos, time-series and text data. Few of the examples are - VGG16, ResNet50, InceptionV3 for images; BERT, GPT-2, XLNet, ELMO for NLP. Typically these CNNs and RNNs consist of numerous training layers, with each layer consisting of multiple n...