Data Science

Posts

Principal Component Analysis (PCA)

January 05, 2021

Principal Component Analysis (PCA) is a dimentionality reduction technique. It is an unsupervised learning algorithm. Dimentionality refers to the number of input features or columns in a dataset. PCA reduces the number of input features in the model by grouping them together to create new features while preserving as much information as possible. The number of new features (components) of PCA is equal to the number of input features. The reasons for using this technique are - Provide smaller set of input features to the model, after having removed unwanted columns and columns having no effect on the output Group columns which are redundant, highly correlated and depict the same underlying concept. Having these extra columns leads to overfitting and unnecessary complexity. For model regularization. PCA does lead to some information loss while reducing the features but it can make the model simpler to understand and increases validation accuracy PCA uses covariance matrix. The first co...

Precision Recall Curve

January 02, 2021

Precision Recall (PR) curve is used for the analysis of results from a classification model. A classification model outputs probabilities. A cutoff decides whether to classify a probability as a positive or a negative output. By default, this is assumed to be 0.5. So, a probability below 0.5 belongs to negative class and greater equal-to 0.5 belongs to positive class. But this is an assumption. A precision-recall (PR) curve helps us play with this assumption to best suit the problem at hand i.e. whether we want to decrease or increase the number of positives from the prediction. This cutoff is tuned between 0 and 1 to plot the two metrics i.e. Precision and Recall in a PR curve. Corresponding probability cutoffs are present separately in a table. Now coming to the concept of Precision and Recall - For all predicted positives, they can be either true (right) or false (wrong). For all predicted negatives, they can be either true (right) or false (wrong) . Precision is a...

Transfer Learning

January 01, 2021

Transfer learning refers to using pre-trained deep learning models in similar datasets to which it was trained on. Pre-trained deep learning models have weights (or coefficients or parameters) determined beforehand by training and optimizing them on extremely huge datasets. These models have some of the highest test accuracy% for the specific dataset types they have been trained on and have been the de-facto standard, state-of-the-art (SOTA) models when they were introduced first. These models are built using a variation of either Convolutional neural nets (CNNs) or Recurrent neural nets (RNNs). CNNs and RNNs work extremely well, thus, mostly used on datasets that have correlated patterns amongst nearby input values e.g. images, languages, music, videos, time-series and text data. Few of the examples are - VGG16, ResNet50, InceptionV3 for images; BERT, GPT-2, XLNet, ELMO for NLP. Typically these CNNs and RNNs consist of numerous training layers, with each layer consisting of multiple n...

Precision, Recall and Accuracy for Classification models

December 27, 2020

The accuracy of a classification model is judged by whether the predicted class is the same as the actual class. Let us assume the levels of a class as either positive or negative, then - For all predicted positives, they can be either true (right) or false (wrong). For all predicted negatives, they can be either true (right) or false (wrong). Precision is a ratio defined as (True Positives)/(Predicted Positives). Ideally 1. Recall is a ratio defined as (True Positives)/(Actual Positives). It is also called Sensitivity or True Positive Rate (TPR). Ideally 1. Specificity is a ratio defined as (True Negatives)/(Actual Negatives). It is also called Selectivity. Ideally 1. Accuracy is a ratio defined as (True Positives + True Negatives)/(Total Predictions). Ideally 1. Python Implementation: sklearn.metrics -> classification_report

Receiver Operating Characteristic (ROC) Curve

December 26, 2020

Receiver Operating Characteristic (ROC) curve is used for the analysis of results from a classification model. A classification model outputs probabilities. A cutoff decides whether to classify a probability as a positive or a negative output. By default, this is assumed to be 0.5. So, a probability below 0.5 belongs to negative class and greater equal-to 0.5 belongs to positive class. But this is an assumption. An ROC curve helps us play with this assumption to best suit the problem at hand i.e. whether we want to decrease or increase the number of positives from the prediction. This cutoff is tuned between 0 and 1 to plot two metrics in an ROC curve, namely - True Positive Rate (TPR) and False Positive Rate (FPR). Only TPR and FPR values are plotted on the graph. Corresponding probability cutoffs are present separately in a table. Now coming to the concept of TPR and FPR - For all predicted positives, they can be either true (right) or false (wrong). For all predicted negatives, they...

Analysis of Variance (ANOVA)

December 26, 2020

ANOVA technique deals with ascertaining differences in groups within a population. The dependent variable is continuous and the independent variable is categorical. Assumptions - Data is normally distributed Population and groups variance is homogenenous and have similar variance Samples are random and independent It uses the F-value statistic. It is the ratio of 'variance among groups'/'variance within group'. Higher F-value leads to rejection of null hypothesis. The null hypothesis states that there is no difference between groups. The alternate hypothesis states that at least one group is different. One-way ANOVA: 1 continuous dependent; 1 categorical independent (> 2 levels) (* Special case - t-test: 1 continuous dependent; 1 categorical independent (2 levels)) Two-way ANOVA: 1 continuous dependent; 2 or more categorical independent Python Implementation: scipy.stats, statsmodels

Search This Blog

Data Science

Posts

Factor Analysis

Principal Component Analysis (PCA)

Precision Recall Curve

Transfer Learning

Precision, Recall and Accuracy for Classification models

Receiver Operating Characteristic (ROC) Curve

Analysis of Variance (ANOVA)