Data Science

Posts

Showing posts from December, 2020

Precision, Recall and Accuracy for Classification models

December 27, 2020

The accuracy of a classification model is judged by whether the predicted class is the same as the actual class. Let us assume the levels of a class as either positive or negative, then - For all predicted positives, they can be either true (right) or false (wrong). For all predicted negatives, they can be either true (right) or false (wrong). Precision is a ratio defined as (True Positives)/(Predicted Positives). Ideally 1. Recall is a ratio defined as (True Positives)/(Actual Positives). It is also called Sensitivity or True Positive Rate (TPR). Ideally 1. Specificity is a ratio defined as (True Negatives)/(Actual Negatives). It is also called Selectivity. Ideally 1. Accuracy is a ratio defined as (True Positives + True Negatives)/(Total Predictions). Ideally 1. Python Implementation: sklearn.metrics -> classification_report

Receiver Operating Characteristic (ROC) Curve

December 26, 2020

Receiver Operating Characteristic (ROC) curve is used for the analysis of results from a classification model. A classification model outputs probabilities. A cutoff decides whether to classify a probability as a positive or a negative output. By default, this is assumed to be 0.5. So, a probability below 0.5 belongs to negative class and greater equal-to 0.5 belongs to positive class. But this is an assumption. An ROC curve helps us play with this assumption to best suit the problem at hand i.e. whether we want to decrease or increase the number of positives from the prediction. This cutoff is tuned between 0 and 1 to plot two metrics in an ROC curve, namely - True Positive Rate (TPR) and False Positive Rate (FPR). Only TPR and FPR values are plotted on the graph. Corresponding probability cutoffs are present separately in a table. Now coming to the concept of TPR and FPR - For all predicted positives, they can be either true (right) or false (wrong). For all predicted negatives, they...

Analysis of Variance (ANOVA)

December 26, 2020

ANOVA technique deals with ascertaining differences in groups within a population. The dependent variable is continuous and the independent variable is categorical. Assumptions - Data is normally distributed Population and groups variance is homogenenous and have similar variance Samples are random and independent It uses the F-value statistic. It is the ratio of 'variance among groups'/'variance within group'. Higher F-value leads to rejection of null hypothesis. The null hypothesis states that there is no difference between groups. The alternate hypothesis states that at least one group is different. One-way ANOVA: 1 continuous dependent; 1 categorical independent (> 2 levels) (* Special case - t-test: 1 continuous dependent; 1 categorical independent (2 levels)) Two-way ANOVA: 1 continuous dependent; 2 or more categorical independent Python Implementation: scipy.stats, statsmodels

Type-2 Error in Hypothesis Testing

December 26, 2020

Type-2 error occurs when 'a criminal is acquitted and goes scot-free'. By default, null hypothesis supports homogeneous distribution, a no effect property and an idealistic situation. To reject a null hypothesis, requires an absolute certainty. When, in reality, a null hypothesis is wrong but the test result accepts the null hypothesis, then we commit a type-2 error. The impact of type-2 error depends on the cost associated with committing one. For example, courts all across the world, start with a null hypothesis that the accused is innocent. So, as per type-2 error, the court during final judgement, accepts the null hypothesis i.e. the accused is innocent and he/she goes scot-free. But in reality, the accused has committed the crime. So, in this case, the cost of type-2 error is huge. P.S. 'A null hypothesis is never accepted'. Either we 'reject a null hypothesis' or we 'fail to reject a null hypothesis'.

Type 1 Error in Hypothesis Testing

December 26, 2020

Type 1 error occurs when 'we declare an innocent as guilty'. By default, null hypothesis supports a homogeneous distribution and a no effect property. To reject a null hypothesis, requires a fair amount of certainty. If a null hypothesis is actually true, but our result shows otherwise i.e. we reject the null hypothesis, that leads to committing a Type 1 error. The impact of this error depends on the cost associated with committing one. For example, suppose I am shopping from a store which has only good apples, I believe that a particular apple is bad and hence I replace it from my bag. Here I have committed a Type-1 error but it's not going to cost me since the replacement apple too, is good. Here, the type 1 error is not harmful at all. On the other hand, this error would have been too costly, if the lot has many bad apples and we remove a good apple from our bag thinking it to be bad.

Central Limit Theorem (CLT)

December 26, 2020

The Central Limit Theorem states that as the sample size increases above a certain threshold, the frequency distribution of sample means follows a normal distribution irrespective of the underlying distribution of the population. For any population distribution, threshold sample size > 30 will satisfy CLT For slightly skewed/fairly symmetric population distribution, threshold sample size > 15 will satisfy CLT For normal population distribution, any sample size will satisfy CLT

SMOTE for oversampling minority class in an Imbalanced Dataset

December 20, 2020

SMOTE stands for Synthetic Minority Oversampling Technique. It uses k-nearest neighbors of the minority class to create a synthetic row for the class. We can specify how much data to synthesize by mentioning the ratio to the majority class. We can also specify which classes to resample - majority, minority, not majority, not minority, all classes. There are other variants of SMOTE in which minority class is synthesized using SVM algorithm or KMeans clustering. Python Implementation : pip install imbalanced-learn ; from imblearn.over_sampling import SMOTE ; <SMOTE object>.fit_resample(X,Y)