Classification Model Scoring with Scikit-Learn

Jun 19, 2022 data coding machine learning python

Introduction

I wish evaluating classification models were a matter of getting as close to 1.0 as possible and calling it a day. The problem is not to do with the different evaluation angles such as precision vs recall, but the difficulty in internalising the underlying notions.

This is a slow, and long tutorial, poisonous to the TLDR crowd, so you’ve been warned. But if you’ve been struggling for a while with the different metrics used for evaluating classification models, I’m sure you’ll find something different that make a few or two concepts click.

Unlike other tutorials in my Scikit-Learn series, this one is more focused on the underlying terminology and notions, rather than Python code, which is the easy bit.

Obligatory Imports and Boilerplate

 1import random
 2import numpy as np
 3import pandas as pd
 4import matplotlib.pyplot as plt
 5from IPython.display import display, Markdown
 6from sklearn.model_selection import train_test_split
 7from sklearn.metrics import confusion_matrix
 8from sklearn.metrics import accuracy_score
 9from sklearn.metrics import precision_score
10from sklearn.metrics import recall_score
11from sklearn.metrics import f1_score
12from sklearn.metrics import classification_report
13from sklearn.metrics import precision_recall_curve
14from sklearn.metrics import roc_curve, auc
15from sklearn.linear_model import LogisticRegression
16from sklearn.linear_model import LinearRegression

Sample Data Sets

In the evaluation of classification models, we don’t care about the data set’s features, but only about the y values.

In other words, we only need the values found in y_pred, compared to the actual correct values:

model = InstantiateModel()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

As such, we have defined the correct result as y_actual, and four sample predicted results (i.e., versions of y_pred) as follows:

1y_actual  = [1,1,0,0] # Actual (like y_test)
2y_100_t   = [1,1,1,1] # 100% True
3y_100_f   = [0,0,0,0] # 100% False
4y_075_t   = [1,1,1,0] #  75% True
5y_075_f   = [1,0,0,0] #  75% False

Every time we discuss a scoring metric, we will reference the above values, which are convenient to display in a user-friendly manner, for which we will use the below code.

Note: You don’t need to be concerned with the implementation details, so please feel free to skip ahead.

 1def val_type(actual,predicted):
 2    predictions = []
 3    for a,p in zip(actual,predicted):
 4        if a == 0 and p == 0:
 5            predictions.append("0 (TN)")
 6        if a == 0 and p == 1:
 7            predictions.append("1 (FP)")
 8        if a == 1 and p == 0:
 9            predictions.append("0 (FN)")
10        if a == 1 and p == 1:
11            predictions.append("1 (TP)")
12    return predictions
13
14def result(method):
15    display(Markdown("Reference values"))
16    df = pd.DataFrame({"Actual" : y_actual, 
17                   "100% True"   : val_type(y_actual,y_100_t),
18                   "100% False"  : val_type(y_actual,y_100_f),                       
19                   "75% True"    : val_type(y_actual,y_075_t),
20                   "75% False"   : val_type(y_actual,y_075_f),
21})
22    display(df.style.hide_index())
23    display(Markdown("Results for `{}()`".format(method.__name__)))
24    if isinstance(method(y_actual,y_100_t),np.ndarray):
25        m = lambda a,p: list(map(str,method(a,p)))
26    else:
27        m = lambda a,p: [method(a,p)]
28    dfr = pd.DataFrame({"Actual"     : m(y_actual,y_actual), 
29                        "100% True"  : m(y_actual,y_100_t),
30                        "100% False" : m(y_actual,y_100_f),
31                        "75% True"   : m(y_actual,y_075_t),
32                        "75% False"  : m(y_actual,y_075_f)})
33    display(dfr.style.hide_index())

Confusion Matrix

Before we can discuss metrics such as accuracy, precision, and recall, we need to talk about confusion matrices first.

In case you were wondering, a confusion matrix has nothing to do with a certain Chinese philosopher.

Confucius

A confusion matrix is a table that describes the performance of a classification model in four dimensions: true negatives, true positives, false positives, and false negatives.

Confusion Matrix

It is hard to memorise this matrix by heart, despite consisting only of four boxes. Let’s make it easier:

Backslash for slashing success!

Use this mnemonic device to remember that the boxes that contain the correct matches (false negatives and true positives) are those that form a backslash.

Note: Scikit-Learn’s confusion matrix presents—along the backslash—first the true negatives, and then the true positives. Outside the Python world, the converse is true. If not looking at Scikit-Learn’s confusion matrix’s output, be sure to check how the matrix is arranged. In most cases, the true positives are likely to come first.

Now that you know how the box is arranged, let’s cover each combination in detail, but before we do that, let’s see what are the combination matrices for the sample values that we have defined:

1result(confusion_matrix)

Reference values

Actual	100% True	100% False	75% True	75% False
1	1 (TP)	0 (FN)	1 (TP)	1 (TP)
1	1 (TP)	0 (FN)	1 (TP)	0 (FN)
0	1 (FP)	0 (TN)	1 (FP)	0 (TN)
0	1 (FP)	0 (TN)	0 (TN)	0 (TN)

Results for confusion_matrix()

Actual	100% True	100% False	75% True	75% False
[2 0]	[0 2]	[2 0]	[1 1]	[2 0]
[0 2]	[0 2]	[2 0]	[0 2]	[1 1]

True Negatives

1[*  ] <- Upper-left corner
2[   ]

True negatives are the values that were predicted to be false, and turned out to be false.

For example, for [1,1,0,0], the predicted values [1,1,1,0] (75% True) match exactly one negative (the last zero), and hence we see 1 in the upper-left corner.

1confusion_matrix([1,1,0,0],
2                 [1,1,1,0])

array([[1, 1],
       [0, 2]])

True Positives

1[   ]
2[  *] <- Bottom-right corner

True positives are the values that were predicted to be true, and were effectively true.

For example, for [1,1,0,0], the predicted values [1,0,0,0] (75% False) result in exactly one true positive, and hence we see 1 in the bottom-right corner.

1confusion_matrix([1,1,0,0],
2                 [1,0,0,0])

array([[2, 0],
       [1, 1]])

False Positives

1[  *] <- Upper-right corner
2[   ]

A false positive is a value that was predicted to be true, but turned out to be false.

For example, for [1,1,0,0], the predicted values [1,1,1,1] result in two true positives, and two false positives, and hence the following confusion matrix.

1confusion_matrix([1,1,0,0],
2                 [1,1,1,1])

array([[0, 2],
       [0, 2]])

False Negatives

1[   ] 
2[*  ] <- Bottom-left corner

A false negative is a value that was predicted to be false, but turned out to be true. For example, for [1,1,0,0], the predicted values [0,0,0,0] result in two true negatives, and two false negatives, and hence the following confusion matrix:

1confusion_matrix([1,1,0,0],
2                 [0,0,0,0])

array([[2, 0],
       [2, 0]])

Key Metrics

Accuracy

Accuracy is the metric that is closer to one’s intuition, but the one that is not necessarily a good yardstick for model ‘performance’.

Accuracy essentially provides the proportion of correct matches vs incorrect matches, regardless of whether they match true or false values.

Accuracy = (TP+TN) / (TP+TN+FP+FN)

Let’s see how the accuracy_score() function behaves against our sample data sets:

1result(accuracy_score)

Reference values

Actual	100% True	100% False	75% True	75% False
1	1 (TP)	0 (FN)	1 (TP)	1 (TP)
1	1 (TP)	0 (FN)	1 (TP)	0 (FN)
0	1 (FP)	0 (TN)	1 (FP)	0 (TN)
0	1 (FP)	0 (TN)	0 (TN)	0 (TN)

Results for accuracy_score()

Actual	100% True	100% False	75% True	75% False
1.000000	0.500000	0.500000	0.750000	0.750000

As you can see, both [1,1,1,1] and [0,0,0,0] result in an accuracy of 0.5 (e.g., 50%). This is because in the first case, two ones match, whereas in the latter, two zeros match.

Key Insights

Dummy Models can Provide High Accuracy

An array filled with zeros or ones provides 50% accuracy for a balanced data set like ours. However, if the data set were to be unbalanced, accuracy could be even higher. For example, if the reference data set were [0,1,1,1], a fixed prediction like [1,1,1,1], would achieve 75% accuracy.

This means that an ineffective model, which behaves no better than a dummy model, can provide misguided high levels of confidence, when accuracy is used as a metric.

There Is No Discrimination Between True and False Matches

Both [1,1,1,1] and [0,0,0,0] result in a 50% level of accuracy, against [1,1,0,0], but this result is OK only if one and zero have the same semantic weighting.

In the case of [1,1,1,1], the last two ones are false positives. In the case of [0,0,0,0], the first two ones are false negatives.

Why should you care? Say that the class represents the chance of fraudulent transaction so that 1 means that the transaction is fraudulent and 0 that it is good to go. Whereas [1,1,1,1] flags 50% of transactions incorrectly as fraudulent, [0,0,0,0] lets 100% of the fraudulent transactions through.

In the fraud detection scenario, false positives (which normally lead to a manual review process) have much greater levels of tolerance than false negatives. The same principle applies in other scenarios such as cancer detection.

Custom Calculation 1

The basic formula boils down to taking the the number of matches and dividing it by the number of values.

1def accuracy_1(y_actual,y_predicted):
2    matches = sum([ y1 == y2 for y1,y2 in zip(y_actual,y_predicted) ])
3    total = len(y_actual)
4    return matches / total
5
6result(accuracy_1)

Custom Calculation 2 (For Intuition)

The formula only aims to show accuracy in terms of true positives, false positives, false positives, and false negatives.

1def accuracy_2(y_actual,y_predicted):
2    # not using bitwise operators for clarity!
3    tp = sum([ y1==1 and y2==1 for y1,y2 in zip(y_actual,y_predicted) ])
4    tn = sum([ y1==0 and y2==0 for y1,y2 in zip(y_actual,y_predicted) ])
5    fp = sum([ y1==0 and y2==1 for y1,y2 in zip(y_actual,y_predicted) ])
6    fn = sum([ y1==1 and y2==0 for y1,y2 in zip(y_actual,y_predicted) ])
7    return (tp+tn)/(tp+tn+fp+fn)
8result(accuracy_2)

Precision

In machine learning, precision, unlike accuracy, is a metric that only cares about the correct identification of true positives.

It follows that this metric does not reward the identification of negative values, but penalises any misjudged true values (false positives):

Precision = TP / (TP + FP)

Let’s run Scikit-Learn’s precision_score() against our sample data sets, so that we can discuss the results.

1def precision_score_(a,p):
2    return precision_score(a,p,zero_division=0)
3result(precision_score_)

Reference values

Actual	100% True	100% False	75% True	75% False
1	1 (TP)	0 (FN)	1 (TP)	1 (TP)
1	1 (TP)	0 (FN)	1 (TP)	0 (FN)
0	1 (FP)	0 (TN)	1 (FP)	0 (TN)
0	1 (FP)	0 (TN)	0 (TN)	0 (TN)

Results for precision_score_()

Actual	100% True	100% False	75% True	75% False
1.000000	0.500000	0.000000	0.666667	1.000000

Let’s see why for [1,1,0,0], [1,0,0,0] (75% False) was rated as having 100% precision:

1confusion_matrix([1,1,0,0],
2                 [1,0,0,0])

array([[2, 0],
       [1, 1]])

Here we see that there there is only one true positive on the bottom-right corner, and zero false positives on the upper-right corner:

1 TP / (1 TP + 0 FP) = 1.0 (100% Precision)

Let’s see, however, that if the prediction is [1,1,1,0] (75% False). Then, we have one false positive on the upper-right corner:

1confusion_matrix([1,1,0,0],
2                 [1,1,1,0])

array([[1, 1],
       [0, 2]])

So, even though we now have two true positves on the bottom-right corner, the score gets penalised by the false positive on the upper-right corner:

2 TP / (2 TP + 1 FP) = 2 / 3 = 0.6666… (67% Precision)

1precision_score([1,1,0,0],
2                [1,1,1,0])

0.6666666666666666

Custom Calculation

1def precision_1(y_actual,y_predicted):
2    tp = sum([ y1==1 and y2==1 for y1,y2 in zip(y_actual,y_predicted) ])
3    fp = sum([ y1==0 and y2==1 for y1,y2 in zip(y_actual,y_predicted) ])
4    if tp + fp != 0:
5        return tp / (tp + fp)
6    return 0.0
7result(precision_1)

Reference values

Actual	100% True	100% False	75% True	75% False
1	1 (TP)	0 (FN)	1 (TP)	1 (TP)
1	1 (TP)	0 (FN)	1 (TP)	0 (FN)
0	1 (FP)	0 (TN)	1 (FP)	0 (TN)
0	1 (FP)	0 (TN)	0 (TN)	0 (TN)

Results for precision_1()

Actual	100% True	100% False	75% True	75% False
1.000000	0.500000	0.000000	0.666667	1.000000

Recall

Arnold Schwarzenegge in Total Recall (1990)

Recall? Total recall?. No, it’s not an Arnold Schwarzenegge movie! Recall is a metric that is similar to precision, in that it gives kudos for true positives, but that penalises false negatives rather than false positives.

Recall = TP / (TP + FN)

For this reason, recall is a metric that is preferred in use cases such as cancer detection, in which false positives are tolerable, but false negatives are not.

Recall is also known as True Positive Rate (TPR), sensitivity, or simply probability of detection.

Let’s see the results of applying recall_score() to our sample data sets.

1result(recall_score)

Reference values

Actual	100% True	100% False	75% True	75% False
1	1 (TP)	0 (FN)	1 (TP)	1 (TP)
1	1 (TP)	0 (FN)	1 (TP)	0 (FN)
0	1 (FP)	0 (TN)	1 (FP)	0 (TN)
0	1 (FP)	0 (TN)	0 (TN)	0 (TN)

Results for recall_score()

Actual	100% True	100% False	75% True	75% False
1.000000	1.000000	0.000000	1.000000	0.500000

Let’s see why [1,1,1,0] (75% False) results in a 0.5 recall score, against [1,0,0,0].

In the case of actual=1, predicted=1, we have one true positive:

Recall = 1 TP / (1 TP + … )

Then, in the case of actual=1, predicted=0, we have one false negative:

Recall = 1 TP / (1 TP + 1 FN) = 1 / 2 = 0.5

The last case, in which actual=0, and predicted=0, result in true negatives (TN), which recall, as a metric, doesn’t care about.

Custom Calculation

 1def recall_1(y_actual,y_predicted):
 2    # not using bitwise operators for clarity!
 3    tp = sum([ y1==1 and y2==1 for y1,y2 in zip(y_actual,y_predicted) ])
 4    fn = sum([ y1==1 and y2==0 for y1,y2 in zip(y_actual,y_predicted) ])
 5    # prevent division by zero
 6    if tp+fn !=0:
 7        return tp/(tp+fn)
 8    else:
 9        return 0.0
10result(recall_1)

Reference values

Actual	100% True	100% False	75% True	75% False
1	1 (TP)	0 (FN)	1 (TP)	1 (TP)
1	1 (TP)	0 (FN)	1 (TP)	0 (FN)
0	1 (FP)	0 (TN)	1 (FP)	0 (TN)
0	1 (FP)	0 (TN)	0 (TN)	0 (TN)

Results for recall_1()

Actual	100% True	100% False	75% True	75% False
1.000000	1.000000	0.000000	1.000000	0.500000

F1

Formula 1

It seems that these machine learning guys are obsessed with celebrities. First Arnie, now Hamilton … what on earth is a F1 score? Does it keep the time for each lap? Of course not.

F1 provides a balance between the precision and recall scores, with some math magic to scare away outliers—i.e., subdue their influence:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Let’s feed our sample data sets and see F1 in action:

1result(f1_score)

Reference values

Actual	100% True	100% False	75% True	75% False
1	1 (TP)	0 (FN)	1 (TP)	1 (TP)
1	1 (TP)	0 (FN)	1 (TP)	0 (FN)
0	1 (FP)	0 (TN)	1 (FP)	0 (TN)
0	1 (FP)	0 (TN)	0 (TN)	0 (TN)

Results for f1_score()

Actual	100% True	100% False	75% True	75% False
1.000000	0.666667	0.000000	0.800000	0.666667

Let’s see why [1,0,0,0] resulted in a F1 score of 0.66.

First, we calculate precision:

Precison = 1 TP / (1 TP + 0 FP) = 1.0

Then, we calculate Recall:

Recall = 1 TP / (1 TP + 1 FN) = 0.5

Finally, we apply the F1 formula:

F1 = 2 * (1 Precision * 0.5 Recall) / (1 Precision + 0.5 Recall)

F1 = 2 * 0.5 / 1.5

F1 = 1 / 1.5

F1 = 0.66666…

Custom Calculation

1def f1_1(y_actual,y_predicted):
2    precision = precision_1(y_actual,y_predicted)
3    recall    = recall_1(y_actual,y_predicted)
4    # prevent division by zero
5    if precision + recall != 0:
6        return 2 * ((precision * recall) / (precision + recall))
7    else: 
8        return 0.0
9result(f1_1)

Reference values

Actual	100% True	100% False	75% True	75% False
1	1 (TP)	0 (FN)	1 (TP)	1 (TP)
1	1 (TP)	0 (FN)	1 (TP)	0 (FN)
0	1 (FP)	0 (TN)	1 (FP)	0 (TN)
0	1 (FP)	0 (TN)	0 (TN)	0 (TN)

Results for f1_1()

Actual	100% True	100% False	75% True	75% False
1.000000	0.666667	0.000000	0.800000	0.666667

Sensitivity and Specificity

Sensitivity, which is also known as the true positive rate, is the same as recall:

Specificity = Recall = TP / (TP + FN)

Specificity, also known as true negative rate, is a metric that rewards the correct identification of negatives—true negatives—but penalises the misclassification of positives—false positives:

Specificity = TN / (TN + FP)

Specificity is the counterpart of sensitivity.

In a disease diagnosis test scenario we use sensitivity to describe how well the test can identify the presence of the disease, and specificity to tell how well the test can identify the absence of the disease.

Comprehensive Classification Report

What if we wanted all the metrics, including precision, recall, and f1, in one single stroke? Is there ~~an app~~ a Scikit-Learn function for that? Yes, and it’s simply called classification_report()!

Let’s feed our sample data sets into it:

 1
 2for name, y_predicted in [("Actual",y_actual),
 3                          ("100% True",y_100_t),
 4                          ("100% False",y_100_f),
 5                          ("75% True",y_075_t),
 6                          ("75% False", y_075_f)]:
 7    display(Markdown("## {}".format(name)))
 8    display(Markdown("```\n   Actual: {}\nPredicted: {}\n```"
 9                     .format(y_actual,y_predicted)))
10    print(classification_report(y_actual, 
11                                y_predicted, 
12                                target_names=['False', 'True'],
13                                zero_division=0))

Actual

1   Actual: [1, 1, 0, 0]
2Predicted: [1, 1, 0, 0]

              precision    recall  f1-score   support

       False       1.00      1.00      1.00         2
        True       1.00      1.00      1.00         2

    accuracy                           1.00         4
   macro avg       1.00      1.00      1.00         4
weighted avg       1.00      1.00      1.00         4

100% True

1   Actual: [1, 1, 0, 0]
2Predicted: [1, 1, 1, 1]

              precision    recall  f1-score   support

       False       0.00      0.00      0.00         2
        True       0.50      1.00      0.67         2

    accuracy                           0.50         4
   macro avg       0.25      0.50      0.33         4
weighted avg       0.25      0.50      0.33         4

100% False

1   Actual: [1, 1, 0, 0]
2Predicted: [0, 0, 0, 0]

              precision    recall  f1-score   support

       False       0.50      1.00      0.67         2
        True       0.00      0.00      0.00         2

    accuracy                           0.50         4
   macro avg       0.25      0.50      0.33         4
weighted avg       0.25      0.50      0.33         4

75% True

1   Actual: [1, 1, 0, 0]
2Predicted: [1, 1, 1, 0]

              precision    recall  f1-score   support

       False       1.00      0.50      0.67         2
        True       0.67      1.00      0.80         2

    accuracy                           0.75         4
   macro avg       0.83      0.75      0.73         4
weighted avg       0.83      0.75      0.73         4

75% False

1   Actual: [1, 1, 0, 0]
2Predicted: [1, 0, 0, 0]

              precision    recall  f1-score   support

       False       0.67      1.00      0.80         2
        True       1.00      0.50      0.67         2

    accuracy                           0.75         4
   macro avg       0.83      0.75      0.73         4
weighted avg       0.83      0.75      0.73         4

Precision-Recall Curve

When I first saw a so-called precision recall curve, I said to myself “Oh, I get it”, except that I didn’t. I couldn’t really grasp the mechanics behind it, and make my intuition “operational”. The reason is that I had to master, first of all, some foundational concepts. This is what we will do now. I told you, this tutorial is toxic for the TLDR folks.

Decision Threshold

In a binomial classification scenario, most models don’t simply conclude “Aha, it is True” on a wimp. Instead, each True/False conclusion is the result of scoring each probability—from 0.0 to 1.0.

The decision threshold in most models is typically set to 50%. For example:

Probability	Class
0.45	False
0.51	True
0.78	True

This all makes sense, you may be thinking, so what is the issue here? And how does it relates to precision recall curves? We’ll get there, but let’s start by explaining why 50% is not necessarily an appropriate threshold.

First, we will generate some sample data containing True and False classes. You don’t need to be concerned with the code below—unless you are interested—so feel free to scroll straight to the visualisation down below.

 1samples = 1000
 2
 3f1 = np.random.rand(samples)
 4f2 = np.random.rand(samples)
 5    
 6label = [ 1 if ((np.random.rand()/2)+x) > 0.8 else 0 for x in f2 ]
 7    
 8df = pd.DataFrame({"f1":f1,
 9                   "f2":f2,
10                   "label" : label})
11
12def decoration():
13    plt.xticks([0,0.2,0.4,0.6,0.8,1])
14    plt.yticks([0,0.2,0.4,0.6,0.8,1])
15    plt.legend(loc='upper left')
16        
17def visualise(df1=None,model=None,decision_threshold=0.5):
18    
19    plt.rcParams['figure.figsize'] = [8, 4]
20    plt.rcParams['figure.dpi'] = 100 
21    
22    plt.subplot(1, 2, 1)    
23    plt.title("Original Data Set")
24    df1_false = df1[df1['label'] == 0]
25    df1_true  = df1[df1['label'] == 1]
26    plt.scatter(df1_true['f1'],df1_true['f2'],label='True',s=5,color='b',alpha=0.3)
27    plt.scatter(df1_false['f1'],df1_false['f2'],label='False',s=5,color='r',alpha=0.3)
28    decoration()
29    
30    plt.subplot(1, 2, 2)            
31    plt.title("Decision Threshold: {}".format(decision_threshold))
32    for m in np.linspace(0.03,0.90,10):
33        for n in np.linspace(0.05,0.95,15):
34            proba = model.predict_proba([[m,n]])[:,1][0]
35            colour = "blue" if proba >= decision_threshold else "red"
36            plt.text(m,n,"{:.2f}".format(proba),
37                    fontdict={"fontsize" : 7, 
38                              "color" : colour})
39     
40X_train, X_test, y_train, y_test = train_test_split(df[['f1','f2']].values, df['label'].values, random_state=0)
41
42model = LogisticRegression().fit(X_train, y_train)
43visualise(df,model,0.5)

Above, to the left, we see that the sample data set consists of True values at the top, and False values at the bottom, which mingle at the middle. By ‘middle’ we mean the region where the Y axis (feature 2) is between 0.2 and 0.8.

Above, to the right, we see what probability—of True—corresponding to each (x,y) coordinate based on a threshold of 0.5 (50%). For example, a dot right at the bottom has pretty much 0% probability of being True:

1# predict_proba() returns two arrays, the first indicating the
2# probability for the False class, the seconding indicating
3# the probability for the True class. We take the latter.
4# 0.5 is the middle of the figure (x axis)
5
6model.predict_proba([[0.5,0.0]])[:,1][0]

0.012990843213615034

The opposite, instead, a coordinate right at the top is most certainly a True class:

1model.predict_proba([[0.5,1.0]])[:,1][0]

0.9705947563393811

The figure above to the right is created simply by traversing the coordinates of a 10x15 grid, and calling model.predict_proba() for each coordinate.

Now, barring the particulars of a logistic regression model, versus others such as SVM, you can see that the decision threshold might provide predictions that defy the model’s intended purpose.

What I mean to say is that the default decision threshold (50%) results in many false negatives and therefore the following recall score:

1y_true = label # for clarity
2y_pred = [ 1 if model.predict_proba([[f1,f2]])[:,1][0] >= 0.5 else 0 
3            for f1,f2 in zip(f1,f2)  ] 
4recall_score(y_true, y_pred)

0.8600451467268623

Let’s suppose that True represents the probability of a tumour. In this case, any chance of a tumour, no matter how small, we want to classify as True. We want to keep recall (false negatives) down, pretty much, to zero. We should therefore decrease the decision threshold, say, to 15%:

1visualise(df,model,0.15) # 0.15 = 15%

It is worth noting, that altering the decision model means, in most cases, determining the class using model.predict_proba() manually, rather than relying on model.predict(). For example, for a coordinate at a height of 0.4 (y axis = feature 2), a threshold above 0.15 should be classified as true:

1model.predict_proba([[0.5,0.4]])[:,1][0]

0.23155237166325443

However, model_predict() uses 0.5 as its internal decision threshold, and provides a False result (represented by the class 0):

1model.predict([[0.5,0.4]])[0]

What about recall with a decision threshold of 15%?

1y_pred = [ 1 if model.predict_proba([[f1,f2]])[:,1][0] >= 0.15 else 0 
2            for f1,f2 in zip(f1,f2)  ] 
3recall_score(label, y_pred)

0.9977426636568849

As you have expected, almost perfect.

Decision Threshold Take Away

The message that needs to hit home here is that a default decision threshold of 50% might not be what best serves the prediction objective at hand.

It may be because we want a model with rock-bottom recall (like the tumour detection example), or sky-high precision (like in the case of search engine results). But, in certain cases,we may also want some compromise between the two. This is what precision recall curves are for.

Prerequisites

Rather than trial and error (or implementing our own algorithm), Scikit-Learn provides a handy function called precision_recall_curve() which provides the decision threshold that is required to achieve each possible precision/recall combination. However, it needs to be fed with a couple of arguments before doing our bidding:

precision_recall_curve(y_true, probas_pred,...)

y_true is in our case the label column.
probas_pred is the effective probability for each class, i.e., the results from model.predict_proba()

Note: In certain models, the class probability is best calculated using model.decision_function() rather than model.predict_proba(). The choice between the two is beyond the scope of this tutorial.

Ok, let’s do our homework. First, y_true which is as easy as:

1y_true = df['label']
2y_true.head().values

array([1, 0, 0, 0, 0])

Next, is proba_pred which is an array containing the probability for each of those classes:

1proba_pred = model.predict_proba(df[['f1','f2']].values)[:, 1]
2proba_pred[0:5]

array([0.85057493, 0.05799033, 0.29637915, 0.17768424, 0.09430635])

At Last, The Curve!

I know you are running impatient, so let’s get on with it. First we get the precision, recall, and thresholds arrays, which will then use to plot the curve:

1precision, recall, thresholds = precision_recall_curve(y_true, proba_pred)
2pd.DataFrame({"precision":precision[:-1],
3              "recall" : recall[:-1],
4              "thresholds":thresholds
5}).head()

	precision	recall	thresholds
0	0.646715	1.000000	0.137600
1	0.646199	0.997743	0.139335
2	0.647145	0.997743	0.139684
3	0.648094	0.997743	0.141756
4	0.649046	0.997743	0.145679

Why precision[:-1] and recall[:-1]? Because the length of thresholds is always one element shorter than that of precision and recall. See for yourself:

1for a in [precision,recall,thresholds]:
2   print(len(a))

686
686
685

The last value of precision is 1.0, and the last value of recall is 0.0, which do not result in a computable threshold. This is why there is a discrepancy in the arrays’ lengths. Blame the mathematicians, not me!

1precision[-1]

1.0

1recall[-1]

0.0

What about the plot?

Sorry about that!. Here we go. As usual, skip straight ahead to the plot, and then we’ll discuss.

 1def visualise_precision_recall_curve():
 2    plt.rcParams['figure.figsize'] = [4, 4]
 3    plt.rcParams['figure.dpi'] = 100 
 4    plt.xlim([0.0, 1.2])
 5    plt.ylim([0.0, 1.2])
 6    plt.xticks([0.0,0.2,0.4,0.6,0.8,1.0])
 7    plt.yticks([0.0,0.2,0.4,0.6,0.8,1.0])
 8    plt.plot(precision, recall, label='Precision-Recall Curve')
 9    plt.xlabel('Precision')
10    plt.ylabel('Recall')
11    plt.grid(alpha=0.2)
12    last_y = 0.0
13    for i in range(0,len(precision)-1):
14        if i == 0 or last_y - recall[i] > 0.055:
15            plt.text(precision[i]+0.03,
16                     recall[i]+0.021,
17                     "{:.2f}".format(thresholds[i]),alpha=0.8)
18            last_y = recall[i]
19            plt.plot(precision[i],recall[i],'-ro',markersize=3)
20visualise_precision_recall_curve()

The above plot shows how a low decision threshold such as 13% results in better recall, and how high decision thresholds, such as 97%, result in better precision, at the expense of compromised recall.

What about the best compromise between recall and precision? Is that perhaps 60%, or 67%? Yes, you got right the intuition about the usefulness of precision-recall curves, but the decision thresholds I’ve plotted above are just a few samples for the sake of visualisation.

The best ‘compromise’ between precision and recall is precisely what an F1 score is for, which we had treated a few sections back. However, what we want is the threshold that produces the highest F1 score. How do we do that?

All we have to do is to calculate the score for all thresholds and then pick the highest one:

1f1_scores = (2 * precision * recall) / (precision + recall)
2i = np.argmax(f1_scores)
3print("Decision threshold: {:.2f}, precision: {:.2f}, recall: {:.2f}"
4        .format(thresholds[i],precision[i],recall[i]))

Decision threshold: 0.47, precision: 0.84, recall: 0.88

Can we see where that point is along the curve?

Naturally. Spot the purple cross down below.

1visualise_precision_recall_curve()
2plt.plot(precision[i],recall[i],'x',color='purple',markersize=12)
3plt.show()

Receiver Operating Characteristic (ROC) Curve

Yet another curve. Isn’t the precision-recall function enough for most use cases? Yes, when it comes to the trade-off between precision and recall, but what about recall vs specificity?

To recap, specificity (referred to as the true negative rate in ROC Curves), in the medical field, tell us how effective is the model at identifying the absence of a disease. Recall, is (referred to as the true positive rate in ROC Curves), tells us how effective is the model at identifying the presence of a disease.

In short, a ROC Curve is what you would use to evaluate the adequacy of a Covid-19 test.

On a ROC Curve, you have the false positive rate (specificity) on the x axis, and the true positive rate (recall) on the y axis. Similarly to the precision-recall curve, each coordinate is associated with a given decision threshold.

There are other useful insights that can be drawn from ROC Curves which will explore in a second. Let’s first get into visualising our first curve.

Prerequisites

The function roc_curve(y_true, y_score) requires the same input arguments as precision_recall_curve(), which we have already discussed and computed in the last section. Search ‘proba_pred’.

1y_score = proba_pred # just for clarity

The ROC Curve

First, we obtain the arrays for the x axis (false positives), the y axis (true positives), and the decision threshold for each combination:

1fpr, tpr, thresholds_roc = roc_curve(y_true, y_score)
2pd.DataFrame({"False Positive Rate":fpr,
3              "True Positive Rate" : tpr,
4              "Threshold":thresholds_roc
5}).head()

	False Positive Rate	True Positive Rate	Threshold
0	0.000000	0.000000	1.972006
1	0.000000	0.002257	0.972006
2	0.000000	0.485327	0.867504
3	0.001795	0.485327	0.866751
4	0.001795	0.519187	0.849187

I hope you spotted a threshold of 1.979406 when both fpr and tpr are 0.0. For a detailed answer, click here.

Now we have all we need to plot the curve

 1def visualise_ROC():
 2    plt.xlim([-0.2, 1.02])
 3    plt.ylim([-0.01, 1.1])
 4    plt.grid()
 5    plt.plot(fpr, tpr, label='current classifier')
 6
 7    plt.xlabel('False Positive Rate / 1-Specificity')
 8    plt.ylabel('True Positive Rate / Sensitivity')
 9    last_y = 0.0
10    for i in range(1,len(tpr)-1): 
11         if i==0 or tpr[i] - last_y >= 0.055:
12            plt.text(fpr[i]-0.15,
13                     tpr[i]+0.021,
14                     "{:.2f}".format(thresholds_roc[i]),
15                                     alpha=0.8)
16            last_y = tpr[i]
17            plt.plot(fpr[i],tpr[i],'-ro',markersize=3)    
18visualise_ROC()

You may be wondering, why 1-Specificity? Because, a specificity score, the closer it gets to 1, the better. In the ROC curve we need it the other way around, so that 0 is the best score, and 1 is the worst.

Further Visualisation Aids

Most ROC curve visualisations include a diagonal line which indicates the curve for a so-called no skill or random classifier—one that ‘flips a coin’ rather than making a prediction.

It is also helpful to include a dot in the top left corner, to see how close is the curve to a perfect classifier.

Last, as we did with the precision-recall curve, we also want to determine which threshold results in the best trade-off between high true positives, and low false positives. This may be accomplished like so:

1# g-means approach
2i = np.argmax(np.sqrt(tpr * (1-fpr)))
3print("Decision threshold: {:.2f}, tpr: {:.2f}, fpr: {:.2f}"
4          .format(thresholds_roc[i], tpr[i], fpr[i]))

Decision threshold: 0.47, tpr: 0.88, fpr: 0.14

Now we add all of these extra helpful references to our visualisation:

 1def visualise_ROC_reference():
 2    # random classifier curve
 3    plt.plot([0.0, 1.0], [0.0, 1.0], color='blue', linestyle='--', alpha=0.3, label="random classifier")
 4    # perfect classifier 
 5    plt.plot(0.0,1.0,'o',color='blue',label="perfect classifier")
 6    # most optimal threshold
 7    plt.plot(fpr[i],tpr[i],'x',color='purple',markersize=12,label="threshold: {:2f}".format(thresholds_roc[i]))
 8visualise_ROC()
 9visualise_ROC_reference()
10plt.legend()
11plt.show()

Area Under The Curve (AUC)

This is a metric that boils down the ROC curve performance to a single number typically between 0.5 (no better than a random guess) and 1.0. (perfect classification). Scikit-learn provides a handy function simply called auc() for this purpose:

1auc(fpr, tpr)

0.9544358482843027

We can visualise AUC as follows:

1visualise_ROC()
2visualise_ROC_reference()
3plt.gca().fill_between(fpr,tpr,facecolor='purple',alpha=0.1,label="AUC")
4plt.legend()
5plt.show()

Multi-Class Model Scoring

The evaluation of a multi-class model is no different than that of a binary one, in that all ‘probabilities’ apply at the class level. How so? Let’s recap by looking at the results of our recent binary classification model, provided by classification_report():

1print(classification_report(y_test, 
2                            model.predict(X_test), 
3                            target_names=['False', 'True'],
4                            zero_division=0))

              precision    recall  f1-score   support

       False       0.94      0.90      0.92       145
        True       0.86      0.91      0.89       105

    accuracy                           0.90       250
   macro avg       0.90      0.91      0.90       250
weighted avg       0.91      0.90      0.90       250

Let’s now add one extra class to our model, so that a given combination of features f1, and f2, in addition to labels 0, 1, we may also have label 2:

 1samples = 1000
 2random.seed(4)
 3
 4f1 = np.random.rand(samples) 
 5f2 = np.random.rand(samples)
 6    
 7label = [ 1 if ((np.random.rand()/2)+x) > 0.8 else 
 8          2 if x-(np.random.rand()/10) < 0.2 else 0
 9           for x in f2 ]
10    
11df = pd.DataFrame({"f1":f1,
12                   "f2":f2,
13                   "label" : label})
14display(df.head())

	f1	f2	label
0	0.354469	0.967415	1
1	0.199742	0.087077	2
2	0.740130	0.984166	1
3	0.027857	0.399047	0
4	0.276840	0.487641	0

As usual, it is useful to visualise the resulting data set, so feel free to skip the code and scroll until you see the picture with the green dots.

 1plt.rcParams['figure.figsize'] = [4, 4]
 2plt.rcParams['figure.dpi'] = 100 
 3plt.title("Multi-class Data Set")
 4df_red    = df[df['label'] == 0]
 5df_blue   = df[df['label'] == 1]
 6df_green  = df[df['label'] == 2]
 7plt.scatter(df_blue['f1'],df_blue['f2'],label='Blue (Former True)',s=5,color='b',alpha=0.3)
 8plt.scatter(df_red['f1'],df_red['f2'],label='Red (Former False)',s=5,color='r',alpha=0.3)
 9plt.scatter(df_green['f1'],df_green['f2'],label='Green (New class!)',s=5,color='g',alpha=0.3)
10ticks = [0,0.2,0.4,0.6,0.8,1.0]  
11plt.xticks(ticks)
12plt.yticks(ticks)
13plt.xlabel("f1")
14plt.ylabel("f2")
15plt.legend(loc='upper left')
16plt.show()

If we fit the data—using a logistic regression model as we did last time—and run a general report, using classification_report(), we will note the extra class:

 1X_train, X_test, y_train, y_test = (
 2    train_test_split(df[['f1','f2']].values, 
 3                     df['label'].values, 
 4                     random_state=0))
 5model = LogisticRegression().fit(X_train, y_train)
 6
 7names = ['Red (Former False)', 'Blue (Former True)','Green']
 8print(classification_report(y_test, 
 9                            model.predict(X_test), 
10                            target_names=names,
11                            zero_division=0))

                    precision    recall  f1-score   support

Red (Former False)       0.85      0.68      0.76        76
Blue (Former True)       0.81      0.92      0.86       108
             Green       0.99      1.00      0.99        66

          accuracy                           0.87       250
         macro avg       0.88      0.87      0.87       250
      weighted avg       0.87      0.87      0.86       250

Confusion Matrix

Multi-class confusion matrices are actually easier than binary ones because all of that true positive, versus true negative dichotomy is gone—in the default representation. Forget about the whole TP,TN,FP,FN ‘confusion’.

We still have a grid in which the y axis is for the actual classes, and the x axis is for the predicted ones, starting in ascending order from the top left corner.

Multi-class Confusion Matrix

Then, for each correct prediction, we get a point added to each cell. That’s it.

Let’s now see multi-class confusion matrices in action.

Example 1: Perfect Accuracy

If we match all the classes (we do it three times for each class), then we get a 3 in each cell along the backslash:

1confusion_matrix([0,0,0,1,1,1,2,2,2], # actual
2                 [0,0,0,1,1,1,2,2,2]) # prediction

array([[3, 0, 0],
       [0, 3, 0],
       [0, 0, 3]])

Example 2: Misclassification of one class

Now, let’s take the example of predicting 0 always incorrectly. Below, we first predict it to be 1, and then to be a 2, two times. As a result, we have 1 for the (1,0) coordinate, and 2 for the (2,0) coordinate:

1confusion_matrix([0,0,0,1,1,1,2,2,2], # actual
2                 [1,2,2,1,1,1,2,2,2]) # prediction

array([[0, 1, 2],
       [0, 3, 0],
       [0, 0, 3]])

Note also that the coorindate (0,0) has 0 rather than 3, unlike Example 1. This is because we have not predicted 0 correctly even once.

Confusion Matrix (Beautified)

It is customary to visualise complex multi-class confusion matrices using a heatmap so that it is easy to spot the classes that are best—and worst—predicted.

For this example, we will use the more complex data set we had introduced at the beginning of this section:

1confusion = confusion_matrix(y_test, model.predict(X_test))

The first step is to pack the values using a dataframe

1names = ["red","blue","green"]
2dfc = pd.DataFrame(confusion, 
3                   index = names, 
4                   columns = names)
5dfc

	red	blue	green
red	52	23	1
blue	9	99	0
green	0	0	66

Then, we use the seaborn library to save a few keystrokes:

1plt.figure(figsize=(3.5,3))
2sns.heatmap(dfc, annot=True, cmap="YlGnBu")
3plt.ylabel('Actual')
4plt.xlabel('Predicted')
5plt.show()

Micro vs Macro Metrics

If we apply metric functions such as precision_score(), recall_score(), or f1_score() to a multi-class array, we’ll get an error, given that these functions expect a binary array—i.e., one that consists of exactly two classes—by default.

1try:
2    recall_score([0,1,2], 
3                 [0,1,2])
4except ValueError as ve:
5    print(ve)

Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

Before we get into Micro vs Macro, let’s first see what happens if we simply try None:

1recall_score([0,1,2], 
2             [0,1,2],
3             average=None)

array([1., 1., 1.])

The problem goes away, but we have a recall score which is individual for each class. Let’s look at an example in which we have a progressively lower recall score for each class:

1recall_score([0,0,0,0,1,1,2,2], 
2             [0,0,0,0,1,0,0,0],
3             average=None)

array([1. , 0.5, 0. ])

In the above example, we have the following results for the recall formula Recall = TP / (TP + FN):

Class	TP	FN	Result
0	4	0	4/(4+0) = 4/4 = 1
1	1	1	1/(1+1) = 1/2 = 0.5
2	0	2	0/(0+2) = 0/2 = 0

Macro Average

What if we wanted a single recall score for the entire data set? Well, we need to decide how to sum up the recall score of each individual class. The easiest thing to do is simply to sum up all individual scores and divide the result by the number of classes:

(1.0 + 0.5 + 0.0) / 3 = 1.5/3 = 0.5

That’s called the macro average!

1recall_score([0,0,0,0,1,1,2,2], 
2             [0,0,0,0,1,0,0,0],
3             average='macro')

A macro averages gives equal weight to each class which is both an advantage, and a disadvantage, depending on our objective. For example, in the data set below, we predict all the instances of the class 0 incorrectly, but because its weight is 1/4 regardless, we still obtain a high recall score:

1recall_score([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,2,3,3], 
2             [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,3,3],
3             average='macro')

0.75

Micro Average

The micro average does not care about the scores for each individual class, but the overall number of correct classifications.

1recall_score([0,0,0,0,1,1,2,2], 
2             [0,0,0,0,1,0,0,0],
3             average='micro')

The above result, 0.625 is the result of:

Actual	Predicted	TP	FN
0	0	1	0
0	0	1	0
0	0	1	0
0	0	1	0
1	1	1	0
1	0	0	1
2	0	0	1
2	0	0	1

Recall = TP / (TP + FN)

Recall = 5 / (5 + 3)

Recall = 5 / 8

Recall = 0.625

But wait a second, what would count as a TN (True Negative) or False Positive (FP) in the above example? We treat this next.

Micro average in the contxt of TP, FP, TN, and FN.

Earlier, we demonstrated how recall is calculated when average='micro', showing the count for TPs and that for FNs. But there’s something not quite right here. Say, that we wanted to compute precision (which uses FPs), rather than FNs. What we would count as FPs for precision?

Actual	Predicted	TP	FP
0	0	1	0
0	0	1	0
0	0	1	0
0	0	1	0
1	1	1	0
1	0	0	1
2	0	0	1
2	0	0	1

As you might have guessed, when average='micro' there is no difference between recall and precision—and neither f1.

1recall_score([0,0,0,0,1,1,2,2], 
2              [0,0,0,0,1,0,0,0],
3               average='micro')

1precision_score([0,0,0,0,1,1,2,2], 
2                [0,0,0,0,1,0,0,0],
3                average='micro')

1f1_score([0,0,0,0,1,1,2,2], 
2         [0,0,0,0,1,0,0,0],
3         average='micro')

In all of the above cases, the score metrics just boil down to:

Score = Correct / Correct + Incorrect

Summary

Metrics

The four ways in which the result of a single prediction (in a binary model) may be interpreted are:

Actual	Predicted	Result
0	0	True Negative (TN)
1	1	True Positive (TP)
1	0	False Negative (FN)
0	1	False Positive (FP)

When evaluating multiple predictions, accuracy is a metric that simply measures correct predictions over incorrect predictions, but depending on the prediction’s objective, we may want a more nuanced metric:

Precision

We use precision when we want to be absolutely sure that if we got a positive, it is indeed a true positive.

Take the case of a criminal court whose goal is incarcerating as many criminals as possible, but not at the cost of sending innocent people to jail:

Actual	Court Decision	Outcome
Not Guilty (0)	Not Guilty (0)	TN - Justice has been made
Guilty (1)	Guilty (1)	TP - A criminal is incarcerated. (Kudos!)
Guilty (1)	Not Guilty (0)	FN - A criminal goes free
Not Guilty (0)	Guilty (1)	FP - An innocent person is incarcerated (Bad!)

Precision = TP / (TP + FP)

The trade-off is that some criminals go free

Recall / Sensitivity / True Positive Rate

We use recall when we want to predict as many positives as possible (and misjudged ones, i.e., false positives, don’t hurt), but where failing to catch a positive—a false negative scenario—has terrible consequences.

Take the case of an effective Covid-19 test. We want to detect as many infections as possible, but what is unacceptable is that we misclassify an infected person as uninfected, and let them board a plane, infecting everyone else.

Actual	Test Result	Outcome
Non-infected (0)	Non-infected (0)	TN - Non-infected person boards plane
Infected (1)	Infected (1)	TP - Infected person is denied boarding (Kudos!)
Infected (1)	Non-Infected (0)	FN - Infected person boards plane (Bad!)
Non-infected (0)	Infected (1)	FP - Non-infected person, unfairly, is denied boarding

Recall = TP / (TP + FN)

The trade-off is that some non-infected subjects are unfairly prevented from flying.

F1 is a balance between precision and recall. It might not be more useful than simple accuracy in certain circumstances.

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Specificity / False Negative Rate

Specificity is a metric that, unlike recall, is used to rate the certainty of ‘absence’, as opposed to the certainty of presence. While both precision and recall give ‘kudos’ for every true positive, specificity, instead, gives kudos for true negatives.

Take, again, the case of Covid-19 but let’s suppose that a certain country is interested in reporting low infection numbers to the world.

In this case, the health authorities would insist of specificity so that every not infected person is counted as a success, but counting not-infected taken as uninfected, would be a unforgivable mistake in the eyes of the ruling party.

This is why this is also called the false positive rate, because it measures the success of classifying negatives correctly.

Actual	Test Result	Outcome
Non-infected (0)	Non-infected (0)	TN - These are the ones we want to count (Kudos!)
Infected (1)	Infected (1)	TP - We are not reporting on those
Infected (1)	Non-infected (0)	FN - We are happy to misreport these ones
Non-infected (0)	Infected (1)	FP - Unacceptable mistake (Bad!)

Specificity = TN / (TN + FP)

Curves

The precision-recall curve helps determine the trade-off between precision and recall, for each decision threshold.

The Receiver Operating Characteristic (ROC) curve helps determine the trade-off between the true positive rate (the same as recall), and the false positive rate (the same as specificity) for each decision threshold.

Final Note

If you find any errors, please get in touch at ernesto@garba.org. I will give you credit for your feedback.

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!

Classification Model Scoring with Scikit-Learn

Table of Contents

Introduction

Obligatory Imports and Boilerplate

Sample Data Sets

Confusion Matrix

True Negatives

True Positives

False Positives

False Negatives

Key Metrics

Accuracy

Precision

Recall

F1

Sensitivity and Specificity

Comprehensive Classification Report

Actual

100% True

100% False

75% True

75% False

Precision-Recall Curve

Decision Threshold

Decision Threshold Take Away

Prerequisites

At Last, The Curve!

Receiver Operating Characteristic (ROC) Curve

Prerequisites

The ROC Curve

Area Under The Curve (AUC)

Multi-Class Model Scoring

Confusion Matrix

Confusion Matrix (Beautified)

Micro vs Macro Metrics

Macro Average

Micro Average

Micro average in the contxt of TP, FP, TN, and FN.

Summary

Metrics

Curves

Final Note

Before You Leave