Dummy Models with Scikit-Learn

Jul 24, 2022 data coding machine learning python

Introduction

Obligatory Imports and Boilerplate

 1import numpy as np
 2import pandas as pd
 3import re
 4import matplotlib.pyplot as plt
 5import warnings
 6from sklearn.model_selection import train_test_split
 7from sklearn.metrics import confusion_matrix
 8from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
 9from sklearn.metrics import classification_report
10from sklearn.dummy import DummyClassifier
11
12warnings.filterwarnings('always')

Sample Data Set

 1def generate_data_set(samples, ratio):
 2
 3    f1 = np.random.rand(samples)
 4    f2 = np.random.rand(samples)
 5    
 6    label = [ 1 if x < ratio else 0 for x in f2 ]
 7    
 8    df = pd.DataFrame({"f1":f1,
 9                       "f2":f2,
10                       "label" : label})
11
12    return df

Visualisation

 1def decoration():
 2    plt.xticks([0,0.2,0.4,0.6,0.8,1])
 3    plt.yticks([0,0.2,0.4,0.6,0.8,1])
 4    plt.legend(loc='upper left')
 5        
 6def visualise(df1=None,df2=None):
 7    
 8    plt.rcParams['figure.figsize'] = [8, 4]
 9    plt.rcParams['figure.dpi'] = 100 
10    
11    plt.subplot(1, 2, 1)    
12    
13    df1_false = df1[df1['label'] == 0]
14    df1_true  = df1[df1['label'] == 1]
15    plt.scatter(df1_false['f1'],df1_false['f2'],label='Actual False',s=5,color='r',alpha=0.3)
16    plt.scatter(df1_true['f1'],df1_true['f2'],label='Actual True',s=5,color='b',alpha=0.3)
17    decoration()
18    
19    plt.subplot(1, 2, 2)            
20    if df2 is not None:
21        df2_false = df2[df2['label'] == 0]
22        df2_true  = df2[df2['label'] == 1]
23        plt.scatter(df2_false['f1'],df2_false['f2'],color='red',s=5,label="Predicted False",alpha=0.3)
24        plt.scatter(df2_true['f1'],df2_true['f2'],color='blue',s=5,label="Predicted True",alpha=0.3)
25        decoration()

Dummy Prediction Scores

 1def dummy_predictions(df,strategy):
 2    
 3    X_train, X_test, y_train, y_test = train_test_split(df[['f1','f2']].values, df['label'].values, random_state=0)
 4
 5    dummy = DummyClassifier(strategy = strategy).fit(X_train, y_train)
 6
 7    f1 = []
 8    f2 = []
 9    for m in np.linspace(0,1,20):
10        for n in np.linspace(0,1,20):
11            f1.append(m)
12            f2.append(n)
13    df2 = pd.DataFrame({ "f1" : f1,
14                        "f2" : f2 })
15    df2['label'] = dummy.predict(df2[['f1','f2']].values)
16    
17    y_test_predicted = dummy.predict(X_test)
18    
19    print("Train Score: {}".format(dummy.score(X_train,y_train)))
20    print(" Test Score: {}".format(dummy.score(X_test,y_test)))
21    confusion = confusion_matrix(y_test, y_test_predicted)
22    print(confusion)
23    
24    print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_test_predicted)))
25    print('Precision: {:.2f}'.format(precision_score(y_test, y_test_predicted,zero_division=0)))
26    print('Recall: {:.2f}'.format(recall_score(y_test, y_test_predicted)))
27    print('F1: {:.2f}'.format(f1_score(y_test, y_test_predicted)))
28    print(classification_report(y_test, y_test_predicted, target_names=['False', 'True'],zero_division=0))
29
30    return dummy,df2

Main Strategies

Here we will focus on the three key strategies used to generate a dummy model:

Most Frequent: Simply predict the most frequent class
Stratified: Predict classes based on their occurrence frequency
Uniform: Predict all classes ’evenly’ regardless of their occurrence frequency

Most Frequent

In this case, “the predict method always returns the most frequent class label in the observed y argument passed to fit.”.

In the example below, 90% of the values are True (label = 1), therefore this is exactly what the dummy classifier predicts.

1df1 = generate_data_set(1000,0.9) # 90% True 10% False
2model,df2=dummy_predictions(df1,'most_frequent')
3visualise(df1,df2)

Train Score: 0.912
 Test Score: 0.912
[[  0  22]
 [  0 228]]
Accuracy: 0.91
Precision: 0.91
Recall: 1.00
F1: 0.95
              precision    recall  f1-score   support

       False       0.00      0.00      0.00        22
        True       0.91      1.00      0.95       228

    accuracy                           0.91       250
   macro avg       0.46      0.50      0.48       250
weighted avg       0.83      0.91      0.87       250

The high score shows how problematic data sets that have a disproportionate representation of a given class are when it comes to judging a given model’s performance.

In other words, for a class (True) that appears 90% of the time, a dummy model which simply spits out a fixed value is pretty much 90% accurate. This means that a proper model (say, SVM) needs to do much better than this.

For further intuition, in the below example we invert the occurrence ratio.

1df1 = generate_data_set(1000,0.1) # 10% True 90% False
2model,df2=dummy_predictions(df1,'most_frequent')
3visualise(df1,df2)

Train Score: 0.8986666666666666
 Test Score: 0.876
[[219   0]
 [ 31   0]]
Accuracy: 0.88
Precision: 0.00
Recall: 0.00
F1: 0.00
              precision    recall  f1-score   support

       False       0.88      1.00      0.93       219
        True       0.00      0.00      0.00        31

    accuracy                           0.88       250
   macro avg       0.44      0.50      0.47       250
weighted avg       0.77      0.88      0.82       250

Stratified

In this case, “the predict method returns the class label which got probability one in the one-hot vector of predict_proba.”

In our example, this means that probability of obtaining True is relative to the occurrence of True classes. Therefore, for an even split between True and False classes, the accuracy is roughly 50%:

1df1 = generate_data_set(1000,0.5) # 50% True 50% False
2model,df2=dummy_predictions(df1,'stratified')
3visualise(df1,df2)

Train Score: 0.5413333333333333
 Test Score: 0.476
[[63 58]
 [61 68]]
Accuracy: 0.52
Precision: 0.54
Recall: 0.53
F1: 0.53
              precision    recall  f1-score   support

       False       0.51      0.52      0.51       121
        True       0.54      0.53      0.53       129

    accuracy                           0.52       250
   macro avg       0.52      0.52      0.52       250
weighted avg       0.52      0.52      0.52       250

However, if we increment the proportion of True values to 90%, we find the the accuracy also gets closer to 90%.

This shows, again, how problematic a data set with an unbalanced number of class instances is. In this case, a ‘proper’ model which provides 90% accuracy would not be better than a random dummy generator like this.

1df1 = generate_data_set(1000,0.9) # 90% True 10% False
2model,df2=dummy_predictions(df1,'stratified')
3visualise(df1,df2)

Train Score: 0.8133333333333334
 Test Score: 0.852
[[  3  15]
 [ 26 206]]
Accuracy: 0.84
Precision: 0.93
Recall: 0.89
F1: 0.91
              precision    recall  f1-score   support

       False       0.10      0.17      0.13        18
        True       0.93      0.89      0.91       232

    accuracy                           0.84       250
   macro avg       0.52      0.53      0.52       250
weighted avg       0.87      0.84      0.85       250

Uniform

The uniform strategy “generates predictions uniformly at random from the list of unique classes observed in y, i.e. each class has equal probability”.

This means that each class is treated equally regardless of the number of occurrences. For a 50/50 distribution, the stratified and uniform strategies are equivalent:

1df1 = generate_data_set(1000,0.5) # 50% True 50% False
2model,df2=dummy_predictions(df1,'uniform')
3visualise(df1,df2)

Train Score: 0.508
 Test Score: 0.464
[[58 67]
 [67 58]]
Accuracy: 0.46
Precision: 0.46
Recall: 0.46
F1: 0.46
              precision    recall  f1-score   support

       False       0.46      0.46      0.46       125
        True       0.46      0.46      0.46       125

    accuracy                           0.46       250
   macro avg       0.46      0.46      0.46       250
weighted avg       0.46      0.46      0.46       250

1df1 = generate_data_set(1000,0.9) # 90% True 10% False
2model,df2=dummy_predictions(df1,'uniform')
3visualise(df1,df2)

Train Score: 0.5333333333333333
 Test Score: 0.444
[[ 16  19]
 [100 115]]
Accuracy: 0.52
Precision: 0.86
Recall: 0.53
F1: 0.66
              precision    recall  f1-score   support

       False       0.14      0.46      0.21        35
        True       0.86      0.53      0.66       215

    accuracy                           0.52       250
   macro avg       0.50      0.50      0.44       250
weighted avg       0.76      0.52      0.60       250

However, if we change the proportion of True values to 90%, we see that the dummy model still predicts each class on an equal basis:

1df1 = generate_data_set(1000,0.9) # 90% True 10% False
2model,df2=dummy_predictions(df1,'uniform')
3visualise(df1,df2)

Train Score: 0.516
 Test Score: 0.532
[[  5  15]
 [123 107]]
Accuracy: 0.45
Precision: 0.88
Recall: 0.47
F1: 0.61
              precision    recall  f1-score   support

       False       0.04      0.25      0.07        20
        True       0.88      0.47      0.61       230

    accuracy                           0.45       250
   macro avg       0.46      0.36      0.34       250
weighted avg       0.81      0.45      0.56       250

Other Strategies

prior is similar to most_frequent, but with a different predict_proba() behaviour.
constant always predicts a constant label that is provided by the user.

Conclusion

Dummy models are commonly used to benchmark models in which the main motivation is answering the question, “is my model better than random guessing?”. Scikit-Learn’s choice of strategies allow finer control over the predictions produced by the dummy model. Predicting the majority class, for example, is useful in situations where the data set is unbalanced.

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!