Dummy Models with Scikit-Learn

Table of Contents
Introduction
Obligatory Imports and Boilerplate
1import numpy as np
2import pandas as pd
3import re
4import matplotlib.pyplot as plt
5import warnings
6from sklearn.model_selection import train_test_split
7from sklearn.metrics import confusion_matrix
8from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
9from sklearn.metrics import classification_report
10from sklearn.dummy import DummyClassifier
11
12warnings.filterwarnings('always')
Sample Data Set
1def generate_data_set(samples, ratio):
2
3 f1 = np.random.rand(samples)
4 f2 = np.random.rand(samples)
5
6 label = [ 1 if x < ratio else 0 for x in f2 ]
7
8 df = pd.DataFrame({"f1":f1,
9 "f2":f2,
10 "label" : label})
11
12 return df
Visualisation
1def decoration():
2 plt.xticks([0,0.2,0.4,0.6,0.8,1])
3 plt.yticks([0,0.2,0.4,0.6,0.8,1])
4 plt.legend(loc='upper left')
5
6def visualise(df1=None,df2=None):
7
8 plt.rcParams['figure.figsize'] = [8, 4]
9 plt.rcParams['figure.dpi'] = 100
10
11 plt.subplot(1, 2, 1)
12
13 df1_false = df1[df1['label'] == 0]
14 df1_true = df1[df1['label'] == 1]
15 plt.scatter(df1_false['f1'],df1_false['f2'],label='Actual False',s=5,color='r',alpha=0.3)
16 plt.scatter(df1_true['f1'],df1_true['f2'],label='Actual True',s=5,color='b',alpha=0.3)
17 decoration()
18
19 plt.subplot(1, 2, 2)
20 if df2 is not None:
21 df2_false = df2[df2['label'] == 0]
22 df2_true = df2[df2['label'] == 1]
23 plt.scatter(df2_false['f1'],df2_false['f2'],color='red',s=5,label="Predicted False",alpha=0.3)
24 plt.scatter(df2_true['f1'],df2_true['f2'],color='blue',s=5,label="Predicted True",alpha=0.3)
25 decoration()
Dummy Prediction Scores
1def dummy_predictions(df,strategy):
2
3 X_train, X_test, y_train, y_test = train_test_split(df[['f1','f2']].values, df['label'].values, random_state=0)
4
5 dummy = DummyClassifier(strategy = strategy).fit(X_train, y_train)
6
7 f1 = []
8 f2 = []
9 for m in np.linspace(0,1,20):
10 for n in np.linspace(0,1,20):
11 f1.append(m)
12 f2.append(n)
13 df2 = pd.DataFrame({ "f1" : f1,
14 "f2" : f2 })
15 df2['label'] = dummy.predict(df2[['f1','f2']].values)
16
17 y_test_predicted = dummy.predict(X_test)
18
19 print("Train Score: {}".format(dummy.score(X_train,y_train)))
20 print(" Test Score: {}".format(dummy.score(X_test,y_test)))
21 confusion = confusion_matrix(y_test, y_test_predicted)
22 print(confusion)
23
24 print('Accuracy: {:.2f}'.format(accuracy_score(y_test, y_test_predicted)))
25 print('Precision: {:.2f}'.format(precision_score(y_test, y_test_predicted,zero_division=0)))
26 print('Recall: {:.2f}'.format(recall_score(y_test, y_test_predicted)))
27 print('F1: {:.2f}'.format(f1_score(y_test, y_test_predicted)))
28 print(classification_report(y_test, y_test_predicted, target_names=['False', 'True'],zero_division=0))
29
30 return dummy,df2
Main Strategies
Here we will focus on the three key strategies used to generate a dummy model:
- Most Frequent: Simply predict the most frequent class
- Stratified: Predict classes based on their occurrence frequency
- Uniform: Predict all classes ’evenly’ regardless of their occurrence frequency
Most Frequent
In this case, “the predict method always returns the most frequent class label in the observed y argument passed to fit.”.
In the example below, 90% of the values are True (label = 1), therefore this is exactly what the dummy classifier predicts.
1df1 = generate_data_set(1000,0.9) # 90% True 10% False
2model,df2=dummy_predictions(df1,'most_frequent')
3visualise(df1,df2)
Train Score: 0.912
Test Score: 0.912
[[ 0 22]
[ 0 228]]
Accuracy: 0.91
Precision: 0.91
Recall: 1.00
F1: 0.95
precision recall f1-score support
False 0.00 0.00 0.00 22
True 0.91 1.00 0.95 228
accuracy 0.91 250
macro avg 0.46 0.50 0.48 250
weighted avg 0.83 0.91 0.87 250
The high score shows how problematic data sets that have a disproportionate representation of a given class are when it comes to judging a given model’s performance.
In other words, for a class (True) that appears 90% of the time, a dummy model which simply spits out a fixed value is pretty much 90% accurate. This means that a proper model (say, SVM) needs to do much better than this.
For further intuition, in the below example we invert the occurrence ratio.
1df1 = generate_data_set(1000,0.1) # 10% True 90% False
2model,df2=dummy_predictions(df1,'most_frequent')
3visualise(df1,df2)
Train Score: 0.8986666666666666
Test Score: 0.876
[[219 0]
[ 31 0]]
Accuracy: 0.88
Precision: 0.00
Recall: 0.00
F1: 0.00
precision recall f1-score support
False 0.88 1.00 0.93 219
True 0.00 0.00 0.00 31
accuracy 0.88 250
macro avg 0.44 0.50 0.47 250
weighted avg 0.77 0.88 0.82 250
Stratified
In this case, “the predict method returns the class label which got probability one in the one-hot vector of predict_proba.”
In our example, this means that probability of obtaining True is relative to the occurrence of True classes. Therefore, for an even split between True and False classes, the accuracy is roughly 50%:
1df1 = generate_data_set(1000,0.5) # 50% True 50% False
2model,df2=dummy_predictions(df1,'stratified')
3visualise(df1,df2)
Train Score: 0.5413333333333333
Test Score: 0.476
[[63 58]
[61 68]]
Accuracy: 0.52
Precision: 0.54
Recall: 0.53
F1: 0.53
precision recall f1-score support
False 0.51 0.52 0.51 121
True 0.54 0.53 0.53 129
accuracy 0.52 250
macro avg 0.52 0.52 0.52 250
weighted avg 0.52 0.52 0.52 250
However, if we increment the proportion of True values to 90%, we find the the accuracy also gets closer to 90%.
This shows, again, how problematic a data set with an unbalanced number of class instances is. In this case, a ‘proper’ model which provides 90% accuracy would not be better than a random dummy generator like this.
1df1 = generate_data_set(1000,0.9) # 90% True 10% False
2model,df2=dummy_predictions(df1,'stratified')
3visualise(df1,df2)
Train Score: 0.8133333333333334
Test Score: 0.852
[[ 3 15]
[ 26 206]]
Accuracy: 0.84
Precision: 0.93
Recall: 0.89
F1: 0.91
precision recall f1-score support
False 0.10 0.17 0.13 18
True 0.93 0.89 0.91 232
accuracy 0.84 250
macro avg 0.52 0.53 0.52 250
weighted avg 0.87 0.84 0.85 250
Uniform
The uniform strategy “generates predictions uniformly at random from the list of unique classes observed in y, i.e. each class has equal probability”.
This means that each class is treated equally regardless of the number of occurrences. For a 50/50 distribution, the stratified and uniform strategies are equivalent:
1df1 = generate_data_set(1000,0.5) # 50% True 50% False
2model,df2=dummy_predictions(df1,'uniform')
3visualise(df1,df2)
Train Score: 0.508
Test Score: 0.464
[[58 67]
[67 58]]
Accuracy: 0.46
Precision: 0.46
Recall: 0.46
F1: 0.46
precision recall f1-score support
False 0.46 0.46 0.46 125
True 0.46 0.46 0.46 125
accuracy 0.46 250
macro avg 0.46 0.46 0.46 250
weighted avg 0.46 0.46 0.46 250
1df1 = generate_data_set(1000,0.9) # 90% True 10% False
2model,df2=dummy_predictions(df1,'uniform')
3visualise(df1,df2)
Train Score: 0.5333333333333333
Test Score: 0.444
[[ 16 19]
[100 115]]
Accuracy: 0.52
Precision: 0.86
Recall: 0.53
F1: 0.66
precision recall f1-score support
False 0.14 0.46 0.21 35
True 0.86 0.53 0.66 215
accuracy 0.52 250
macro avg 0.50 0.50 0.44 250
weighted avg 0.76 0.52 0.60 250
However, if we change the proportion of True values to 90%, we see that the dummy model still predicts each class on an equal basis:
1df1 = generate_data_set(1000,0.9) # 90% True 10% False
2model,df2=dummy_predictions(df1,'uniform')
3visualise(df1,df2)
Train Score: 0.516
Test Score: 0.532
[[ 5 15]
[123 107]]
Accuracy: 0.45
Precision: 0.88
Recall: 0.47
F1: 0.61
precision recall f1-score support
False 0.04 0.25 0.07 20
True 0.88 0.47 0.61 230
accuracy 0.45 250
macro avg 0.46 0.36 0.34 250
weighted avg 0.81 0.45 0.56 250
Other Strategies
prior
is similar tomost_frequent
, but with a differentpredict_proba()
behaviour.constant
always predicts a constant label that is provided by the user.
Conclusion
Dummy models are commonly used to benchmark models in which the main motivation is answering the question, “is my model better than random guessing?”. Scikit-Learn’s choice of strategies allow finer control over the predictions produced by the dummy model. Predicting the majority class, for example, is useful in situations where the data set is unbalanced.