Using Scikit-Learn's Multi-layer Perceptron Classifier (MLP) with Small Data.

Table of Contents
Introduction
In my previous blog post, I suggested that Decision Tree Ensembles, and in particular XGBoost, could offer an effective—and faster—alternative to deep learning for certain data sets.
In this lab we take Scikit-Learn’s humble Multi-layer Perceptron Classifier and push it to come up with a predictive model for the same data set that we threw at XGBoost in my last post.
Imports and Boilerplate
1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4import seaborn as sns
5import timeit
6import warnings
7import sys
8import os
9
10from sklearn.model_selection import train_test_split
11from sklearn.metrics import accuracy_score
12from sklearn.metrics import confusion_matrix
13from sklearn.neural_network import MLPClassifier
14from xgboost import XGBClassifier
15from sklearn.model_selection import GridSearchCV
16
17if not sys.warnoptions:
18 warnings.simplefilter("ignore")
19 os.environ["PYTHONWARNINGS"] = "ignore"
Data Set
We create a synthetic data set consisting of red, blue, green, and purple dots. Do not worry too much about the code below; jump straight to the plot and we’ll continue the conversation.
1np.random.seed(0)
2samples = 500
3red_x = np.linspace(0,1.0,samples)
4red_y = [ .85+np.sin(x*10)* .15-(np.random.rand() * 0.1) for x in red_x ]
5
6blue_x = np.linspace(0,1.0,samples)
7blue_y = [ .75+np.sin(x*10)* .15-(np.random.rand() * 0.1) for x in red_x ]
8
9green_x = np.linspace(0,.5,samples)
10green_y = [ np.random.rand()*.5 for v in green_x ]
11
12purple_x = np.linspace(.5,1.0,samples)
13purple_y = [ np.random.rand()*.5 for v in green_x ]
14
15X = np.concatenate((
16 np.array( [ [x,y] for (x,y) in zip(red_x,red_y)]),
17 np.array( [ [x,y] for (x,y) in zip(blue_x,blue_y)]),
18 np.array( [ [x,y] for (x,y) in zip(green_x,green_y)]),
19 np.array( [ [x,y] for (x,y) in zip(purple_x,purple_y)]),
20 ))
21y = np.concatenate((
22 np.repeat(0,len(red_x)),
23 np.repeat(1,len(blue_x)),
24 np.repeat(2,len(green_x)),
25 np.repeat(3,len(purple_x))
26 ))
27
28X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
29
30def plot(X_, y_):
31 plt.rcParams['figure.figsize'] = [4, 4]
32 plt.rcParams['figure.dpi'] = 100
33 for i, v in enumerate(['red','blue','green','purple']):
34 Xs = [ t[0][0] for t in zip(X_,y_) if t[1] == i ]
35 ys = [ t[0][1] for t in zip(X_,y_) if t[1] == i ]
36 plt.scatter(Xs,ys, color=v, s=1)
37
38plot(X,y)
Note above that while the green and purple dots gather perfectly in the plot’s corners, the red and blue dots have an irregular, wave-like pattern.
The lower portion of this data set is purposely created to make tree-based models (like XGBoost) happy, given that the green and purple dots live within well demarcated regions which can be predicted navigating a simple split point hierarchy such as
1 Y > 5
2 YES NO
3 ... X > 5
4 NO YES
5 Green Purple
However, a Multi-Layer Perceptron (MLP) does not learn specific split points but applies a so-called activation function to each of its perceptrons, which while not necessarily linear, have a continuous rather than discrete behaviour.
We will use one hidden layer, and observe the effect of each activation function with a small size of 1, 2, and 4, to understand how it affects the shape of the predictive model.
Visualisation Code
The code below helps compare three different models, showing the predicted ‘regions’ for each class (in a light colour), versus the coordinates of the actual dots in the test set. Understanding the code below is not important. What matters is our discussion further down, so feel free to skip ahead.
1def fit_visualise(models,names):
2
3 plt.rcParams['figure.figsize'] = [8, 4]
4 plt.rcParams['figure.dpi'] = 100
5 plt.subplots_adjust(wspace=0.5)
6
7 for i,m in enumerate(models):
8 m.fit(X_train,y_train)
9
10 plt.subplot(2, 3, i+1)
11 plt.title("%s\nAccuracy = (%.2f)" %
12 (names[i], accuracy_score(m.predict(X_test),y_test)))
13 plt.xticks(np.linspace(0,1,11),fontsize="xx-small")
14 if i == 0:
15 plt.yticks(np.linspace(0,1,11),fontsize="xx-small")
16 else:
17 plt.yticks([])
18
19 if m != None:
20 classes = [{}] * 4
21 for i,v in enumerate(['#ff8181','#8181ff','#81a781','#a782a7']):
22 classes[i] = { 'colour' : v , 'x' : [], 'y' : []}
23
24 for x in np.linspace(0,1,40):
25 for y in np.linspace(0,1,40):
26 classes[m.predict([[x,y]])[0]]['x'].append(x)
27 classes[m.predict([[x,y]])[0]]['y'].append(y)
28
29 for c in classes:
30 plt.scatter(c['x'],
31 c['y'],
32 color=c['colour'],
33 edgecolors='none',
34 marker='s',
35 s=10)
36 for i, v in enumerate(['red','blue','green','purple']):
37 Xs = [ t[0][0] for t in zip(X_test,y_test) if t[1] == i ]
38 ys = [ t[0][1] for t in zip(X_test,y_test) if t[1] == i ]
39 plt.scatter(Xs,ys, color=v, s=1)
40
41 for i,m in enumerate(models):
42 plt.subplot(2, 3, i+4)
43 labels = ["r","b","g","p"]
44 confusion = confusion_matrix(y_test, m.predict(X_test), normalize='true')
45 dfc = pd.DataFrame(confusion,
46 index = labels,
47 columns = labels)
48 sns.heatmap(dfc, annot=True, cmap="YlGnBu", fmt='.2f', annot_kws={"size":8})
49 plt.ylabel('Actual')
50 plt.xlabel('Predicted')
Activation Function Safari
Here we will see the nature of each activation function, as well as their effect when using 1, 2, and 4 perceptrons.
Rectified Linear Unit Function (ReLU)
This function, which in Python may be expressed as y = np.maximum(0, x)
, returns either 0
for any negative value, or the value itself for any other positive value.
1for x in [-1000,-100,-10,-5,-1,-0.5,0,0.5,1,10,100,1000]:
2 print("max(0,%2.2f) = %2.2f" % (x,np.maximum(0,x)))
max(0,-1000.00) = 0.00
max(0,-100.00) = 0.00
max(0,-10.00) = 0.00
max(0,-5.00) = 0.00
max(0,-1.00) = 0.00
max(0,-0.50) = 0.00
max(0,0.00) = 0.00
max(0,0.50) = 0.50
max(0,1.00) = 1.00
max(0,10.00) = 10.00
max(0,100.00) = 100.00
max(0,1000.00) = 1000.00
1x_values = np.linspace(-1, 1, 100)
2y_values = np.maximum(0, x_values)
3
4plt.plot(x_values, y_values, label = 'relu')
5plt.xticks([-1,0,1])
6plt.yticks([-1,0,1])
7plt.legend()
8plt.show()
1models = []
2titles = []
3
4for i in [1,2,4]:
5 models.append(
6 MLPClassifier(random_state = 0, hidden_layer_sizes = [i], activation="relu", solver='lbfgs')
7 )
8 titles.append(
9 "1 Layer, size = %d" % i
10 )
11
12fit_visualise(models,titles)
The value of hidden_layer_size
must be at least 2
for relu
and all the other activation functions to predict the green and purple dots correctly. However, with a size of 4
, this particular activation function does not do any better.
Hyperbolic Tangent (TanH)
This function, which in Python may be expressed as y = np.tanh(x)
, returns a value between -1
and 1
for any value of x
, no matter how large or small it may be.
1for x in [-1000,-100,-10,-5,-1,-0.5,0,0.5,1,10,100,1000]:
2 print("tanh(%2.2f) = %2.2f" % (x,np.tanh(x)))
tanh(-1000.00) = -1.00
tanh(-100.00) = -1.00
tanh(-10.00) = -1.00
tanh(-5.00) = -1.00
tanh(-1.00) = -0.76
tanh(-0.50) = -0.46
tanh(0.00) = 0.00
tanh(0.50) = 0.46
tanh(1.00) = 0.76
tanh(10.00) = 1.00
tanh(100.00) = 1.00
tanh(1000.00) = 1.00
1x_values = np.linspace(-5, 5, 100)
2y_values = np.tanh(x_values)
3
4plt.plot(x_values, y_values, label = 'tanh')
5plt.xticks([-5,0,5])
6plt.yticks([-1,0,1])
7plt.show()
1models = []
2titles = []
3
4for i in [1,2,4]:
5 models.append(
6 MLPClassifier(random_state = 0, hidden_layer_sizes = [i], activation="tanh", solver='lbfgs')
7 )
8 titles.append(
9 "1 Layer, size = %d" % i
10 )
11
12fit_visualise(models,titles)
In the case of tanh
, neither a size of 2
nor 4
achieve a perfect score for the purple dots (you can see above that some fall on the green region). However, with a size of 4
, the model provides a quite impressive predictive model for the red and blue dots, which results in an overall accuracy score of 0.92
. This is the best activation function for this data set.
Sigmoid Function (Logistic)
This function, which in Python may be expressed as y = 1 / (1 + np.exp(-x))
, returns a value between 0
and 1
for any value of x
, no matter how large or small it may be.
1for x in [-1000,-100,-10,-5,-1,-0.5,0,0.5,1,10,100,1000]:
2 print("sigmoid(%2.2f) = %2.2f" % (x,1 / (1 + np.exp(-x))))
sigmoid(-1000.00) = 0.00
sigmoid(-100.00) = 0.00
sigmoid(-10.00) = 0.00
sigmoid(-5.00) = 0.01
sigmoid(-1.00) = 0.27
sigmoid(-0.50) = 0.38
sigmoid(0.00) = 0.50
sigmoid(0.50) = 0.62
sigmoid(1.00) = 0.73
sigmoid(10.00) = 1.00
sigmoid(100.00) = 1.00
sigmoid(1000.00) = 1.00
1x_values = np.linspace(-5, 5, 100)
2y_values = 1 / (1 + np.exp(-x_values))
3
4plt.plot(x_values, y_values, label = 'sigmoid')
5plt.xticks([-5,0,5])
6plt.yticks([0,1])
7plt.show()
1models = []
2titles = []
3
4for i in [1,2,4]:
5 models.append(
6 MLPClassifier(random_state = 0, hidden_layer_sizes = [i], activation="logistic", solver='lbfgs')
7 )
8 titles.append(
9 "1 Layer, size = %d" % i
10 )
11
12fit_visualise(models,titles)
The logistic
activation function provides a better result with a size of 2
than 4
, but it does not do a better job than tanh
even when fed with larger values.
Most Optimal Activation Function and Single Layer Size
In the previous section, we tried to discover the ‘shape’ of the MLP’s predictive model for each activation function whereby the single hidden layer had sizes {1, 2, 4}
.
Here below, we use GridSearchCV
to find the most optimal activation function, and a single layer of varying sizes. We do not iterate through different regularisation hyperparameters.
1
2parameters = {'activation' : ['relu','tanh','logistic'],
3 'hidden_layer_sizes' : [ [s] for s in range(1,50) ],
4}
5model = MLPClassifier(random_state = 0, solver='lbfgs')
6clf = GridSearchCV(model, parameters)
7clf.fit(X_train,y_train)
8best_model = clf.best_estimator_
9print(" Activation Function: %s" % clf.best_estimator_.activation)
10print(" Hidden Layer Sizes: %s" % clf.best_estimator_.hidden_layer_sizes)
11print(" Alpha: %f" % clf.best_estimator_.alpha)
12print(" Accuracy: %f" % accuracy_score(y_test,best_model.predict(X_test)))
13fit_visualise([best_model],[""])
Activation Function: tanh
Hidden Layer Sizes: [40]
Alpha: 0.000100
Accuracy: 0.994000
Here we have the proof that tanh
works well with our data set, and that a simple hidden layer of size 7
provides more than acceptable accuracy levels.
XGBoost vs MLP
We will now compare XGBoost’s vs MLP’s training performance.
Please note that whereas XGBoost does not take any non-default argument, MLP takes the most optimal parameters found by CVGridSearch
. In other words, hidden_layer_sizes = [40]
, and activation='tanh'
.
1xg_model = XGBClassifier(random_state=0)
2fit_visualise([xg_model,best_model],["XGBoost","MLP"])
Although MLP appears to be the ‘superior’ choice for the small data set at hand, given its 3 beeps superiority over the gradient boosted model, we have to note that the MLP model seems more arbitrary, expecting green and purple dots in regions where blue dots are more likely, at least as far as pure human intuition goes.
1# Performance comparison
2%timeit -n 10 XGBClassifier(random_state=0).fit(X_train,y_train)
3%timeit -n 10 best_model.fit(X_train,y_train)
487 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
257 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Above we can see that the MLP model is roughly two times faster but with the caveat that we have spent significant time searching for the most optimal model first. In the case of XGBoost we have just taken the defaults without trying to optimise any hyperparameter.
Conclusion
Although techniques such as MLP are seldom used (nor recommended) for small data sets, for the sample provided in this tutorial, the results were not disappointing.
As noted on the section comparing XGBoost with the Scikit-Learn’s MLP Classifier, we should not immediately jump at the conclusion that MLP does a better job given its superior score.
This experiment isn’t neither a fair benchmark, in the sense that we haven’t done any hyperparameter searching for XGBoost, and in the case of MLP itself we have only iterated through three activation functions, and experimented with 50 sizes for a single layer.
What this experiment shows, though, is that MLP can provide outstanding results, out of a predictive model that makes less sense than one produced by a traditional approach (such as decision trees), as seen with odd regions that would have predicted green and purple dots.