Using Scikit-Learn's Multi-layer Perceptron Classifier (MLP) with Small Data.

Share on:

Table of Contents

Introduction

In my previous blog post, I suggested that Decision Tree Ensembles, and in particular XGBoost, could offer an effective—and faster—alternative to deep learning for certain data sets.

In this lab we take Scikit-Learn’s humble Multi-layer Perceptron Classifier and push it to come up with a predictive model for the same data set that we threw at XGBoost in my last post.

Imports and Boilerplate

 1import numpy as np
 2import pandas as pd
 3import matplotlib.pyplot as plt
 4import seaborn as sns
 5import timeit
 6import warnings
 7import sys
 8import os
 9
10from sklearn.model_selection import train_test_split
11from sklearn.metrics import accuracy_score
12from sklearn.metrics import confusion_matrix
13from sklearn.neural_network import MLPClassifier
14from xgboost import XGBClassifier
15from sklearn.model_selection import GridSearchCV
16
17if not sys.warnoptions:
18    warnings.simplefilter("ignore")
19    os.environ["PYTHONWARNINGS"] = "ignore" 

Data Set

We create a synthetic data set consisting of red, blue, green, and purple dots. Do not worry too much about the code below; jump straight to the plot and we’ll continue the conversation.

 1np.random.seed(0)
 2samples = 500 
 3red_x = np.linspace(0,1.0,samples)
 4red_y = [ .85+np.sin(x*10)* .15-(np.random.rand() * 0.1) for x in red_x ]
 5
 6blue_x = np.linspace(0,1.0,samples)
 7blue_y = [ .75+np.sin(x*10)* .15-(np.random.rand() * 0.1) for x in red_x ]
 8
 9green_x = np.linspace(0,.5,samples)
10green_y = [ np.random.rand()*.5 for v in green_x ]
11
12purple_x = np.linspace(.5,1.0,samples)
13purple_y = [ np.random.rand()*.5 for v in green_x ]
14
15X = np.concatenate((
16                    np.array( [ [x,y] for (x,y) in zip(red_x,red_y)]),
17                    np.array( [ [x,y] for (x,y) in zip(blue_x,blue_y)]),
18                    np.array( [ [x,y] for (x,y) in zip(green_x,green_y)]),
19                    np.array( [ [x,y] for (x,y) in zip(purple_x,purple_y)]),
20                  ))
21y = np.concatenate((
22                    np.repeat(0,len(red_x)),
23                    np.repeat(1,len(blue_x)),
24                    np.repeat(2,len(green_x)),
25                    np.repeat(3,len(purple_x))
26                   ))
27
28X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
29
30def plot(X_, y_):
31  plt.rcParams['figure.figsize'] = [4, 4]
32  plt.rcParams['figure.dpi'] = 100
33  for i, v in enumerate(['red','blue','green','purple']):
34    Xs = [ t[0][0] for t in zip(X_,y_) if t[1] == i ]
35    ys = [ t[0][1] for t in zip(X_,y_) if t[1] == i ]
36    plt.scatter(Xs,ys, color=v, s=1)
37
38plot(X,y)  

Note above that while the green and purple dots gather perfectly in the plot’s corners, the red and blue dots have an irregular, wave-like pattern.

The lower portion of this data set is purposely created to make tree-based models (like XGBoost) happy, given that the green and purple dots live within well demarcated regions which can be predicted navigating a simple split point hierarchy such as

1       Y > 5
2   YES        NO
3   ...      X > 5
4          NO      YES
5        Green    Purple    

However, a Multi-Layer Perceptron (MLP) does not learn specific split points but applies a so-called activation function to each of its perceptrons, which while not necessarily linear, have a continuous rather than discrete behaviour.

We will use one hidden layer, and observe the effect of each activation function with a small size of 1, 2, and 4, to understand how it affects the shape of the predictive model.

Visualisation Code

The code below helps compare three different models, showing the predicted ‘regions’ for each class (in a light colour), versus the coordinates of the actual dots in the test set. Understanding the code below is not important. What matters is our discussion further down, so feel free to skip ahead.

 1def fit_visualise(models,names):
 2    
 3    plt.rcParams['figure.figsize'] = [8, 4]
 4    plt.rcParams['figure.dpi'] = 100
 5    plt.subplots_adjust(wspace=0.5)
 6
 7    for i,m in enumerate(models):
 8      m.fit(X_train,y_train)
 9
10      plt.subplot(2, 3, i+1)
11      plt.title("%s\nAccuracy = (%.2f)" % 
12                (names[i], accuracy_score(m.predict(X_test),y_test)))
13      plt.xticks(np.linspace(0,1,11),fontsize="xx-small")
14      if i == 0:
15        plt.yticks(np.linspace(0,1,11),fontsize="xx-small")
16      else:
17        plt.yticks([])
18      
19      if m != None:
20        classes = [{}] * 4
21        for i,v in enumerate(['#ff8181','#8181ff','#81a781','#a782a7']):
22          classes[i] = { 'colour' : v , 'x' : [], 'y' : []}
23
24        for x in np.linspace(0,1,40): 
25          for y in np.linspace(0,1,40):
26            classes[m.predict([[x,y]])[0]]['x'].append(x)
27            classes[m.predict([[x,y]])[0]]['y'].append(y)
28
29        for c in classes:
30          plt.scatter(c['x'], 
31                      c['y'],
32                      color=c['colour'],
33                      edgecolors='none',
34                      marker='s',
35                      s=10)
36        for i, v in enumerate(['red','blue','green','purple']):
37          Xs = [ t[0][0] for t in zip(X_test,y_test) if t[1] == i ]
38          ys = [ t[0][1] for t in zip(X_test,y_test) if t[1] == i ]
39          plt.scatter(Xs,ys, color=v, s=1)
40
41    for i,m in enumerate(models):
42      plt.subplot(2, 3, i+4)
43      labels = ["r","b","g","p"]
44      confusion = confusion_matrix(y_test, m.predict(X_test), normalize='true')
45      dfc = pd.DataFrame(confusion, 
46                          index = labels, 
47                          columns = labels)
48      sns.heatmap(dfc, annot=True, cmap="YlGnBu", fmt='.2f', annot_kws={"size":8})
49      plt.ylabel('Actual')
50      plt.xlabel('Predicted')

Activation Function Safari

Here we will see the nature of each activation function, as well as their effect when using 1, 2, and 4 perceptrons.

Rectified Linear Unit Function (ReLU)

This function, which in Python may be expressed as y = np.maximum(0, x), returns either 0 for any negative value, or the value itself for any other positive value.

1for x in [-1000,-100,-10,-5,-1,-0.5,0,0.5,1,10,100,1000]:
2    print("max(0,%2.2f) = %2.2f" % (x,np.maximum(0,x)))
max(0,-1000.00) = 0.00
max(0,-100.00) = 0.00
max(0,-10.00) = 0.00
max(0,-5.00) = 0.00
max(0,-1.00) = 0.00
max(0,-0.50) = 0.00
max(0,0.00) = 0.00
max(0,0.50) = 0.50
max(0,1.00) = 1.00
max(0,10.00) = 10.00
max(0,100.00) = 100.00
max(0,1000.00) = 1000.00
1x_values = np.linspace(-1, 1, 100)
2y_values = np.maximum(0, x_values) 
3
4plt.plot(x_values, y_values, label = 'relu')
5plt.xticks([-1,0,1])
6plt.yticks([-1,0,1])
7plt.legend()
8plt.show()

 1models = []
 2titles = []
 3
 4for i in [1,2,4]: 
 5  models.append(
 6    MLPClassifier(random_state = 0, hidden_layer_sizes = [i], activation="relu", solver='lbfgs')
 7  )
 8  titles.append(
 9    "1 Layer, size = %d" % i
10  )                
11
12fit_visualise(models,titles)

The value of hidden_layer_size must be at least 2 for relu and all the other activation functions to predict the green and purple dots correctly. However, with a size of 4, this particular activation function does not do any better.

Hyperbolic Tangent (TanH)

This function, which in Python may be expressed as y = np.tanh(x), returns a value between -1 and 1 for any value of x, no matter how large or small it may be.

1for x in [-1000,-100,-10,-5,-1,-0.5,0,0.5,1,10,100,1000]:
2    print("tanh(%2.2f) = %2.2f" % (x,np.tanh(x)))
tanh(-1000.00) = -1.00
tanh(-100.00) = -1.00
tanh(-10.00) = -1.00
tanh(-5.00) = -1.00
tanh(-1.00) = -0.76
tanh(-0.50) = -0.46
tanh(0.00) = 0.00
tanh(0.50) = 0.46
tanh(1.00) = 0.76
tanh(10.00) = 1.00
tanh(100.00) = 1.00
tanh(1000.00) = 1.00
1x_values = np.linspace(-5, 5, 100)
2y_values = np.tanh(x_values) 
3
4plt.plot(x_values, y_values, label = 'tanh')
5plt.xticks([-5,0,5])
6plt.yticks([-1,0,1])
7plt.show()

 1models = []
 2titles = []
 3
 4for i in [1,2,4]: 
 5  models.append(
 6    MLPClassifier(random_state = 0, hidden_layer_sizes = [i], activation="tanh", solver='lbfgs')
 7  )
 8  titles.append(
 9    "1 Layer, size = %d" % i
10  )                
11
12fit_visualise(models,titles)

In the case of tanh, neither a size of 2 nor 4 achieve a perfect score for the purple dots (you can see above that some fall on the green region). However, with a size of 4, the model provides a quite impressive predictive model for the red and blue dots, which results in an overall accuracy score of 0.92. This is the best activation function for this data set.

Sigmoid Function (Logistic)

This function, which in Python may be expressed as y = 1 / (1 + np.exp(-x)), returns a value between 0 and 1 for any value of x, no matter how large or small it may be.

1for x in [-1000,-100,-10,-5,-1,-0.5,0,0.5,1,10,100,1000]:
2    print("sigmoid(%2.2f) = %2.2f" % (x,1 / (1 + np.exp(-x))))
sigmoid(-1000.00) = 0.00
sigmoid(-100.00) = 0.00
sigmoid(-10.00) = 0.00
sigmoid(-5.00) = 0.01
sigmoid(-1.00) = 0.27
sigmoid(-0.50) = 0.38
sigmoid(0.00) = 0.50
sigmoid(0.50) = 0.62
sigmoid(1.00) = 0.73
sigmoid(10.00) = 1.00
sigmoid(100.00) = 1.00
sigmoid(1000.00) = 1.00
1x_values = np.linspace(-5, 5, 100)
2y_values = 1 / (1 + np.exp(-x_values))
3
4plt.plot(x_values, y_values, label = 'sigmoid')
5plt.xticks([-5,0,5])
6plt.yticks([0,1])
7plt.show()

 1models = []
 2titles = []
 3
 4for i in [1,2,4]: 
 5  models.append(
 6    MLPClassifier(random_state = 0, hidden_layer_sizes = [i], activation="logistic", solver='lbfgs')
 7  )
 8  titles.append(
 9    "1 Layer, size = %d" % i
10  )                
11
12fit_visualise(models,titles)

The logistic activation function provides a better result with a size of 2 than 4, but it does not do a better job than tanh even when fed with larger values.

Most Optimal Activation Function and Single Layer Size

In the previous section, we tried to discover the ‘shape’ of the MLP’s predictive model for each activation function whereby the single hidden layer had sizes {1, 2, 4}.

Here below, we use GridSearchCV to find the most optimal activation function, and a single layer of varying sizes. We do not iterate through different regularisation hyperparameters.

 1
 2parameters = {'activation' : ['relu','tanh','logistic'],
 3              'hidden_layer_sizes' : [ [s] for s in range(1,50) ],
 4}
 5model = MLPClassifier(random_state = 0, solver='lbfgs')
 6clf = GridSearchCV(model, parameters)
 7clf.fit(X_train,y_train)
 8best_model = clf.best_estimator_
 9print("  Activation Function: %s" % clf.best_estimator_.activation)
10print("   Hidden Layer Sizes: %s" % clf.best_estimator_.hidden_layer_sizes)
11print("                Alpha: %f" % clf.best_estimator_.alpha)
12print("             Accuracy: %f" % accuracy_score(y_test,best_model.predict(X_test)))
13fit_visualise([best_model],[""])
  Activation Function: tanh
   Hidden Layer Sizes: [40]
                Alpha: 0.000100
             Accuracy: 0.994000

Here we have the proof that tanh works well with our data set, and that a simple hidden layer of size 7 provides more than acceptable accuracy levels.

XGBoost vs MLP

We will now compare XGBoost’s vs MLP’s training performance.

Please note that whereas XGBoost does not take any non-default argument, MLP takes the most optimal parameters found by CVGridSearch. In other words, hidden_layer_sizes = [40], and activation='tanh'.

1xg_model = XGBClassifier(random_state=0)
2fit_visualise([xg_model,best_model],["XGBoost","MLP"])

Although MLP appears to be the ‘superior’ choice for the small data set at hand, given its 3 beeps superiority over the gradient boosted model, we have to note that the MLP model seems more arbitrary, expecting green and purple dots in regions where blue dots are more likely, at least as far as pure human intuition goes.

1# Performance comparison
2%timeit -n 10 XGBClassifier(random_state=0).fit(X_train,y_train)
3%timeit -n 10 best_model.fit(X_train,y_train)
487 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
257 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Above we can see that the MLP model is roughly two times faster but with the caveat that we have spent significant time searching for the most optimal model first. In the case of XGBoost we have just taken the defaults without trying to optimise any hyperparameter.

Conclusion

Although techniques such as MLP are seldom used (nor recommended) for small data sets, for the sample provided in this tutorial, the results were not disappointing.

As noted on the section comparing XGBoost with the Scikit-Learn’s MLP Classifier, we should not immediately jump at the conclusion that MLP does a better job given its superior score.

This experiment isn’t neither a fair benchmark, in the sense that we haven’t done any hyperparameter searching for XGBoost, and in the case of MLP itself we have only iterated through three activation functions, and experimented with 50 sizes for a single layer.

What this experiment shows, though, is that MLP can provide outstanding results, out of a predictive model that makes less sense than one produced by a traditional approach (such as decision trees), as seen with odd regions that would have predicted green and purple dots.

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!

website counters