Gaussian Naive Bayes Classification with Scikit-Learn

Table of Contents
Introduction
Scikit’s Learn Gaussian Naive Bayes Classifier has the advantage, over the likes of logistic regression, that it can be fed with partial data in ‘chunks’ using the partial_fit(X, y, classes)
method. Also, given its ‘Gaussian’ nature, the dividing line between classes is a parabola, rather than a straight line, which may be more useful for some data sets.
Note: This tutorial follows the same structure as the SVM one in that this classifier is presented in terms of its difference with logistic regression.
Imports and Boilerplate
1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4from copy import deepcopy
5
6from sklearn.model_selection import KFold
7from sklearn.linear_model import LogisticRegression
8from sklearn.naive_bayes import GaussianNB
9from sklearn.metrics import accuracy_score
10
11plt.rcParams['figure.figsize'] = [8, 7]
12plt.rcParams['figure.dpi'] = 100
A Classification Problem. Red Dots vs Blue Dots.
We use classification to predict the likelihood of a class vs another, given one or more features.
Let us contemplate two classes (i.e., a binary/binomial classification), consisting of red and blue dots. Do not worry too much about the code below; jump straight to the plot and we’ll continue the conversation.
1np.random.seed(0)
2samples = 80
3red_x = np.linspace(1,8,samples)
4red_y = [ v + np.random.rand()*(10-v) for v in red_x ]
5
6blue_x = np.linspace(1,10,samples)
7blue_y = [ v-(np.random.rand()*v) for v in blue_x ]
8
9X = np.concatenate((
10 np.array( [ [x,y] for (x,y) in zip(red_x,red_y)]),
11 np.array( [ [x,y] for (x,y) in zip(blue_x,blue_y)])
12 ))
13y = np.concatenate((
14 np.repeat(0,len(red_x)),
15 np.repeat(1,len(blue_x))
16 ))
17
18def visualise(models,names):
19 plt.scatter(red_x,red_y, color='red')
20 plt.scatter(blue_x,blue_y, color='blue')
21 plt.xticks(range(0,11))
22 plt.yticks(range(0,11))
23 plt.xlabel("X")
24 plt.ylabel("Y")
25 for i,m in enumerate(models):
26 class_boundary = [ (x,y)
27 for x in np.linspace(0,10,200)
28 for y in np.linspace(0,10,200)
29 if abs((m.predict_proba([[x,y]])[0][1])-0.5)
30 <= 0.001 ]
31 plt.plot([ t[0] for t in class_boundary ],
32 [ t[1] for t in class_boundary ],
33 label=names[i])
34 if len(models) > 0:
35 plt.legend(loc='upper left')
36
37
38visualise([],[])
Ok, above we have X and Y axes (both in the 0..10 range), and depending on the coordinate, it is more likely that we will get either red or blue dots.
Logistic Regression
An easy predictive model that we can use is logistic regression (check my tutorial), which will predict (barring regularisation, etc.) either red or blue dots, depending on whether they fall above or below the teal decision boundary line.
1model_logistic = LogisticRegression()
2model_logistic.fit(X,y)
3visualise([model_logistic],["Logistic"])
Gaussian Naive Bayes Classifier
Here we see the difference between the Logistic classifier, and the Bayes ones. In this particular example, the Bayes classifier is not necessary more accurate than the linear one, but this is down to the particular synthetic data set that we are using for the example.
1gnb = GaussianNB()
2gnb.fit(X, y)
3l_accuracy = accuracy_score(model_logistic.predict(X),y)
4g_accuracy = accuracy_score(gnb.predict(X),y)
5visualise([model_logistic,gnb],["Logistic (%s)" % l_accuracy,"Bayes (%s)" % g_accuracy])
Progressive Training
As explained in the beginning, an advantage offered by this classifier is that it can be trained progressively, unlike other classifiers which require the entire data set to be uploaded into RAM.
In the example below we divide the data set into three folds (we use test
rather than train
because this represents the smaller slice of data), and then add each fold progressively to the same model. Each ‘round’ or ‘pass’ makes the model more accurate.
For intuition, we represent the results in a combined visualisation.
1gnb1 = GaussianNB()
2
3kf = KFold(n_splits=3, shuffle=True, random_state=2)
4gen = kf.split(X)
5
6_, test = next(gen)
7gnb1.partial_fit(X[test], y[test], [0,1])
8
9gnb2 = deepcopy(gnb1)
10_, test = next(gen)
11gnb2.partial_fit(X[test], y[test], [0,1])
12
13gnb3 = deepcopy(gnb2)
14_, test = next(gen)
15gnb3.partial_fit(X[test], y[test], [0,1])
16
17visualise([model_logistic,gnb1,gnb2,gnb3],["Logistic","Bayes 1/3","Bayes 2/3","Bayes 3/3"])
Conclusion
The Gaussian Naive Bayes Classifier is useful to obtain fast, preliminary results, upon data which may come in a stream, and that cannot be processed all at once in memory. Its accuracy is often below that of plain logistic regression, but this weakness may be compensated by its space and time advantages, when applicable.