Gaussian Naive Bayes Classification with Scikit-Learn

Sep 18, 2022 data coding machine learning python

Introduction

Scikit’s Learn Gaussian Naive Bayes Classifier has the advantage, over the likes of logistic regression, that it can be fed with partial data in ‘chunks’ using the partial_fit(X, y, classes) method. Also, given its ‘Gaussian’ nature, the dividing line between classes is a parabola, rather than a straight line, which may be more useful for some data sets.

Note: This tutorial follows the same structure as the SVM one in that this classifier is presented in terms of its difference with logistic regression.

Imports and Boilerplate

 1import numpy as np
 2import pandas as pd
 3import matplotlib.pyplot as plt
 4from copy import deepcopy
 5
 6from sklearn.model_selection import KFold
 7from sklearn.linear_model import LogisticRegression
 8from sklearn.naive_bayes import GaussianNB
 9from sklearn.metrics import accuracy_score
10
11plt.rcParams['figure.figsize'] = [8, 7]
12plt.rcParams['figure.dpi'] = 100

python

A Classification Problem. Red Dots vs Blue Dots.

We use classification to predict the likelihood of a class vs another, given one or more features.

Let us contemplate two classes (i.e., a binary/binomial classification), consisting of red and blue dots. Do not worry too much about the code below; jump straight to the plot and we’ll continue the conversation.

 1np.random.seed(0)
 2samples = 80
 3red_x = np.linspace(1,8,samples)
 4red_y = [ v + np.random.rand()*(10-v) for v in red_x ]
 5
 6blue_x = np.linspace(1,10,samples)
 7blue_y = [ v-(np.random.rand()*v) for v in blue_x ]
 8
 9X = np.concatenate((
10                    np.array( [ [x,y] for (x,y) in zip(red_x,red_y)]),
11                    np.array( [ [x,y] for (x,y) in zip(blue_x,blue_y)])
12                  ))
13y = np.concatenate((
14                    np.repeat(0,len(red_x)),
15                    np.repeat(1,len(blue_x))
16                   ))
17
18def visualise(models,names):
19    plt.scatter(red_x,red_y, color='red')
20    plt.scatter(blue_x,blue_y, color='blue')
21    plt.xticks(range(0,11))
22    plt.yticks(range(0,11))
23    plt.xlabel("X")
24    plt.ylabel("Y")
25    for i,m in enumerate(models):
26        class_boundary = [ (x,y) 
27                           for x in np.linspace(0,10,200) 
28                           for y in np.linspace(0,10,200)
29                           if abs((m.predict_proba([[x,y]])[0][1])-0.5)
30                             <= 0.001 ]
31        plt.plot([ t[0] for t in class_boundary ], 
32                 [ t[1] for t in class_boundary ],
33                 label=names[i])
34    if len(models) > 0:
35        plt.legend(loc='upper left')
36    
37
38visualise([],[])

python

Ok, above we have X and Y axes (both in the 0..10 range), and depending on the coordinate, it is more likely that we will get either red or blue dots.

Logistic Regression

An easy predictive model that we can use is logistic regression (check my tutorial), which will predict (barring regularisation, etc.) either red or blue dots, depending on whether they fall above or below the teal decision boundary line.

1model_logistic = LogisticRegression()
2model_logistic.fit(X,y)
3visualise([model_logistic],["Logistic"])

python

Gaussian Naive Bayes Classifier

Here we see the difference between the Logistic classifier, and the Bayes ones. In this particular example, the Bayes classifier is not necessary more accurate than the linear one, but this is down to the particular synthetic data set that we are using for the example.

1gnb = GaussianNB()
2gnb.fit(X, y)
3l_accuracy = accuracy_score(model_logistic.predict(X),y)
4g_accuracy = accuracy_score(gnb.predict(X),y)
5visualise([model_logistic,gnb],["Logistic (%s)" % l_accuracy,"Bayes (%s)" % g_accuracy])

python

Progressive Training

As explained in the beginning, an advantage offered by this classifier is that it can be trained progressively, unlike other classifiers which require the entire data set to be uploaded into RAM.

In the example below we divide the data set into three folds (we use test rather than train because this represents the smaller slice of data), and then add each fold progressively to the same model. Each ‘round’ or ‘pass’ makes the model more accurate.

For intuition, we represent the results in a combined visualisation.

 1gnb1 = GaussianNB()
 2
 3kf = KFold(n_splits=3, shuffle=True, random_state=2)
 4gen = kf.split(X)
 5
 6_, test = next(gen)
 7gnb1.partial_fit(X[test], y[test], [0,1])
 8
 9gnb2 = deepcopy(gnb1)
10_, test = next(gen)
11gnb2.partial_fit(X[test], y[test], [0,1])
12
13gnb3 = deepcopy(gnb2)
14_, test = next(gen)
15gnb3.partial_fit(X[test], y[test], [0,1])
16
17visualise([model_logistic,gnb1,gnb2,gnb3],["Logistic","Bayes 1/3","Bayes 2/3","Bayes 3/3"])

python

Conclusion

The Gaussian Naive Bayes Classifier is useful to obtain fast, preliminary results, upon data which may come in a stream, and that cannot be processed all at once in memory. Its accuracy is often below that of plain logistic regression, but this weakness may be compensated by its space and time advantages, when applicable.

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!

Gaussian Naive Bayes Classification with Scikit-Learn

Table of Contents

Introduction

Imports and Boilerplate

A Classification Problem. Red Dots vs Blue Dots.

Logistic Regression

Gaussian Naive Bayes Classifier

Progressive Training

Conclusion

Before You Leave