Linear Regression with Scikit-Learn

Table of Contents
Introduction
In the world of neural networks and quantum machines, it may seem as though linear regression was yesterday’s news. Nothing could be further from the truth.
Most business and life phenomena do actually exhibit a linear relationship.
In this tutorial, we will use Scikit-Learn to create predictive models using simple linear regression, polynomial regression, Ridge regression, and Lasso Regression.
Imports
1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4from sklearn.model_selection import train_test_split
5from sklearn.model_selection import GridSearchCV
6from sklearn.preprocessing import MinMaxScaler
7from sklearn.linear_model import LinearRegression
8from sklearn.linear_model import Ridge
9from sklearn.linear_model import Lasso
10from sklearn.preprocessing import PolynomialFeatures
11
12plt.rcParams['figure.figsize'] = [8, 7]
13plt.rcParams['figure.dpi'] = 100
Simple Linear Regression
This model, also known as least squares, works out the coefficient m and intercept b, for a linear equation in the form y = mx + b, given a set of data.
First, we will rearrange the equation using the format preferred in the machine learning community, given that we’ll cover polynomials soon after, and that regular linear equation could be understood as a polynomial regression whose degree is one:
y = ß_0 + ß_1x^1
- ß_0 is the the intercept (b in the traditional equation)
- ß_1 is the coefficient. (m in the traditional equation)
We begin by generating some random target values y, forming an upward trend for each larger value of x:
1samples = 40
2np.random.seed(0)
3
4# Range of numbers between 0 and 1
5x = np.linspace(0.0,1.0,samples)
6# Random numbers between 0 and 1
7r = np.random.rand(samples)
8# Random increasing y values
9y = np.array(list(map(lambda t: t[0]/2 + t[1]/5, zip(x,r))))
10# X needs to be in the shape of [ [v1], [v2], ... ] rather than [v1,v2,...]
11X = x.reshape(samples,-1)
12
13_ = pd.DataFrame({"values" : y}, index=x).plot(style='.')
We then split the data into a training set and a test set, and train the model with the training set. LinearRegression.fit()
calculates the the intercept B_0, and the coefficient B_1, which we can plug into our own prediction equation:
1X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0)
2
3model = LinearRegression().fit(X_train, y_train)
4
5b_0 = model.intercept_
6b_1 = model.coef_[0]
7
8print('Intercept: {:+.5f}'.format(b_0))
9print('Coefficient: {:+.5f}'.format(b_1))
10
11def our_predict(b_0,b_1,x):
12 y = b_0 + b_1*x
13 return y
Intercept: +0.12147
Coefficient: +0.48881
We now finally draw the sample data again together with the values for the linear equation. We can see here that the prediction is easy to implement on our own since it is just a high-school formula. Scikit-Learn’s aid is primarily in working out the intercept and the coefficient that best approximates the training data.
1use_our_predict = True
2
3plt.scatter(X_train,y_train,label='Training data')
4plt.scatter(X_test,y_test,label='Test data')
5plt.xlabel("X - features")
6plt.ylabel("y - target values")
7equation = "y={:.3f}{:+.3f}x".format(b_0,b_1)
8if use_our_predict:
9 plt.plot(X, our_predict(b_0,b_1,X), '-r', label=equation)
10else:
11 plt.plot(X,model.predict(X),'-r', label=equation)
12plt.legend(loc='upper left')
13plt.show()
14
15print('Score (Training): {:.3f}'.format(model.score(X_train, y_train)))
16print('Score (Test): {:.3f}'.format(model.score(X_test, y_test)))
Score (Training): 0.885
Score (Test): 0.820
Polynomial Regression
A polynomial regression is still a linear regression, except that it has extra term(s), each with an increasing power:
y = ß_0 + ß_1x^1 + ß_2x^2 + ß_3x^3 + …
As we said before, a linear regression is a polynomial regression whose degree is 1, whereas a typical polynomial regression has a degree of at least 2. The degree is the highest power used by one of the terms.
Polynomial regression is useful for data sets whose shape has ‘bends’, from simple parabolas, to swings and roundabouts, for which a straight line is simply not a predictive model at all.
Let’s start by generating some sample random data whose shape is sort of ‘bent’:
1samples = 101
2np.random.seed(0)
3
4# Range of numbers between 0 and 1
5x = np.linspace(0.0,1.0,samples)
6# Random numbers between 0 and 1
7r = np.random.rand(samples)
8# Random increasing y values
9y = np.array(list(map(lambda t: (t[0]**2) + t[1]/2, zip(x,r))))
10# X needs to be in the shape of [ [v1], [v2], ... ] rather than [v1,v2,...]
11X = x.reshape(-1,1)
12_ = pd.DataFrame({"values" : y}, index=x).plot(style='.')
We obtain training and test data sets as usual:
1X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0)
And now comes the interesting part. As we said before, a polynomial regression is still a linear regression. The difference is that we must treat each extra power as having the ability to describe a different bend or segment of the data. This is a non-mathematical explanation but hold onto that thought for a moment.
Say that we choose a degree of two. In this case, a given y target value is best approximated not only by the best choice of B_1x^1 but also, B_2x^2—in addition to the intercept. In the simple linear regression, which produces straight lines, x^1 simply equals x. In other words we only have one feature and one target value:
X -> y
But with an extra term, we effectively have two features associated with each target value y:
(X, X^2) -> y
You can think that every extra term (and therefore, each extra power) provides finer resolution, and the ability to capture more intricate bends. You can type y = x^2
, y = x^3
and so on on a graphics calculator (or Google) to get an idea of what kind of shapes each power produces. By combining multiple coefficients and powers in one equation, we get more complex shapes that, hopefully, more accurately describe the data, but we’ll get back to that in a moment.
Let’s return to practicalities. First we need to choose a degree. We’ll choose a degree of 9, an excessive one, for didactic purposes.
The next step is to take a one dimensional sequence of X values, and turn them into multiple features, whereby each feature is a version of X, elevated to the powers 1..9.
Let’s first see the kind of plain X values that we have:
1X_train[0:5] # Display only the first five elements
array([[0.06],
[1. ],
[0.82],
[0.76],
[0.6 ]])
Let’s extract the features for each power, and see the results for the first value, which is 0.06
:
1X_features = np.array([np.power(x,range(1,10)) for x in X_train])
2X_features[0] # show first element only
array([6.0000000e-02, 3.6000000e-03, 2.1600000e-04, 1.2960000e-05,
7.7760000e-07, 4.6656000e-08, 2.7993600e-09, 1.6796160e-10,
1.0077696e-11])
Now, we don’t need to do this ourselves, the PolynomialFeatures.fit_transform()
method does this for us:
1pf = PolynomialFeatures(degree=9,include_bias=False)
2X_features = pf.fit_transform(X_train)
3X_features[0]
array([6.0000000e-02, 3.6000000e-03, 2.1600000e-04, 1.2960000e-05,
7.7760000e-07, 4.6656000e-08, 2.7993600e-09, 1.6796160e-10,
1.0077696e-11])
Please note that this is an oversimplification about what fit_transform()
really does under the covers. For example, if more features were to be included, fit_transform()
would include combinations/permutations for each feature, for which our features extraction code is inappropriate. In short, always use fit_transform()
, rather the range()
-based example seen a few moments ago.
Let’s now train the model based on the new feature set:
1model = LinearRegression().fit(X_features, y_train)
We won’t be using our own prediction formula this time, and, instead, rely directly on Scikit-Learn’s predict()
one. Note that we need to predict upon features (those derived from our x values), as opposed to the set of x values directly:
1plt.rcParams['figure.figsize'] = [10, 8]
2plt.rcParams['figure.dpi'] = 100
3plt.scatter(X_train,y_train,label='Training data')
4plt.scatter(X_test,y_test,label='Test data')
5plt.xlabel("X - features")
6plt.ylabel("y - target values")
7# Remember that model.predict() takes a sets of features!
8plt.plot(X, model.predict(pf.fit_transform(X)), color='red')
9plt.legend(loc='upper left')
10plt.show()
11
12print('Score (Training): {:.3f}'.format(model.score(pf.fit_transform(X_train), y_train)))
13print('Score (Test): {:.3f}'.format(model.score(pf.fit_transform(X_test), y_test)))
Score (Training): 0.795
Score (Test): 0.747
Above we can see that the model is relatively effective in capturing the shape of the training data, but that it then struggles a with the test set. This is because the model is overfit, in other words, it is too tightly aligned to the training data set, and it doesn’t generalise well when it comes to new, unforeseen, data sets, like our test one.
This shows that high degree polynomials are often problematic (they also don’t scale well when the source training set has multiple features). A solution here is to use a lower degree. But which one? Let’s find out!
1print('Degree Training Test')
2for d in range(1,10):
3 pf = PolynomialFeatures(degree=d)
4 X_features = pf.fit_transform(X_train)
5 model = LinearRegression().fit(X_features, y_train)
6 print(' {} {:+.3f} {:+.3f}'
7 .format(d,
8 model.score(pf.fit_transform(X_train), y_train),
9 model.score(pf.fit_transform(X_test), y_test)))
Degree Training Test
1 +0.721 +0.659
2 +0.782 +0.803
3 +0.785 +0.792
4 +0.786 +0.791
5 +0.787 +0.788
6 +0.793 +0.760
7 +0.794 +0.758
8 +0.794 +0.754
9 +0.795 +0.747
Here we can see that a degree of two achieves the best results in terms of a balance between training and test scores. This is a small sample set, though, and the squiggly line as a result of using a degree of 9 is intentional, as we will see in the next section.
Ridge Regression
So far we’ve seen that a polynomial regression is a flavour of ‘regular’ linear regression that typically has terms with powers of two or greater.
Similarly, a Ridge Regression is still a linear regression, which provides regularisation by adding a ‘penality’ which has the effect of reducing the effect of coefficients that make the model overfit around the training set, and that doesn’t generalise well when confronted with unforeseen data.
In fact, in the last section, we chose a polynomial with a degree of 9 to show this very effect. This so-called ‘penalty’ is normally called lambda. In Scikit-Learn, instead, it is provided through an argument called alpha
. Let’s start off by having no penalty whatsosever, which means that alpha=0.0
:
1model = Ridge(alpha=0.0).fit(X_features, y_train)
2plt.scatter(X_train,y_train,label='Training data')
3plt.scatter(X_test,y_test,label='Test data')
4plt.xlabel("X - features")
5plt.ylabel("y - target values")
6plt.plot(X,model.predict(pf.fit_transform(X)),'-r')
7plt.legend(loc='upper left')
8plt.show()
9
10print('Score (Training): {:.3f}'.format(model.score(pf.fit_transform(X_train), y_train)))
11print('Score (Test): {:.3f}'.format(model.score(pf.fit_transform(X_test), y_test)))
Score (Training): 0.795
Score (Test): 0.747
The above result is not different from the polynomial regression we had seen before. Let’s now set alpha
to 1.0
, which is a significant penalty, and has the effect of smoothing out the bends toward specific points in the training data set:
1model2 = Ridge(alpha=1.0).fit(X_features, y_train)
2plt.scatter(X_train,y_train,label='Training data')
3plt.scatter(X_test,y_test,label='Test data')
4plt.xlabel("X - features")
5plt.ylabel("y - target values")
6plt.plot(X,model2.predict(pf.fit_transform(X)),'-r')
7plt.legend(loc='upper left')
8plt.show()
9
10print('Alpha 0.0 1.0')
11print('Score (Training): {:.3f} {:.3f}'.format(model.score(pf.fit_transform(X_train), y_train),model2.score(pf.fit_transform(X_train), y_train)))
12print('Score (Test): {:.3f} {:.3f}'.format(model.score(pf.fit_transform(X_test), y_test),model2.score(pf.fit_transform(X_test), y_test)))
Alpha 0.0 1.0
Score (Training): 0.795 0.782
Score (Test): 0.747 0.790
We see that with alpha=1.0
, the penalty is way too high, and the score goes down. We will use a cross validation search, and iterate through fine-grained alpha values, to find the most optimal one:
1params = {"alpha" : np.linspace(0.0, 0.1, num=1000)}
2gs_cv = GridSearchCV(Ridge(), params, cv=3)
3gs_cv.fit(X_features, y_train)
4model3 = gs_cv.best_estimator_
5
6plt.scatter(X_train,y_train,label='Training data')
7plt.scatter(X_test,y_test,label='Test data')
8plt.xlabel("X - features")
9plt.ylabel("y - target values")
10plt.plot(X,model3.predict(pf.fit_transform(X)),'-r')
11plt.legend(loc='upper left')
12plt.show()
13
14print('Alpha 0.0 1.0 {}'.format(gs_cv.best_params_['alpha']))
15print('Score (Training): {:.3f} {:.3f} {:.3f}'.format(model.score(pf.fit_transform(X_train), y_train),model2.score(pf.fit_transform(X_train), y_train),model3.score(pf.fit_transform(X_train), y_train)))
16print('Score (Test): {:.3f} {:.3f} {:.3f}'.format(model.score(pf.fit_transform(X_test), y_test),model2.score(pf.fit_transform(X_test), y_test),model3.score(pf.fit_transform(X_test), y_test)))
Alpha 0.0 1.0 0.1
Score (Training): 0.795 0.782 0.785
Score (Test): 0.747 0.790 0.795
Above we see that when alpha=0.1
, we have a better test score, although a slightly worst training score. This is typical of models using Ridge regression. We want the model to generalise well for unforseen, i.e., new data, rather than performing an excellent score on the training set (because of its high bias toward it), to then fall off a cliff when presented with new data.
Lasso Regression
Lasso regression is similar to Ridge regression in that it involves adding a penalty which results in a model that is less ‘skewed’ (it is more biased) toward specific points in the data set, and that generalises well for new data (it offers less variance), as the lambda (alpha
argument) increases.
The difference is in the mathematical approach to adding the penalty. Lasso regression adds a penalty equal to the absolute value of the magnitude of the coefficients, reducing some of them down to zero so that they can be eliminated. Ridge regression, instead, always keeps all coefficients.
In the below example, we use a fixed alpha
value (rather than iterating through various ones to find the most optimal one, as we did in the case of Ridge regression). For this particular data set, the results are similar. We leave a cross validation search as an exercise for the reader.
1model4 = Lasso(alpha=0.000025,max_iter=1000000,tol=1e-2)
2model4.fit(X_features, y_train)
3
4plt.scatter(X_train,y_train,label='Training data')
5plt.scatter(X_test,y_test,label='Test data')
6plt.xlabel("X - features")
7plt.ylabel("y - target values")
8plt.plot(X,model4.predict(pf.fit_transform(X)),'-r')
9plt.legend(loc='upper left')
10plt.show()
11
12print('Model Reg Ridge Lasso')
13print('Score (Training): {:.3f} {:.3f} {:.3f}'.format(model.score(pf.fit_transform(X_train), y_train),model3.score(pf.fit_transform(X_train), y_train),model4.score(pf.fit_transform(X_train), y_train)))
14print('Score (Test): {:.3f} {:.3f} {:.3f}'.format(model.score(pf.fit_transform(X_test), y_test),model3.score(pf.fit_transform(X_test), y_test),model4.score(pf.fit_transform(X_test), y_test)))
Model Reg Ridge Lasso
Score (Training): 0.795 0.785 0.786
Score (Test): 0.747 0.795 0.796
Feature Normalisiation using the MinMax Scaler
Sometimes features are found, in their original format, in a different scale, even though they map to the same target values or labels. Take the example of the below data points:
1x = np.arange(70,100)
2y = np.array([ t[0] +t[1] for t in
3 zip(range(0,30),np.random.randint(95,101,30))])
4
5plt.scatter(x,y)
6plt.gca().set_xticks(range(0,101,20))
7plt.gca().set_yticks(range(0,141,20))
8plt.show()
A common approach is normalising the data set so that the minimum value is 0.0
and the maximum value is 1.0.
, using MinMaxScaler.fit_transform()
:
1scaler = MinMaxScaler()
2x_scaled = scaler.fit_transform(x.reshape(-1, 1))
3y_scaled = scaler.fit_transform(y.reshape(-1, 1))
4plt.scatter(x_scaled,y_scaled)
5plt.show()
Conclusion
Linear regression is not limited to ‘straight line’ scenarios only. Data with lots of curves or bends may be addressed with polynomial regression, and resulting models fine-tuned with either the Ridge or Lasso flavours, to adjust bias and variance.