Decision Tree Ensembles Compared

Oct 9, 2022 data coding machine learning python

Introduction

A Decision Tree ‘Ensemble’ is a method in which multiple trees, as opposed to a single one, are used to arrive at a prediction. The most popular of these methods are Random Forests and Gradient Boosted Decision Trees. Our objective here is not to spend too much in theory, but to see—in practice—the difference between the three methods, using Scikit-Learn—and XGBoost too!

In a nutshell, Random Forests and Gradient Boosted Decision Trees produce better results than regular, simple Decision Trees, but to what degree? Let’s find out.

Regular Decision Trees

Regular Decision Trees consist of a model in which the features are the basis to define split points, so that, during inference, we end up with the predicted class—in the case of classification—by recursively navigating a split point to the left, or to the right.

Random Forest

A random forest uses multiple decisions trees as opposed to one. But where do these trees come from?

Through a process called bootstrapping, the Random Forest’s algorithm creates ‘random data sets’ to train each tree—the number of trees is defined by a hyperparameter named n_estimators, which is set to 100 by default in Scikit-Learn’s Random Forest implementation. The ‘random data sets’ used to train each tree are derived from the original data set using a flavour of bootstrapping called bootstrapping with replacement, which consists on picking random rows from the original data set, even if the same rows had been selected before—this is what with replacement means.

Furthermore, only a subset of features is considered to train the trees on each random data set. The maximum number of features to consider for each data set is defined by another hyperparameter called max_features in Scikit-Learn’s Random Forest implementation, which is set to the square root of the total number of features by default. For example, if there are 9 features, then only 3 random features would be selected for each tree.

During model inference, all the trees (each of which has been constructed out of an individual bootstrapped data set, of which only a subset of features have been used) are evaluated and the majority class prediction is selected—this process is called aggregation. In a regression scenario, the mean would be used instead.

Gradient Boosted Trees

Gradient boosted trees are also an ’ensemble’ of trees (the n_estimators hyperparameter is also used in Scikit-Learn’s implementation) like Random Forests but the training process is sequential whereby each new tree is trained on the prediction errors produced by the tree before, until further optimisation does not longer result in better accuracy—hence the ‘gradient’ part.

Although Scikit-Learn has its own Gradient Boosted Trees implementation, XGBoost is more popular due to its superior efficiency and extra features. We will cover both.

Imports and Boilerplate

 1import numpy as np
 2import pandas as pd
 3import matplotlib.pyplot as plt
 4import seaborn as sns
 5import timeit
 6
 7from xgboost import XGBClassifier
 8from sklearn.model_selection import train_test_split
 9from sklearn.tree import DecisionTreeClassifier
10from sklearn.ensemble import RandomForestClassifier
11from sklearn.ensemble import GradientBoostingClassifier
12from sklearn.metrics import accuracy_score
13from sklearn.metrics import confusion_matrix

Data Set

We create a synthetic data set consisting of red, blue, green, and purple dots. Do not worry too much about the code below; jump straight to the plot and we’ll continue the conversation.

 1np.random.seed(0)
 2samples = 500 
 3red_x = np.linspace(0,10,samples)
 4red_y = [ 8.5+np.sin(x)* 1.5-np.random.rand() for x in red_x ]
 5
 6blue_x = np.linspace(0,10,samples)
 7blue_y = [ 7.5+np.sin(x)* 1.5-np.random.rand() for x in red_x ]
 8
 9green_x = np.linspace(0,5,samples)
10green_y = [ np.random.rand()*5 for v in green_x ]
11
12purple_x = np.linspace(5,10,samples)
13purple_y = [ np.random.rand()*5 for v in green_x ]
14
15X = np.concatenate((
16                    np.array( [ [x,y] for (x,y) in zip(red_x,red_y)]),
17                    np.array( [ [x,y] for (x,y) in zip(blue_x,blue_y)]),
18                    np.array( [ [x,y] for (x,y) in zip(green_x,green_y)]),
19                    np.array( [ [x,y] for (x,y) in zip(purple_x,purple_y)]),
20                  ))
21y = np.concatenate((
22                    np.repeat(0,len(red_x)),
23                    np.repeat(1,len(blue_x)),
24                    np.repeat(2,len(green_x)),
25                    np.repeat(3,len(purple_x))
26                   ))
27
28X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
29
30def plot(X_, y_):
31  plt.rcParams['figure.figsize'] = [4, 4]
32  plt.rcParams['figure.dpi'] = 100
33  for i, v in enumerate(['red','blue','green','purple']):
34    Xs = [ t[0][0] for t in zip(X_,y_) if t[1] == i ]
35    ys = [ t[0][1] for t in zip(X_,y_) if t[1] == i ]
36    plt.scatter(Xs,ys, color=v, s=1)
37
38plot(X,y)

Note above that while the green and purple dots gather perfectly in the plot’s corners, the red and blue dots have an irregular, wave-like pattern.

Regular trees work well with classes that live within well demarcated regions—like the green and purple dots—which can be predicted navigating a simple split point hierarchy such as

1       Y > 5
2   YES        NO
3   ...      X > 5
4          NO      YES
5        Green    Purple

Given that regular trees ’like’ data that that is clustered together in either horizontal or vertical regions, we produced the red and blue dot wave pattern with the intention of throwing the learning algorithm ‘off track’ and understanding the efficacy of Random Forests and Gradient Boosted Trees vs traditional trees.

Visualisation Code

The code below helps compare three different models, showing the predicted ‘regions’ for each class (in a light colour), versus the coordinates of the actual dots in the test set. Understanding the code below is not important. What matters is our discussion further down, so feel free to skip ahead.

 1def fit_visualise(models,names):
 2    
 3    plt.rcParams['figure.figsize'] = [8, 4]
 4    plt.rcParams['figure.dpi'] = 100
 5    plt.subplots_adjust(wspace=0.5)
 6
 7    for i,m in enumerate(models):
 8      m.fit(X_train,y_train)
 9
10      plt.subplot(2, 3, i+1)
11      plt.title("%s\nAccuracy = (%.2f)" % 
12                (names[i], accuracy_score(y_test,m.predict(X_test),)))
13      plt.xticks(range(0,11))
14      if i == 0:
15        plt.yticks(range(0,11))
16      else:
17        plt.yticks([])
18      
19      if m != None:
20        classes = [{}] * 4
21        for i,v in enumerate(['#ff8181','#8181ff','#81a781','#a782a7']):
22          classes[i] = { 'colour' : v , 'x' : [], 'y' : []}
23
24        for x in np.linspace(0,10,40): 
25          for y in np.linspace(0,10,40):
26            classes[m.predict([[x,y]])[0]]['x'].append(x)
27            classes[m.predict([[x,y]])[0]]['y'].append(y)
28
29        for c in classes:
30          plt.scatter(c['x'], 
31                      c['y'],
32                      color=c['colour'],
33                      edgecolors='none',
34                      marker='s',
35                      s=10)
36        for i, v in enumerate(['red','blue','green','purple']):
37          Xs = [ t[0][0] for t in zip(X_test,y_test) if t[1] == i ]
38          ys = [ t[0][1] for t in zip(X_test,y_test) if t[1] == i ]
39          plt.scatter(Xs,ys, color=v, s=1)
40
41    for i,m in enumerate(models):
42      plt.subplot(2, 3, i+4)
43      labels = ["r","b","g","p"]
44      confusion = confusion_matrix(y_test, m.predict(X_test), normalize='true')
45      dfc = pd.DataFrame(confusion, 
46                          index = labels, 
47                          columns = labels)
48      sns.heatmap(dfc, annot=True, cmap="YlGnBu", fmt='.2f', annot_kws={"size":8})
49      plt.ylabel('Actual')
50      plt.xlabel('Predicted')

Model Comparison

In regular decision trees, max_depth is a crucial hyperparameter. Low values result in low accuracy, whereas high values result in the decision tree memorising the data model, which is not memory efficient and generalisation is poor—in other words, the resulting model is likely to suffer from high bias.

Max Depth = 3, Estimators = Default

In Gradient Boosted Trees, max_depth is 3 by default, given that each individual tree is considered a so-called ‘weak learner’. It is the tree ensemble (the evaluation of multiple weak learners) which result in an overall, strong learner. Both regular decision trees and Random Forests hava a variable depth—to be discussed further on.

Estimators are only applicable to Random Forests and Gradient Boosted Trees. We will also discuss this in a moment.

Let’s compare the results now.

1dt_model = DecisionTreeClassifier(random_state=0,max_depth=3)
2rf_model = RandomForestClassifier(random_state=0,max_depth=3)      
3gb_model = GradientBoostingClassifier(random_state=0) # max_depth = 3 by default
4
5fit_visualise([dt_model,rf_model,gb_model],
6              ["Decision Tree","Random Forest","Gradient Boost"])

As expected, all models dealt well with the green and purple dots, which live in perfectly square regions.

However, for the red and blue dots, the story is different. With a shallow depth of 3, the single decision tree has no other choice than creating a sweeping split point for the red and blue ‘regions’. For Random Forests, even though there are 100 estimators, such a shallow max depth prevents the model from capturing the nuances of the wave-shape made up by the red and blue dots.

In terms of performance, the Random Forest is roughly two orders of magnitude slower than the regular decision tree, while the Gradient Boosted Tree is three times slower than the Random Forest Tree.

1# Performance comparison
2%timeit -n 10 DecisionTreeClassifier(random_state=0,max_depth=3).fit(X_train,y_train)
3%timeit -n 10 RandomForestClassifier(random_state=0,max_depth=3).fit(X_train,y_train)
4%timeit -n 10 GradientBoostingClassifier(random_state=0).fit(X_train,y_train)

1.46 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
145 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
649 ms ± 6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Max Depth = None, Estimators = Default (All Defaults)

If we don’t specify max_depth, then Gradient Boost will use 3 by default, but regular Decision Trees, and Random Forests will “expand nodes until all leaves are pure or until all leaves contain less than min_samples_split samples”, as per Scikit-Learn’s documentation

For Random Forests and Gradied Boosted Trees, n_estimators = 100 by default.

1dt_model = DecisionTreeClassifier(random_state=0)
2rf_model = RandomForestClassifier(random_state=0)
3gb_model = GradientBoostingClassifier(random_state=0)
4
5fit_visualise([dt_model,rf_model,gb_model],
6              ["Decision Tree","Random Forest","Gradient Boost"])

For our simple model, now the regular decision tree achieves accuracy levels comparable to that of the Random Forest and Gradient Boosted trees, but without a restricted max_depth value, we can expect such a default setting to place scaling limitations and high bias for real world applications.

Performance is comparable to that seen when using max_depth = 3.

1# Performance comparison
2%timeit -n 10 DecisionTreeClassifier(random_state=0).fit(X_train,y_train)
3%timeit -n 10 RandomForestClassifier(random_state=0).fit(X_train,y_train)
4%timeit -n 10 GradientBoostingClassifier(random_state=0).fit(X_train,y_train)

1.94 ms ± 124 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
171 ms ± 4.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
650 ms ± 4.65 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Max Depth = None, Estimators = 50

In Random Forests, n_estimators is a hyperparameter which specifies the number of trees in the forest, whereas in Gradient Boosted Trees, it specifies the number of boosting stages to perform. In regular decision trees, well … there is only one estimator.

The reason Random Forests and Gradient Boosted Trees take so long to fit is because n_estimators is set to 100 by default. According to Scikit-Learn’s documentation, “Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance” so accuracy degradation is expected if we halve this value.

1dt_model = DecisionTreeClassifier(random_state=0)
2rf_model = RandomForestClassifier(random_state=0,n_estimators=50)
3gb_model = GradientBoostingClassifier(random_state=0,n_estimators=50)
4
5fit_visualise([dt_model,rf_model,gb_model],
6              ["Decision Tree","Random Forest","Gradient Boost"])

By halving n_estimators, we don’t see a degradation in accuracy, but, as expected, we reduce the fitting time proportionally, by 50%.

1# Performance comparison
2%timeit -n 10 DecisionTreeClassifier(random_state=0).fit(X_train,y_train)
3%timeit -n 10 RandomForestClassifier(random_state=0,n_estimators=50).fit(X_train,y_train)
4%timeit -n 10 GradientBoostingClassifier(random_state=0,n_estimators=50).fit(X_train,y_train)

1.95 ms ± 98.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
85.5 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
327 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Max Depth = None, Estimators = 10

We now reduce the number of estimators to 10 down from 100. Please note that Scikit-Learn’s default was originally 10, so this reduction should not be considered ‘dramatic’.

In other words, n_estimators=10 was considered a sensible default for quite a while by Scikit-Learn team.

1dt_model = DecisionTreeClassifier(random_state=0)
2rf_model = RandomForestClassifier(random_state=0,n_estimators=10)
3gb_model = GradientBoostingClassifier(random_state=0,n_estimators=10)
4
5fit_visualise([dt_model,rf_model,gb_model],
6              ["Decision Tree","Random Forest","Gradient Boost"])

Setting n_estimators to just 10 does not result in an accuracy degradation to the Random Forest model—and it has the benefit of saving 90% of processing time—but the Gradient Boosted Tree takes a hit in accuracy.

1# Performance comparison
2%timeit -n 10 DecisionTreeClassifier(random_state=0).fit(X_train,y_train)
3%timeit -n 10 RandomForestClassifier(random_state=0,n_estimators=10).fit(X_train,y_train)
4%timeit -n 10 GradientBoostingClassifier(random_state=0,n_estimators=10).fit(X_train,y_train)

1.9 ms ± 106 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
17.5 ms ± 457 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
67.3 ms ± 3.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Scikit-Learn’s Gradient Boosting vs XGBoost

While most bread and butter machine learning algorithms are fairly well-implemented in Scikit-Learn, Extreme Gradient Boosting (XGBoost) is often preferred in the case of Gradient Boosted Trees. For most complex problems, it is quite usual to see XGBoost as the ‘winner’ when an AutoML framework attempts to find out the most optimal model for a given data set.

Here below, we compare Extreme Gradient Boosting against Scikit-Learn’s implementation, as well as Random Forests, for reference.

1rf_model = RandomForestClassifier(random_state=0)
2gb_model = GradientBoostingClassifier(random_state=0)
3xg_model = XGBClassifier(random_state=0)
4
5fit_visualise([rf_model,gb_model,xg_model],
6              ["Random Forest","Gradient Boost", "Extreme GB"])

Here we can see that XGBoost provides just a marginal accuracy improvement over Scikit-Learn’s implementation—but this is considering our simple data set which consists of two non-categorical features.

What is most interesting is the difference in performance, whereby XGBoost provides a 20-30% improvement in speed when fitting the data.

1# Performance comparison
2%timeit -n 10 RandomForestClassifier(random_state=0).fit(X_train,y_train)
3%timeit -n 10 GradientBoostingClassifier(random_state=0).fit(X_train,y_train)
4%timeit -n 10 XGBClassifier(random_state=0).fit(X_train,y_train)

169 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
655 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
469 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conclusion

This tutorial introduced the two most popular decision trees ensembles which are Random Forests and Gradient Boosted Trees, using regular decision trees as the baseline.

While the synthetic data set provided doesn’t make justice to the power offered by Gradient Boosted Trees, it does show that simpler models (like Random Forests) are actually more accurate, and faster than the ‘smarter’ alternative.

Last, we have not delved into the extensive number of hyperparameters available in both cases, except for max_depth, which is familiar to users of conventional decision trees, and n_estimators which is the first ‘go to’ hyperparameter in the case of Random Forests and Gradient Boosted Trees.

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!