Learn Random Forest using Excel

Random Forest using Excel Machine learning algorithm

Quick facts about Random Forest

  • Random forest algorithm consists of a random collection of decision trees
  • Random subset of training data provided to each decision tree
  • Bagging or bootstrap aggregating is used. It’s a general procedure that can be used to reduce the variance of algorithms that have high variance
  • Not so good for Regression
  • You can’t control the inside functionality aside from changing the input values
  • Maintains accuracy, even when data is missing
  • Can handle large datasets with a large number of attributes

Watch a video on Random Forest

Random Forest Algorithm Using Excel Machine learning

What is a Random Forest?

You probably already guessed the answer having already learned about decision trees. Yes, just as a forest is a collection of trees, a random forest is also a collection of decision trees. Decision trees that are grown very deep often overfit the training data so they show high variation even on a small change in input data. They are sensitive to the specific data on which they are trained so they are error prone to test data sets. The random forest grows many such decision trees and provide the average of the different classification trees (or the mode) and thus reduces the variance. The different classification trees are trained on different parts of the training dataset. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, the forest chooses the classification having the most votes or the average of all the trees in the forest.

Working of Random Forest
Fig: Working of Random Forest. The final classification value is the average (or mode) of the many decision trees.

How are trees grown in a Random Forest?

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. To see what bootstrap means let us suppose we have a sample of 50 data values. Calculating the mean directly has more error as our sample is very small. But we can improve the accuracy by taking a large number of random samples of the data with replacement and taking the average of the mean of the subsamples.Bootstrap sampling de-correlates the trees using different training sets as training many trees on a single training set gives strongly correlated trees.

Random Forest is same as the original bagging algorithm but with one difference. It extends the bootstrap algorithm by applying different machine learning algorithms to each of the decision trees. The way that each subtree is learned is different in random forests. Random forests reduce the correlation among the subtrees as each one is learned using a different mode.

If all of the previous material seem daunting to you, worry not. Let us look at a very simple example similar to see what it all means.

Dataset of Random Forest
Random Forest Scatter Plot

Fig: Data and Scatterplot

From the scatter plot we can see that we can’t partition the sets into two halves (as we did in the decision tree). So, the idea here is to train multiple trees and then take the mean (or mode) of all the predictions.

Let us take three tree splits as follows:
Model 1: X1<9
Model 2: X1<6
Model 3: X2>9

Let’s see what each tree would predict for the case (8, 6).

Model 1 predicts 0

Model 2 predicts 1

Model 3 predicts 1

Using model 1 only gives us wrong answer. But if we take the majority of the predictions of all three then we get the right answer. Let’s look at another example say (9, 17)

Model 1 predicts 0
Model 2 predicts 1
Model 3 predicts 0

In this case, Model 2 predicts wrong. But again taking majority we get the correct answer. We see that each of the trees fail for some cases. But the combination (forest) always gives a correct answer. This is the idea of random forests, combining the prediction of multiple trees.
(Please refer to the section on decision trees and the excel worksheet to look at detailed calculation of each tree)

Let us summarize the steps in classification or regression using Random forests. Suppose we have a training set X = {x1,x2,…….,x¬n} with class labels(values).

  1. First, we sample at random with replacement (B times) from the original data. This sample functions as the training set for growing the tree.
  2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on this m is used to split the node. The value of m is held constant during the forest growing.
  3. The decision/regression trees (let’s say a function Fi) are trained on different models. Each tree has grown to the largest extent possible without pruning.

Prediction in Random Forest

If it is a regression problem then the predictions for test samples xt are done by taking the mean of the prediction by all of the trees.

Mean of prediction in Random Forest

If it is a classification then the majority of the prediction of all the trees is taken.

How many trees Random Forest need to train?

The number of samples/trees B is taken typically from a few hundred to several thousand depending upon the size and nature of the training set. It can also be found using cross-validation, or by observing the out-of-bag error.  The out-of-bag error is the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample.

Number of features?

The number of features at each split point (m) must be supplied to the algorithm. Following formula is used to calculate the recommended number for m.

  1. For classification problem with p features, m =√p.
  2. For regression problems, m=p/3.

Why are Random Forests used?

Random forests are very widely used because they have some very desirable properties. First of all, they correct the overfitting problem that plagues normal decision trees. They have unparalleled accuracy among the current algorithms and can run on very large datasets. They also have an effective method for estimating missing data and maintaining accuracy when large chunks of the data are missing.

Pros and Cons of Random Forest:

👍

  • As we mentioned earlier a single decision tree tends to overfit the data. The process of averaging or combining the results of different decision trees helps to overcome the problem of overfitting.
  • Random forest also has less variance than a single decision tree. It means that it works correctly for a large range of data items than single decision trees.
  • Random forests are extremely flexible and have very high accuracy.
  • They also do not require preparation of the input data. You do not have to scale the data.
  • It also maintains accuracy even when a large proportion of the data are missing.

👎

  • The main disadvantage of Random forests is their complexity. They are much harder and time-consuming to construct than decision trees.
  • They also require more computational resources and are also less intuitive. When you have a large collection of decision trees it is hard to have an intuitive grasp of the relationship existing in the input data.
  • In addition, the prediction process using random forests is time-consuming than other algorithms.