Model Tuning: A Random Forest Example

Dec 04

Random forest is an ensemble model which can be visualized as a combination of multiple decision trees.

Before explaining randomforest, I’ll start with two simple yet popular examples of ensembles from our life:
in a movie there are number of actors performing together to tell us a story
in a soccer game a number of players contribute towards a common goal of winning a match

In all these examples and many others, error of one player or actor is often overpowered by the others in the group. That is how, the average performance of all these actors or players is much higher than any of the individual actor or player performing alone.

In simple terms, if we can find a superior model from the combination of some individual models, that works better than any of the individual models it’s called as an ensemble model.

That’s how Randomforest works. By growing a lot of different trees, and making their outcomes averaged or voted across the group, Random forest can give better results many of the times, than individual decision trees.

To know how random decision trees are grown to carry out this voting, you can visit the post how decision trees work. Here I’ve taken a simple example to illustrate the use of Randomforest using R.

We will apply the model on unseen data (other than the train data) for which we have the actual values for the target variable, and then compare the results given by the model with the actual values to calculate the accuracy.

The below data set is a subset taken from the income prediction problem data. We will train a random forest model on the train data and apply the prediction on the test data. Please note that we have created the test data from the train data values, so that we know the actual Income values.

The test data looks like below:

You may have noticed that the train data that we have chosen is part of the actual sample. The data set already has the values for income. We will add another column with predicted income values to this data set.

We’ll use the Random forest with 3 different model parameters to predict the income. We’ll then calculate the absolute difference between predicted values and actual values using abs function to find the error in predicted values. This will tell us how much our predicted values vary from the actual data.

Below are the steps that we’ll follow:

```> train1<- combi[which(combi\$id > 0),] > trainmodel<- randomForest(Income ~ ., data = train1, ntree = 25) > prediction <- predict(trainmodel, train1, type = "class") > train1<- cbind(train1, prediction) # We’ll change 50000+ to 1 and -50000 to 0 for prediction and income columns > train1\$prediction<- as.integer(train1\$prediction) > train1\$prediction<- train1\$prediction-1 > train1\$Income<- as.integer(train1\$Income) > train1\$Income<- train1\$Income-1 #we’ll find the difference between predicted values and actual values > train1\$difference <- abs(train1\$Income - train1\$prediction) # mean of the error for all the rows will give us an average error value > mean(train1\$difference)```

We can repeat the above codes for same data set with different values of ntree, that is by taking ntree value as 50 and 100. We can then compare average error ( mean value for train\$difference) and we will see that for ntree=100 the average is 0.25 for ntree=50 average is 0.3 and for ntree=25 the average is 0.

That means for this data set by taking ntree=25 we get a perfect prediction for our test data.