Techniques at a glance
Before I explain the advantages of classification models, I would like to start with the example of a political system that is used to elect representatives who make laws and run a country. All modern western-style democracies are representative democracies, which means there are elected officials representing groups of people, who are in turn responsible for making laws. But there may also be direct democracies in which people decide on policy initiatives directly. The second form of democracy undoubtedly is the purest form of representation where people make their own laws but it also demands much more resources for carrying out referendums whenever a policy decision is to be made. Likewise, in case of classification models we make decisions based on a system of voting. In classification models we have bins instead of constituencies. Each bin has a different voting pattern based on its constituents, which differs from the overall voting pattern of the complete data set. And instead of policies, these bins vote to predict the value of target variables.
All classification models that we’ll explore next are based on some modification of the above basic idea. We may not always have the freedom to try out some of the fancy models in our analysis due to limited availability of data. Though some of the models work fabulously with large datasets, they may not be as useful while applied on a small set of data.
To explain decision trees in simple terms I’ll take the example of a person looking for a burger house. Let’s say he has a smartphone assistant that works based on decision tree. When he asks the smart assistant about the best possible option for a burger house, it follows an algorithm like this:
In past you’ve mostly visited a burger house within 2 miles of your neighborhood. So, it’ll look for something within 2 miles. You’ve also preferred French burgers 80% of the times. So, it’ll find out a place serving French burgers. 90% of the times you’ve gone for a restaurant in the medium price range. So, it will look for a restaurant within that price range. It also knows that on most occasions in past you’ve visited the burger house with a group. Since you are currently in Midtown Manhattan, you’re with two of your friends and you still want to have French burger, based on the past analysis it suggest you a place called Brasserie. You’re impressed.
But what if any of these basic assumption fails in a slightly different context. May be the reason you’ve not travelled more than 2 miles in past was because of the heavy traffic conditions. What if you’re not in a mood to have a French Burger and you are willing to travel a few extra miles to try out something different? These circumstances will not be taken into consideration by the model at all. What it just did was overfitting. It just looked at every piece of information from your past and used them to make a decision. But in order to make a better judgement it should also weigh your preferences for saving travel time against how much you want to try out something different. To carry out such an weightage based analysis what you need is not just one smartphone assistant but a number of them each working on a different set of data points.
As discussed, decision trees do the voting in binary terms. They try to divide all the data points into segments based on questions which can be answered in yes or no. These segments are formed in such a way that in each segment either yes or no dominates. Which means if we are trying to predict whether the income of a set of people is less than or equal to 50k or not, we will start with the training data set where concentration of people with more than 50k income is 40%. We will keep dividing the training data into small segments based on what information we have about these people. Possibly we’ll first check whether they have higher educations like master or PhD’s. Then we’ll check what industry they belong to. Then their work profile, age group, family size and so on. Based on all these variables we will create bins like say, people with age more than 45, having a PhD and working in financial sector. In this particular group let’s say we find that the concentration of people having more than 50K income is 92%. We call it information gain.
A major problem with decision tree is that it’s follows a greedy approach. While it tries to create segments or bins to maximize the concentration of a particular target value, it goes for those variables first that causes maximum split. In this case it might isolate people with PhD’s first to get a good split. But, it may still miss out on those dropouts, having excellent academic records, who chose to pursue their dreams at an early age. That group may have 100% people falling in the 50K plus income category. We’ll never know because decision trees do not consider all possible scenarios.
If you have followed the discussions on Kaggle or other analytics communities, you might be familiar with this model by now. Randomforest is an advanced way of using decision trees in our algorithm by overcoming some of its shortcomings.
Random forest as the name suggests is a combination of multiple trees. Each tree is based on
cforest have a few advantages over randomforest as it can handle factors with more number of levels than randomforest. As we’ll see in our solved examples, this might address a real challenge at times while dealing with variables with more than 53 factors. Though we sometimes have the flexibility to reduce the number of factors through feature engineering, that might not always be possible.