A wonderful entry point to the field of data science is the Kaggle platform. It provides you a wide array of competitions to choose from depending on the area of your interest. If you are a trader or a scientist, or a programmer with some domain experience, a student, a manager, an analyst, a researcher or just an enthusiast- chances are Kaggle has something interesting in store for you. Another interesting part of the platform is, you can submit entries to competitions that happened in the past just to check where you would have been on the leaderboard. Now, leaderboard itself is an interesting concept that I’ve explained in a separate post. In short, it is the list of top 10 participants ranked in the order of their performance.
For beginners, I’ve chosen an interesting competition that tries to predict the income of a person by the help of census data. The actual competition link is here (https://inclass.kaggle.com/c/competition-1-mipt-fivt-ml-spring-2015) .
Income of a person always intrigues us. We try guessing the income of celebrities, bosses, colleagues and more importantly our potential partners by the means of their occupation, education, houses, family etc. So I thought it would be interesting to work on this income prediction problem, with the help of a powerful tool R and understand which factors can effectively predict the income.
In Kaggle problems (as in most of the practical problem solving scenarios) it’s important to divide the data into two sets- test data and training data. Basically we divide the dataset that we have, consisting of past observations into two parts- training and test data. Training data will be the one based on which we shall build our models.
Once the model is built we have to test it with a different dataset so as to know how it performs under normal circumstances. We test this model on test dataset, because it’s easy for us to compare the results predicted by the model with actual results as we already have the complete data available for test data. Let’s take an example to make this point clear.
Let’s say we are analyzing a number of people based on their propensity to respond to a marketing campaign for a particular age group. We have a data of 10,000 respondents who have in past either responded or not responded to similar campaigns. We also have the demographic and other details of all these 10,000 people.
We have to now build our model based on the data and test the model’s performance. But where can we test the model. Can we test it on the same data on which the model was created. If we do that our model might give a 100% accuracy and we will never know if it’s overfitted. Overfitting is over emphasizing on patterns that can’t be generalized to different samples of the actual data.
In this post I’ll briefly discuss about the error metrics used by Kaggle in evaluating problems- the problems can be classification problem in which we make predictions about a class of values rather than an exact value for the target variable or a regression problem where we predict continuous variables. For example in the Income prediction problem that we will take up next, we will predict whether income of a person is more than or less than 50000. So, this is a classification problem where we predict whether target variable will lie in one of the two classes. In case we were to predict the actual income of the person (which is a continuous variable) it would become more of a regression problem. The methods commonly used by Kaggle for measuring error in regression are: Mean absolute error, root mean squared error etc. Error metrics for classification problems are Mean F score, Mean consequential error, Multi class Log Loss etc. Visit the Kaggle wiki for the description of error metrics: https://www.kaggle.com/wiki/Metrics
A detailed discussion of each pf this methods is available in the tutorials where you will see a practical application of these methods.
Git is a free and open source version control system which works well for small as well as large projects. GitHub is a web-based Git repository hosting service, which offers revision control as well as source code management functionality. I would recommend everyone to create a GitHub account which will help you access codes and store codes in a simple and efficient manner. All the codes that we use as part of tutorials in this site will also be made available in GitHub.
Models- these are mathematical representation of a solution to a problem.
Some frequently used functions:
Here are some useful shortcuts that will be handy while writing codes in R studio. The most important of all is navigating the command window using up or down arrow key. You may go to any of the previously typed command by pressing up arrow key multiple times. Ctrl+L can be used to clear console.