- Setting current directory
- Loading data sets
- Working with datasets
- Summary view of data
- Finding missing data
- Replacing missing data
- Modifying variables like date etc.
- Combining and separating data sets
- Handling factor values
Part 1- Starting with the data exploration for Income Prediction
In the previous posts we have covered the installation, loading of packages, useful shortcuts, loading data, cross validation techniques and some basic functions. Now it’s time to fold your sleeves and get started with an actual problem. I preferred to start with Income prediction problem as it’s faced by many analysts who are solving analytical problems in marketing, economics, finance etc.
I’ll kick start the problem solving process with slicing and dicing the data. Then we’ll apply a few models to find out if models make a better prediction than our own algorithms. Our aim would be to try out various optimizations to derive meaning out the data and make predictions. I’ll show you how you can land in the top 3 slots of a Kaggle competition with the help of a few powerful functions and models. In future tutorials I’ll demonstrate how we can use feature engineering to improve performance of our models and how to select the best model for a given data.
Though I’ll demonstrate the applications of dozens of models over the next few tutorials, analysts must realize that models are just some tools that can be used effectively only when the analyst has a command over the data and statistics. The more powerful models are usually less interpretable. So it’s best to start our analytics journey with a manual analysis which will give us better understanding of data. Never undermine the power of the human element in machine learning.
I’ll try to explain and apply statistical concepts in the simplest manner throughout this tutorial and I’d welcome your views in the form of queries or suggestions to improve the post. You can also post any bugs or typos that you detect. All types of questions are welcome.
How to get started?
As with most Kaggle competitions we’ll download 3 files here. One with training data in which values for target variable are available, one with test data for which target variable is to be predicted and the one for submission. To learn more about training and test data and the concept of overfitting you may read this post.
What should we look for in the data?
In analyzing the data the four most important parameters that will determine the prediction are: sufficiency, accuracy, predictability and exclusivity. So our goal is to find out the most important variables for making a prediction and how to create new variables or combine existing variables to achieve sufficiency, accuracy, predictability and exclusivity.
As we will witness in the discussions ahead predictions are seldom accurate. There is always a degree of inaccuracy. That’s why increasing the quality of predictions by the slightest bit may result in millions of dollars of value at times. Predictive analytics is all about increasing the accuracy of our predictions by various means.
Accuracy is measured through validation techniques. Predictability is a subjective factor that requires human intervention.
Sufficiency and exclusivity are measured through statistical techniques and can be improved by various optimization techniques. Let’s say education and occupation are two different fields that indicate who can be a high income person with 90% certainty. But on comparing the results predicted by these two variables if we find that 90% of people with proper ‘education’ and higher income are also having a decent ‘occupation’, we can’t consider ‘education’ and ‘occupation’ as exclusive predictors any more. Which means if we are already using ‘occupation’ for our prediction, using ‘education’ might not give us the desired result. This is also called multicollinearity in statistical terms and we’ll deal with it in later discussions.
Sufficiency on the other hand is to measure if any particular variable is useful in making prediction individually or in conjugation with some other variable. If we find on analysis that very few of these variables in their current form can predict the outcome with a desired level of certainty, the quality of our prediction will be compromised. In such cases we would explore various ways to increase the sufficiency by combining multiple variables to increase the quality of prediction. The interplay of multiple variables make predictive analytics all the more interesting.
Here’s a simple example: let’s say we can divide the data into male and females and all males with certain education earn higher and all females with certain occupation earn higher. In this case education can give us the prediction only for males and occupation only for females. But combining the two variables can make prediction for the complete data. This is how we achieve sufficiency.
Then there is predictability. There are certain data that doesn’t give us any concrete information about the target variable. Like for example marital status may tell us very little about income. People get married irrespective of whether they earn less than or more than 50000. So marital status is a field with low predictive value. Yet when we delve deeper into this variable, by finding the age of marriage we may find that people getting married at a higher age are earning more. The reason could be that these people devoted more time to their career before getting married. However accurate your prediction is there is always room to improve the predictability of variables for more accuracy.
As this guide is for people with zero experience in machine learning, before starting with the analysis of data I’ve mentioned some basic functions that we will use throughout the discussions. If you are aware of these functions already click here to directly go to the data analysis.
Reading and writing data using R
We shall start with reading the data from the file. Here I will introduce a few basic functions that will help you to read data from excel files in csv format and use it for analysis.
I’ll start with finding out what is my current directory. The getwd function returns an absolute filename representing the current working directory of the R process.
This command returns the path of your current directory which in this case is ‘Documents’
Now let’s change the current directory to Problem Solving folder in Desktop where I’ve saved the downloaded file from Kaggle
Now that we have set the directory to the folder containing the training and test data let’s read the training data. To read a csv file we’ll use the read.csv function. You can store the value in a new object called train. Whatever modifications you do to the data stored in train our original data in csv file will not be affected.
train <- read.csv (“train.csv”)
test <- read.csv (“test.csv”)
Now, once we create a dataframe or any other data object we can write it into a file inside our current directory. If we are using write.csv the data will be written to a csv file. We will use this function when we have some prediction for the test data that we will write in submission file. For now, just giving you the syntax:
write.csv(train, file = “MyData.csv”)
where train is an object where we have stored a data frame in the previous step. You can see that in the ‘Problem Solving’ folder another file named MyData.csv has been created.
Once you have created a dataframe you might need to drop a column. This is required especially when you are merging two dataframes and some of them has additional columns. In that case you may use the function c inside the [ ] operator with a – sign as below:
this function modifies the dataframe x by removing the 2nd and 10th column.
To add column to a data frame we will require the values to be added to that column. We can have these values stored inside a vector or we can create a column from an existing one. I’ll show you both the ways. In this example I’ll create a column to store an additional variable derived from the given data.
As we have already specified the current directory, let’s find out what files are there in the current directory. For that we use the dir() command.
dir() “sampleSubmission.csv” “test.csv” “train.csv”
This means there are 3 different files in the directory.
View the data columns