Even though Kaggle provides us data that’s pretty neat, it’s important to carry out some basic formatting and editing in order to make the data more useful for our analysis by renaming columns, removing redundant fields etc. But before that I’ll show you how to make a submission in Kaggle and check your score or rank.
Starting with a Sample submission
Making a sample submission will give you a basic idea of what we are required to submit and how the submission file is to be prepared. As you can see we have downloaded the 3 files test, train and samplesubmission from Kaggle for predicting the income of a set of people. We’ll start our journey with a dummy submission now and as we proceed with our analysis you’ll notice the big gains in accuracy of our predictions.
To make the first submission, I’ll use a function to generate random values and populate the sample submission file with random values. We’ve already stored the test data in test object using:
test <- read.csv (“test.csv”)
We’ll now populate the Income variable with random values:
test$Income <- sample(0:1, 4800, replace=TRUE)
This function will randomly pick up values from the specified set (in this case 0 and 1) and create 4800 results. Now, we’ll store the sample submission file in a object called ‘submission’ and replace the Prediction column in sample submission with Income values in the test data.
submission <- read.csv(“samplesubmission.csv” )
submission$Prediction <- test$Income
We can just write the file to our current folder and submit it now.
write.csv(submission, file = “randomsubmission.csv”, row.names = FALSE)
if we submit the file at Kaggle site for the competition https://inclass.kaggle.com/c/competition-1-mipt-fivt-ml-spring-2015 using the make a submission link in the left panel under dashboard
Followed by specifying the path of submission file or just dragging the file over the area as shown below and clicking submit button:
we’ll get a score 0.51198 like below which is below the lowest score 0.24766 in the leaderboard.
Let’s try improving the score using some analysis.
Finding summary of the data using R
Number of columns in the data frame
skip this topic and go directly to data analysis
Earlier we have stored the data from “train.csv” to the ‘train’ object using read.csv function. In order to make better predictions we have to first understand the available data. We can find out number of rows and columns inside the data frame object using ncol and nrow functions:
So, we found that the file has 42 columns and 10,000 rows. It means we have a considerably large number of fields and many of them can be redundant for our analysis. Also, the data set we will be analyzing is pretty small with just 10,000 rows.
Be careful about using the capital and small letters while writing function/variable names, as R is case sensitive.
Now that we know the dimension of the table let’s take a snapshot of the table to have an overview of what is there in the top and bottom few rows of each column, basically to get a feel of the data.
We’ll use the head and tail function here:
Will give us the bottom 6 rows of each column like below:
I’ve just shown a glimpse of the output, but if you use this function, you’ll get an idea of what’s stored in the 42 columns. We already know that our aim is to predict the income. So, Income will be our target variable in this case and there are many other variables in the dataset that will serve as predictors. To begin with, let’s first see what kind of fields we are dealing with, using a few simple functions. Let’s start with reading the training dataset.
Now that we have pretty much got a gist of the data let’s jump another step and use the str() function, which gives us some more details about the data.
This function shows us a summary of the data- the total number of rows and columns along with the column names, classes, some values and levels of the columns. This function should give us a lot to start our analysis on the columns.
But it’s still not cool enough. The column names are not meaningful to understand the data. So what about we rename the columns to make them sound more meaningful. We can have a look at the definition of the columns given in the Kaggle site and rename the columns like below:
Naming and renaming columns in a dataframe
Before renaming let’s find out the names first:
To find out the names we can use the function names(data frame) as below:
 “AAGE” “ACLSWKR” “ADTIND” “ADTOCC” “AHGA” “AHRSPAY” “AHSCOL”
 “AMARITL” “AMJIND” “AMJOCC” “ARACE” “AREORGN” “ASEX” “AUNMEM”
 “AUNTYPE” “AWKSTAT” “CAPGAIN” “CAPLOSS” “DIVVAL” “FILESTAT” “GRINREG”
 “GRINST” “HHDFMX” “HHDREL” “MARSUPWT” “MIGMTR1” “MIGMTR3” “MIGMTR4”
 “MIGSAME” “MIGSUN” “NOEMP” “PARENT” “PEFNTVTY” “PEMNTVTY” “PENATVTY”
 “PRCITSHP” “SEOTR” “TAXINC” “VETQVA” “VETYN” “WKSWORK” “CLASS”
There are several ways to rename the columns in a data frame. I’ll demonstrate two useful functions to achieve the renaming:
First is colnames function which is the simplest one but you have to be careful about the order of columns while renaming. Also, you have to specify all the column names while using this methods. Otherwise the other column names will be replaced by NA. For example if we have 10 columns in the data frame and we just specify 6 names in the vector with column names, the last 4 columns in the data frame will be named as NA. On the other hand if we specify more than 10 column names in the vector it will throw an error. In our example we will use colnames to rename all the columns like below:
colnames (train) <- c(“age”,”class_of_worker”,”industry_code”,”occupation_code”,”education”, “wage_per_hour”, “enrolled_inst_last_wk”, “marital_status”,”major_industry_code”, “major_occupation_code”, “Race”,”AREORGN”,”sex”,”union_member”,”reason_unemployment”, “full_part_time”,”capital_gains”,”capital_losses”,”dividends_from_stocks”,”tax_filer_status”,”region_of_previous_residence”,”state_of_previous_residence”,”family_stat”,”household_summary”,”MARSUPWT”,”MIGMTR1″,”MIGMTR3″,”MIGMTR4″,”MIGSAME”,”MIGSUN”,”num_persons_worked_for_employer”,”PARENT”,”country_of_birthfather”,”CoB_mother”,”CoB_self”,”citizenship”,”own_business”,”taxable_income”,”VETQVA”,”veterans_benefits”,”weeks_worked_in_year”,”Income”)
The other method is using the rename function from plyr library. Using this function we can rename all the columns by calling out the old name and new name. Since we have a large number of columns to rename we have broken it down to 4 or 5 columns for the sake of simplicity. It can be achieved in one step as well.
train<- rename(train, c(“AAGE”= “age”, “ACLSWKR”= “class_of_worker”, “ADTIND”= “industry_code”, “ADTOCC”= “occupation_code”, “AHGA”= “education”, “AHRSPAY”= “wage_per_hour”))
train<- rename(train, c(“AHSCOL”= “enrolled_inst _last_wk”, “AMARITL”= “marital_status”))
train<- rename(train, c(“AMJIND”= “major_industry_code”,”AMJOCC”= “major_occupation_code”, “ARACE”= “Race”, “ASEX”= “sex”, “AUNMEM”= “union_member”, “AUNTYPE”= “reason_unemployment”))
train<- rename(train, c(“AWKSTAT”= “full_part_time”,”CAPGAIN”= “capital_gains”,”CAPLOSS”= “capital_losses”, “DIVVAL”= “dividends_from_stocks”, “FILESTAT”= “tax_filer_status”, “GRINREG”= “region_of_previous_residence”, “GRINST”= “state_of_previous_residence”, “HHDFMX”= “family_stat”, “HHDREL”= “household_summary”))
train<- rename(train, c(“NOEMP”= “num_persons_worked_for_employer”, “PEFNTVTY”= “country_of_birthfather”, “PEMNTVTY”= “CoB_mother”, “PENATVTY”= “CoB_self”, “PRCITSHP”= “citizenship”, “SEOTR”= “own_business”, “TAXINC”= “taxable_income”, “VETYN”= “veterans_benefits”, “WKSWORK”= “weeks_worked_in_year”))
train<- rename(train, c(“CLASS”= “Income”))
Remember we have to carry out the same renaming in our test data set as well to make the data consistent. Also, note that the target variable name is CLASS that contains the data. We have renamed it to Income.
Now it looks somewhat better. We will now create a few new columns which will be used in later parts of the discussion. The new columns will be: age bracket (age_bracket) and source of income (inc_source). These two columns will be derived from the existing columns and will be used for our analysis in the later part of the discussion.
This is how we introduce this new column by just adding the values in the column from a vector
Syntax is: dataframe$new_column_name <- vector with same number of values as rows in dataframe
First let’s have a look at the age values in this table. If we find out all the age values using the unique function we’ll get the result like below:
Since we have the all the age values till 90 present in the data we can divide them into groups. We will assume the age brackets as: under 18 – under, 18-35 – young, 36-49 – mid age, 50-65 – senior and above 65 – retirement
Though we have created the age brackets with a conjecture, in later tutorials we’ll use more sophisticated ways to analyze these buckets.
Now, we will create our new column for age brackets. We’ll start with naming all values as under and stepwise we’ll rename the values based on their age like below:
train$age_bracket <- ‘under’
train$age_bracket[train$age >18 & train$age <36] <-‘young’
train$age_bracket[train$age >35 & train$age <50] <-‘mid’
train$age_bracket[train$age >49 & train$age <65] <-‘senior’
train$age_bracket[train$age >65] <-‘retirement’
 “mid” “under” “mid” “mid” “mid” “young”
Let’s see how our training data looks like now let’s use the function str again.
As we can see some columns that might not have much impact on our analysis. Let’s weed out those columns to get a clean and compact table.
Finding fields with just one unique value
Next we will find out fields which don’t add much to our analysis, just because they have just one unique value. We can follow the below steps to find out such fields:
field.value.count <-sapply(train, function(x) length(unique(x)))
This gives us an output like below:
As you can see the sapply function returns a vector of values consisting of each column name and number of unique values it has. To learn apply functions in detail you may visit this post.
To get the fields with just one unique value we’ll use the code:
this will gives us the below fields with just one unique value.
We can ignore these fields for further analysis.
Removing columns from the data
At this point we have enough to start with our analysis. We can proceed by deciding which columns might be more important for us to look into. At the first glance it looks like ‘wage per hour’ and ‘occupation code’ will be good predictors for Income. But there are also some fields which add very little value to our analysis and may lead to redundancy while carrying out our predictive analysis. For example if we look at the data summary carefully, we will find the field ‘weeks_worked_in_year’:
Apparently this field has just one value. On analyzing a little further we find that:
Min. 1st Qu. Median Mean 3rd Qu. Max.
94 94 94 94 94 94
Which means the all the values in the field are same. The variable doesn’t provide any new information to us. So we can drop this column.
On a cursory look at the data we can say that some other fields with little predictive value are:
marital_status, Race, AREORGN (hispanic origin), union_member, reason_unemployment, region_of_previous_residence, state_of_previous_residence, family_stat,
MIGMTR1 (migration code-change in msa), MIGMTR3 (migration code-change in reg), MIGMTR4 (migration code-move within reg), MIGSUN (migration prev res in sunbelt), country_of_birthfather, CoB_mother, citizenship and own_business
As you might have observed by now, the marital status has little to do with income level above or below 50000. People get married irrespective of the fact that their paycheck is under or above 50k. Same holds for the other fields I’ve listed above like race, migration code, birth country of father and mother etc. But we will still explore these fields at a later stage of our analysis to see if we can draw some additional insights from them. For now we’ll just drop these fields with the following operation using the column numbers:
train <- train[ ,-c(8,11, 12, 14, 15, 21, 22, 23, 26, 27, 28, 30, 33, 34, 36, 37 ) ]
There may be few more fields with less predictive value and we will find them out in the later part of the analysis. The data now looks much cleaner: