Once I came across a question “How to Learn R in a day?”. Though it sounds an impossible task, you can surely gain some basic understanding of the tool in a very short time. Interestingly R has an easy learning curve at the beginning but once you proceed to learn advanced topics the learning curve gets steeper and steeper, partly due to the statistical knowledge required for advanced learning.
This post is for those who want to gain initial understanding of R in a short span of time with some hands-on tutorials. I’ll start with a Kaggle problem. Kaggle is one of the many places to find interesting problems in data science. As you have decided to learn R, you must be already knowing that R is a free and powerful statistical programming language available for free. For this tutorial I’ll use the R console. You may install R here if you have not already done so.
Once you have R installed, you can download the the ‘test’ and ‘train’ data files for the competition Titanic: Machine Learning from Disaster and you are all set to begin.
As I’m a strong believer in the role of questions in learning and thinking I’ve tried to follow a question and answer method in this post. I’m sure there are many more questions that may arise in your mind and I’ll be happy to answer them in the comments section. We’ll start by 3 basic questions which I believe you should ask at the beginning of each Kaggle problem:
1. What is being predicted?
2. What data do we have?
3. What’s the prime motive or behavior in play here?
We are predicting the fate of all the passengers who were aboard ocean liner ‘Titanic’ on its fatal maiden voyage. In predictive analytics terminology we are predicting the variable ‘survival’ which is also called target variable.
The train data consists of details of passengers, each row having several information of a passenger like- class (denotes the socio-economic status and not the class of journey), sex, age etc. and most importantly whether the passenger survived. In the test data the value for the variable survived is not given and we have to predict it.
The prime motive or behavior is, since there is a shortage of lifeboats, a pecking order will be followed to decide who has access to the lifeboats first. Our instinct says women and children will be the first ones to be saved, followed by the elderly. Somewhere in between the influential people in the society will cut in (like Hockley in the movie). Let’s find out with our analysis.
We’ll begin the analysis by asking a few questions about the data-
How to read the data in R?
# store the path to current directory in a variable cur_dir
cur_dir <- "location of the folder where you stored the train and test files"
# set the variable cur_dir as your working directory using setwd command
# read the train data and store in in train object using read.csv command
# stringsAsFactors is a logical that is TRUE by default
# by making it FALSE we indicate string shouldn’t be treated as factor variables.
train <- read.csv("train.csv", stringsAsFactors=FALSE)
We’ll discuss the stringsAsFactors variable later. For now just note that it’s TRUE by default and we make it FALSE so as to avoid conversion of all strings to factors.
Let’s try finding answers to the next set of questions:
- How to summarize the data to understand it properly?
- What are the values in the Target Variable?
- What are the most important predictors influencing the target variable?
- Based on behavior of data identified earlier what predictors can be found?
- What are characteristics of predictor variables: Are they able to predict the Target Variable independently, or they should be combined with other predictors?
In order to summarize the data we will use the str function and the output will be as below:
Let’s start with the data types here. The ‘int’ data type indicates integer variables which means the variable can only have whole number values. ‘num’ indicates numeric variables which can take decimal values. ‘factors’ are like categories. By default all text is imported as factors but if we specify stringsAsFactors= FALSE as we did in the read.csv function text is imported as ‘chr’ and not as factors. ‘chr’ is a string variable.
The str function also gives us the number of observations in the data as well as the number of variables. It also gives few initial values for each column of the data as you can see here.
To find out what values can be taken by the target variable, we need to check out the unique values in the ‘survived’ column.
We can access the survived column using a dollar sign as: train$survived
train$survived will give you a vector of all the values present in the column. To find the unique values present in the column we need to use the unique function:
will give us the output as:
 0 1
This means this is a binary classifier problem where target variable can take only two values: 0 and 1
The unique function gives us the unique values in a column but what if we want to know how many values of each type are present in the column? for that we need the table function.
gives us output as below:
Now, we know the number of 0’s and 1’s in the data.
The next question is to find out the predictors. From the list of variables available in the summary which do you think should be a likely predictor. Well, intuitively we can say the Age and Sex should be likely predictors as women and children are given preference in the rescue operations during a tragedy. But how shall check this assumption? We can do that by just adding another column to our table function.
0 81 468
1 233 109
Now, we have the number of males and females who survived the tragedy. It will be better if we can see this as a proportion. That can be done by using prop.table table function which converts values into proportions:
0 0.09090909 0.52525253
1 0.26150393 0.12233446
Though we found out the proportions, but this is for the complete table. We would prefer to find out the proportions of males and females that survived and for that we need to find column wise proportions. To specify the column wise proportion, we have to add 2 in this function as:
0 0.2579618 0.8110919
1 0.7420382 0.1889081
Now, we have a more meaningful representation. 74.2% of the females survived the disaster compared to 18.89% male. So our initial assumption was correct. Let’s now check out for age variable. If you use unique function on the data you will find there are 89 values for age. It will be wise to reduce the number of values so as to get an easy comparison.
Let’s consider the following age groups: 1-12, 13-18, 19-35 and 35+. This classification is not strictly based on any reasoning but it’s just for initial analysis. Later on we’ll find better age brackets with proper techniques. We’ll now learn the bracket operator that’ll help us find out values in a column that meets certain condition and we’ll create a new variable Age_grp to store the age group values:
Here’s how we create Age_grp variable:
train$Age_grp[train$Age<13] <- "grp1"
train$Age_grp[train$Age>12 & train$Age<19] <- "grp2"
train$Age_grp[train$Age>18 & train$Age<36] <- "grp3"
train$Age_grp[train$Age>35] <- "grp4"
Now, if we use the table proportion function with our Age_grp values we’ll get new insights about the data:
grp1 grp2 grp3 grp4
0 0.4202899 0.5714286 0.6173184 0.6175115
1 0.5797101 0.4285714 0.3826816 0.3824885
As you can see only 57.97% and 42.857% children survived compared to 38% of the adults. So, apparently there is some effort given to saving children.
Now, before we proceed to the other questions, based on our analysis so far we will create a submission file and upload it to Kaggle to see our score. Here’s how we’ll make our prediction for the submission:
All females survived and all males below the age of 12 survived.
we’ll first read the test file and modify the values in the Survived column and write it again.
# first we make all the values in the Survived column as 0
# then we modify those values which we predicted as survived as 1
test$Survived[test$Sex==”female”] <- 1
test$Survived[test$Sex==”male” & test$Age<12] <- 1
# now we’ll write the value stored in the test object to the file test.csv in current directory
As a last step before submission we have to create the submit object and write it in a submission file as below:
submit <- data.frame(PassengerId = test$PassengerId, Survived = test$Survived) write.csv(submit, file = "first_submission.csv", row.names = FALSE)
If we now go to the Make a Submission link for Titanic tutorial we’ll see a submission area like below:
We just have to click on the click on the highlighted button, select the first_submission.csv file from our current directory and upload the file. Then click on the submit button as indicated and you’re done.
You’ve just scored 0.77033 and left behind almost a 1000 competitors by predicting the obvious. Congratulations!! In the next tutorials as we fine-tune our models we will see a huge jump in our ranking.
Now, coming to the next question, what are the other predictors we can single out based on behavior of the data? In data mining you’ll often find that algorithms mine the data to find the behavior. But in data science, you can use your judgement to find out certain predictors very easily and this will save you a lot of effort where there are large number of variables which can be predictors.
Now, based on our understanding of society, we know that people having substantial social influence have a better shot at accessing the lifeboat. How can we identify such people? This people must have purchased tickets at a higher price and they must be belonging to a higher class. Let’s test this assumption.
In order to test this assumption, I’ll introduce an amazing function called aggregate. Let’s find out how it can be used to analyze the Pclass and Fare variables.
aggregate(Survived~Pclass, data=train, mean)
1 1 0.6296296
2 2 0.4728261
3 3 0.2423625
In the above function I’ve used mean function to find the fraction of people who survived. Since Survived variables has values 0 and 1, mean will add all 1’s and divide it by total number of values (sum of 0’s and 1’s). Effectively we’ll get the fraction of people who survived. You can get a detailed understanding of the aggregate function here.
It’s apparent that people belonging to class 3 have a much lower survival rate of 24.2% compared to class 1 which has 62.96% survival rate. Let’s now check the fare variable. But there must be multiple fares and it might produce a long list of aggregate values. Let’s find out:
So, we have 248 values. Let’s classify these values into more meaningful categories of high and low fares. For that we have to plot a histogram to see how the values are distributed.
We can see that most of the passengers paid a fare less than 50. Only a small fraction paid above 500 and some paid between 200 to 500. So we can heuristically assign the following categories for the fares:
0 to 50, 50 to 100, 100 to 150, 150 to 500 and 500+
We’ll introduce another variable called Fare_type and apply the aggregate function on it:
train$Fare_type[train$Fare>50 & train$Fare<=100]<-"med1"
train$Fare_type[train$Fare>100 & train$Fare<=150]<-"med2"
train$Fare_type[train$Fare>150 & train$Fare<=500]<-"high"
# now applying the aggregate function on Fare_type
1 high 0.6538462
2 low 0.3191781
3 med1 0.6542056
4 med2 0.7916667
5 vhigh 1.0000000
As you can see, all the people who paid very high fares survived compared to only 31.9% of those who paid low fares.
Based on the above methods you can try all variables and see for yourself how you can improve the accuracy of your prediction.
In the next post we’ll discuss about the various models that can be applied to this problem for better predictions.
I’d seriously suggest you work out these steps on your own, as we move forward with this tutorial, to get a first hand experience of the programming and modeling. Have fun with R and Kaggle.