# Category Archives for "Tutorials"

Dec 05

## Dealing with NA Values in a Dataset

Removing NA values present in the data may help you address multiple challenges while analyzing the data. This will not just help you use various models that don’t work with NA values, it will increase the accuracy of the predictions.

### Remove NA Values for Better Analysis

There are multiple approaches available for getting rid of the NA values. I’ll discuss some of them in this post. Below is a sample file that we will take up for this tutorial. Some NA values are highlighted in this tables.

Ideally you should carry out this exercise during data preprocessing after merging the test and the train dataset.

Data set for analysis

If we use the str function on this data we’ll get the below view (I’ve highlighted some of the NA values in this view):

### Imputing NA Values

In order to get rid of the NA values we have to either delete all the observations with NA value, or impute them.  In the below code, NA values in numeric columns have been replaced by -1 and all NA values in character columns were replaced with “NAvalue” which appears as a different level in the data.

``````# run a for loop for all columns
for(i in 1:ncol(train)){
# find out the numeric columns
if(is.numeric(train[,i])){
train[is.na(train[,i]),i] = -1
}else{
#find out the character columns
train[,i] = as.character(train[,i])
train[is.na(train[,i]),i] = "NAvalue"
train[,i] = as.factor(train[,i])
}
}``````

On applying str function again we get the below view:

You may observe that the number of observations are still the same. The data has not been truncated. We have only imputed the NA values wherever applicable.

You may try out some other approaches as well and post your experiences here.

Dec 05

## The Ultimate Guide to Data Visualization

“The greatest value of a picture is when it forces us to notice what we never
expected to see.” – John Tukey

This post will cover:

• Types of variables
• Types of charts and graphs

## Some Terminologies:

Longitudinal data tracks the same type of information on the same subjects at multiple points in time. For example the data could contain a few stores and their sales over a period of 8 consecutive quarters.

A covariate is a variable that is possibly predictive of the outcome under study. In other words a variable that can be used for predicting the value of target variable is called covariate.

Cross-sectional data is a type of data collected by observing many subjects at the same point of time without regard to differences in time. For example data may be collected for different students from different states.

## Types of variables

In this post we will discuss the cardinal, ordinal, interval and categorical variables.

Cardinal variables are the ones that can be added, subtracted and multiplied. Examples- age, quantity, weight, count etc.

Ordinal variables can be compared with greater than or smaller than or equal to signs, but we can’t have other mathematical operations like addition or multiplication on them. Examples are dates, response to survey questions like- strongly agree, agree or disagree.

Nominal variables gives us some information to classify other variables but they can’t be compared and we can’t have any mathematical operation like addition or subtraction on them. Examples are: race, gender etc.

## Types of charts and graphs

### Scatter plot

Scatter plots can be used when we want a direct comparison between two variables to find out how much one variable is affected by the other. Here we’ll start by picking up the freeny dataset from R and demonstrate how different variables can be plotted to understand their relationship:

A snapshot of freeny is below:

if we now plot the revenue on Y-axis against price index in
X-axis we see a clear relation among the two variables- the revenue decreases with increase in Price Index. R command:
>plot(freeny\$price.index, freeny\$lag.quarterly.revenue)

Now, we will plot the each variable with other variables available in the table to see how variables are related to each other. In real life problems, scatter plots often come handy when we are to test how the target variables changes with various other variables.

Before that let me introduce the color brewer
>install.packages(“RColorBrewer”)
>library(RColorBrewer)

Color Brewer will help you choose sensible color schemes for the plots. brewer.pal makes the color palettes from ColorBrewer available as R palettes. display.brewer.pal() displays the selected palette in a graphics window. Example:

>display.brewer.pal(3,”Set1″)

output will be the below palette in graphics window:

In a similar fashion you may try out different palettes like set2, set3 etc. and vary the number of colors with the function display.brewer.pal(num_of_color, Palette_name) to get an idea of different palettes. Once you know which palettes to use and how many colors, simply use the col=brewer.pal(num_of_color, Palette_name) in the plot function.

Now, if we plot freeny data with 3 colors and set1 palette for each 2 set of variables:

> plot(freeny,col=brewer.pal(3,”Set1″))

In the above function, 3 is the number of different colors in the palette (minimum value is 3). Set1 is a palette name (there are multiple palettes available for your plots in R).

How to Read a Two dimensional Scatter Plots?

For any graph the variable name in the same column is plotted in Y-axis and variable in the same row is plotted in X-axis.So, the rightmost graph in the second row has market potential in the
Y-axis and quarterly.Revenue in the X-axis. The interesting fact about the above set of plots is that each combination of variables is either directly correlated or inversely correlated.

Advantage of scatter plot is that we can look at the correlation visually. The difference between visual and mathematical correlation can be of great significance. Let’s say we derive the mathematical correlation between two variables as 0.2. It’s not significant. But when we plot them we find at a specific interval the two variables are perfectly correlated and in other intervals they don’t show any correlation. This finding can be of great help if we are interested in predicting or forecasting values in that particular interval only.

However if one of your variables is a logical (TRUE/FALSE) variable scatter plot would not make sense. It will give a result like below:

This plot shows only two values for x-axis- 0 and 1. And for almost all values for variable in Y-axis a corresponding point is plotted. So this plot is meaningless. In order to derive meaningful results out of binary variables using scatter plot we need to use the aggregate function.

### Bar Charts

Bar charts are most suitable when we have different classes or groups in the data that we want to compare. For example it can be used for depicting time series, where we want to compare certain variable across the decades or we can break the classes into further groups for a detailed analysis by comparing the data for various regions in the same decade. The breaking down of the classes/groups is possible in many ways as I’ll demonstrate here.

The below table (JohnsonJohnson) shows Quarterly Earnings per Johnson & Johnson Share:

hist(JohnsonJohnson)

will give us the histogram of all the values:

In the above table there are 9 bars each denoting the frequency of value in a particular interval. The first bar shows us number of values from 0-2 interval. If we count in the table, number of such values is 33. So, 33 is the frequency of first interval which is denoted in the Y-axis. Of course we can change the number of bars, colors etc. for this plot as we’ll discuss next.

### Pie Charts

I personally prefer using pie charts while doing comparison among 2 variables. For example to depict the response to a survey where a ‘yes’ or ‘No’ question was answered by the respondents can be represented well using the Pie chart.

Pie Charts are suitable for depicting the results of a single variable that has few outcomes and the outcomes are expressed as percentages rather than absolute numbers. It provides a visual aid to grasp the contrast in those outcomes. That’s why you’d observe them mostly in polls etc.

### Bubble Charts

Bubble charts are often useful to show a ripple effect. For example if we want to depict the assets under consideration at different stages of the financial crisis we can visualize it well using a bubble chart.

### Raster Plot

These are suitable for visualizing 3 dimensional data. We have 3 variables to be depicted in a two dimensional space. It’s best to take two variables in the two dimension and represent the 3rd variable using a color coding. The 3rd variable should be categorical variable in tis case.

### Two important Function used for Visualization in R

qplot function in R

We’ll start discussing qplot with the ChickWeight dataset available in R that looks like below:

The very basic plot that we can draw with qplot is by supplying the variables to be plotted in x-axis and y-axis and passing the dataset name like below:

qplot(Time, weight, data=ChickWeight)

The plot that is rendered is as below:

This plot is not very helpful in making any inference from the data. So, now let’s go a step ahead and find out how we can differentiate between various diets by looking at this plot. We shall use different colors to plot the values of different diets by using the parameter colour= Diet.

That’s more helpful. We can see that Diet 3 is most effective in increasing weight and diet 1 is least effective. However Diet 2 gives mixed results.

ggplot2 library in R

Last but not the least picking up the right chart or graph will depend on the purpose of the chart or graph. If the end user of the graph is statistically savvy you may opt for a complex chart that convey more information in a short window. On the other hand if the end user is only interested in a high level view, you should keep your graphs and charts simple and focus only on end results that are actionable.

For example if you are presenting the sales for different stores of a retail chain across the states to the higher management, it makes sense to restrict your charts to actionable items like in which states the sales have decreased, which stores were out of stock during the peak season and so on.

You may visit some nice blogs about data visualization from around the internet

Dec 04

## Model Tuning: A Random Forest Example

Random forest is an ensemble model which can be visualized as a combination of multiple decision trees.

Before explaining randomforest, I’ll start with two simple yet popular examples of ensembles from our life:
in a movie there are number of actors performing together to tell us a story
in a soccer game a number of players contribute towards a common goal of winning a match

In all these examples and many others, error of one player or actor is often overpowered by the others in the group. That is how, the average performance of all these actors or players is much higher than any of the individual actor or player performing alone.

In simple terms, if we can find a superior model from the combination of some individual models, that works better than any of the individual models it’s called as an ensemble model.

That’s how Randomforest works. By growing a lot of different trees, and making their outcomes averaged or voted across the group, Random forest can give better results many of the times, than individual decision trees.

To know how random decision trees are grown to carry out this voting, you can visit the post how decision trees work. Here I’ve taken a simple example to illustrate the use of Randomforest using R.

We will apply the model on unseen data (other than the train data) for which we have the actual values for the target variable, and then compare the results given by the model with the actual values to calculate the accuracy.

The below data set is a subset taken from the income prediction problem data. We will train a random forest model on the train data and apply the prediction on the test data. Please note that we have created the test data from the train data values, so that we know the actual Income values.

The test data looks like below:

You may have noticed that the train data that we have chosen is part of the actual sample. The data set already has the values for income. We will add another column with predicted income values to this data set.

We’ll use the Random forest with 3 different model parameters to predict the income. We’ll then calculate the absolute difference between predicted values and actual values using abs function to find the error in predicted values. This will tell us how much our predicted values vary from the actual data.

Below are the steps that we’ll follow:

```> train1<- combi[which(combi\$id > 0),] > trainmodel<- randomForest(Income ~ ., data = train1, ntree = 25) > prediction <- predict(trainmodel, train1, type = "class") > train1<- cbind(train1, prediction) # We’ll change 50000+ to 1 and -50000 to 0 for prediction and income columns > train1\$prediction<- as.integer(train1\$prediction) > train1\$prediction<- train1\$prediction-1 > train1\$Income<- as.integer(train1\$Income) > train1\$Income<- train1\$Income-1 #we’ll find the difference between predicted values and actual values > train1\$difference <- abs(train1\$Income - train1\$prediction) # mean of the error for all the rows will give us an average error value > mean(train1\$difference)```

We can repeat the above codes for same data set with different values of ntree, that is by taking ntree value as 50 and 100. We can then compare average error ( mean value for train\$difference) and we will see that for ntree=100 the average is 0.25 for ntree=50 average is 0.3 and for ntree=25 the average is 0.

That means for this data set by taking ntree=25 we get a perfect prediction for our test data.