Dealing with NA Values in a Dataset

Dec 05

Removing NA values present in the data may help you address multiple challenges while analyzing the data. This will not just help you use various models that don’t work with NA values, it will increase the accuracy of the predictions.

Remove NA Values for Better Analysis

There are multiple approaches available for getting rid of the NA values. I’ll discuss some of them in this post. Below is a sample file that we will take up for this tutorial. Some NA values are highlighted in this tables.

Ideally you should carry out this exercise during data preprocessing after merging the test and the train dataset.

Data set for analysis

If we use the str function on this data we’ll get the below view (I’ve highlighted some of the NA values in this view):

Imputing NA Values

In order to get rid of the NA values we have to either delete all the observations with NA value, or impute them.  In the below code, NA values in numeric columns have been replaced by -1 and all NA values in character columns were replaced with “NAvalue” which appears as a different level in the data.

``````# run a for loop for all columns
for(i in 1:ncol(train)){
# find out the numeric columns
if(is.numeric(train[,i])){
train[is.na(train[,i]),i] = -1
}else{
#find out the character columns
train[,i] = as.character(train[,i])
train[is.na(train[,i]),i] = "NAvalue"
train[,i] = as.factor(train[,i])
}
}``````

On applying str function again we get the below view:

You may observe that the number of observations are still the same. The data has not been truncated. We have only imputed the NA values wherever applicable.

You may try out some other approaches as well and post your experiences here.