Category Archives for "Tutorials"

Oct 06

Handling Missing Data in Predictive Analytics

By Ani Rud | Tutorials

It is possible for values to be missing while dealing with real data, and this can cause a variety of problems in analyzing the data and deriving patterns out of it. In this post we’ll discuss some of the ways in which we can deal with missing values.

In R NA stands for missing values. Though NA is not a string or a numeric value, it can be stored in a vector just as a symbol for a missing value.

v1 <- c(1, 2, NA, 3, NA)

If we now use the function to find out which values in this vector are missing we will find the below result: (v1)


In a similar way we can count the number of missing values in the column of a table.

For a data frame we sometimes need to get rid of the NA values in order to make our predictions more appropriate. Some of the models that we will come across for predictive modelling doesn’t work with missing values. In those cases we can use na.omit or na.exclude to get rid of the missing values. Let’s say df is a dataframe with two observations.

Col1 Col2 Col3

1 2 Yes

NA 3 No

We can use either na.omit or na.exclude to remove the rows containing NA as below:


Col1 Col2 Col3

1 2 Yes



Col1 Col2 Col3

1 2 Yes


In some R functions, one of the arguments the user can provide is the na.action. For example, if you look at the help for the lm command, you can see that na.action is one of the listed arguments. By default, it will use the na.action specified in the R options. If you wish to use a different na.action for the regression, you can indicate the action in the lm command.

Two common options with lm are the default, na.omit and na.exclude which does not use the missing values, but maintains their position for the residuals and fitted values.

If you wish to calculate the mean of the non-missing values in the passed object, you can indicate this in the na.rm argument (which is, by default, set to FALSE).

mean(x1, na.rm = TRUE)

## [1] 2.67

Two common commands used in data management and exploration are summary and table. The summary command (when used with numeric vectors) returns the number of NAs in a vector, but the table command ignores NAs by default.


## Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s

## 1.00 2.00 3.00 2.67 3.50 4.00 2


## x1

## 1 3 4

## 1 1 1

To see NA among the table output, you can indicate “ifany” or “always” in the useNA argument. The first will show NA in the output only if there is some missing data in the object. The second will include NA in the output regardless.

table(x1, useNA = “ifany”)

## x1

## 1 3 4

## 1 1 1 2

table(1:3, useNA = “always”)


## 1 2 3


The most common approaches for dealing with missing features involve

Imputation. The main idea of imputation is that if an important feature is

missing for a particular instance, it can be estimated from the data that are present. There are

two main families of imputation approaches: (predictive) value imputation and distribution-based

imputation. Value imputation estimates a value to be used by the model in place of the missing

feature. Distribution-based imputation estimates the conditional distribution of the missing value,

and predictions will be based on this estimated distribution.

Create flags for missing values

missing_val_var <- function(data,variable,new_var_name) {

data$new_var_name <- ifelse(,1,0))


Impute Numeric Missing values

numeric_impute <- function(data,variable) {

mean1 <- mean(data$variable)

data$variable <- ifelse($variable),mean1,data$variable)



Similarly impute categorical variable so that all missing value is coded as a single value say “Null”

Pass the imputed variable into the modelling process

#Challenge: Try to Integrate a K-fold methodology in this step

create_model <- function(trainData,target) {


myglm <- glm(target ~ . , data=trainData, family = “binomial”)

return(myglm) }

Make predictions

score <- predict(myglm, newdata = testData, type = “response”)

score_train <- predict(myglm, newdata = complete, type = “response”)

Check performance


Sep 20

How to start data exploration in R

By Ani Rud | R Programming , Tutorials


  1. Setting current directory
  2. Loading data sets
  3. Working with datasets
  4. Summary view of data
  5. Finding missing data
  6. Replacing missing data
  7. Modifying variables like date etc.
  8. Combining and separating data sets
  9. Handling factor values

Part 1- Starting with the data exploration for Income Prediction

In the previous posts we have covered the installation, loading of packages, useful shortcuts, loading data, cross validation techniques and some basic functions. Now it’s time to fold your sleeves and get started with an actual problem. I preferred to start with Income prediction problem as it’s faced by many analysts who are solving analytical problems in marketing, economics, finance etc.

I’ll kick start the problem solving process with slicing and dicing the data. Then we’ll apply a few models to find out if models make a better prediction than our own algorithms. Our aim would be to try out various optimizations to derive meaning out the data and make predictions. I’ll show you how you can land in the top 3 slots of a Kaggle competition with the help of a few powerful functions and models. In future tutorials I’ll demonstrate how we can use feature engineering to improve performance of our models and how to select the best model for a given data.

Though I’ll demonstrate the applications of dozens of models over the next few tutorials, analysts must realize that models are just some tools that can be used effectively only when the analyst has a command over the data and statistics. The more powerful models are usually less interpretable. So it’s best to start our analytics journey with a manual analysis which will give us better understanding of data. Never undermine the power of the human element in machine learning.

I’ll try to explain and apply statistical concepts in the simplest manner throughout this tutorial and I’d welcome your views in the form of queries or suggestions to improve the post. You can also post any bugs or typos that you detect. All types of questions are welcome.

How to get started?

As with most Kaggle competitions we’ll download 3 files here. One with training data in which values for target variable are available, one with test data for which target variable is to be predicted and the one for submission. To learn more about training and test data and the concept of overfitting you may read this post.

What should we look for in the data?

In analyzing the data the four most important parameters that will determine the prediction are: sufficiency, accuracy, predictability and exclusivity. So our goal is to find out the most important variables for making a prediction and how to create new variables or combine existing variables to achieve sufficiency, accuracy, predictability and exclusivity.

As we will witness in the discussions ahead predictions are seldom accurate. There is always a degree of inaccuracy. That’s why increasing the quality of predictions by the slightest bit may result in millions of dollars of value at times. Predictive analytics is all about increasing the accuracy of our predictions by various means.

Accuracy is measured through validation techniques. Predictability is a subjective factor that requires human intervention.

Sufficiency and exclusivity are measured through statistical techniques and can be improved by various optimization techniques. Let’s say education and occupation are two different fields that indicate who can be a high income person with 90% certainty. But on comparing the results predicted by these two variables if we find that 90% of people with proper ‘education’ and higher income are also having a decent ‘occupation’, we can’t consider ‘education’ and ‘occupation’ as exclusive predictors any more. Which means if we are already using ‘occupation’ for our prediction, using ‘education’ might not give us the desired result. This is also called multicollinearity in statistical terms and we’ll deal with it in later discussions.

Sufficiency on the other hand is to measure if any particular variable is useful in making prediction individually or in conjugation with some other variable. If we find on analysis that very few of these variables in their current form can predict the outcome with a desired level of certainty, the quality of our prediction will be compromised. In such cases we would explore various ways to increase the sufficiency by combining multiple variables to increase the quality of prediction. The interplay of multiple variables make predictive analytics all the more interesting.

Here’s a simple example: let’s say we can divide the data into male and females and all males with certain education earn higher and all females with certain occupation earn higher. In this case education can give us the prediction only for males and occupation only for females. But combining the two variables can make prediction for the complete data. This is how we achieve sufficiency.

Then there is predictability. There are certain data that doesn’t give us any concrete information about the target variable. Like for example marital status may tell us very little about income. People get married irrespective of whether they earn less than or more than 50000. So marital status is a field with low predictive value. Yet when we delve deeper into this variable, by finding the age of marriage we may find that people getting married at a higher age are earning more. The reason could be that these people devoted more time to their career before getting married. However accurate your prediction is there is always room to improve the predictability of variables for more accuracy.

As this guide is for people with zero experience in machine learning, before starting with the analysis of data I’ve mentioned some basic functions that we will use throughout the discussions. If you are aware of these functions already click here to directly go to the data analysis.

Reading and writing data using R

We shall start with reading the data from the file. Here I will introduce a few basic functions that will help you to read data from excel files in csv format and use it for analysis.

Current directory

I’ll start with finding out what is my current directory. The getwd function returns an absolute filename representing the current working directory of the R process.


[1] C:/Users/dataspoc/Desktop/Documents

This command returns the path of your current directory which in this case is ‘Documents’

Change directory

Now let’s change the current directory to Problem Solving folder in Desktop where I’ve saved the downloaded file from Kaggle

setwd(“C:/Users/dataspoc/Desktop/Problem Solving”)

Read data

Now that we have set the directory to the folder containing the training and test data let’s read the training data. To read a csv file we’ll use the read.csv function. You can store the value in a new object called train. Whatever modifications you do to the data stored in train our original data in csv file will not be affected.

train <- read.csv (“train.csv”)

test <- read.csv (“test.csv”)

Write data

Now, once we create a dataframe or any other data object we can write it into a file inside our current directory. If we are using write.csv the data will be written to a csv file. We will use this function when we have some prediction for the test data that we will write in submission file. For now, just giving you the syntax:

write.csv(train, file = “MyData.csv”)

where train is an object where we have stored a data frame in the previous step. You can see that in the ‘Problem Solving’ folder another file named MyData.csv has been created.

Drop column

Once you have created a dataframe you might need to drop a column. This is required especially when you are merging two dataframes and some of them has additional columns. In that case you may use the function c inside the [ ] operator with a – sign as below:


this function modifies the dataframe x by removing the 2nd and 10th column.

Add column

To add column to a data frame we will require the values to be added to that column. We can have these values stored inside a vector or we can create a column from an existing one. I’ll show you both the ways. In this example I’ll create a column to store an additional variable derived from the given data.


As we have already specified the current directory, let’s find out what files are there in the current directory. For that we use the dir() command.


[1] “sampleSubmission.csv” “test.csv” “train.csv”

This means there are 3 different files in the directory.

View the data columns

colnames(train )

1 5 6 7