Tag Archives for " r-tutorial "

Dec 05

Dealing with NA Values in a Dataset

By ganpati | Tutorials

Removing NA values present in the data may help you address multiple challenges while analyzing the data. This will not just help you use various models that don’t work with NA values, it will increase the accuracy of the predictions.

Remove NA Values for Better Analysis

There are multiple approaches available for getting rid of the NA values. I’ll discuss some of them in this post. Below is a sample file that we will take up for this tutorial. Some NA values are highlighted in this tables.

Ideally you should carry out this exercise during data preprocessing after merging the test and the train dataset.

Data set for analysis

Data set for analysis

If we use the str function on this data we’ll get the below view (I’ve highlighted some of the NA values in this view):


Imputing NA Values

In order to get rid of the NA values we have to either delete all the observations with NA value, or impute them.  In the below code, NA values in numeric columns have been replaced by -1 and all NA values in character columns were replaced with “NAvalue” which appears as a different level in the data.

# run a for loop for all columns
for(i in 1:ncol(train)){
# find out the numeric columns
train[is.na(train[,i]),i] = -1
#find out the character columns
train[,i] = as.character(train[,i])
train[is.na(train[,i]),i] = "NAvalue"
train[,i] = as.factor(train[,i])

On applying str function again we get the below view:


You may observe that the number of observations are still the same. The data has not been truncated. We have only imputed the NA values wherever applicable.

You may try out some other approaches as well and post your experiences here.

Sep 03

Getting Started With Predictive Analytics

By Ani Rud | Getting Started

So you are all excited about making predictions and ready to get started with predictive analytics!

Well this is just the right place for you to begin with. I’m going to introduce you to a free and powerful statistical programming language called R and make you awesome with predictive analytics.

If you are wondering why I’m using R and not any other language to demonstrate the problem solving here’s the reason- why not SAS or Python.

Over the next 12 posts for Problem Solving with R basics I’ll ease you into R and its syntax, step by step, and you’ll be able to write your own algorithms to crack problems on your own within a week or two if you sincerely follow these steps. The list of these 12 posts are available at the end of this post. Also, at the end of each post you will find the link to the next suggested post.

The basics are for those who are either new to data science and are learning the tricks of the trade or for those who have already learnt the R language but are still finding out ways to reach the top 5% of Kaggle.

If you have already learnt the basics and are gearing up to climb the Kaggle competitions ranks, undertake some freelance consulting or internships you might be interested in our mentorship for dream analytics job program where you get direct exposure to analytic product development, teaming up for Kaggle with competent partners, knowing what analytics skills would best suit your existing profile, participation in in house competitions, mock interviews and guidance to build a strong analytic resume.

Getting Started with Analytics

In the initial few posts, I’ll start with the installation of R, some important packages, basic tips on syntax, working with Github, some useful datasets, creating Kaggle account and basic handling of a Kaggle dataset. If you already know the basics and want to jump straight away to problem solving part, visit predictive modelling problems. Predictive modelling section will introduce you to some challenging problems from Kaggle: selection of algorithms and frequently used feature engineering concepts that will give you a wide range of choices to attack a problem. Here is a guide to all the 12 tutorials:

  1. Fire up your analytics skills with R
  • R Studio and GUI’s
  • Installing R
  • Installing and loading important packages in R (ggplot, party, Hmisc, car, MASS, plyr)
  • Running R 64 bit vs 128 bit
  • Useful datasets
  1. Get familiar with World of Analytics
  • Creating Kaggle account (for practice)
  • Training and test datasets
  • Github basics
  • Some commonly used terms
  • Shortcuts (keyboard and others)

Techniques at a glance

  1. Regression vs Classification techniques in R
  • Example of few regression techniques
  • Example of few classification techniques

Approach to Predictive Analytics

  1. Understanding Predictive Analytics
  • Difference with other forms of analytics
  • Emphasis on large data sets
  • Types of predictive analytics problems
  • Challenges in predictive analytics
  1. 5 Steps for Mastering Data Analytics
  2. How to start data exploration in R
  • Setting current directory
  • Loading data sets
  • Working with datasets
  • Summary view of data
  • Finding missing data
  • Replacing missing data
  • Modifying variables like date etc.
  • Combining and separating data sets
  • Handling factor values
  1. Basic Statistics for Data Science
  • Basic rules of probability
  • Expected value
  • sample and population quantities
  • Signal and Noise
  • probability densities and mass functions
  • variability and distribution
  • Statistical meaning of overfitting
  1. Success criterion for Predictive Model
  2. Understanding Data sets using visualization in R

Predictive analytics tutorials

  1. Slicing and Dicing the data with R
  2. Understanding the models in R
  3. Starting with feature engineering