Jan 01

## Start Solving Kaggle Problem With R: One Hour Tutorial

Once I came across a question “How to Learn R in a day?”. Though it sounds an impossible task, you can surely gain some basic understanding of the tool in a very short time. Interestingly R has an easy learning curve at the beginning but once you proceed to learn advanced topics the learning curve gets steeper and steeper, partly due to the statistical knowledge required for advanced learning.

This post is for those who want to gain initial understanding of R in a short span of time with some hands-on tutorials. I’ll start with a Kaggle problem. Kaggle is one of the many places to find interesting problems in data science. As you have decided to learn R, you must be already knowing that R is a free and powerful statistical programming language available for free. For this tutorial I’ll use the R console. You may install R here if you have not already done so.

Once you have R installed, you can download the the ‘test’ and ‘train’ data files for the competition Titanic: Machine Learning from Disaster and you are all set to begin.

As I’m a strong believer in the role of questions in learning and thinking I’ve tried to follow a question and answer method in this post. I’m sure there are many more questions that may arise in your mind and I’ll be happy to answer them in the comments section. We’ll start by 3 basic questions which I believe you should ask at the beginning of each Kaggle problem:

1. What is being predicted?
2. What data do we have?
3. What’s the prime motive or behavior in play here?

We are predicting the fate of all the passengers who were aboard ocean liner ‘Titanic’ on its fatal maiden voyage. In predictive analytics terminology we are predicting the variable ‘survival’ which is also called target variable.

The train data consists of details of passengers, each row having several information of a passenger like- class (denotes the socio-economic status and not the class of journey), sex, age etc. and most importantly whether the passenger survived. In the test data the value for the variable survived is not given and we have to predict it.

The prime motive or behavior is, since there is a shortage of lifeboats, a pecking order will be followed to decide who has access to the lifeboats first. Our instinct says women and children will be the first ones to be saved, followed by the elderly. Somewhere in between the influential people in the society will cut in (like Hockley in the movie). Let’s find out with our analysis.

We’ll begin the analysis by asking a few questions about the data-

### How to read the data in R?

# store the path to current directory in a variable cur_dir

cur_dir <- "location of the folder where you stored the train and test files"

# set the variable cur_dir as your working directory using setwd command

setwd(cur_dir)

# read the train data and store in in train object using read.csv command

# stringsAsFactors is a logical that is TRUE by default

# by making it FALSE we indicate string shouldn’t be treated as factor variables.

We’ll discuss the stringsAsFactors variable later. For now just note that it’s TRUE by default and we make it FALSE so as to avoid conversion of all strings to factors.

Let’s try finding answers to the next set of questions:

1. How to summarize the data to understand it properly?
2. What are the values in the Target Variable?
3. What are the most important predictors influencing the target variable?
4. Based on behavior of data identified earlier what predictors can be found?
5. What are characteristics of predictor variables: Are they able to predict the Target Variable independently, or they should be combined with other predictors?

In order to summarize the data we will use the str function and the output will be as below:

str(train)

Let’s start with the data types here. The ‘int’ data type indicates integer variables which means the variable can only have whole number values. ‘num’ indicates numeric variables which can take decimal values. ‘factors’ are like categories. By default all text is imported as factors but if we specify stringsAsFactors= FALSE as we did in the read.csv function text is imported as ‘chr’ and not as factors. ‘chr’ is a string variable.

The str function also gives us the number of observations in the data as well as the number of variables. It also gives few initial values for each column of the data as you can see here.

To find out what values can be taken by the target variable, we need to check out the unique values in the ‘survived’ column.

We can access the survived column using a dollar sign as: train\$survived

train\$survived will give you a vector of all the values present in the column. To find the unique values present in the column we need to use the unique function:

unique (train\$Survived)

will give us the output as:
[1] 0 1

This means this is a binary classifier problem where target variable can take only two values: 0 and 1

The unique function gives us the unique values in a column but what if we want to know how many values of each type are present in the column? for that we need the table function.

table(train\$Survived)

gives us output as below:
0 1
549 342

Now, we know the number of 0’s and 1’s in the data.

The next question is to find out the predictors. From the list of variables available in the summary which do you think should be a likely predictor. Well, intuitively we can say the Age and Sex should be likely predictors as women and children are given preference in the rescue operations during a tragedy. But how shall check this assumption? We can do that by just adding another column to our table function.

table(train\$Survived, train\$Gender)

female male
0 81 468
1 233 109

Now, we have the number of males and females who survived the tragedy. It will be better if we can see this as a proportion. That can be done by using prop.table table function which converts values into proportions:

prop.table(table(train\$Survived, train\$Sex))
female male
0 0.09090909 0.52525253
1 0.26150393 0.12233446

Though we found out the proportions, but this is for the complete table. We would prefer to find out the proportions of males and females that survived and for that we need to find column wise proportions. To specify the column wise proportion, we have to add 2 in this function as:

prop.table(table(train\$Survived, train\$Sex),2)

female male
0 0.2579618 0.8110919
1 0.7420382 0.1889081

Now, we have a more meaningful representation. 74.2% of the females survived the disaster compared to 18.89% male. So our initial assumption was correct. Let’s now check out for age variable. If you use unique function on the data you will find there are 89 values for age. It will be wise to reduce the number of values so as to get an easy comparison.

Let’s consider the following age groups: 1-12, 13-18, 19-35 and 35+. This classification is not strictly based on any reasoning but it’s just for initial analysis. Later on we’ll find better age brackets with proper techniques. We’ll now learn the bracket operator that’ll help us find out values in a column that meets certain condition and we’ll create a new variable Age_grp to store the age group values:

Here’s how we create Age_grp variable:

train\$Age_grp[train\$Age<13] <- "grp1"

train\$Age_grp[train\$Age>12 & train\$Age<19] <- "grp2"

train\$Age_grp[train\$Age>18 & train\$Age<36] <- "grp3"

train\$Age_grp[train\$Age>35] <- "grp4"

Now, if we use the table proportion function with our Age_grp values we’ll get new insights about the data:
prop.table(table(train\$Survived, train\$Age_grp),2)

grp1 grp2 grp3 grp4
0 0.4202899 0.5714286 0.6173184 0.6175115
1 0.5797101 0.4285714 0.3826816 0.3824885

As you can see only 57.97% and 42.857% children survived compared to 38% of the adults. So, apparently there is some effort given to saving children.

Now, before we proceed to the other questions, based on our analysis so far we will create a submission file and upload it to Kaggle to see our score. Here’s how we’ll make our prediction for the submission:

All females survived and all males below the age of 12 survived.

we’ll first read the test file and modify the values in the Survived column and write it again.

# first we make all the values in the Survived column as 0

test\$Survived <-0

# then we modify those values which we predicted as survived as 1

test\$Survived[test\$Sex==”female”] <- 1

test\$Survived[test\$Sex==”male” & test\$Age<12] <- 1

# now we’ll write the value stored in the test object to the file test.csv in current directory

write.csv(test, file=”test.csv”)

As a last step before submission we have to create the submit object and write it in a submission file as below:

submit <- data.frame(PassengerId = test\$PassengerId, Survived = test\$Survived) write.csv(submit, file = "first_submission.csv", row.names = FALSE)

If we now go to the Make a Submission link for Titanic tutorial we’ll see a submission area like below:

We just have to click on the click on the highlighted button, select the first_submission.csv file from our current directory and upload the file. Then click on the submit button as indicated and you’re done.

You’ve just scored 0.77033 and left behind almost a 1000 competitors by predicting the obvious. Congratulations!! In the next tutorials as we fine-tune our models we will see a huge jump in our ranking.

Now, coming to the next question, what are the other predictors we can single out based on behavior of the data? In data mining you’ll often find that algorithms mine the data to find the behavior. But in data science, you can use your judgement to find out certain predictors very easily and this will save you a lot of effort where there are large number of variables which can be predictors.

Now, based on our understanding of society, we know that people having substantial social influence have a better shot at accessing the lifeboat. How can we identify such people? This people must have purchased tickets at a higher price and they must be belonging to a higher class. Let’s test this assumption.

In order to test this assumption, I’ll introduce an amazing function called aggregate. Let’s find out how it can be used to analyze the Pclass and Fare variables.

aggregate(Survived~Pclass, data=train, mean)
Pclass Survived
1 1 0.6296296
2 2 0.4728261
3 3 0.2423625

In the above function I’ve used mean function to find the fraction of people who survived. Since Survived variables has values 0 and 1, mean will add all 1’s and divide it by total number of values (sum of 0’s and 1’s). Effectively we’ll get the fraction of people who survived. You can get a detailed understanding of the aggregate function here.

It’s apparent that people belonging to class 3 have a much lower survival rate of 24.2% compared to class 1 which has 62.96% survival rate. Let’s now check the fare variable. But there must be multiple fares and it might produce a long list of aggregate values. Let’s find out:

length(unique(train\$Fare))
248

So, we have 248 values. Let’s classify these values into more meaningful categories of high and low fares. For that we have to plot a histogram to see how the values are distributed.

hist(train\$Fare)

We can see that most of the passengers paid a fare less than 50. Only a small fraction paid above 500 and some paid between 200 to 500. So we can heuristically assign the following categories for the fares:

0 to 50, 50 to 100, 100 to 150, 150 to 500 and 500+

We’ll introduce another variable called Fare_type and apply the aggregate function on it:

train\$Fare_type[train\$Fare<50]<-"low"

train\$Fare_type[train\$Fare>50 & train\$Fare<=100]<-"med1"

train\$Fare_type[train\$Fare>100 & train\$Fare<=150]<-"med2"

train\$Fare_type[train\$Fare>150 & train\$Fare<=500]<-"high"

train\$Fare_type[train\$Fare>500]<-"vhigh"

# now applying the aggregate function on Fare_type

aggregate(Survived~Fare_type, data=train,mean)

Fare_type Survived
1 high 0.6538462
2 low 0.3191781
3 med1 0.6542056
4 med2 0.7916667
5 vhigh 1.0000000

As you can see, all the people who paid very high fares survived compared to only 31.9% of those who paid low fares.

Based on the above methods you can try all variables and see for yourself how you can improve the accuracy of your prediction.

In the next post we’ll discuss about the various models that can be applied to this problem for better predictions.

I’d seriously suggest you work out these steps on your own, as we move forward with this tutorial, to get a first hand experience of the programming and modeling. Have fun with R and Kaggle.

Dec 05

## The Ultimate Guide to Data Visualization

“The greatest value of a picture is when it forces us to notice what we never
expected to see.” – John Tukey

This post will cover:

• Types of variables
• Types of charts and graphs

## Some Terminologies:

Longitudinal data tracks the same type of information on the same subjects at multiple points in time. For example the data could contain a few stores and their sales over a period of 8 consecutive quarters.

A covariate is a variable that is possibly predictive of the outcome under study. In other words a variable that can be used for predicting the value of target variable is called covariate.

Cross-sectional data is a type of data collected by observing many subjects at the same point of time without regard to differences in time. For example data may be collected for different students from different states.

## Types of variables

In this post we will discuss the cardinal, ordinal, interval and categorical variables.

Cardinal variables are the ones that can be added, subtracted and multiplied. Examples- age, quantity, weight, count etc.

Ordinal variables can be compared with greater than or smaller than or equal to signs, but we can’t have other mathematical operations like addition or multiplication on them. Examples are dates, response to survey questions like- strongly agree, agree or disagree.

Nominal variables gives us some information to classify other variables but they can’t be compared and we can’t have any mathematical operation like addition or subtraction on them. Examples are: race, gender etc.

## Types of charts and graphs

### Scatter plot

Scatter plots can be used when we want a direct comparison between two variables to find out how much one variable is affected by the other. Here we’ll start by picking up the freeny dataset from R and demonstrate how different variables can be plotted to understand their relationship:

A snapshot of freeny is below:

if we now plot the revenue on Y-axis against price index in
X-axis we see a clear relation among the two variables- the revenue decreases with increase in Price Index. R command:
>plot(freeny\$price.index, freeny\$lag.quarterly.revenue)

Now, we will plot the each variable with other variables available in the table to see how variables are related to each other. In real life problems, scatter plots often come handy when we are to test how the target variables changes with various other variables.

Before that let me introduce the color brewer
>install.packages(“RColorBrewer”)
>library(RColorBrewer)

Color Brewer will help you choose sensible color schemes for the plots. brewer.pal makes the color palettes from ColorBrewer available as R palettes. display.brewer.pal() displays the selected palette in a graphics window. Example:

>display.brewer.pal(3,”Set1″)

output will be the below palette in graphics window:

In a similar fashion you may try out different palettes like set2, set3 etc. and vary the number of colors with the function display.brewer.pal(num_of_color, Palette_name) to get an idea of different palettes. Once you know which palettes to use and how many colors, simply use the col=brewer.pal(num_of_color, Palette_name) in the plot function.

Now, if we plot freeny data with 3 colors and set1 palette for each 2 set of variables:

> plot(freeny,col=brewer.pal(3,”Set1″))

In the above function, 3 is the number of different colors in the palette (minimum value is 3). Set1 is a palette name (there are multiple palettes available for your plots in R).

How to Read a Two dimensional Scatter Plots?

For any graph the variable name in the same column is plotted in Y-axis and variable in the same row is plotted in X-axis.So, the rightmost graph in the second row has market potential in the
Y-axis and quarterly.Revenue in the X-axis. The interesting fact about the above set of plots is that each combination of variables is either directly correlated or inversely correlated.

Advantage of scatter plot is that we can look at the correlation visually. The difference between visual and mathematical correlation can be of great significance. Let’s say we derive the mathematical correlation between two variables as 0.2. It’s not significant. But when we plot them we find at a specific interval the two variables are perfectly correlated and in other intervals they don’t show any correlation. This finding can be of great help if we are interested in predicting or forecasting values in that particular interval only.

However if one of your variables is a logical (TRUE/FALSE) variable scatter plot would not make sense. It will give a result like below:

This plot shows only two values for x-axis- 0 and 1. And for almost all values for variable in Y-axis a corresponding point is plotted. So this plot is meaningless. In order to derive meaningful results out of binary variables using scatter plot we need to use the aggregate function.

### Bar Charts

Bar charts are most suitable when we have different classes or groups in the data that we want to compare. For example it can be used for depicting time series, where we want to compare certain variable across the decades or we can break the classes into further groups for a detailed analysis by comparing the data for various regions in the same decade. The breaking down of the classes/groups is possible in many ways as I’ll demonstrate here.

The below table (JohnsonJohnson) shows Quarterly Earnings per Johnson & Johnson Share:

hist(JohnsonJohnson)

will give us the histogram of all the values:

In the above table there are 9 bars each denoting the frequency of value in a particular interval. The first bar shows us number of values from 0-2 interval. If we count in the table, number of such values is 33. So, 33 is the frequency of first interval which is denoted in the Y-axis. Of course we can change the number of bars, colors etc. for this plot as we’ll discuss next.

### Pie Charts

I personally prefer using pie charts while doing comparison among 2 variables. For example to depict the response to a survey where a ‘yes’ or ‘No’ question was answered by the respondents can be represented well using the Pie chart.

Pie Charts are suitable for depicting the results of a single variable that has few outcomes and the outcomes are expressed as percentages rather than absolute numbers. It provides a visual aid to grasp the contrast in those outcomes. That’s why you’d observe them mostly in polls etc.

### Bubble Charts

Bubble charts are often useful to show a ripple effect. For example if we want to depict the assets under consideration at different stages of the financial crisis we can visualize it well using a bubble chart.

### Raster Plot

These are suitable for visualizing 3 dimensional data. We have 3 variables to be depicted in a two dimensional space. It’s best to take two variables in the two dimension and represent the 3rd variable using a color coding. The 3rd variable should be categorical variable in tis case.

### Two important Function used for Visualization in R

qplot function in R

We’ll start discussing qplot with the ChickWeight dataset available in R that looks like below:

The very basic plot that we can draw with qplot is by supplying the variables to be plotted in x-axis and y-axis and passing the dataset name like below:

qplot(Time, weight, data=ChickWeight)

The plot that is rendered is as below:

This plot is not very helpful in making any inference from the data. So, now let’s go a step ahead and find out how we can differentiate between various diets by looking at this plot. We shall use different colors to plot the values of different diets by using the parameter colour= Diet.

That’s more helpful. We can see that Diet 3 is most effective in increasing weight and diet 1 is least effective. However Diet 2 gives mixed results.

ggplot2 library in R

Last but not the least picking up the right chart or graph will depend on the purpose of the chart or graph. If the end user of the graph is statistically savvy you may opt for a complex chart that convey more information in a short window. On the other hand if the end user is only interested in a high level view, you should keep your graphs and charts simple and focus only on end results that are actionable.

For example if you are presenting the sales for different stores of a retail chain across the states to the higher management, it makes sense to restrict your charts to actionable items like in which states the sales have decreased, which stores were out of stock during the peak season and so on.

You may visit some nice blogs about data visualization from around the internet

Dec 01

## How to Become A Data Scientist

Over the past few years the role of predictive modeler has broadened and received a lot of attention, so as to make it the sexiest job of the century. Job of a Data Scientist, as we call them, is rewarding both in terms of salary and recognition.

Data Scientists are now required to posses not just statistical and coding knowledge but also an understanding of industry and the products. This post primarily focuses on a bird’s eye view of topics and related resources for building expertise on the topics through hands-on tutorials. There is a separate post dedicated towards thought process and traits for a data scientist. If you’re self-studying for data science you should cover all the below listed topics and other relevant topics from the free ebooks available for data science.

You may also like the infographic on Data Scientist’s resources

# Data Science Prerequisites

## Mathematics and Statistics:

### Probability

• Random variables and expectations
• Probability mass functions
• Probability density functions
• Expected values
• Expected values for PDFs
• Independent and dependednt variables
• Conditional Probability
• Bayes’ rule

### Basic Statistics

You may go through the video tutorials available on SAS for the below topics on basic statistics:

• Distribution Analysis
• One-Way Frequency Analysis
• Table Analysis
• Correlation Analysis
• Sample Tests (One-Sample, Two-sample, Paired t Tests)
• ANOVA (One-Way, N-way, Nonparametric One-Way)
• Linear Regression (Simple, Multiple )
• Analysis of Covariance
• Binary Logistic Regression
• Generalized Linear Models

The below topics are most important for analyzing data statistically. A detailed discussion on these topics are available in book thinkstats (Free pdf) and the MOOC- Design and Interpretation of Clinical Trials from coursera.

### Statistical Approach

In our day to day life we come across many anecdotal evidence which are based on data that is unpublished and usually personal. But in a professional set up we might want evidence that is more persuasive and anecdotal evidence usually fails, because of Small number of observations, Selection bias, Confirmation bias and inaccuracy. A statistical approach helps us overcome all these shortcomings and helps us make convincing arguments in front of different stakeholders in the business.

### Descriptive Statistics

Means and averages, Variance, distributions and various visualization techniques.

### Experimental Design and Associated Methods

Most simple example of a statistical experiment is a coin toss. There is more than one possible outcome. Each possible outcome (i.e., heads or tails) can be listed down in advance. There is also an element of chance due to the uncertainty of outcome.

Some practical areas of significance are marketing campaigns and clinical trials. Results from randomized clinical trials are usually considered the highest level of evidence for determining whether a treatment is effective.

On the other hand, marketers use this technique to increase the number of variables tested in a single campaign (product offers, messages, incentives, mail formats and so on) and to test multiple offers in the market simultaneously. Marketers learn exactly which variables entice consumers to act. As a result, response rates rise dramatically, the effectiveness of future campaigns improves and overall return on spending increases.

Deriving insights from data require an understanding of proper experimental design. You’ll need to be able to interpret when causal effects are significant, and be able to work with more complicated setups involving sequential/overlapping/multiple experiments.

Associate analysis methods include t-tests to ANOVAs etc. which are useful in making inference from these experiments.

There are many questions that can’t be answered just with data alone, especially questions like How does this new feature impact latency/conversion/CTR?

### Quasi-experiments or observational studies

These are more difficult than experiments to analyze, and require a very strong understanding of assumptions and causal models in order to draw convincing conclusions from them. Not everything can be randomized, so you’ll often have to face these observational studies or quasi-experiments.

### Time Series

Most of the BI metrics we’ll discuss through out the tutorials come in a time series. With time series analysis you can use past data to forecast the future.

### Linear modeling

It is important and common for regression and classification exercises. It gives you a good framework to combine information from covariates and predict an outcome or class.

Some other statistical topics of interest are: Survey analysis, Causal inference, Bayesian data analysis, Nonparametric methods etc.

### Statistical Methods used in Machine Learning

For classification problems discriminant analysis seems to be used a lot by data scientists.

Techniques such as clustering, trees, etc. also help with regression, classification, and more complicated modeling questions such as understanding text, pictures, and audio.

## Tools (one or more of below tools):

• R
• Python
• SAS
• Matlab
• SPSS and many others
• ### Learning Python programming:

Step by step guide to learn Pyhton

### Learning R programming:

Why is R a language of choice for data scientists?

Here is a step by step guide with 12 tutorials that help you learn Data Science with R

Here is a trove of learning materials from IDRE, UCLA

### Learning SAS:

Visit SAS website for tutorials

## Database Tools:

As a data scientist you must know how data is stored, queried, managed, handled, cleaned, modelled and reported. Though you may not need to create data pipelines and manage data warehouses.

As a student you’ll most often work with textfiles or csv’s to access and store data. But once you join an industry you’ll be required to access the databases to fetch your data and do the analysis. MySQL, MongoDB and Cassandra are few of the names. It’s important that you knoa at least basics of querying the data and other concepts of relational databases (datamodels etc.) that will give you the ease of working in a real life setup.

## Mastering Machine Learning Algorithms

Once you have gained a basic understanding of the statistical topics and tools you should focus on various techniques used in machine learning. There are 3 broad types of learning: Supervised Learning, Unsupervised Learning and Semi-Supervised Learning. Below is a list of various types of algorithms that you should learn for Data Science applications:

• Decision Tree Algorithms (Conditional Decision Trees, CART etc.)
• Bayesian Algorithms (Naive Bayes etc.)
• Clustering Algorithms (k-means, Hierarchical Clustering etc.)
• Artificial Neural Network Algorithms (Perceptron etc.)
• Deep Learning Algorithms (Deep Belief Networks etc.)
• Dimensionality Reduction Algorithms (PCA, PCR etc.)
• Regression Algorithms (Linear Regression, Logistic Regression etc.)
• Instance-based Algorithms (KNN etc.)
• Regularization Algorithms (Ridge Regression etc.)
• Ensemble Algorithms (Random Forest, GBM etc.)
• Other Algorithms

Here’s an excellent blog post that takes you on a tour of machine learning algorithms.

## Data Cleaning, Wrangling, Visualization

Data scrubbing, also called data cleansing, is the process of amending or removing data in a database that is incorrect, incomplete, improperly formatted, or duplicated.

Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.

ggplot2 is a powerful tool in R that comes handy for Data Scientists. Here’s a cheatsheet on ggplot2.

## Industry Metrics

Engagement / retention rate, churn, conversion, cancellation, Competitive products / duplicates matching, how to measure them, spam detection, behavioral traits.

Cost functions: Log-loss, other entropy-based, DCG/NDCG, etc.

## Other Useful Topics

### A Grasp of End-to-End Development

The stuffs that Data Scientists build- the graphs, charts, reports, analysis and presentations, are for the consumption of product managers and other decision makers across the company. These ideas need to be integrated into other systems that are either already functional or will be introduced in future. So, in order o make practical suggestions to the decision makers it’s important to understand the end-to-end processes in your company and how things work.

### Topics related to Computer Science

• Theory of computation / Analysis of algorithms
• Data structures and algorithms
• Software engineering
• Parallel programming / Massive computation (for processing huge datasets)
• Network analysis

Data Science is a huge subject area and even if you master a lot of tools and techniques, you’ll still need a good deal of intuition and rational guesses to complement the scientific knowledge. That’s because in a practical scenario you’ll most often work with inadequate data, inefficient systems that may not allow to process data above a certain limit, strict timelines and so on. Working on some real time puzzles and guesstimation problems will open your eyes towards developing your own thought process to tackle such situations.

### Familiarity with Big Data Tools and Fraeworks

While working with data that runs into Petabytes, you may need to work with distributed processing as such volumes can’t be processed by a single machine. The volume, velocity and variety of the data will require a different approach than what you would apply to ordinary data. Familiarity with some of the Big Data tools and frameworks like Hadoop, MapReduce and
Apache Spark may prove to be a plus depending on which company you’re aiming to work for.

### Economics

A few nice topics to cover are: Behavioral economics, Game theory, Auction design

## Free Online Courses

I’ve listed here some highly recommended courses that are available online for free. If you are looking for a thorough analysis of different online courses please visit the post on analysis of 12 online data science courses.

### Data Science

CS109 Data Science @ Harvard
Estimated efforts: 120 to 200 hours
Tool used: Python, d3

Introduction to Data Science @ COURSERA
Estimated efforts: 100 to 140 hours
Certificate for \$49
Tool used: Python, R, SQL

### Data Analysis Courses Online

Data Analysis: Take It to the MAX() @ EDX
Estimated efforts: 30 to 50 hours
Certificate for \$50
Tool used: MS-Excel, python

The Analytics Edge @ EDX
Data Analysis & Statistics
Estimated efforts: 120 to 180 hours
Certificate for \$100
Tool used: R

### Machine Learning

Learning From Data @ CALTECH
Estimated efforts: 120 to 140 hours
Tool used: Any tool

Machine Learning (by Andrew NG) @ COURSERA
Machine Learning
Estimated efforts: 80 to 130 hours
Certificate for \$49
Tool used: Octave

## Hands-on with Kaggle

Kaggle is the best platform for learning by doing. Whether you are just a beginner, a learner, a professional or a maestro Kaggle has something in store for you. Besides working on real life problems it also helps you get noticed by prospective employers. But like every other platform the key to use Kaggle to your best advantage is through forming the right team and choosing the right problems. We have started a community to help you get into the right team based on your expertise and interest. Just fill in this simple form and we’ll get in touch with you.

If you are just thinking of trying your hands at some of the Kaggle problems, best advice would be to start by asking some basic questions and exploring the data before moving to the advanced analysis.

Sep 03

## Getting Started With Predictive Analytics

So you are all excited about making predictions and ready to get started with predictive analytics!

Well this is just the right place for you to begin with. I’m going to introduce you to a free and powerful statistical programming language called R and make you awesome with predictive analytics.

If you are wondering why I’m using R and not any other language to demonstrate the problem solving here’s the reason- why not SAS or Python.

Over the next 12 posts for Problem Solving with R basics I’ll ease you into R and its syntax, step by step, and you’ll be able to write your own algorithms to crack problems on your own within a week or two if you sincerely follow these steps. The list of these 12 posts are available at the end of this post. Also, at the end of each post you will find the link to the next suggested post.

The basics are for those who are either new to data science and are learning the tricks of the trade or for those who have already learnt the R language but are still finding out ways to reach the top 5% of Kaggle.

If you have already learnt the basics and are gearing up to climb the Kaggle competitions ranks, undertake some freelance consulting or internships you might be interested in our mentorship for dream analytics job program where you get direct exposure to analytic product development, teaming up for Kaggle with competent partners, knowing what analytics skills would best suit your existing profile, participation in in house competitions, mock interviews and guidance to build a strong analytic resume.

Getting Started with Analytics

In the initial few posts, I’ll start with the installation of R, some important packages, basic tips on syntax, working with Github, some useful datasets, creating Kaggle account and basic handling of a Kaggle dataset. If you already know the basics and want to jump straight away to problem solving part, visit predictive modelling problems. Predictive modelling section will introduce you to some challenging problems from Kaggle: selection of algorithms and frequently used feature engineering concepts that will give you a wide range of choices to attack a problem. Here is a guide to all the 12 tutorials:

• R Studio and GUI’s
• Installing R
• Installing and loading important packages in R (ggplot, party, Hmisc, car, MASS, plyr)
• Running R 64 bit vs 128 bit
• Useful datasets
• Creating Kaggle account (for practice)
• Training and test datasets
• Github basics
• Some commonly used terms
• Shortcuts (keyboard and others)

Techniques at a glance

• Example of few regression techniques
• Example of few classification techniques

Approach to Predictive Analytics

• Difference with other forms of analytics
• Emphasis on large data sets
• Types of predictive analytics problems
• Challenges in predictive analytics
1. 5 Steps for Mastering Data Analytics
2. How to start data exploration in R
• Setting current directory
• Working with datasets
• Summary view of data
• Finding missing data
• Replacing missing data
• Modifying variables like date etc.
• Combining and separating data sets
• Handling factor values
• Basic rules of probability
• Expected value
• sample and population quantities
• Signal and Noise
• probability densities and mass functions
• variability and distribution
• Statistical meaning of overfitting
1. Success criterion for Predictive Model
2. Understanding Data sets using visualization in R

Predictive analytics tutorials

1. Slicing and Dicing the data with R
2. Understanding the models in R
3. Starting with feature engineering

Jan 22

## What is Business Analytics: A Hiring Manager’s Perspective

As per Wiki, The term Business Analytics is defined as “The skills, technologies, practices for continuous iterative exploration and investigation of past business performance to gain insight and drive business planning.”

Since this definition is a broad one, we looked into the job postings from a number of industries and analyzed over 100+ job postings from companies across the industries to shortlist some common attributes for a Business Analytics professional (which includes titles such as business analytics manager, business analytics consultant, Sr. Analyst, Business Analytics etc.). As per those, a business analytics professional should have following attributes:

Transform business questions into fact based analysis that delivers clear insights and actionable recommendations

This is one of the most important attributes of a senior business analytics professional to serve as a bridge between business and analytics department. Business managers face many problems that may have a solution hidden in data. A business analytics professional should be able to understand those problem and

Recommend appropriate performance measures to be produced including lifts, efficiency, confidence intervals, and other statistical metrics.

Once the business quetions have been identified, the next job is to think of appropriate metrics that are to be tracked (like ROI for sales or attrition rate for HR) so that any change in performance can be measured and come up with suggestions on how much improvement can be achieved in a certain time horizon based on historic data on industry data.

Analyze and process data, build and maintain models and report templates, and develop dynamic, data-driven solutions.

This is the most challenging and technical part of a analytics manager’s job. They are expected to create models based on data, that can justify taking certain measures to bring about a positive change. For example, in order to check the high attrition rates a manager may have to find ways to collect new data points like average compensation of people who are leaving the organization, their experience level and average stay in the company. Then they have to prove how some of these factors are contributing to the metrics being targeted (that is attrition). They may come up with suggestions on what measures may lead to a positive change based on industry data or external research reports.

Provide business clients with detailed, actionable reports documenting the findings from, data processing, and data analysis.

Different functions in an organization often track different metrics. The sales, marketing, operations and HR may each have their own set of objectives and consequently they have their own metrics to track. It’s therefore critical for the business analytics professionals to have a birds eye view of all these functions and cater to their business needs (which can at times be conflicting) in a balanced manner. In larger organizations there can be multiple analytics teams across the functions to crunch data for these functions. In that case it’s important for the Business Analytics manager to partner with cross functional analytics teams to develop well-rounded perspective and bring insights together to tell a cohesive story to the senior leadership and drive strategic decision.

Consult on using business intelligence data for predictive analytics and facilitate implementation of new tools and data marts.

Last but not the least, the job of a business analytics manager is to always look forward in terms of technology and changing business environment to ensure smarter decision making for the organization. They may often have to take cognizance of these changes surrounding them and consult the leadership about appropriate strategy for adopting new tools, data sources etc.

As we just observed, the duties of a business analytics professional may often transcend the functional or departmental boundaries. Sometimes junior level analytics professionals are also hired by the companies who mostly focus on data from a business group or function. Instead of looking at organization wide pictures they are often concerned about finding patterns in particular data sets and convey their analysis to the senior professionals. For example, an analyst with marketing department may have to analyze market and account planning data, market intelligence data, and draw the right insights and communicate it to the managers. They also need to work closely with business and function leaders to scope strategic focus, targets, and metrics for the group or unit that they focus on.

Jan 20

## Important Mistakes Most Managers Make with Analytics

There is a lot of hype surrounding data and analytics. Firms are constantly exhorted to set strategies in place to collect and analyze big data, and warned about the potential negative consequences of not doing so. For example, the Wall Street Journal recently suggested that companies sit on a treasure trove of customer data but for the most part do not know how to use it. In this article we explore why.