Category Archives for "Statistics"

Jan 06

How to Differentiate Between Causation and Correlation

By ganpati | Data Analysis , Statistics

Many a time we need fast answers with limited resources. We need to prove something for which don’t have enough evidence. Or as the case may be, we don’t have enough data. Under such circumstances it’s easier to fall into the trap of interpreting correlation as causation. In order to understand the difference between causation and correlation it’s important to have a look at how completely unrelated data points can also show spurious correlation.

Once you do that, you would understand that two very unrelated data sets like, Per capita cheese consumption and number of people who died by getting entangled in their bed sheets can be correlated. So can be number of suicides by strangling and suffocation and US spending on science, space and technology. So, it becomes obvious through these examples that one of these factors can’t be the cause for another- hence, even though their is a visible correlation , a causation in very unlikely.

How do we establish a cause-effect (causal) relationship?

Once we understand that correlation and causation are different we can explore, what criteria should we set in order to establish a causal relationship. Generally, there are three criteria that must be satisfied before we can say that there is an evidence for a causal relationship:

Temporal Precedence: First, you have to be able to show that your cause happened before your effect.

Covariation of the Cause and Effect: If you observe that whenever X is present, Y is also present, and whenever X is absent, Y is too, then you have demonstrated that there is a relationship between X and Y. We can do it using correlation coefficient.

Rule out plausible Alternative Explanations Just because you show there’s a relationship doesn’t mean it’s a causal one. It’s possible that there is some other variable or factor that is causing the outcome. This is sometimes referred to as the “third variable” or “missing variable” problem and it’s at the heart of the issue of internal validity.

So, in order to argue that there is internal validity — and that there’s a causal relationship — we have to “rule out” the plausible alternative explanations.

Dec 27

Statistical Experiments for Data Scientists

By ganpati | Statistics

Designing and interpreting experiments to understand customer behavior, sales, profitability, advertisement effectiveness etc. is part and parcel of data scientist’s job. Here I’ll illustrate with a simple example how statistical experiments can be used in a business context.


Cab cancellation in city A is 150% higher in the month of November than average car cancellation over the year.


The higher cancellation is due to a seasonal factor affecting cancellations in November


Calculate average cancellation rates during the year and during the month of November

Hypothesis Testing

Based on the results of this experiment the Data Scientist can determine whether or not this difference is significant enough to warrant increased attention and focus from the management.

Nov 16

Part 2- Basic statistics and Data Crunching

By ganpati | Statistics

Statistics using summary

To start with we will pick up some fields intuitively that appear to be good predictors for income. In later part of the analysis we’ll validate those assumptions through various techniques and refine our assumptions. We’ll start with the occupation code as it appears to be a good predictor of income.

summary (train$occupation_code)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.00 0.00 5.00 11.91 23.00 46.00

We can see that occupation_code takes a range of values from 0 to 46. If we arrange all the 10000 occupation values in ascending order 2500th value will be 1st Quartile Value which is 0, 2nd quartile will be the median and 7500th value will be 3rd quartile which is 23.

Out of these different occupations, surely a few would be much more rewarding in terms of income generation than the others. That we can find out by diving a little deeper using slicing and dicing.

However the ranking of occupation codes doesn’t really matter in this case because occupation_code is a nominal variable. Let’s see the types of variable in a detail before analyzing the occupation codes.

Cardinal, ordinal and nominal variables

To put it in simplest terms, cardinal numbers represent count, ordinal numbers represent rank and nominal numbers are just used to identify certain things like gender, months, religion etc.

The occupation_code in this case is a nominal data. The value of occupation_code doesn’t represent any count or ranking. For example we can’t say occupation_code 2 is better than occupation_code 3. In a similar way the industry_code, sex etc. are nominal numbers. On the other hand age is a cardinal number. Education can be considered as an ordinal number as it ranks the stages of qualification if used in correct order.

By understanding cardinal, ordinal and nominal variables, we can decide which functions can work better while analyzing them. For example functions that use classification will work better with nominal variables. But functions that use count, proportion, average etc. will work with cardinal variables.

Finding average age of persons with more than 50000 income makes sense but finding average occupation_code values of people with more than 50000 income doesn’t.

Slicing and dicing the data with Table function

Now, to find out those occupations codes that help people earn more than 50K let’s start with the table function:

table function gives us a picture like below breaking up each occupation code into number of values under and above 50,000 income level:

Though table function tells us for each occupation exactly how many people are under and above 50000 income level, we will be better off if we get a proportionate representation. The function prop.table does the job if we use the nested expression like below. Nesting is writing one function within the parentheses of another function, when we want to use outer function to be applied on the values obtained from inner function. We’ll see more use of nesting as we proceed with the tutorials.

In this case we apply proportion function on the values of table function to show the values in the table as proportion.

Proportions across rows and columns

prop.table(table(train$occupation_code, train$Income))

This gives us a list like this (I’ve just shown the top 19 values of the list):

But that’s still not very useful as you can see the function has provided proportions for each cell of the table. We have to tell the function that we need proportion of 50000+ and -50000 for each row so as to understand for each occupation code whether majority of people make more than 50000 or less. So, let’s try this: let’s change last argument to 1 to obtain row wise proportions.

Tip: in most R functions we use 1 to specify rows and 2 for columns

prop.table(table(train$occupation_code, train$Income), 1)

We get a row wise proportion table like below where each row values sum up to 1:

Now, that’s something interesting. It clearly shows that for occupations 0,29,30, 31, 32, 40, 41, 44 at least 80% people earn -50000 and for 1,2,4,5,6,7,9, 11,15,17,18 at least 80% people earn 50000+. Wow! that’s a lot of occupations that can make one rich. Only rows that are not very conclusive in predicting the income (because distribution of 50000+ and -50000 are close to 50% each) are 10, 13, 22, 34, 38 and 43. For the rest of the occupations we can make a prediction, about which side of 50000 that occupation leads people to.

We can also find column-wise proportions by using the expression prop.table(table(train$occupation_code, train$Income), 2)

For each column, proportions will sum up to 1 but that’s not going to add much value for our analysis. Our goal is to find out occupation codes leading to more 50000 income and to that effect column proportion will not reveal much. However if we were to find out the occupation codes where most 50000+ income earners lie, column proportions will be useful. We can illustrate the difference in use of column proportions and row proportions with a simple example with 100 income earners as below:

High income (I1) Low Income (I2)
Occupation code1 (o1) 40 (o1I1) 40 (o1I2)
Occupation code2 (o2) 16 (o2I1) 4 (o2I2)

Now if we find column proportions, o1l1 has 72% and o2l1 has 28% of the high Income population. But on the basis of this observation we can’t say that o1 is a high income occupation as o1 also has 91% of low income population. As you can see in the table below:

Column-wise distribution

High income (I1) Low Income (I2)
Occupation code1 (o1) 72% (o1I1) 91% (o1I2)
Occupation code2 (o2) 28% (o2I1) 9% (o2I2)

But if we find out row wise proportions o2I1 will have 80% of and o2I2 will have 20% of population for this occupation while o1I1 and o1I2 will have 50% each for o1 occupation. Based on this we can correctly predict Occupation 2 as a high income occupation because o2 has 80% people earning high income.

Row-wise distribution

High income (I1) Low Income (I2)
Occupation code1 (o1) 50% (o1I1) 50% (o1I2)
Occupation code2 (o2) 80% (o2I1) 20% (o2I2)

How to select relevant values from a table (using table index)

In order to extract the relevant values from a table that we just created we need to find the index of element(s) in a table. Now, we’ll use another function to find out the index of values, for which at least 80% people earn more than 50%. Since the occupation codes are integers in a serial order, we can find out the occupation code by just finding the index of the row where proportion of 50000+ is greater than 80%. We can achieve that in two ways. First technique is to use the ‘which’ function. ‘Which’ function can be called in two ways:

Syntax 1

which( vector <condition> value)

syntax 2

which( vector %in% variable)

A vector is nothing but a set of values. In this case all values in a particular column of the table constitutes a vector.

Let’s try out syntax1 first,

which(prop.table(table(train$occupation_code, train$Income), 1)[,2]>0.8)

in this we are calling the which function and passing all the proportions in the second column of the table (specified by [,2]) obtained earlier with the condition greater than 0.8. This gives us the list of occupation codes with 80% people earning more than 50000 (as the second column is for 50000+ income):

The values in the second row are list of occupation codes with more than 80% high income population. But this representation is not very neat. Moreover, we may need to take a step ahead and use functions other than proportion to get a better picture. For that we will use the aggregate function discussed later in this post.