Category Archives for "Data Analysis"

Jan 06

How to Differentiate Between Causation and Correlation

By ganpati | Data Analysis , Statistics

Many a time we need fast answers with limited resources. We need to prove something for which don’t have enough evidence. Or as the case may be, we don’t have enough data. Under such circumstances it’s easier to fall into the trap of interpreting correlation as causation. In order to understand the difference between causation and correlation it’s important to have a look at how completely unrelated data points can also show spurious correlation.

Once you do that, you would understand that two very unrelated data sets like, Per capita cheese consumption and number of people who died by getting entangled in their bed sheets can be correlated. So can be number of suicides by strangling and suffocation and US spending on science, space and technology. So, it becomes obvious through these examples that one of these factors can’t be the cause for another- hence, even though their is a visible correlation , a causation in very unlikely.

How do we establish a cause-effect (causal) relationship?

Once we understand that correlation and causation are different we can explore, what criteria should we set in order to establish a causal relationship. Generally, there are three criteria that must be satisfied before we can say that there is an evidence for a causal relationship:

Temporal Precedence: First, you have to be able to show that your cause happened before your effect.

Covariation of the Cause and Effect: If you observe that whenever X is present, Y is also present, and whenever X is absent, Y is too, then you have demonstrated that there is a relationship between X and Y. We can do it using correlation coefficient.

Rule out plausible Alternative Explanations Just because you show there’s a relationship doesn’t mean it’s a causal one. It’s possible that there is some other variable or factor that is causing the outcome. This is sometimes referred to as the “third variable” or “missing variable” problem and it’s at the heart of the issue of internal validity.

So, in order to argue that there is internal validity — and that there’s a causal relationship — we have to “rule out” the plausible alternative explanations.

Oct 23

An Analysis of eCommerce Businesses

By ganpati | Data Analysis

RJMetrics looked at the revenue breakdown between the largest and smallest of these sites to try to figure out the industry landscape. Based on our dataset, ecommerce clearly breaks down into three distinct groups.

The largest ecommerce sites on the internet make up about 1% of the total population and generate 34% of the total revenue.
A distinct middle tier of ecommerce sites make up 51% of the total population and generate 63% of the total revenue.
Small ecommerce sites make up 48% of the total population and generate 3% of the total revenue.

Read more+

Aug 22

Making Predictions Using Time Series Models: A Comprehensive Tutorial

By ganpati | Data Analysis

To explain time series analysis in an intuitive manner (and as little mathematical symbols as possible), the data set that I prefer most is the Quarterly beer sales data. Below is the plot from a data set for beer sales in USA for consecutive 3 month periods.

What does this plot mean to us? Just by looking at these numbers can we predict the next number or the next few numbers that will be part of this series? Perhaps yes! The question is about picking up the right statistical method and to what degree of confidence we can predict those numbers. That statistical method that we will use for this is time series analysis. We’ll first start with the definition of time series:

Definition Time series data are a collection of ordered observations recorded at a specific time, for instance, hours, months, or years. Most often, the observations are made at regular time intervals.

As in most other analyses, in time series analysis it is assumed that the data consist of a systematic pattern (usually a set of identifiable components) and random noise (error) which usually makes the pattern difficult to identify. The whole purpose of this article is to understand, through examples, the pattern by making it more prominent by filtering out the noise.

Time series analysis also accounts for the fact that data points taken over time may have an internal structure, such as autocorrelation, trend or seasonal variation.

Components of a Time Series

Any time series can contain some or all of the following components:
1. Trend (T)
2. Cyclical (C)
3. Seasonal (S)
4. Irregular (I)

Now, let’s again have a look at our diagram to understand what these 4 components mean.

Trend component The trend is the long term pattern of a time series. A trend can be positive or negative depending on whether the time series exhibits an increasing long term pattern or a decreasing long term pattern. If a time series does not show an increasing or decreasing pattern then the series is stationary in the

Cyclical component Any pattern showing an up and down movement around a given trend is identified as a cyclical pattern. The duration of a cycle depends on the type of business or industry being analyzed.

Seasonal component Seasonality occurs when the time series exhibits regular fluctuations during the same month (or months) every year, or during the same quarter every year. For instance, retail sales peak during the month of December (and beer sales in summer!).

Irregular component This component is unpredictable. Every time series has some unpredictable component that makes it a random variable. In prediction, the objective is to “model” all the components to the point that the only component that remains unexplained is the random component.

Now that we know that our time series has all the 4 characteristics from the above diagrams, what can we infer about making the predictions with this data (the beer sales data points)?

Three things:

1. In order to make the predictions mathematically we will eliminate the cyclical factor of the data and simplify the other parts mathematically so that we can arrive at the predicted values with just a few additions, multiplications and divisions.
2. We can draw a trend line through the data and make predictions for each of the quarter.
3. We then find out how much the actual data at each point deviates from the value predicted by the trend line. This deviation of the actual from the trend line at each point will be explained by seasonality and irregular components.

Starting with regression

We already have the data for 48 quarters in the series. Now, let’s say we want to predict the amount of sales in 49th Quarter. Our regression line says the value will be somewhere at the red dotted point in the line.


The seasonality in the data is due to the fact that every quarter has a similar trend. The beer sales goes up in summer and goes down in winter and stays somewhere in the middle for other quarters. With the help of moving averages, we will be able to find out seasonally adjusted indices for each quarter that will help us predict the numbers for future quarters.

To answer that question we need to understand a few more statistical aspects of the series.

Now that we have understood how to break the time series for making predictions, we’ll move on to another important idea- how do we account for interrelations in the data recorded at different points in time.

Assigning Weights to the Observations

1. The “simple” average or mean of all past observations is only a useful estimate for forecasting when there are no trends. If there are trends, use different estimates that take the trend into account.
2. The average “weighs” all past observations equally. For example, the average of the values 3, 4, 5 is 4. We know, of course, that an average is computed by adding all the values and dividing the sum by the number of values. Another way of computing the average is by adding each value divided by the number of values, or
3/3 + 4/3 + 5/3 = 1 + 1.3333 + 1.6667 = 4.
The multiplier 1/3 is called the weight.

Interesting thing about time series is, it challenges you to think about the phenomena that caused such a pattern in the data. Secondly, it helps you to forecast and monitor.
And that makes us identify the pattern of observed time series data and describe it in a mathematical way. We’ll discuss the mathematics part later. Let’s first have a look at the applications. Time series is used majorly in finance, economics and operations management as well as other fields. The applications are:
Economic Forecasting
Sales Forecasting
Budgetary Analysis
Stock Market Forecasts
Yield Projections
Process and Quality Control
Inventory Studies
Workload Projections
Utility Studies
Census Analysis
– Seismological Predictions

NIST/SEMATECH e-Handbook of Statistical Methods,, date.
Before getting into any theory, just have a look at this graph that depicts the time series plot of one very important measure, total retail sales in United States.

Note that data are in millions of dollars, not adjusted for inflation and not seasonally adjusted (“NSA”). The chart and data were obtained from, a highly recommended source for economic data.

Now the math

Important Definitions

There are few other terms like auto-regression and moving averages that will be used frequently in this post.
Don’t get intimidated by these terms. Let me try to simplify them.
Auto-regression (AR)- the idea behind auto regression is that the value of variable x(t) can be expressed as a linear function of its past values. For example, let’s say, if the total amount of ice-cream sales for a brand in any month is related to the sales of ice-cream in the past month, this series will be autoregressive.
The order of the AR model tells how many lagged past values are included. The simplest AR model is the first-order autoregressive, or AR(1) model. In our case,
Sales of ice-cream in may = k* Sales of ice-cream in April + error (t)
Is a AR(1) model because sales is only dependent on the sales from past one month.

There can be many underlying reasons for auto regression. The sales for past month was probably due to easy availability, weather condition of the area etc. Some impact of these underlying factors will persist over time affecting the value of the variable. Hence, we have auto-regression.
What are Smoothing Techniques?
Most data collected over a time have some form of random variation. Smoothing techniques are for reducing or canceling the effect due to random variation. “Smoothing” data removes random variation and shows the underlying trends and cyclic components and is frequently used in industry.
There are two distinct groups of smoothing methods
– Averaging Methods
– Exponential Smoothing Methods

The Natural Ordering of Observations

Time series data have a natural temporal ordering. This makes time series analysis distinct from cross-sectional studies, in which there is no natural ordering of the observations (e.g. explaining people’s wages by reference to their respective education levels, where the individuals’ data could be entered in any order). Time series analysis is also distinct from spatial data analysis where the observations typically relate to geographical locations (e.g. accounting for house prices by the location as well as the intrinsic characteristics of the houses). A stochastic model for a time series will generally reflect the fact that observations close together in time will be more closely related than observations further apart. In addition, time series models will often make use of the natural one-way ordering of time so that values for a given period will be expressed as deriving in some way from past values, rather than from future values.

1 2 3