Category Archives for "Uncategorized"

Nov 16

Analyzing sales through visualization

By ganpati | Uncategorized

Our aim is to close the gap between data and decision makers by developing a thriving community of analytics. Before ending the post I have two requests for the readers:

  1. If you have come across any interesting piece of R code that you think can be used as a learning please post it in the comments.
  2. If a fellow community member seeks a response regarding any concept or R code please take the initiative to help them.

In this post we will analyze the sales for a retail chain with the help of machine learning principles. We will use visualization techniques to understand the data. Once you grasp these techniques you can apply these principles to many other problems. I’ll start with simple techniques (like the plot function) and slowly move towards more complex techniques for analyzing the data. Feel free to cite scenarios that can be added to the post and shoot questions if you have any query related to visualizing a data science problem.

I’ll first introduce you to a new function attach that will help you get rid of $ notation. The attach function is a nice way to reference the data columns.

We’ll start with our usual way of reading the test and train data and we’ll combine the test and train data to a data object.

train<- read.csv(“train.csv”)

test<- read.csv(“test.csv”)


which shows that there are 5 columns in the data. We’ll have to predict the values for Weekly_Sales column in the test data.

To start with our graphical analysis we’ll first remove the weekly sales column values from the train data, as if we don’t know the values for weekly sales to begin with, and combine the test and train data to observe the pattern within the various variables.

Once we observe certain patterns that we think will help us predicting the values for Weekly_Sales we will start working on a model after adding back the Weekly_Sales values to the data.

# we’ll start by including the library ggplot2 that we’ve installed earlier

library (ggplot2)

Now, let’s convert the Date field into char and find out the month, day and year values

train$Date <- as.character (train$Date)

train$year <- substr(train$Date, 1, 4)

train$month <- substr(train$Date, 6, 7)

Let’s start with a simple plot where we’ll see how the sales trend is over the months

qplot(month, Weekly_Sales, data= train)

This gives us a graph like below:

What does the graph tell us? It doesn’t convey anything meaningful because of overplotting.

It might appear from the chart that sales is almost constant for the months January till October and jumps during november and December. But this is just an illusion created by the overlapping of data points which we shall prove by making the dots more transparent. Even at a very high level we can’t conclude anything from the graph because the top of each bar is represented by the store having maximum sales. The median value of the weekly sales may lie somewhere else which will give us a fair idea of how the sales vary with months. We’ll demonstrate it through boxplots in the later part of this post.

In order to understand the plots in detail we have to understand which geometrical shape should we use to represent each graph under different conditions. How scatterplots or bubbleplots are different from the line graphs.

Scatterplot represents each observation as a point, positioned according to the two variables plotted in the x and y axis. The number of points and their relative position helps us decide which plot can give us a better representation of the data. In the scenario discussed in the example above, for a single value of x (month) there can be multiple values of y (store sales). So in this case we might need a graph that gives us a combined picture of all y-axis values for a particular x-axis value. On the other hand if we had fields for which x axis had a continuous variable and for each x-axis value there is just one y-axis value, a line (or curve) would have been sufficient to visualize the data. There may also be cases where x axis represent a categorical variable or variable with certain range of values. In that case a bar chart could be the best representation of the data. Bar charts are also useful when we are looking at the values of a single variable, and examining its frequency in different x-axis intervals. Now, let’s get back to the current problem and check out how different sets of variables can be graphically represented to facilitate our analysis.

We’ll use qplot with geom_smooth() to create a smooth plot that fits a smooth line through the middle of the data. after mapping the data to aesthetics, the data is passed to a statistical transformation,which manipulates the data in order to render a more comprehensible an meaningful graph. There may be different types of stats like loess smoother, one and two dimensional binning, group means, quantile regression and contouring

So far we dealt with only qplots that allows only single datasets and single aesthetic mapping. What if we want to create more complex graphs with multiple aesthetic features. That’s where we will use ggplot. In this post I shall describe how to use layers, geoms, statistics and position adjustments to visualize large and complex data with ggplot. We will use layers to add multiple charecteristics like statistics and position of the data. For example the simplest layer can be to specify a geom. Layers are added to ggplot by using a simple + sign like below:

Layers and facets of data-

The scatterplot uses points, but if we draw lines across these points we would get a line plot. If we use bars, we’d get a bar plot.

Points, lines and bars are all geometric objects and are termed as geoms in ggplot2. A plot may have single or multiple geoms that we shall discuss going forward.

So far we had been plotting the values based on the data points that we have. Now we will discuss a few concepts for representing the data after applying scale transformation or statistical transformation. Scale transformations are applied before statistical transformation so that we bring some uniformity in the data points before applying statistics.

Let’s apply two types of point geoms – scatterplot and bubblechart to our data to analyze it further.

The graph that we plotted earlier were subjected to overplotting because of a huge number of data points. Overplotting is the plotting at the same point multiple times. So, what if a data is overwritten 10 times. Visually we would only observe it once. In order to deal with this problem we use transparency. Once we make these points more transparent the areas where multiple points are located will become prominent compared to areas where single points are located.

Components of the layer are defined by an optional mapping, dataset, parameters for the geom or stat, geom or stat and position. The below examples illustrates how the layers is added in different scenarios:


In this post I’ve explicitly named all arguments but in actual codes you may rely on positional matching. But it’s always advisable to name arguments in order to make the code more readable.

We will use the summary function for inspecting the structure of a plot to understand the information about the plot defaults, and then each layer.


Layers are regular R objects and hence we will store them as variables to keep our code clean and reduce duplication. We’ll reuse these layers with a set of different plots as and when applicable.

An important point to be kept in mind is that, ggplot can only be used with dataframes.

I’ve used plyr or reshape packages to give proper shape to the data so that ggplot2 can focus on plotting the data.

The aes function is used to describe how data is mapped in the plot. The aes function takes a list of aesthetic-variable pair.

Instead of mapping an aesthetic property to a variable, you can set it to a single value by specifying it in the layer parameters. Aesthetics may vary for each observation being plotted, but parameters do not. We can map an aesthetic to a variable (e.g., (aes(colour = month))) or set it to a constant (e.g., colour = “red”).

Faceting is a global operation (i.e., it works on all layers) and it needs to have a base dataset which defines the set of facets for all datasets

The aes function also takes a grouping variable. Once we specify the grouping variable in the function as store we will get a separate line for each store’s sale. If we leave this aside we’ll get a plot where all the store’s sales will be clubbed and it will be impossible to reach any definite conclusion from those combined values.

This data contains categorical variables as well as continuous variables like weekly_sales and you will probably be interested to know how the values of the continuous variables vary with the levels of the categorical variable. Here’s how we can plot categorical variables with respect to continuous variables with the help of Boxplots and jittered plots. Box plots highlights two important characteristics- the spread of values and the centre of the distribution. Boxplots summarise the bulk of the distribution with only five numbers, while jittered plots show every point but can suffer from overplotting. We can reduce the effect of overplotting by using semi transparent points using alpha argument.

library(mgcv) # first install package nlme

qplot(month, Weekly_Sales, data = train, geom = c(“boxplot”))

qplot(month, Weekly_Sales, data = train, geom = c(“jitter”))

qplot(month, Weekly_Sales, data = train, geom = c(“jitter”), alpha = I(1 /10))

qplot(month, Weekly_Sales, data = train, geom = c(“jitter”), alpha = I(1 /100))

So far we have been plotting two variables against each other to see a how two variables relate to each other and how one variable vary depending on the other. What if we want to plot a single variable to find out in which range maximum values lie. For example if we plot weekly sales we will get a plot like below:

qplot(Weekly_Sales, data = train, geom = “histogram”)

You will also see a message like below:

stat_bin: binwidth defaulted to range/30. Use ‘binwidth = x’ to adjust this.

We can set the values of binwidth and xlim to make the plots more meaningful to our analysis.

Let’s start with adjusting the binwidth (which is the width of the bars) to 1 to see the changes in the plot.

Setting the binwidth to 1 makes the bar almost invisible. Which means the size of the bars should ideally be derived from the scale of the window. So, let’s adjust the x-axis values and binwidth simultaneously to get a better representation of the barchart.

The maximum value of weekly sales can be found using


[1] 693099.4

xlim can be used to specify the x-axis limits for the current axes. It takes two arguments, the minimum and maximum value of x, and also scales the y axis values accordingly. As you can see the x axis values may go upto 700,000 let’s scale down the values of xlim by 100,000, 10,000 and 1000 and fix the values of xlim to 7, 70 and 700.

qplot(Weekly_Sales, data = train, geom = “histogram”, xlim = c(0,7))

qplot(Weekly_Sales, data = train, geom = “histogram”, xlim = c(0,70))

qplot(Weekly_Sales, data = train, geom = “histogram”, xlim = c(0,700))

Now, if we make the binwidth as 1, we get something like below:

qplot(Weekly_Sales, data = train, geom = “histogram”, binwidth = 1, xlim = c(0,700))

The initial binsize was range/30 by default. By reducing it to 1 we get a lot of new peaks in the data as some unusual spikes are also getting displayed. That is because earlier when the bin size was let’s say 20, all values between 0 to 20, 20 to 40 etc. will get averaged and displayed here. So even if in a particular bin there is a high value it will be averaged out and we will get a smooth curve. But as we decreased the bin size the individual values are also getting plotted and those extreme or unusual values are affecting our plot. We need to balance the xlim and binwidth values to get a smooth curve that represents the pattern. By setting the bin size to 7 that is 1/100th of xlim we can clearly see the distribution of sales values.

qplot(Weekly_Sales, data = train, geom = “histogram”, binwidth = 1, xlim = c(0,700))

The distribution represents hyperbolic function as it falls off rapidly at first, then flattens out at higher values. It tells us that for most weeks sales are at a lower level. To understand the pattern in detail we have to add other parameters to our graph.

Next we will use the density plot to analyze the sales.

qplot(Weekly_Sales, data = train, geom = “density”, colour= month)

here ‘colour=month’ means sales for each month will be represented by a curve of different colour

As you can see this gives us a similar information as what we found from bar charts. It also shows us that sales distribution doesn’t vary much depending on the month. The curves for all the months more or less overlap in this graph.

A different representation can be obtained from histograms like:

Here different months are represented by different colors and breakdown of sales withon a particular range is more prominent. For example in the lower bucket part of March to May is quite high.

The %value% can be used to define new infix operators by R users.

We will now illustrate the time series values for the sales data using line and path plots.

Line plots usually show how a single variable changes over time. We shall analyze the Weekly_Sales using the line plot to begin with.

qplot(Date, Weekly_Sales, data = train, geom = “line”)

It gives us a graph like below that demonstrates that sales is constant over the weeks except for some occasional spikes. We’ll analyze the reason for the spikes by digging deeper into the data.

In order to understand how two variables have changed over time we can use the path plot which we’ll take up in a different post.

So far we have been using qplot and it’s an amazing function that’ll take care of many of your programing needs. But to be able to address the needs that are more complex, requiring wider range of plots, multiple sources of data and better customization you need to learn the underlying grammar of graphics. I’ll illustrate the grammar with some simple steps.

In order to understand the sales data in detail we have to combine the train and test datasets with the features dataset and store data set. The features dataset contains additional fields like the temperature, Fuel price, Markdown, CPI and unemployment rates for different weeks. Markdown is a temporary discount offered for certain products and it would be interesting to analyze the impact of such promotions on the overall sales for a store.

Once we get a complete view of the data set we’ll get a clear understanding of the store’s properties (like location etc.) and relate it to the weekly sales. Once we take this into consideration we will be able to answer multiple questions regarding the data like how factors like temperature and weekly_sales are related, do certain types of stores show higher sales for a particular promotion, and how sales have changed for different types of stores over month. While answering these questions we will also learn more about R graphics.

The datasets can be combined with the following steps:

First we’ll combine the train data with stores data

trainStr <- merge(x=train, y=stores, all.x=TRUE)

then we’ll combine the new data set obtained in the previous step with the features data to obtain a complete dataset containing all the values:

trainAll <- merge (x=trainStr, y=features, all.x=TRUE)

Now that we have all values in one data set we can find out how Weekly_Sales vary with temperature:

qplot(Temperature, Weekly_Sales, data = trainAll, geom = c(“point”, “smooth”), span = 1)

We get the message:

geom_smooth: method=”auto” and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = “cs”). Use ‘method = x’ to change the smoothing method.

The graph shows that sales increases with temperature to a certain level and then starts decreasing at very high temperatures.

Let’s now plot the variation in sales with another important variable MarkDwon1.

The graph shows that although sales increases initially with the markdown, after a certain point it becomes steady and further markdown doesn’t have much impact on prices. We’ll now plot Markdown 2 with weekly sales to understand how it affects the sales.

Here we observe that Markdown2 steadily increases the sales after an initial dip for the low markdown. This is in line with our conventional understanding that as we keep increasing discounts people tend to make more and more purchases. We’ll continue a similar analysis with other Markdowns in order to see if this conventional understanding holds ground in all cases and if not what’s the reason behind deviations.

Nov 16

Parametric vs Non-parametric models

By ganpati | Uncategorized

In predictive analytics we will be using both parametric and non-parametric algorithms in our algorithms. Parametric algorithms usually assume known distributions like normal distributions in the data.

A parametric model captures all information about the data as parameters. All you need to know for predicting a future data value from the current state of the model is just its parameters.

For example, in case of a linear regression we assume that both the predictor and dependent variable are normally distributed. If we know the predictor, we can use the mean and the standard deviation of the dependent variable to define its value.

On the other hand, a non parametric model can capture more certain aspects of the data points instead of its distribution. As it knows very less about the data a priori, it follows an iterative process to find a relationship for making the prediction. There is no guarantee that an optimal relationship can be found.

It allows more information to pass from the current set of data that is attached to the model at the current state, to be able to predict any future data. The parameters are usually said to be infinite in dimensions and so can express the characteristics in the data much better than parametric models. It has more degrees of freedom making it more flexible. A Gaussian mixture model for example has more flexibility to define the data in form of multiple gaussian distributions. Observing more data will help you make better prediction about the future data.

In short, we can summarize the parametric models as a model to predict new data, knowing just the parameters (think of linear regression based on a set of parameters). For a non parametric model, predicting future data is based on not just the parameters but also in the current state of data that has been observed.

Think of decision trees that will build a model based on available data. It will make no assumption about the distribution of the data at the beginning. It will build a model based on its learning from the data.

Another example may be KNN where all the original training data is retained in the model in order to make the predictions. Though, it’s not necessary that a non-parametric model will be retaining the training data, yet they may store some aspects of the data. For example, artificial neural networks (ANN) are nonparametric models but they do not retain the training data.

Nov 16

Feature engineering 1

By ganpati | Uncategorized

In this tutorial we shall discuss how we can make the most of the data we have using feature engineering/ data pre processing. Especially when we have limited data, we need to make most of it so that our algorithms perform better. That’s why feature engineering is also known as the human element of machine learning. This is the part that really differentiates between various analytical models- with understanding, intuition, experience and creativity.

In feature engineering we will primarily carry out addition, deletion and transformation of the data.

  • Transformation is required to reduce the impact of skewness and outliers in the data
  • Addition generally involves creating dummy variables that are nothing but combination of multiple predictors
  • Deletion involves removal of variables that adds no value as predictors

Between-Predictor Correlations

Collinearity is the technical term for the situation where a pair of predictor variables have a substantial correlation with each other. It is also possible to have relationships between multiple predictors at once (called multicollinearity). For example, if we are predicting the luxury good purchase, the propensity of customers in the higher income bracket and those who shop in the high end malls might have a high correlation.

What’s the best way to process raw data to make it perfect for machine learning? Actually, there isn’t any single answer to this question. All we can do is develop an intuition about the models and what data works best with which type of models. In the context of this example, what we will do is, we’ll break some of the values into parts and we’ll combine some values to make them more meaningful. An engineered variable is easily consumed by a machine learning algorithm than raw data.

For the purpose of this tutorial I’ll select a few variables for applying the feature engineering techniques. Before we do that let’s have a look at the advantages of removing unwanted variables from data.

The three major advantages of removing unwanted variables are:

Decreased computational time and complexity- the lesser the number of variables the easier they will be digested and processed by any model

Easy interpretations- while analyzing a model performance it is important to understand how a particular variable contributes towards the prediction. Under such circumstances removing a correlated variables will help in easy interpretability of the model without losing its predictive power.

Predictors with degenerate distributions- you may consider a degenerate distribution as the distribution of a variable having a single value. For example, what happens when we are trying the predict the winner in a baseball match of Cubs vs Royals and one of the variables is whether the team has blue color in its jersey. Since both the teams have blue color, this variable doesn’t add any value as a predictor. The degenerate variable is localized at a point. These are also problematic variables in predictive analytics and getting rid of them during feature engineering may increase the model performace significantly.

Reading data in R and viewing the summary

We will start with reading the data as usual:

train<-read.csv (“Kaggle_YourCabs_training.csv”)

we can use the str function to find out what lies inside the data

Now, let’s find out how the test data looks:

test<-read.csv (“Kaggle_YourCabs_score.csv” )

Inferring from the combined dataset

Now that we have read the data and stored test and train data in two different objects, let’s combine the two datasets to see the pattern of each field over a bigger set of fields. For that we need to use the rbind functions that attaches rows of one data frame to the rows of another.

But before using rbind we need to drop the extra columns from the train data to make the two datasets similar. If you look at the test and train datasets you’ll find that there are two extra columns in the end of test x and x.1 and 2 extra columns ( which are the values to be predicted and evaluation method) in train dataset- Car_Cancellation and Cost_of_error)


[1] 20

This gives us the total number of columns in the table as 20. We need to drop the 19th and 20th column

test1 <- test [-c(19:20)]

> ncol(train)

[1] 20

train1 <- train [-c(19:20)]

now let’s combine the data test1 and train1

data<- rbind(test1, train1)

Remember, rbind will work only when both datasets have equal number of columns. It will not check on its own if the columns are similar. You have to make sure before applying this function that the columns are the exact same in both the datasets.


gives us the following overview of the data


[1] 12 65 85 28 24 54 23 87 75 17 36 30 43 86 10 89 64 90 76 13 91 72 1 69 14

[26] 70 39

Shows us there are 27 unique values for vehicle_model_id

now, if we plot histograms for each of the numeric columns we will find results like below:

hist(data$vehicle_model_id, breaks=27)

These histograms tells us a story about the data. Let’s summarize it here:

  1. Most vehicle models have very little demand and only one vehicle model dominates the bookings.
  2. Similarly for travel type, there is one travel type that dominates the overall bookings.
  3. There are substantial number of online bookings.
  4. Number of mobile booking are small compared to the overall number of bookings.

Further analysis of travel_type_id using prportion function to calculate percentage of cancellation for each travel type gives us an interesting insight.

prop.table(table(train$travel_type_id, train$Car_Cancellation),1)

0 1

1 0.98678414 0.01321586

2 0.91907734 0.08092266

3 0.95549669 0.04450331

Another interesting piece of analysis would be to get the proportion of cancellations for online booking and non-online bookings. ‘prop’ funtion is really handy to find such information:


0 1

0 0.95635808 0.04364192

1 0.87537656 0.12462344

That’s interesting! Rate of cancellation for online bookings are 12% compared to non-online bookings that witness just about 4% cancellation. People are not using the inernet in a right way!!

It shows us that more popular travel types generate more frequent cancellations than niche travel types. This makes sense. Let’s say you are traveling to a special destination (your much deserved vaccation, or a friend’s birthday party). Chances are you’ll plan well for the journey and chances of cancellation are less. But in case you are undertaking a journey that’s quite usual, you might plan less and chances of cancellation are higher (your meeting gets postponed, you decide to share a ride with a friend back from office).

But what about the data that are available as factors? How can we draw a picture of that valuable data.

Let’s tackle these factors one by one. The easiest and most interesting piece of information is from city and to city. If we convert the city ID to numeric value if using as.integer we can plot the city id as histogram. Let’s try that:

data$from_area_id = as.integer(data$from_area_id)

hist(data$from_area_id, breaks=605)

This histogram shows us that some areas witness high amount of bookings while some can be classified as medium and low. Can we draw any insight about the cancellations from this data alone? Probably no. We have to dig a little deeper. Let’s draw plots for to_area_id in a similar way and see if we can compare them. If we place the plots of from and to cities side by side we can get some more insights.

hist(data$to_area_id, breaks=576)

We have selected breakes as 576 because there are 576 factors and each factor will represent a column in the histogram.

The areas that witness maximum originating points are different from areas that constitute maximum destination points. But how can we relate it to cancellations? Will it not be logical to assume that areas that have maximum originating points must have more supply of cabs than other areas? So, people booking from these areas will have more choices of cabs and hence they can afford to cancel. Let’s see how our assumptions works out on actual data.

A similar function is cbind that combines columns from one dataset to the columns of another dataset. Let’s discuss it in our next post.