Our aim is to close the gap between data and decision makers by developing a thriving community of analytics. Before ending the post I have two requests for the readers:
- If you have come across any interesting piece of R code that you think can be used as a learning please post it in the comments.
- If a fellow community member seeks a response regarding any concept or R code please take the initiative to help them.
In this post we will analyze the sales for a retail chain with the help of machine learning principles. We will use visualization techniques to understand the data. Once you grasp these techniques you can apply these principles to many other problems. I’ll start with simple techniques (like the plot function) and slowly move towards more complex techniques for analyzing the data. Feel free to cite scenarios that can be added to the post and shoot questions if you have any query related to visualizing a data science problem.
I’ll first introduce you to a new function attach that will help you get rid of $ notation. The attach function is a nice way to reference the data columns.
We’ll start with our usual way of reading the test and train data and we’ll combine the test and train data to a data object.
which shows that there are 5 columns in the data. We’ll have to predict the values for Weekly_Sales column in the test data.
To start with our graphical analysis we’ll first remove the weekly sales column values from the train data, as if we don’t know the values for weekly sales to begin with, and combine the test and train data to observe the pattern within the various variables.
Once we observe certain patterns that we think will help us predicting the values for Weekly_Sales we will start working on a model after adding back the Weekly_Sales values to the data.
# we’ll start by including the library ggplot2 that we’ve installed earlier
Now, let’s convert the Date field into char and find out the month, day and year values
train$Date <- as.character (train$Date)
train$year <- substr(train$Date, 1, 4)
train$month <- substr(train$Date, 6, 7)
Let’s start with a simple plot where we’ll see how the sales trend is over the months
qplot(month, Weekly_Sales, data= train)
This gives us a graph like below:
What does the graph tell us? It doesn’t convey anything meaningful because of overplotting.
It might appear from the chart that sales is almost constant for the months January till October and jumps during november and December. But this is just an illusion created by the overlapping of data points which we shall prove by making the dots more transparent. Even at a very high level we can’t conclude anything from the graph because the top of each bar is represented by the store having maximum sales. The median value of the weekly sales may lie somewhere else which will give us a fair idea of how the sales vary with months. We’ll demonstrate it through boxplots in the later part of this post.
In order to understand the plots in detail we have to understand which geometrical shape should we use to represent each graph under different conditions. How scatterplots or bubbleplots are different from the line graphs.
Scatterplot represents each observation as a point, positioned according to the two variables plotted in the x and y axis. The number of points and their relative position helps us decide which plot can give us a better representation of the data. In the scenario discussed in the example above, for a single value of x (month) there can be multiple values of y (store sales). So in this case we might need a graph that gives us a combined picture of all y-axis values for a particular x-axis value. On the other hand if we had fields for which x axis had a continuous variable and for each x-axis value there is just one y-axis value, a line (or curve) would have been sufficient to visualize the data. There may also be cases where x axis represent a categorical variable or variable with certain range of values. In that case a bar chart could be the best representation of the data. Bar charts are also useful when we are looking at the values of a single variable, and examining its frequency in different x-axis intervals. Now, let’s get back to the current problem and check out how different sets of variables can be graphically represented to facilitate our analysis.
We’ll use qplot with geom_smooth() to create a smooth plot that ﬁts a smooth line through the middle of the data. after mapping the data to aesthetics, the data is passed to a statistical transformation,which manipulates the data in order to render a more comprehensible an meaningful graph. There may be different types of stats like loess smoother, one and two dimensional binning, group means, quantile regression and contouring
So far we dealt with only qplots that allows only single datasets and single aesthetic mapping. What if we want to create more complex graphs with multiple aesthetic features. That’s where we will use ggplot. In this post I shall describe how to use layers, geoms, statistics and position adjustments to visualize large and complex data with ggplot. We will use layers to add multiple charecteristics like statistics and position of the data. For example the simplest layer can be to specify a geom. Layers are added to ggplot by using a simple + sign like below:
Layers and facets of data-
The scatterplot uses points, but if we draw lines across these points we would get a line plot. If we use bars, we’d get a bar plot.
Points, lines and bars are all geometric objects and are termed as geoms in ggplot2. A plot may have single or multiple geoms that we shall discuss going forward.
So far we had been plotting the values based on the data points that we have. Now we will discuss a few concepts for representing the data after applying scale transformation or statistical transformation. Scale transformations are applied before statistical transformation so that we bring some uniformity in the data points before applying statistics.
Let’s apply two types of point geoms – scatterplot and bubblechart to our data to analyze it further.
The graph that we plotted earlier were subjected to overplotting because of a huge number of data points. Overplotting is the plotting at the same point multiple times. So, what if a data is overwritten 10 times. Visually we would only observe it once. In order to deal with this problem we use transparency. Once we make these points more transparent the areas where multiple points are located will become prominent compared to areas where single points are located.
Components of the layer are defined by an optional mapping, dataset, parameters for the geom or stat, geom or stat and position. The below examples illustrates how the layers is added in different scenarios:
In this post I’ve explicitly named all arguments but in actual codes you may rely on positional matching. But it’s always advisable to name arguments in order to make the code more readable.
We will use the summary function for inspecting the structure of a plot to understand the information about the plot defaults, and then each layer.
Layers are regular R objects and hence we will store them as variables to keep our code clean and reduce duplication. We’ll reuse these layers with a set of different plots as and when applicable.
An important point to be kept in mind is that, ggplot can only be used with dataframes.
I’ve used plyr or reshape packages to give proper shape to the data so that ggplot2 can focus on plotting the data.
The aes function is used to describe how data is mapped in the plot. The aes function takes a list of aesthetic-variable pair.
Instead of mapping an aesthetic property to a variable, you can set it to a single value by specifying it in the layer parameters. Aesthetics may vary for each observation being plotted, but parameters do not. We can map an aesthetic to a variable (e.g., (aes(colour = month))) or set it to a constant (e.g., colour = “red”).
Faceting is a global operation (i.e., it works on all layers) and it needs to have a base dataset which deﬁnes the set of facets for all datasets
The aes function also takes a grouping variable. Once we specify the grouping variable in the function as store we will get a separate line for each store’s sale. If we leave this aside we’ll get a plot where all the store’s sales will be clubbed and it will be impossible to reach any definite conclusion from those combined values.
This data contains categorical variables as well as continuous variables like weekly_sales and you will probably be interested to know how the values of the continuous variables vary with the levels of the categorical variable. Here’s how we can plot categorical variables with respect to continuous variables with the help of Boxplots and jittered plots. Box plots highlights two important characteristics- the spread of values and the centre of the distribution. Boxplots summarise the bulk of the distribution with only ﬁve numbers, while jittered plots show every point but can suﬀer from overplotting. We can reduce the effect of overplotting by using semi transparent points using alpha argument.
library(mgcv) # first install package nlme
qplot(month, Weekly_Sales, data = train, geom = c(“boxplot”))
qplot(month, Weekly_Sales, data = train, geom = c(“jitter”))
qplot(month, Weekly_Sales, data = train, geom = c(“jitter”), alpha = I(1 /10))
qplot(month, Weekly_Sales, data = train, geom = c(“jitter”), alpha = I(1 /100))
So far we have been plotting two variables against each other to see a how two variables relate to each other and how one variable vary depending on the other. What if we want to plot a single variable to find out in which range maximum values lie. For example if we plot weekly sales we will get a plot like below:
qplot(Weekly_Sales, data = train, geom = “histogram”)
You will also see a message like below:
stat_bin: binwidth defaulted to range/30. Use ‘binwidth = x’ to adjust this.
We can set the values of binwidth and xlim to make the plots more meaningful to our analysis.
Let’s start with adjusting the binwidth (which is the width of the bars) to 1 to see the changes in the plot.
Setting the binwidth to 1 makes the bar almost invisible. Which means the size of the bars should ideally be derived from the scale of the window. So, let’s adjust the x-axis values and binwidth simultaneously to get a better representation of the barchart.
The maximum value of weekly sales can be found using
xlim can be used to specify the x-axis limits for the current axes. It takes two arguments, the minimum and maximum value of x, and also scales the y axis values accordingly. As you can see the x axis values may go upto 700,000 let’s scale down the values of xlim by 100,000, 10,000 and 1000 and fix the values of xlim to 7, 70 and 700.
qplot(Weekly_Sales, data = train, geom = “histogram”, xlim = c(0,7))
qplot(Weekly_Sales, data = train, geom = “histogram”, xlim = c(0,70))
qplot(Weekly_Sales, data = train, geom = “histogram”, xlim = c(0,700))
Now, if we make the binwidth as 1, we get something like below:
qplot(Weekly_Sales, data = train, geom = “histogram”, binwidth = 1, xlim = c(0,700))
The initial binsize was range/30 by default. By reducing it to 1 we get a lot of new peaks in the data as some unusual spikes are also getting displayed. That is because earlier when the bin size was let’s say 20, all values between 0 to 20, 20 to 40 etc. will get averaged and displayed here. So even if in a particular bin there is a high value it will be averaged out and we will get a smooth curve. But as we decreased the bin size the individual values are also getting plotted and those extreme or unusual values are affecting our plot. We need to balance the xlim and binwidth values to get a smooth curve that represents the pattern. By setting the bin size to 7 that is 1/100th of xlim we can clearly see the distribution of sales values.
qplot(Weekly_Sales, data = train, geom = “histogram”, binwidth = 1, xlim = c(0,700))
The distribution represents hyperbolic function as it falls off rapidly at first, then flattens out at higher values. It tells us that for most weeks sales are at a lower level. To understand the pattern in detail we have to add other parameters to our graph.
Next we will use the density plot to analyze the sales.
qplot(Weekly_Sales, data = train, geom = “density”, colour= month)
here ‘colour=month’ means sales for each month will be represented by a curve of different colour
As you can see this gives us a similar information as what we found from bar charts. It also shows us that sales distribution doesn’t vary much depending on the month. The curves for all the months more or less overlap in this graph.
A different representation can be obtained from histograms like:
Here different months are represented by different colors and breakdown of sales withon a particular range is more prominent. For example in the lower bucket part of March to May is quite high.
The %value% can be used to define new infix operators by R users.
We will now illustrate the time series values for the sales data using line and path plots.
Line plots usually show how a single variable changes over time. We shall analyze the Weekly_Sales using the line plot to begin with.
qplot(Date, Weekly_Sales, data = train, geom = “line”)
It gives us a graph like below that demonstrates that sales is constant over the weeks except for some occasional spikes. We’ll analyze the reason for the spikes by digging deeper into the data.
In order to understand how two variables have changed over time we can use the path plot which we’ll take up in a different post.
So far we have been using qplot and it’s an amazing function that’ll take care of many of your programing needs. But to be able to address the needs that are more complex, requiring wider range of plots, multiple sources of data and better customization you need to learn the underlying grammar of graphics. I’ll illustrate the grammar with some simple steps.
In order to understand the sales data in detail we have to combine the train and test datasets with the features dataset and store data set. The features dataset contains additional fields like the temperature, Fuel price, Markdown, CPI and unemployment rates for different weeks. Markdown is a temporary discount offered for certain products and it would be interesting to analyze the impact of such promotions on the overall sales for a store.
Once we get a complete view of the data set we’ll get a clear understanding of the store’s properties (like location etc.) and relate it to the weekly sales. Once we take this into consideration we will be able to answer multiple questions regarding the data like how factors like temperature and weekly_sales are related, do certain types of stores show higher sales for a particular promotion, and how sales have changed for different types of stores over month. While answering these questions we will also learn more about R graphics.
The datasets can be combined with the following steps:
First we’ll combine the train data with stores data
trainStr <- merge(x=train, y=stores, all.x=TRUE)
then we’ll combine the new data set obtained in the previous step with the features data to obtain a complete dataset containing all the values:
trainAll <- merge (x=trainStr, y=features, all.x=TRUE)
Now that we have all values in one data set we can find out how Weekly_Sales vary with temperature:
qplot(Temperature, Weekly_Sales, data = trainAll, geom = c(“point”, “smooth”), span = 1)
We get the message:
geom_smooth: method=”auto” and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = “cs”). Use ‘method = x’ to change the smoothing method.
The graph shows that sales increases with temperature to a certain level and then starts decreasing at very high temperatures.
Let’s now plot the variation in sales with another important variable MarkDwon1.
The graph shows that although sales increases initially with the markdown, after a certain point it becomes steady and further markdown doesn’t have much impact on prices. We’ll now plot Markdown 2 with weekly sales to understand how it affects the sales.
Here we observe that Markdown2 steadily increases the sales after an initial dip for the low markdown. This is in line with our conventional understanding that as we keep increasing discounts people tend to make more and more purchases. We’ll continue a similar analysis with other Markdowns in order to see if this conventional understanding holds ground in all cases and if not what’s the reason behind deviations.