“The greatest value of a picture is when it forces us to notice what we never
expected to see.” – John Tukey
This post will cover:
Longitudinal data tracks the same type of information on the same subjects at multiple points in time. For example the data could contain a few stores and their sales over a period of 8 consecutive quarters.
A covariate is a variable that is possibly predictive of the outcome under study. In other words a variable that can be used for predicting the value of target variable is called covariate.
Cross-sectional data is a type of data collected by observing many subjects at the same point of time without regard to differences in time. For example data may be collected for different students from different states.
In this post we will discuss the cardinal, ordinal, interval and categorical variables.
Cardinal variables are the ones that can be added, subtracted and multiplied. Examples- age, quantity, weight, count etc.
Ordinal variables can be compared with greater than or smaller than or equal to signs, but we can’t have other mathematical operations like addition or multiplication on them. Examples are dates, response to survey questions like- strongly agree, agree or disagree.
Nominal variables gives us some information to classify other variables but they can’t be compared and we can’t have any mathematical operation like addition or subtraction on them. Examples are: race, gender etc.
Scatter plots can be used when we want a direct comparison between two variables to find out how much one variable is affected by the other. Here we’ll start by picking up the freeny dataset from R and demonstrate how different variables can be plotted to understand their relationship:
A snapshot of freeny is below:
if we now plot the revenue on Y-axis against price index in
X-axis we see a clear relation among the two variables- the revenue decreases with increase in Price Index. R command:
Now, we will plot the each variable with other variables available in the table to see how variables are related to each other. In real life problems, scatter plots often come handy when we are to test how the target variables changes with various other variables.
Before that let me introduce the color brewer
Color Brewer will help you choose sensible color schemes for the plots. brewer.pal makes the color palettes from ColorBrewer available as R palettes. display.brewer.pal() displays the selected palette in a graphics window. Example:
In a similar fashion you may try out different palettes like set2, set3 etc. and vary the number of colors with the function display.brewer.pal(num_of_color, Palette_name) to get an idea of different palettes. Once you know which palettes to use and how many colors, simply use the col=brewer.pal(num_of_color, Palette_name) in the plot function.
Now, if we plot freeny data with 3 colors and set1 palette for each 2 set of variables:
In the above function, 3 is the number of different colors in the palette (minimum value is 3). Set1 is a palette name (there are multiple palettes available for your plots in R).
How to Read a Two dimensional Scatter Plots?
For any graph the variable name in the same column is plotted in Y-axis and variable in the same row is plotted in X-axis.So, the rightmost graph in the second row has market potential in the
Y-axis and quarterly.Revenue in the X-axis. The interesting fact about the above set of plots is that each combination of variables is either directly correlated or inversely correlated.
Advantage of scatter plot is that we can look at the correlation visually. The difference between visual and mathematical correlation can be of great significance. Let’s say we derive the mathematical correlation between two variables as 0.2. It’s not significant. But when we plot them we find at a specific interval the two variables are perfectly correlated and in other intervals they don’t show any correlation. This finding can be of great help if we are interested in predicting or forecasting values in that particular interval only.
This plot shows only two values for x-axis- 0 and 1. And for almost all values for variable in Y-axis a corresponding point is plotted. So this plot is meaningless. In order to derive meaningful results out of binary variables using scatter plot we need to use the aggregate function.
Bar charts are most suitable when we have different classes or groups in the data that we want to compare. For example it can be used for depicting time series, where we want to compare certain variable across the decades or we can break the classes into further groups for a detailed analysis by comparing the data for various regions in the same decade. The breaking down of the classes/groups is possible in many ways as I’ll demonstrate here.
We’ll start with the simplest plot:
In the above table there are 9 bars each denoting the frequency of value in a particular interval. The first bar shows us number of values from 0-2 interval. If we count in the table, number of such values is 33. So, 33 is the frequency of first interval which is denoted in the Y-axis. Of course we can change the number of bars, colors etc. for this plot as we’ll discuss next.
I personally prefer using pie charts while doing comparison among 2 variables. For example to depict the response to a survey where a ‘yes’ or ‘No’ question was answered by the respondents can be represented well using the Pie chart.
Pie Charts are suitable for depicting the results of a single variable that has few outcomes and the outcomes are expressed as percentages rather than absolute numbers. It provides a visual aid to grasp the contrast in those outcomes. That’s why you’d observe them mostly in polls etc.
Bubble charts are often useful to show a ripple effect. For example if we want to depict the assets under consideration at different stages of the financial crisis we can visualize it well using a bubble chart.
These are suitable for visualizing 3 dimensional data. We have 3 variables to be depicted in a two dimensional space. It’s best to take two variables in the two dimension and represent the 3rd variable using a color coding. The 3rd variable should be categorical variable in tis case.
qplot function in R
We’ll start discussing qplot with the ChickWeight dataset available in R that looks like below:
The very basic plot that we can draw with qplot is by supplying the variables to be plotted in x-axis and y-axis and passing the dataset name like below:
qplot(Time, weight, data=ChickWeight)
The plot that is rendered is as below:
This plot is not very helpful in making any inference from the data. So, now let’s go a step ahead and find out how we can differentiate between various diets by looking at this plot. We shall use different colors to plot the values of different diets by using the parameter colour= Diet.
That’s more helpful. We can see that Diet 3 is most effective in increasing weight and diet 1 is least effective. However Diet 2 gives mixed results.
ggplot2 library in R
Last but not the least picking up the right chart or graph will depend on the purpose of the chart or graph. If the end user of the graph is statistically savvy you may opt for a complex chart that convey more information in a short window. On the other hand if the end user is only interested in a high level view, you should keep your graphs and charts simple and focus only on end results that are actionable.
You may visit some nice blogs about data visualization from around the internet