So often we come across variables that show some form of correlation. Examples are aplenty like- temperature and ice cream sales, month of the year and beer sales, credit score and probability of defaulting on loans and so on. And when we have variables in the form of (x,y) pairs that are correlated and can be listed in two columns to form a data table, and we need to predict one of them using the other, regression is the tool of choice.
Like any other tool regression has its disadvantages and we’ll discuss it in a while. But first let me showcase the positive side of regression analysis which makes it figure among many top 10 list of algorithms created by various experts.
Applying regression model to the data is easy and hardly takes a few minutes if you know the right functions to use (I’ve included some R and Python functions here). But you must be aware of few ground rules before applying regression analysis, so that you are not misled by the simplicity of this method. I usually recommend below ground rules or checks before applying regression:
1. Type of data: you have to be careful about spurious correlations. Use your sense of judgement and experience and don’t just go by the numbers. See how misleading numbers can be (in lying with statistics).
2. Assumptions: We often ignore assumptions when we have our desired results. I’ve listed below several assumptions and why they are important. Once you make it a habit to check if all assumptions are satisfied, you’ll never get it wrong with regression or any other algorithm for that matter.
3. Select the right regression model– There are various types of regression that suits different scenarios. We’ll discuss many of them in a separate post. For the purpose of this post, I’ll use only linear regression.
4. Test the result– After you derive the results using regression tick some mental checkboxes to make sure the prediction is solid. Depending on the complexity of data it may take some time.
It’s also important to understand the assumptions behind a linear regression model in order to understand its strength and weakness. Here I’ve listed down some of the assumptions and how to be careful with them.
The variable that we want to predict (x) is called independent variable. This means that we have control over the variable during the experiment. We can collect values of y for known values of x in order to derive the co-efficient and y-intercept of the model using certain assumptions. The equation looks like below:
y = a + bx + e
a is the y-intercept
b is the slope of the line
e is an error term
In practice, under ordinary circumstances, we do not know the value of the error term so we use the following form of the equation
y = a + bx
When we draw a regression line some points are above the line and some are below the line. Those that are above are positive errors and the points below are negative errors. In fact, if we could fit a graph through all these points, that would be a regression line without any error. But in practical scenarios it’s not possible to find such a curve. So, our objective here is to reduce the error to a minimum level.
So, what should be the criterion for best fit?
Take the sum of errors- both positive and negative. The sum of it can be minimized and we can get the best fit line. But only problem here is, if we have large positive and negative and positive errors they will cancel each others out and minimize the error. And we end up with a model with large values of error.
We can use several methods to find out the error in our regression line. 3 of the most common methods are:
Mean absolute error
Sum of squares of error
But the sum of squares is the most popular one as it magnifies the errors and there is no chance that the errors will nullify each other.
The difference between the observed value of the dependent variable (y) and the predicted value (y-hat) is called the residual (e). Each data point has one residual.
Residual = Observed value – Predicted value
When a residual plot shows a random pattern, it indicated a good fit for a linear model.
The difference between the height of each man in the sample and the unobservable population mean is a statistical error, whereas
The difference between the height of each man in the sample and the observable sample mean is a residual.
There are four principal assumptions which justify the use of linear regression models for purposes of inference or prediction:
(i) linearity and additivity of the relationship between dependent and independent variables:
(a) The expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed.
(b) The slope of that line does not depend on the values of the other variables.
(c) The effects of different independent variables on the expected value of the dependent variable are additive.
(ii) statistical independence of the errors (in particular, no correlation between consecutive errors in the case of time series data)
(iii) homoscedasticity (constant variance) of the errors
(a) versus time (in the case of time series data)
(b) versus the predictions
(c) versus any independent variable
(iv) normality of the error distribution.
If any of these assumptions is violated (i.e., if there are nonlinear relationships between dependent and independent variables or the errors exhibit correlation, heteroscedasticity, or non-normality), then the forecasts, confidence intervals, and scientific insights yielded by a regression model may be (at best) inefficient or (at worst) seriously biased or misleading.
Linear regression models depend linearly on their unknown parameters and are easier to fit than models which are non-linearly related to their parameters. Also, because of this, the statistical properties of the resulting estimators are easier to determine in case of linear regression.
There are two additional pieces of information that are of importance for us: r and r^2
The correlation coefficient, r, indicates the nature and strength of the relationship between x and y. Values of r range from -1 to +1. A correlation coefficient of 0 means that there is no relationship. A value of -1 is a perfect negative coefficient and a correlation value of +1 indicates a perfect positive correlation.
A bike rental company is analyzing the weather related factors that affect the number of bikes rented out during any season. You are conducting the regression analysis and include the humidity and temperature as your two predictors. The following is a table of the results as obtained through from R using the formula
fit<- lm (formula = count ~ temp + humidity, data = bike_rental)
The regression results show you that both predictors are significant because of their low p-values. Together, the two predictors explain 22.72% of the variance of bike rentals. Specifically:
• For each 1 degree increase in temperature, the number of bike rentals is expected to increase by 0.909%.
• For each 1 degree Celsius increase in the percentage of humidity, the number of bike rentals is expected to decrease by 3.1.
Regression results identify the direction, size, and statistical significance of the relationship between a predictor and response.
• The sign of each coefficient indicates the direction of the relationship.
• Coefficients represent the mean change in the response for one unit of change in the predictor while holding other predictors in the model constant.
• The P-value for each coefficient tests the null hypothesis that the coefficient is equal to zero (no effect). Therefore, low p-values indicate the predictor is a meaningful addition to your model.
• The equation predicts new observations given specified predictor values.
The above example showed you how to use regression in R. Below is the code for Python:
#Import other necessary libraries like pandas, numpy…
from sklearn import linear_model
#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
# Create linear regression object
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
#Equation coefficient and Intercept
print(‘Coefficient: \n’, linear.coef_)
print(‘Intercept: \n’, linear.intercept_)
The standardized regression coefficient is the same as Pearson’s correlation coefficient
The square of Pearson’s correlation coefficient is the same as the R2R2 in simple linear regression
Neither simple linear regression nor correlation answer questions of causality directly. This point is important, because I’ve met people that think that simple regression can magically allow an inference that XX causes YY.
Second, some differences:
With correlation, it doesn’t matter which of the two variables you call “X” and which you call “Y”. You’ll get the same correlation coefficient if you swap the two.
The regression equation (i.e., a+bXa+bX) can be used to make predictions on YY based on values of XX
While correlation typically refers to the linear relationship, it can refer to other forms of dependence, such as polynomial or truly nonlinear relationships
While correlation typically refers to Pearson’s correlation coefficient, there are other types of correlation, such as Spearman’s.
It can’t cover for poor data quality issues. Data should be prepared well to remove missing values. Also if the data doesn’t follow normal distribution the validity of the model suffers.
There can be collinearity problems: If two or more independent variables are strongly correlated they will eat into each other’s predictive power as demonstrated later. The model will not automatically choose between highly collinear variables.
The model may get unreliable if a large number of variables are included. All variables included in the model will reflect in the equation irrespective of their contribution to the overall prediction.
Model doesn’t automatically take care of collinearity. It’s up to the modeler to add additional terms that might be needed to improve the fit of the model.
Regression doesn’t work with categorical variables with multiple values. These variables need to be converted to other variables with yes/no values before using them in regression models.
Linear regression is used when the desired output is required to take a continuous value based on whatever input/dataset is given to the algorithm. Suppose you want to make a program which would predict the average temperature of say tomorrow, based on certain features, like average temperature, minimum temperature, maximum temperature, etc. of past week. Since this problem needs output as a value of continuous nature, it is classified as a linear regression problem.
Linear regression is usually solved by minimizing the least squares error of the model to the data, therefore large errors are penalized quadratically. Logistic regression is just the opposite. Using the logistic loss function causes large errors to be penalized to an asymptotically constant.
Now suppose your problem was to not output the temperature, but the type of weather that tomorrow might have (eg., sunny, cloudy, stormy, rainy etc). This problem will give an output belonging to a certain set of values predefined, hence it is basically classifying your output into categories. Classifying problems can be either binary (yes/no, o/1) or multiclass (like the problem described above). Logistic regression is used in classifying problems of machine learning.