By Lionel Hertzog at Rbloggers, DataScience+
PhD student at the Technical University of Munich (DE). As data and stats are everywhere nowadays, Lionel finds that developing statistical literacy is of great interest for any scientists.
Before going into complex model building, looking at data relation is a sensible step to understand how your different variable interact together. Correlation look at trends shared between two variables, and regression look at relation between a predictor (independent variable) and a response (dependent) variable.
As mentioned above correlation look at global movement shared between two variables, for example when one variable increases and the other increases as well, then these two variables are said to be positively correlated. The other way round when a variable increase and the other decrease then these two variables are negatively correlated. In the case of no correlation no pattern will be seen between the two variable.
Let’s look at some code before introducing correlation measure:
x<-sample(1:20,20)+rnorm(10,sd=2) y<-x+rnorm(10,sd=3) z<-(sample(1:20,20)/2)+rnorm(20,sd=5) df<-data.frame(x,y,z) plot(df[,1:3])
Here is the plot:
From the plot we get we see that when we plot the variable y with x, the points form some kind of line, when the value of x get bigger the value of y get somehow proportionally bigger too, we can suspect a positive correlation between x and y.
The measure of this correlation is called the coefficient of correlation and can calculated in different ways, the most usual measure is the Pearson coefficient, it is the covariance of the two variable divided by the product of their standard deviation, it is scaled between 1 (for a perfect positive correlation) to -1 (for a perfect negative correlation), 0 would be complete randomness. We can get the Pearson coefficient of correlation using the function cor():
cor(df,method="pearson") x y z x 1.0000000 0.8736874 -0.2485967 y 0.8736874 1.0000000 -0.2376243 z -0.2485967 -0.2376243 1.0000000 cor(df[,1:3],method="spearman") x y z x 1.0000000 0.8887218 -0.3323308 y 0.8887218 1.0000000 -0.2992481 z -0.3323308 -0.2992481 1.0000000
From these outputs our suspicion is confirmed x and y have a high positive correlation, but as always in statistics we can test if this coefficient is significant. Using parametric assumptions (Pearson, dividing the coefficient by its standard error, giving a value that follow a t-distribution) or when data violate parametric assumptions using Spearman rank coefficient.
cor.test(df$x,df$y,method="pearson") Pearson's product-moment correlation data: df$x and df$y t = 7.6194, df = 18, p-value = 4.872e-07 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.7029411 0.9492172 sample estimates: cor 0.8736874 cor.test(df$x,df$y,method="spearman") Spearman's rank correlation rho data: df$x and df$y S = 148, p-value < 2.2e-16 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.8887218
An extension of the Pearson coefficient of correlation is when we square it we obtain the amount of variation in y explained by x (this is not true for the spearman rank based coefficient where squaring it has no statistical meanings). In our case we have around 75% of the variance in y that is explained by x.
However such results do not allow any explanation of the effect of x on y, indeed x could act on y in various way that are not always direct, all we can say from the correlation is that these two variables are linked somehow, to really explain and measure effects of x on y we need to use regression method, which will come next.
Regression is different from correlation because it try to put variables into equation and thus explain relationship between them, for example the most simple linear equation is written : Y=aX+b, so for every variation of unit in X, Y value change by aX. Because we are trying to explain natural processes by equations that represent only part of the whole picture we are actually building a model that’s why linear regression are also called linear modelling.
In R we can build and test the significance of linear models.
m1<-lm(mpg~cyl,data=mtcars) summary(m1) Call: lm(formula = mpg ~ cyl, data = mtcars) Residuals: Min 1Q Median 3Q Max -4.9814 -2.1185 0.2217 1.0717 7.5186 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.8846 2.0738 18.27 < 2e-16 *** cyl -2.8758 0.3224 -8.92 6.11e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.206 on 30 degrees of freedom Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171 F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
The basic function to build linear model (linear regression) in R is to use thelm() function, you provide to it a formula in the form of y~x and optionally a data argument.
Using the summary() function we get all information about our model: the formula called, the distribution of the residuals (the error of our models), the value of the coefficient and their significance plus an information on the overall model performance with the adjusted R-squared (0,71 in our case) that represent the amount of variation in y explained by x, so 71% of the variation in ‘mpg’ can be explain by the variable ‘cyl’.
Before shouting ‘Eureka’ we should first check that the models assumptions are met, indeed linear models make a few assumptions on your data, the first one is that your data are normally distributed, the second one is that the variance in y is homogeneous over all x values (sometimes called homoscedasticity) and independence which means that a y value at a certain x value should not influence other y values.
There is a marvelous built-in methods to check all this with linear models:
par(mfrow=c(2,2)) plot(m1)
The par() argument is just to put all graphs in one window, the plot function is the real one.
Here is the plot:
The graphs on the first columns look at variance homogeneity among other things, normally you should see no pattern in the dots but just a random clouds of points. In this example this is clearly not the case since we see that the spreads of dots increase with higher values of cyl, our homogeneity assumptions is violated we can go back at the beginning and build new models this one cannot be interpreted… Sorry m1 you looked so great…
For the record the graph on the top right check the normality assumptions, if your data are normally distributed the point should fall (more or less) in a straight line, in this case the data are normal.
The final graph show how each y influence the model, each points is removed at a time and the new model is compared to the one with the point, if the point is very influential then it will have a high leverage value. Points with too high leverage value should be removed from the dataset to remove their outlying effect on the model.
There are a few basics mathematical transformations that can be applied to non normal or heterogeneous data, usually it is a trial and error process;
mtcars$Mmpg<-log(mtcars$mpg) plot(Mmpg~cyl,mtcars)
Here is the plot we get:
In our case this looks ok, but we can still remove the two outliers in ‘cyl’ categorie 8;
n<-rownames(mtcars)[mtcars$Mmpg!=min(mtcars$Mmpg[mtcars$cyl==8])] mtcars2<-subset(mtcars,rownames(mtcars)%in%n)
The first line ask for row names in ‘mtcars’ (rownames(mtcars)), but only return the one where the value of the variable ‘Mmpg’ is not equal (!=) to the minimum value of the variable ‘Mmpg’ which fall in the category of 8 cylinders. Then the list ‘n’ contain all these rownames and the next step is to make a new data frame that only contain rows with rownames present in the list ‘n’.
In this stage of transforming and removing outliers from the data you should use and abuse of plots to help you through the process.
Now let’s look back at our bivariate linear regression model from this new dataset;
model<-lm(Mmpg~cyl,mtcars2) summary(model) Call: lm(formula = Mmpg ~ cyl, data = mtcars2) Residuals: Min 1Q Median 3Q Max -0.19859 -0.08576 -0.01887 0.05354 0.26143 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.77183 0.08328 45.292 < 2e-16 *** cyl -0.12746 0.01319 -9.664 2.04e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1264 on 28 degrees of freedom Multiple R-squared: 0.7693, Adjusted R-squared: 0.7611 F-statistic: 93.39 on 1 and 28 DF, p-value: 2.036e-10 plot(model)
Here is the plot for the model:
Again we have highly significant intercept and slope, the model explain 76% of the variance in log(mpg) and is overall significant. Now we biologist are trained to love and worship ANOVA table, in R there are several way to do it (as always an easy and straightforward way and another with more possibilities for tuning);
anova(model) Analysis of Variance Table Response: Mmpg Df Sum Sq Mean Sq F value Pr(>F) cyl 1 1.49252 1.49252 93.393 2.036e-10 *** Residuals 28 0.44747 0.01598 library(car) Le chargement a nécessité le package : MASS Le chargement a nécessité le package : nnet Anova(model) Anova Table (Type II tests) Response: Mmpg Sum Sq Df F value Pr(>F) cyl 1.49252 1 93.393 2.036e-10 *** Residuals 0.44747 28
The second function Anova() allow you to define which type of sum-of-square you want to calculate (here is a nice explanation of their different assumptions) and also to correct for variance heterogeneity;
Anova(model,white.adjust=TRUE) Analysis of Deviance Table (Type II tests) Response: Mmpg Df F Pr(>F) cyl 1 69.328 4.649e-09 *** Residuals 28
You would have noticed that the p-value is a bit higher. This function is very useful for unbalanced dataset (which is our case) but need to take care when formulating the model especially when there is more than one predictor variables since the type II sum of square assume that there is no interaction between the predictors.
To sum up, correlation is a nice first step to data exploration before going into more serious analysis and to select variable that might be of interest (anyway it always produce sexy and easy to interpret graphs which will make your supervisor happy), then the next step is to model the variable relationship and the most basic models are bivariate linear regression that put the relation between the response variable and the predictor variable into equation and testing this using the summary and anova() function. Since linear regression make several assumptions on the data before interpreting the results of the model you should use the function plot and look if the data are normally distributed, that the variance is homogeneous (no pattern in the residuals~fitted values plot) and when necessary remove outliers.
Next step will be using multiple predictors and looking at generalized linear models.
Great post, very easy to understand.
There is one thing that bothered me a little though
It's always good to check the assumptions and make sure everything checks, but it's not that crucial in some cases.
You only really need homoscedasticity and normality when you want a model for inference. See, sometimes the normality issue is not even a big deal thanks to the 'pretty good' robustness of the Wald t statistic (the homoscedasticity is a lot more important). You can also even just bootstrap confidence intervals and standard errors.
For predictive purposes you can use a linear model that doesn't check any assumption and get pretty good results with it. Of course a lot of classical results won't work.
Thanks for your comment.
In my area (experimental ecology) models are always used for inference purposes. I guess that for predictive purposes you would only care about the parameters so the distribution assumption are not so important. The linearity assumption will however be very important, fitting a straight line to a quadratic curve would make little sense, so I guess you would have to check the predictive ability of your linear regression. I would actually argue that in most cases we would not expect linear relations but rather more complex functions leading to the use of more complex model families.
I think what cosmicmessenger is trying to point out is the fact, that a linear model is very often a sufficient or quite good approcimation of dependencies.
Imagine eg. a quadratic curve x^2, but data only gathered in the interval x = [2;4]. Linear approximation to the parabola works quite well there, although there will be systematic bias, but possibly very small compared the "real" error. In fact it turns out to be a question of how many "turns" your function does in the area you want to model.
Also: Good thing you removed the phrase "causal". There is a directional idea behind regression (which can easily be seen checking how the residuals are calculated), but direct causality is only one possible explanation for suspected dependencies.
"Regression is different from correlation because it try to put
variables into equation and thus explain causal relationship between them, ..."
Sorry, but this is clearly wrong. First, a simple regression (Y = a + b*X) yields basically the same parameters as a correlation between the two variables as both procedures model a linear relationship. Second, what if you change your variables, i.e. X = a + b*Y. Do you magically switch the direction of causality?
Thanks for your critical comment.
Parameters from a linear regression only makes sense if they can be linked to some hypothesis/ideas about how the variables are linked. Switching the response and explanatory variables will not switch the direction of causality, the mechanisms that generated your data only works in one way. It is then at the discretion of the analyst to find what makes sense. R will fit all possible models but only a few will be relevant.
Correlation is different from linear regression as there are no directionality in correlation coefficients while linear regression basically says: "X causes changes in Y". Even if the parameters yielded are very similar (strong coefficient of correlation will certainly yield large regression coefficients), regression clearly puts a directionality in the relationships between two variables.
Thank you for your reply, Lionel.
You are right, saying that the analyst has to find what makes sense. Nevertheless, I think it is dangerous, especially for students who stumble upon your post, to read a claim like "This statistical method implies causal relationships between your variables."
The statistical method has nothing to do with causality as it is mere interpretation by the analyst or scientist.
Let me further elaborate my claim regarding the similarity of correlation and a simple linear regression (only one predictor): every correlation between two variables can be modeled as a linear regression. The decision on one of the two variables as the "predictor" and the other as the dependent variable is purely theoretical. And the fit of your model cannot say whether your assumption is true or false.
You are right, I do get your concerns that people might be over-confident in their interpretation of simple regression models. I removed the word "causal" from the post.
In the end I do think that regression analysis enable us to get a glimpse at the processes (causes) generating our data. I am however aware that the word "causal" is a very strong word with some deep philosophical issues bond to it (ie is our data ever going to be able to reflect the processes that we want to study?)