Uni http://people.duke.edu/~rnau/regintro.htm Linear regression models Notes on linear regression analysis (pdf) Introduction to linear regression analysis Mathematics of simple regression Regression examples · Baseball batting averages · Beer sales vs. price, part 1: descriptive analysis · Beer sales vs. price, part 2: fitting a simple model · Beer sales vs. price, part 3: transformations of variables · Beer sales vs. price, part 4: additional predictors · NC natural gas consumption vs. temperature What to look for in regression output What’s a good value for R-squared? What's the bottom line? How to compare models Testing the assumptions of linear regression Additional notes on regression analysis Stepwise and all-possible-regressions Excel file with simple regression formulas Excel file with regression formulas in matrix form If you are a PC Excel user, you must check this out: RegressIt: free Excel add-in for linear regression and multivariate data analysis
Justification for regression assumptions Correlation and simple regression formulas Linear regression analysis is the most widely used of all statistical techniques: it is the study of This formula has the property that the prediction for Y is a straight-line function of each of the X variables, holding the others fixed, and the contributions of different X variables to the predictions are additive. The slopes of their individual straight-line relationships with Y are the constants b_{2}_{, }…, b, the so-called _{k}coefficients of the variables. That is,b is the change in the predicted value of Y per unit of change in X_{i}_{i}, other things being equal. The additional constant b, the so-called _{0}intercept, is the prediction that the model would make if all the X’s were zero (if that is possible). The coefficients and intercept are estimated by least squares, i.e., setting them equal to the unique values that minimize the sum of squared errors within the sample of data to which the model is fitted. And the model's prediction errors are typically assumed to be independently and identically normally distributed.The first thing you ought to know about linear regression is how the strange term Galton was a pioneer in the application of statistical methods to measurements in many branches of science, and in studying data on relative sizes of parents and their offspring in various species of plants and animals, he observed the following phenomenon: a larger-than-average parent tends to produce a larger-than-average child, but the child is likely to be The R symbol on this chart (whose value is 0.33) denotes the slope coefficient, not the correlation, although the two are the same if both populations have the same standard deviation, as will be shown below. Galton termed this phenomenon a exactly as the same size as the parent in relative terms (i.e., unless the correlation is exactly equal to 1), the predictions must regress to the mean regardless of biology if mean squared error is to be minimized. (Return to top of page.)Regression to the mean is an inescapable fact of life. Your children can be We have already seen a suggestion of regression-to-the-mean in some of the time series forecasting models we have studied: plots of forecasts tend to be The intuitive explanation for the regression effect is simple: the thing we are trying to predict usually consists of a predictable component ("signal") and a statistically independent Another way to think of the regression effect is in terms of A nice discussion of regression to the mean in the broader context of social science research can be found here. (Return to top of page.)
Why should we assume that relationships between variables are - Because linear relationships are the
*simplest non-trivial relationships*that can be imagined (hence the easiest to work with), and..... - Because the "true" relationships between our variables are often at least
*approximately*linear over the range of values that are of interest to us, and... - Even if they're not, we can often
*transform*the variables in such a way as to linearize the relationships.
This is a strong assumption, and the first step in regression modeling should be to look at scatterplots of the variables (and in the case of time series data, plots of the variables vs. time), to make sure it is reasonable a priori. And after fitting a model, plots of the errors should be studied to see if there are unexplained nonlinear patterns. This is especially important when the goal is to make predictions for scenarios outside the range of the historical data, where departures from perfect linearity are likely to have the biggest effect. If you see evidence of nonlinear relationships, it is possible (though not guaranteed) that transformations of variables will straighten them out in a way that will yield useful inferences and predictions via linear regression. (Return to top of page.)
And why should we assume that the effects of different independent variables on the expected value of the dependent variable are very strong assumption, stronger than most people realize. It implies that the marginal effect of one independent variable (i.e., its slope coefficient) does not depend on the current values of other independent variables. But… why shouldn’t it? It’s conceivable that one independent variable could amplify the effect of another, or that its effect might vary systematically over time. In a multiple regression model, the estimated coefficient of a given independent variable supposedly measures its effect while "controlling" for the presence of the others. However, the way in which controlling is performed is extremely simplistic: multiples of other variables are merely added or subtracted. Many users just throw a lot of independent variables into the model without thinking carefully about this issue, as if their software will automatically figure out exactly how they are related. It won’t! Even automatic model-selection methods (e.g., stepwise regression) require you to have a good understanding of your own data and to use a guiding hand in the analysis. They work only with the variables they are given, in the form that they are given, and then they look only for linear, additive patterns among them in the context of each other.
And why should we assume the 1. This assumption is often justified by appeal to the 2. It is (again) mathematically convenient: it implies that the optimal coefficient estimates for a linear model are those that minimize the 3. Even if the "true" error process is not normal in terms of the original units of the data, it may be possible to transform the data so that your model's prediction errors are approximately normal. But here too caution must be exercised. Even if the unexplained variations in the dependent variable are approximately normally distributed, it is not guaranteed that they will also be A very important special case is that of You still might think that variations in the values of Because the assumptions of linear regression (linear, additive relationships with i.i.d. normally distributed errors) are so strong, it is very important to test their validity when fitting models, a topic discussed in more detail on the testing-model-assumptions page, and be alert to the possibility that you may need more or better data to accomplish your objectives. You can’t get something from nothing. All too often, naïve users of regression analysis view it as a black box that can automatically predict any given variable from any other variables that are fed into it, when in fact a regression model is a very special and very transparent kind of prediction box. Its output contains no more information than is provided by its inputs, and its inner mechanism needs to be compared with reality in each situation where it is applied. (Return to top of page.)
A In particular, when fitting A measure of the absolute amount of variability in a variable is (naturally) its Our task in predicting Y might be described as that of explaining some or all of its variance--i.e., In using The correlation coefficient is most easily computed if we first ...where Now, Thus, for example, if X and Y are stored in columns on a spreadsheet, you can use the AVERAGE and STDEV.P functions to compute their averages and population standard deviations, then you can create two new columns in which the values of X If the two variables tend to vary on the The correlation coefficient can be said to measure the strength of the Thus, if X is observed to be 1 standard deviation above its own mean, then we should predict that Y will be r In graphical terms, this means that, versus X^{*}, the line for predicting Y^{*} from X^{*} so as to minimize mean squared error is the line that passes through the origin and has slope r_{XY}. This fact is not supposed to be obvious, but it is easily proved by elementary differential calculus.Here is an example: on a scatterplot of Y If we want to obtain the linear regression equation for predicting Y from X in By rearranging this equation and collecting constant terms, we obtain: where: is the estimated slope of the regression line, and is the estimated Y-intercept of the line. Notice that, as we claimed earlier, the coefficients in the linear equation for predicting Y from X depend only on the means and standard deviations of X and Y and on their coefficient of correlation. The additional formulas that are needed to compute
In general we find less-than-perfect correlation, which is to say, we find that r So, the technical explanation of the regression-to-the-mean effect hinges on two mathematical facts: (i) the correlation coefficient, calculated in the manner described above, happens to be the coefficient that minimizes the squared error in predicting Y The term "regression" has stuck and has even mutated from an intransitive verb into a transitive one since Galton's time. We don't merely say that the predictions for Y "regress to the mean"--we now say that we are "regressing Y on X" when we estimate a linear equation for predicting Y from X, and we refer to X as a "regressor" in this case. When we have fitted a linear regression model, we can compute the variance of its errors and compare this to the variance of the dependent variable (the latter being the error variance of an intercept-only model). The relative amount by which the regression model's error variance is less than the variance of the dependent variable is referred to as the It turns out that In a Go on to a nearby topic: · Mathematics of simple regression · Example #1: baseball batting averages · What to look for in regression output ## blogBy Jim http://blog.minitab.com/blog/adventures-in-statistics/regression-analysis-tutorial-and-examples Why Choose Regression and the Hallmarks of a Good Regression Analysis Before we begin the regression analysis tutorial, there are several important questions to answer. Why should we choose regression at all? What are the common mistakes that even experts make when it comes to regression analysis? And, how do you distinguish a good regression analysis from a less rigorous regression analysis? Read these posts to find out: Tribute to Regression Analysis: See why regression is my favorite! Sure, regression generates an equation that describes the relationship between one or more predictor variables and the response variable. But, there’s much more to it than just that. Four Tips on How to Perform a Regression Analysis that Avoids Common Problems: Keep these tips in mind through out all stages of this tutorial to ensure a top-quality regression analysis. Sample Size Guidelines: These guidelines help ensure that you have sufficient power to detect a relationship and provide a reasonably precise estimate of the strength of that relationship. Tutorial: How to Choose the Correct Type of Regression Analysis Minitab statistical software provides a number of different types of regression analysis. Choosing the correct type depends on the characteristics of your data, as the following posts explain. Giving Thanks for the Regression Menu: Patrick Runkel goes through the regression choices using a yummy Thanksgiving context! Linear or Nonlinear Regression: How to determine when you should use one or the other. What is the Difference between Linear and Nonlinear Equations: Both types of equations can model curvature, so what is the difference between them? Tutorial: How to Specify Your Regression Model Choosing the correct type of regression analysis is just the first step in this regression tutorial. Next, you need to specify the model. Model specification consists of determining which predictor variables to include in the model and whether you need to model curvature and interactions between predictor variables. Specifying a regression model is an iterative process. The interpretation and assumption verification sections of this regression tutorial show you how to confirm that you’ve specified the model correctly and how to adjust your model based on the results. How to Choose the Best Regression Model: I review some common statistical methods, complications you may face, and provide some practical advice. Stepwise and Best Subsets Regression: Minitab provides two automatic tools that help identify useful predictors during the exploratory stages of model building. Curve Fitting with Linear and Nonlinear Regression: Sometimes your data just don’t follow a straight line and you need to fit a curved relationship. Interaction effects: Michelle Paret explains interactions using Ketchup and Soy Sauce. Proxy variables: Important variables can be difficult or impossible to measure but omitting them from the regression model can produce invalid results. A proxy variable is an easily measurable variable that is used in place of a difficult variable. Overfitting the model: Overly complex models can produce misleading results. Learn about overfit models and how to detect and avoid them. Hierarchical models: I review reasons to fit, or not fit, a hierarchical model. A hierarchical model contains all lower-order terms that comprise the higher-order terms that also appear in the model. Standardizing the variables: In certain cases, standardizing the variables in your regression model can reveal statistically significant findings that you might otherwise miss. Five reasons why your R-squared can be too high: If you specify the wrong regression model, or use the wrong model fitting process, the R-squared can be too high. Tutorial: How to Interpret your Regression Results So, you’ve chosen the correct type of regression and specified the model. Now, you want to interpret the results. The following topics in the regression tutorial show you how to interpret the results and effectively present them: Regression coefficients and p-values Regression Constant (Y intercept) How to statistically test the difference between regression slopes and constants R-squared and the goodness-of-fit How high should R-squared be? How to interpret a model with a low R-squared Adjusted R-squared and Predicted R-squared S, the standard error of the regression F-test of overall significance How to Compare Regression Slopes How to Present Your Regression Results to Avoid Costly Mistakes: Research shows that presentation affects the number of interpretation mistakes. Tutorial: How to Use Regression to Make Predictions In addition to determining how the response variable changes when you change the values of the predictor variables, the other key benefit of regression is the ability to make predictions. In this part of the regression tutorial, I cover how to do just this. How to Predict with Minitab: A prediction guide that uses BMI to predict body fat percentage. Predicted R-squared: This statistic indicates how well a regression model predicts responses for new observations rather than just the original data set. Prediction intervals: See how presenting prediction intervals is better than presenting only the regression equation and predicted values. Prediction intervals versus other intervals: I compare prediction intervals to confidence and tolerance intervals so you’ll know when to use each type of interval. Tutorial: How to Check the Regression Assumptions and Fix Problems Like any statistical test, regression analysis has assumptions that you should satisfy, or the results can be invalid. In regression analysis, the main way to check the assumptions is to assess the residual plots. The following posts in the tutorial show you how to do this and offer suggestions for how to fix problems. Residual plots: What they should look like and reasons why they might not! How important are normal residuals: If you have a large enough sample, nonnormal residuals may not be a problem. Multicollinearity: Highly correlated predictors can be a problem, but not always! Heteroscedasticity: You want the residuals to have a constant variance (homoscedasticity), but what if they don’t? Box-Cox transformation: If you can’t resolve the underlying problem, Cody Steele shows how easy it can be to transform the problem away! Examples of Different Types of Regression Analyses The final part of the regression tutorial contains examples of the different types of regression analysis that Minitab can perform. Many of these regression examples include the data sets so you can try it yourself! New Linear Model Features in Minitab 17 Binary Logistic Regression: Predicts the winner of the 2012 U.S. Presidential election. Multiple regression with response optimization: Highlights new features added to the Assistant for Minitab 17. Linear Regression: Great Presidents by Patrick Runkel and my follow up, Great Presidents Revisited. Linear regression with a double-log transformation: Examines the relationship between the size of mammals and their metabolic rate with a fitted line plot. Nonlinear regression: Kevin Rudy uses nonlinear regression to predict winning basketball teams. Orthogonal regression: Carly Barry shows how orthogonal regression (a.k.a. Deming Regression) can test the equivalence of different instruments. Partial least squares (PLS) regression: Cody Steele uses PLS to successfully analyze a very small and highly multicollinear data set. You Might Also Like: Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit? Linear or Nonlinear Regression? That Is the Question. Multiple Regression Analysis: Use Adjusted R-Squared and Predicted R-Squared to Include the Correct Number of Variables What Is the Difference between Linear and Nonlinear Equations in Regression Analysis? |

Learning Hub >