For each of these predictor examples, the researcher just observes the values as they occur for the people in the random sample. Multicollinearity happens more often than not in such observational studies.
And, unfortunately, regression analyses most often apply to data obtained from observational studies. If you aren't convinced, consider the example data sets for this course. Most of the data sets were obtained from observational studies, not experiments. It is for this reason that we need to fully understand the impact of multicollinearity on our regression analyses. In the case of structural multicollinearity, the multicollinearity is induced by what you have done.
Data-based multicollinearity is the more troublesome of the two types of multicollinearity. Technical Analysis Basic Education. Financial Ratios. Risk Management. Financial Analysis. Your Privacy Rights. To change or withdraw your consent choices for Investopedia. At any time, you can update your settings through the "EU Privacy" link at the bottom of any page.
These choices will be signaled globally to our partners and will not affect browsing data. We and our partners process data to: Actively scan device characteristics for identification. I Accept Show Purposes. Your Money. Personal Finance. Your Practice. Popular Courses. Investing Fundamental Analysis. What Is Multicollinearity?
Key Takeaways Multicollinearity is a statistical concept where independent variables in a model are correlated. The correlations among the remaining pairs of predictors do not appear to be particularly strong. Focusing only on the relationship between the two predictors Vocab and Abstract :. Let's see what havoc this high correlation wreaks on our regression analysis! Yikes — the variance inflation factors for Vocab and Abstract are very large — What should we do about this?
We could opt to remove one of the two predictors from the model. Alternatively, if we have a good scientific reason for needing both of the predictors to remain in the model, we could go out and collect more data. Let's try this second approach here.
For the sake of this example, let's imagine that we went out and collected more data, and in so doing, obtained the actual data collected on all 69 patients enrolled in the Allen Cognitive Level ACL Study. A matrix plot of the resulting Allen Test data set :. Again, focusing only on the relationship between the two predictors Vocab and Abstract :. The round data points in blue represent the 23 data points in the original data set, while the square red data points represent the 46 newly collected data points.
As you can see from the plot, collecting the additional data has expanded the "base" over which the "best fitting plane" will sit. The existence of this larger base allows less room for the plane to tilt from sample to sample, and thereby reduces the variance of the estimated slope coefficients. Let's see if the addition of the new data helps to reduce the multicollinearity here. The researchers could now feel comfortable proceeding with drawing conclusions about the effects of the vocabulary and abstraction scores on the level of psychopathology.
One thing to keep in mind In order to reduce the multicollinearity that exists, it is not sufficient to go out and just collect any ol' data. The data have to be collected in such a way to ensure that the correlations among the violating predictors is actually reduced. That is, collecting more of the same kind of data won't help to reduce the multicollinearity.
The data have to be collected to ensure that the "base" is sufficiently enlarged. Doing so, of course, changes the characteristics of the studied population, and therefore should be reported accordingly. Because of this, at the same time that we learn here about reducing structural multicollinearity, we learn more about polynomial regression models. What is the impact of exercise on the human immune system? In order to answer this very global and general research question, one has to first quantify what "exercise" means and what "immunity" means.
Of course, there are several ways of doing so. For example, we might quantify one's level of exercise by measuring his or her "maximal oxygen uptake. Because some researchers were interested in answering the above research question, they collected the following Exercise and Immunity data set on a sample of 30 individuals:.
In order to allow for the apparent curvature — rather than formulating a linear regression function — the researchers formulated the following quadratic polynomial regression function :.
But, now what do the estimated coefficients tell us? The interpretation of the regression coefficients is mostly geometric in nature. That is, the coefficients tell us a little bit about what the picture looks like:. So far, we have kept our head a little bit in the sand! Is this surprising to you? If you think about it, we've created a correlation by taking the predictor oxygen and squaring it to obtain oxygensq.
That is, just by the nature of our model, we have created a " structural multicollinearity. The neat thing here is that we can reduce the multicollinearity in our data by doing what is known as " centering the predictors.
For example, Minitab reports that the mean of the oxygen values in our data set is Therefore, in order to center the predictor oxygen , we merely subtract Doing so, we obtain the centered predictor, oxcent , say:. For example, Now, in order to include the squared oxygen term in our regression model — to allow for curvature in the trend — we square the centered predictor oxcent to obtain oxcentsq. Having centered the predictor oxygen , we must reformulate our quadratic polynomial regression model accordingly.
That is, we now formulate our model as:. Note that we add asterisks to each of the parameters in order to make it clear that the parameters differ from the parameters in the original model we formulated. Let's see how we did by centering the predictors and reformulating our model. Recall that — based on our original model — the variance inflation factors for oxygen and oxygensq were Because we reformulated our model based on the centered predictors, the meaning of the parameters must be changed accordingly.
Now, the estimated coefficients tell us:. We shouldn't be surprised to see that the estimates of the coefficients in our reformulated polynomial regression model are quite similar to the estimates of the coefficients for the simple linear regression model:. The similarities in the estimates, of course, arise from the fact that the predictors are nearly uncorrelated and therefore the estimates of the coefficients don't change all that much from model to model.
Now, you might be getting this sense that we're "mucking around with the data" in order to get an answer to our research questions. One way to convince you that we're not is to show you that the two estimated models are algebraically equivalent. That is, if given one form of the estimated model, say the estimated model with the centered predictors:. In fact, it can be shown algebraically that the estimated coefficients of the original model equal:.
For example, the estimated regression function for our reformulated model with centered predictors is:. That is, the estimated regression function for our quadratic polynomial model with the original uncentered predictors is:. Given the equivalence of the two estimated models, you might ask why we bother to center the predictors.
The main reason for centering to correct structural multicollinearity is that low levels of multicollinearity can be helpful in avoiding computational inaccuracies. Severe multicollinearity has the effect of making this determinant come close to zero.
Thus, under severe multicollinearity, the regression coefficients may be subject to large roundoff errors. Of course, before we use our model to answer a research question, we should always evaluate it first to make sure it means all of the necessary conditions.
The residuals versus fits plot:. It also suggests that the variances of the error terms are equal. And, the normal probability plot:. When asking Minitab to make this prediction, you have to remember that we have centered the predictors. Why does Minitab report that "XX denotes a row with very extreme X values? Therefore, a maximal oxygen uptake of 90 is way outside the scope of the model, and Minitab provides such a warning.
A word of warning. Be careful — because of changes in direction of the curve, there is an even greater danger in extrapolation when modeling data with a polynomial function. Just one closing comment since we've been discussing polynomial regression to remind you about the "hierarchical approach to model fitting. For example, suppose we formulate the following cubic polynomial regression function:. Then, to see if the simpler first order model a "line" is adequate in describing the trend in the data, we could test the null hypothesis:.
But then … if a polynomial term of a given order is retained, then all related lower-order terms are also retained. That is, if a quadratic term is deemed significant, then it is standard practice to use this regression function:. Data source: The U. In this example, the observations are the 50 states of the United States Poverty data - Note: remove data from the District of Columbia.
The two x -variables are correlated so we have multicollinearity. The correlation is about 0. A plot of the two x -variables is given below. Both x -variables are linear predictors of the poverty percentage. Minitab results for the two possible simple regressions and the multiple regression are given below. In general, it is dangerous to extrapolate beyond the scope of model.
The following example illustrates why this is not a good thing to do. The scope of the model — that is, the range of the x values — was 0 to 5.
The researchers obtained the following estimated regression equation:. Using the estimated regression equation, the researchers predicted the number of colonies at But when the researchers conducted the experiment at The moral of the story is that the trend in the data as summarized by the estimated regression equation does not necessarily hold outside the scope of the model. Excessive nonconstant variance can create technical difficulties with a multiple linear regression model.
For example, if the residual variance increases with the fitted values, then prediction intervals will tend to be wider than they should be at low fitted values and narrower than they should be at high fitted values. Some remedies for refining a modeal exhibiting excessive nonconstant variance includes the following:.
One common way for the "independence" condition in a multiple linear regression model to fail is when the sample data have been collected over time and the regression model fails to effectively capture any time trends.
In such a circumstance, the random errors in the model are often positively correlated over time, so that each random error is more likely to be similar to the previous random error that it would be if the random errors were independent of one another. This phenomenon is known as autocorrelation or serial correlation and can sometimes be detected by plotting the model residuals versus time.
We'll explore this further in Lesson When building a regression model, we don't want to include unimportant or irrelevant predictors whose presence can overcomplicate the model and increase our uncertainty about the magnitudes of the effects for the important predictors particularly if some of those predictors are highly collinear.
Such "overfitting" can occur the more complicated a model becomes and the more predictor variables, transformations, and interactions are added to a model. It is always prudent to apply a sanity check to any model being used to make decisions. Models should always make sense, preferably grounded in some kind of background theory or sensible expectation about the types of associations allowed between variables.
Predictions from the model should also be reasonable over-complicated models can give quirky results that may not reflect reality. However, there is potentially greater risk from excluding important predictors than from including unimportant ones. The linear association between two variables ignoring other relevant variables can differ both in magnitude and direction from the association that controls for other relevant variables. Whereas the potential cost of including unimportant predictors might be increased difficulty with interpretation and reduced prediction accuracy, the potential cost of excluding important predictors can be a completely meaningless model containing misleading associations.
Multicollinearity exists whenever an independent variable is highly correlated with one or more of the other independent variables in a multiple regression equation. Multicollinearity is a problem because it undermines the statistical significance of an independent variable. Other things being equal, the larger the standard error of a regression coefficient, the less likely it is that this coefficient will be statistically significant.
Unable to display preview. Download preview PDF.
0コメント