Multicollinearity

Modified on Fri, 13 Nov, 2020 at 8:13 AM

What is multicollinearity?

The existence of multicollinearity between your explanatory variables means that they are strongly dependent on each other. In other terms, the value of one explanatory variable conditions the value of at least one other explanatory variable of the model.

Why is multicollinearity a problem?

Multicollinearity could lay to biased coefficients or to an impossibility for the model to be computed. This is due to the unability for the model to choose which variable implied in the multicollinearity phenomenon is more important to predict the variable to explain. For example, if one wants to predict the event "cancer" by the age in years and the age in months, the model will be unable to choose if it is better to use the unit "month" or the unit "year". It will then be unable to choose for which variable it should put the higher coefficient and which one has to get a coefficient equal to 0.

What can I do to avoid multicollinearity?

To avoid multicollinearity:

Don't put the two same variables in your model (e.g. the weight in kilograms and the weight in pounds).
Don't put variables that you know they should be strongly correlated (e.g. the weight and the height): remove one of them or combine them in a new one (e.g. the Body Mass Index).
Don't put variables that measure the same outcome at different dates (e.g. the pain on one date and the pain on another date): you can also combine them in a new one (e.g. the pain difference between two dates).

On EMS, multicollinearity will be assessed by the Belsley-Kuh-Welsch's test. If multicollinearity is detected, the interface will advice you to delete the variable(s) which seem(s) to be in cause in the multicollinearity phenomenon.