Linearity assessment in multivariate analysis

Modified on Mon, 23 Nov 2020 at 10:47 AM

Linear regression implies that there is a linear relationship (linearity) between the variable to explain and the numeric explanatory variables. Logistic regression implies that there is a linear relationship (called log-linearity) between the log-odds of the variables to explain and the numeric explanatory variables.

If these conditions aren't respected, the coefficients computed could be uncertain, that is why it is really important to assess the linearity of a linear model or the log-linearity of a logistic model before any interpretation of the results.


Is the red line generally within the green zone?

Linear regression


Taking for example the following model: one wants to explain the variable to explain "Numbers of asthma attacks in the last 12 months" by the explanatory variables "Age", "Sex" (female/male), "Pack year", "Disease-modifying treatments" (yes/no), "Years since diagnosis", "Number of asthma diagnosis in family", "Therapeutic education sessions in the past 6 months" and "Alcohol per day (g/day)".



YES, it is clearly within the green zone

Even if the red line is out of the green zone some times, the  relationship is generally linear.        


            


 NO, it is not within the green zone 

The variables "Age", "Quality of life score" and "Number of rehabilitation sessions" are clearly non linear as we can see that the red line is quasi never within the green zone.




YES, it is within the green zone BUT maybe non-linear ...


For the variables "Years since diagnosis" and "Number of asthma diagnosis in family", the red line is generally within the green margin, the user can consider them as linear variables. But, we can discuss how there are specific cases that could be treated as non linear variables*.


* For the more experienced researchers

There are two special cases : "Years since diagnosis" and "Number of asthma diagnosis in family" are not linear, even if the red line is generally within the green zone. In fact, the point cloud shows that another mathematical relation could better explain the relationship between the variable to explain and the explanatory variable, as a curvilinear relationship for "Years since diagnosis" or polygonal or splines relationship for "Number of asthma diagnosis in family".



Logistic

Same principle, the graphics are very similar, the only difference is that there is no scatter plot on them.


YES, it is clearly within the green zone


NO, it is not within the green zone




For more details, see this book:

Garet J., Witten D., Hastie T. and Tibshirani R. An introduction to statistical learning with applications in R. Springer. 2013.
https://doi.org/10.1007/978-1-4614-7138-7


Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select atleast one of the reasons

Feedback sent

We appreciate your effort and will try to fix the article