Requirements

  • Linear Regression needs at least two variables
  • The amount of samples is preferred to be at minimum 20 observations per variable to have confidence in the model

1. Linear Relationship

  • The relationship between the independent variable (response) and the dependent variable(s) (predictors) needs to be linear
  • Limit the amount of outliers as Linear Regression is sensitive to outliers - test with scatter plots

  • R methods: scatter plots

plot(cars)

2. Multivariate Normality

  • The distribution of each of the variables is normal (or Gaussian) - the averages of each random, independent observations converge (tend towards) a central mean.

  • R methods: Histograms, Shapiro-Wilk Test, Q-Q plot, Kolmogorov-Smirnof test

hist(cars$speed)

hist(cars$dist)

shapiro.test(cars$speed)
## 
##  Shapiro-Wilk normality test
## 
## data:  cars$speed
## W = 0.97765, p-value = 0.4576
shapiro.test(cars$dist)
## 
##  Shapiro-Wilk normality test
## 
## data:  cars$dist
## W = 0.95144, p-value = 0.0391
qqplot(cars$speed, cars$dist)

ks.test(cars$speed, cars$dist)
## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  cars$speed and cars$dist
## D = 0.76, p-value = 5.735e-13
## alternative hypothesis: two-sided

3. No or little multicollinearity

  • Multicollinearity occurs when the independent variables are not independent from each other - One variable can be a predictor of the other
  • A second important independence assumption is that the error of the mean has to be independent from the independent variables.

  • R methods: correlation, tolerance, variance inflation factor (VIF)

cor(cars)
##           speed      dist
## speed 1.0000000 0.8068949
## dist  0.8068949 1.0000000
# example of correlated variables
cars2 <- cars
cars2$speed2 <- cars2$speed
cor(cars2) # if there is a 1.0 in any other position than the diagonal there is perfect correlation
##            speed      dist    speed2
## speed  1.0000000 0.8068949 1.0000000
## dist   0.8068949 1.0000000 0.8068949
## speed2 1.0000000 0.8068949 1.0000000

4. No auto-correlation

  • Autocorrelation occurs when the residuals are not independent from each other
  • correlation of a signal with itself at different points in time - common in time series analysis

  • R methods: Durbin-Watson Test

library(car)

fit <- lm(dist ~ speed, data=cars)
summary(fit)
## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
durbinWatsonTest(fit)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.1604322      1.676225   0.166
##  Alternative hypothesis: rho != 0

5. Homoscedasticity

  • The error terms along the regression are the same and not heteroscedastic.

  • R methods: Residual plot, Goldfeld-Quandt Test

library(car)
residualPlot(fit)

# or in base
plot(fit$fitted.values, fit$residuals)

LS0tDQp0aXRsZTogIkFzc3VtcHRpb25zIG9mIExpbmVhciBSZWdyZXNzaW9uIg0KYXV0aG9yOiAiSmFzbWluZSBEdW1hcyINCmRhdGU6ICJBdWd1c3QgMzAsIDIwMTYiDQpvdXRwdXQ6DQogIGh0bWxfZG9jdW1lbnQ6DQogICAgdG9jOiB0cnVlDQogICAgdG9jX2Zsb2F0OiB0cnVlDQogICAgY29kZV9mb2xkaW5nOiBzaG93DQogICAgY29kZV9kb3dubG9hZDogdHJ1ZQ0KICAgIGZpZ193aWR0aDogOQ0KICAgIGZpZ19oZWlnaHQ6IDYNCiAgICB0aGVtZTogZmxhdGx5DQogICAgaGlnaGxpZ2h0OiB0YW5nbw0KLS0tDQoNCg0KYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUsIG1lc3NhZ2U9RkFMU0UsIHdhcm5pbmc9RkFMU0UpDQoNCmBgYA0KDQoNCiMjIyBSZXF1aXJlbWVudHMNCg0KKiBMaW5lYXIgUmVncmVzc2lvbiBuZWVkcyBhdCBsZWFzdCB0d28gdmFyaWFibGVzDQoqIFRoZSBhbW91bnQgb2Ygc2FtcGxlcyBpcyBwcmVmZXJyZWQgdG8gYmUgYXQgbWluaW11bSAyMCBvYnNlcnZhdGlvbnMgcGVyIHZhcmlhYmxlIHRvIGhhdmUgY29uZmlkZW5jZSBpbiB0aGUgbW9kZWwNCg0KIyMjIDEuIExpbmVhciBSZWxhdGlvbnNoaXANCg0KKiBUaGUgcmVsYXRpb25zaGlwIGJldHdlZW4gdGhlIGluZGVwZW5kZW50IHZhcmlhYmxlIChyZXNwb25zZSkgYW5kIHRoZSBkZXBlbmRlbnQgdmFyaWFibGUocykgKHByZWRpY3RvcnMpIG5lZWRzIHRvIGJlIGxpbmVhcg0KKiBMaW1pdCB0aGUgYW1vdW50IG9mIG91dGxpZXJzIGFzIExpbmVhciBSZWdyZXNzaW9uIGlzIHNlbnNpdGl2ZSB0byBvdXRsaWVycyAtIHRlc3Qgd2l0aCBzY2F0dGVyIHBsb3RzDQoNCiogUiBtZXRob2RzOiBzY2F0dGVyIHBsb3RzDQpgYGB7cn0NCnBsb3QoY2FycykNCmBgYA0KDQoNCiMjIyAyLiBNdWx0aXZhcmlhdGUgTm9ybWFsaXR5DQoNCiogVGhlIGRpc3RyaWJ1dGlvbiBvZiBlYWNoIG9mIHRoZSB2YXJpYWJsZXMgaXMgW25vcm1hbCAob3IgR2F1c3NpYW4pXShodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9Ob3JtYWxfZGlzdHJpYnV0aW9uKSAtIHRoZSBhdmVyYWdlcyBvZiBlYWNoICoqcmFuZG9tLCBpbmRlcGVuZGVudCoqIG9ic2VydmF0aW9ucyBjb252ZXJnZSAodGVuZCB0b3dhcmRzKSBhIGNlbnRyYWwgbWVhbi4NCg0KKiBSIG1ldGhvZHM6IEhpc3RvZ3JhbXMsIFtTaGFwaXJvLVdpbGsgVGVzdF0oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvU2hhcGlybyVFMiU4MCU5M1dpbGtfdGVzdCksIFEtUSBwbG90LCBLb2xtb2dvcm92LVNtaXJub2YgdGVzdA0KYGBge3J9DQpoaXN0KGNhcnMkc3BlZWQpDQpoaXN0KGNhcnMkZGlzdCkNCg0Kc2hhcGlyby50ZXN0KGNhcnMkc3BlZWQpDQpzaGFwaXJvLnRlc3QoY2FycyRkaXN0KQ0KDQpxcXBsb3QoY2FycyRzcGVlZCwgY2FycyRkaXN0KQ0KDQprcy50ZXN0KGNhcnMkc3BlZWQsIGNhcnMkZGlzdCkNCg0KYGBgDQoNCiMjIyAzLiBObyBvciBsaXR0bGUgbXVsdGljb2xsaW5lYXJpdHkNCg0KKiBbTXVsdGljb2xsaW5lYXJpdHldKGh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL011bHRpY29sbGluZWFyaXR5KSBvY2N1cnMgd2hlbiB0aGUgaW5kZXBlbmRlbnQgdmFyaWFibGVzIGFyZSBub3QgaW5kZXBlbmRlbnQgZnJvbSBlYWNoIG90aGVyIC0gT25lIHZhcmlhYmxlIGNhbiBiZSBhIHByZWRpY3RvciBvZiB0aGUgb3RoZXINCiogQSBzZWNvbmQgaW1wb3J0YW50IGluZGVwZW5kZW5jZSBhc3N1bXB0aW9uIGlzIHRoYXQgdGhlIGVycm9yIG9mIHRoZSBtZWFuIGhhcyB0byBiZSBpbmRlcGVuZGVudCBmcm9tIHRoZSBpbmRlcGVuZGVudCB2YXJpYWJsZXMuDQoNCiogUiBtZXRob2RzOiBjb3JyZWxhdGlvbiwgdG9sZXJhbmNlLCB2YXJpYW5jZSBpbmZsYXRpb24gZmFjdG9yIChWSUYpDQpgYGB7cn0NCmNvcihjYXJzKQ0KDQojIGV4YW1wbGUgb2YgY29ycmVsYXRlZCB2YXJpYWJsZXMNCmNhcnMyIDwtIGNhcnMNCmNhcnMyJHNwZWVkMiA8LSBjYXJzMiRzcGVlZA0KY29yKGNhcnMyKSAjIGlmIHRoZXJlIGlzIGEgMS4wIGluIGFueSBvdGhlciBwb3NpdGlvbiB0aGFuIHRoZSBkaWFnb25hbCB0aGVyZSBpcyBwZXJmZWN0IGNvcnJlbGF0aW9uDQoNCmBgYA0KDQoNCiMjIyA0LiBObyBhdXRvLWNvcnJlbGF0aW9uDQoNCiogQXV0b2NvcnJlbGF0aW9uIG9jY3VycyB3aGVuIHRoZSByZXNpZHVhbHMgYXJlIG5vdCBpbmRlcGVuZGVudCBmcm9tIGVhY2ggb3RoZXINCiogY29ycmVsYXRpb24gb2YgYSBzaWduYWwgd2l0aCBpdHNlbGYgYXQgZGlmZmVyZW50IHBvaW50cyBpbiB0aW1lIC0gY29tbW9uIGluICp0aW1lIHNlcmllcyBhbmFseXNpcyoNCg0KKiBSIG1ldGhvZHM6IFtEdXJiaW4tV2F0c29uIFRlc3RdKGh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL0R1cmJpbiVFMiU4MCU5M1dhdHNvbl9zdGF0aXN0aWMpDQpgYGB7cn0NCmxpYnJhcnkoY2FyKQ0KDQpmaXQgPC0gbG0oZGlzdCB+IHNwZWVkLCBkYXRhPWNhcnMpDQpzdW1tYXJ5KGZpdCkNCg0KZHVyYmluV2F0c29uVGVzdChmaXQpDQoNCmBgYA0KDQojIyMgNS4gSG9tb3NjZWRhc3RpY2l0eQ0KDQoqICBUaGUgZXJyb3IgdGVybXMgYWxvbmcgdGhlIHJlZ3Jlc3Npb24gYXJlIHRoZSAqKnNhbWUqKiBhbmQgbm90IGhldGVyb3NjZWRhc3RpYy4NCg0KKiBSIG1ldGhvZHM6IFJlc2lkdWFsIHBsb3QsIFtHb2xkZmVsZC1RdWFuZHQgVGVzdF0oaHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvR29sZGZlbGQlRTIlODAlOTNRdWFuZHRfdGVzdCkNCmBgYHtyfQ0KbGlicmFyeShjYXIpDQpyZXNpZHVhbFBsb3QoZml0KQ0KDQojIG9yIGluIGJhc2UNCnBsb3QoZml0JGZpdHRlZC52YWx1ZXMsIGZpdCRyZXNpZHVhbHMpDQpgYGANCg0KDQojIyMgUmVzb3VyY2VzDQoNCiogW1N0YXRpc3RpY3MgU29sdXRpb25zXShodHRwOi8vd3d3LnN0YXRpc3RpY3Nzb2x1dGlvbnMuY29tL2Fzc3VtcHRpb25zLW9mLWxpbmVhci1yZWdyZXNzaW9uLykNCg0KKiBbUXVpY2sgUjogRGlhZ25vc3RpY3NdKGh0dHA6Ly93d3cuc3RhdG1ldGhvZHMubmV0L3N0YXRzL3JkaWFnbm9zdGljcy5odG1sKQ0KDQoqIFtRdWljayBSOiBQcm9iYWJpbGl0eV0oaHR0cDovL3d3dy5zdGF0bWV0aG9kcy5uZXQvYWR2Z3JhcGhzL3Byb2JhYmlsaXR5Lmh0bWwpDQo=