Polynomial Regression

Supervised Learning

lm-vs-polylm
.

Regression problem is one that predicts a continuous value based on previously known inputs. Input values are called predictors and output is called response. Here we predict or estimate an actual value not a class as in classification.
.

Regression measures the contribution of the independent variables to the variability of dependent variable. Simple linear regression uses formula with predictors raised to the power of 1 (a linear combination of monomials) and a straight line is fitted to depict the relationship.
.

Simple linear regression is a good fit if the relationship is highly correlated; however, in many cases it may not be so and a curve may explain the relationship better. Polynomial regression helps capture such relationship by extending linear regression formula - it uses predictors raised to the power of 2, 3, 4 and so on until adding higher polynomials does not further explain the variability of the dependent variable significantly compared to the previous.

Names of polynomials by degree:

Degree 0 – constant
Degree 1 – linear
Degree 2 – quadratic
Degree 3 – cubic
Degree 4 – quartic (or, if all terms have even degree, biquadratic)
Degree 5 – quintic
and so on... See Polynomials : Wikipedia

How far do we go?

In simple linear regression we face the problem of under-fitting whereas in polynomial regression, if we go on adding the degree of polynomials we could be over-fitting by explaining the noise along with the signal and thus making the model unsuitable for prediction. So in most practical scenarios we do not fit beyond cubic polynomials.

Linear_model = lm(y ~ x)

Quadratic_model = lm(y ~ x + x^2)

Cubic_model = lm(y ~ x + x^2 + x^3)

lr-formula

poly-reg-2

poly-reg-3

Our regression problem

In this example we fit a polynomial regression model to measure the variability in scalar dependent variable dist caused by independent variable speed for cars data.

dist is response and speed is our predictor

## see the car data description

?cars

# Description

# The data give the speed of cars and the distances taken to stop. Note that the data were recorded in the 1920s.

# A data frame with 50 observations on 2 variables.

# [,1] speed numeric Speed (mph)
# [,2] dist numeric Stopping distance (ft)

#see the car dimensions
dim(cars)

#[1] 50 2

#see the first six rows
head(cars)

# 1 4 2
# 2 4 10
# 3 7 4
# 4 7 22
# 5 8 16
# 6 9 10

Programming Logic

Steps to fit the simple linear & polynomial regression models to compare the influence of independent variables on the response variable.

Step 1:
Find the correlation between dependent variable dist and independent variable speed

Step 2:
Scatter plot dependent vs independent variables to see if there is any pattern in the distribution

Step 3:
Fit the linear regression model, note the significance and multiple r-squared value

Step 4:
Fit the quadratic and cubic polynomial regression models and note the significance and multiple r-squared value

Step 5:
Plot the lines for predicted values of response using the linear, quadratic and cubic regression models

Step 6:
Do the analysis of vairance for the linear, quadratic and cubic models to decide which is the best fit for prediction.

Correlation between variables

cor(cars$dist, cars$speed)

#[1] 0.8068949

Plot to visualize the correlation

dist-speed

#scatter plot dist~speed
# pch=19 is solid circle

plot(cars$dist~cars$speed, pch=19,
xlab="Car Speed (mph)",
ylab="Distance Covered (ft)",
main = "Car Speed And Stops Taken",
las=1)

Fit the simple linear regression model

# first we fit a simple linear model
fitlm = lm(dist ~ speed, data=cars)

# analysis of variance
anova(fitlm)

# Analysis of Variance Table

# Response: dist
# Df Sum Sq Mean Sq F value Pr(>F)
# speed 1 21186 21185.5 89.567 1.49e-12 ***
# Residuals 48 11354 236.5
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#summary to get the r-squared value
summary(fitlm)

# Call:
# lm(formula = dist ~ speed, data = cars)

# Residuals:
# Min 1Q Median 3Q Max
# -29.069 -9.525 -2.272 9.215 43.201

# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -17.5791 6.7584 -2.601 0.0123 *
# speed 3.9324 0.4155 9.464 1.49e-12 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Residual standard error: 15.38 on 48 degrees of freedom
# Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
# F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

Fit the quadratic polynomial regression model

# now fit quadratic polynomial model
fitQ = lm(dist~poly(speed,2,raw=TRUE), data=cars)

#analysis of variance
anova(fitQ)

# Analysis of Variance Table

# Response: dist
# Df Sum Sq Mean Sq F value Pr(>F)
# poly(speed, 2, raw = TRUE) 2 21714 10857.1 47.141 5.852e-12 ***
# Residuals 47 10825 230.3
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

#summary to get the r-squared value
summary(fitQ)

# Call:
# lm(formula = dist ~ poly(speed, 2, raw = TRUE), data = cars)

# Residuals:
# Min 1Q Median 3Q Max
# -28.720 -9.184 -3.188 4.628 45.152

# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.47014 14.81716 0.167 0.868
# poly(speed, 2, raw = TRUE)1 0.91329 2.03422 0.449 0.656
# poly(speed, 2, raw = TRUE)2 0.09996 0.06597 1.515 0.136

# Residual standard error: 15.18 on 47 degrees of freedom
# Multiple R-squared: 0.6673, Adjusted R-squared: 0.6532
# F-statistic: 47.14 on 2 and 47 DF, p-value: 5.852e-12

# R-squared is 0.67, that 67 percent variability is due to predictors
# Residual Error is 15.18

Fit the cubic polynomial regression model

#now fit quadratic polynomial model
fitC = lm(dist~poly(speed,3,raw=TRUE), data=cars)

#analysis of variance
anova(fitC)

# Analysis of Variance Table

# Response: dist
# Df Sum Sq Mean Sq F value Pr(>F)
# poly(speed, 3, raw = TRUE) 3 21905 7301.5 31.584 3.074e-11 ***
# Residuals 46 10634 231.2
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# summary to get the r-squared value
summary(fitC)

# Call:
# lm(formula = dist ~ poly(speed, 3, raw = TRUE), data = cars)

# Residuals:
# Min 1Q Median 3Q Max
# -26.670 -9.601 -2.231 7.075 44.691

# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -19.50505 28.40530 -0.687 0.496
# poly(speed, 3, raw = TRUE)1 6.80111 6.80113 1.000 0.323
# poly(speed, 3, raw = TRUE)2 -0.34966 0.49988 -0.699 0.488
# poly(speed, 3, raw = TRUE)3 0.01025 0.01130 0.907 0.369

# Residual standard error: 15.2 on 46 degrees of freedom
# Multiple R-squared: 0.6732, Adjusted R-squared: 0.6519
# F-statistic: 31.58 on 3 and 46 DF, p-value: 3.074e-11

Plot the predicted values using the three regression models

Linear

Quadratic

Cubic

#plot the prediction using the linear model
plot(cars$dist~cars$speed, pch=19,
xlab="Car Speed (mph)",
ylab="Distance Covered (ft)",
main = "Linear Fit",
las=1)

#draw the linear regression fit line
lines(cars$speed, predict(fitlm), col="blue", lwd=2)

#plot the prediction using the quadratic model
plot(cars$dist~cars$speed, pch=19,
xlab="Car Speed (mph)",
ylab="Distance Covered (ft)",
main = "Quadratic Fit",
las=1)

#draw the quadratic regression fit line
lines(cars$speed, predict(fitQ), col="green", lwd=2)

#plot the prediction using the cubic model
plot(cars$dist~cars$speed, pch=19,
xlab="Car Speed (mph)",
ylab="Distance Covered (ft)",
main = "Cubic Fit",
las=1)

#draw the cubic regression fit line
lines(cars$speed, predict(fitC), col="red", lwd=2)

Compare the three regression models:

Linear

Quadratic

Cubic

Linear-Fit Quadratic-Fit Cubic-Fit-new

Linear Vs Quadratic Vs Cubic

Which regression model do we select?

As compared to the Linear model, Quadratic model explains the variability more significantly and also the curvature is a better fit than the straight line..

Adding the cubic term does not improve the significance greatly as compared to quadratic term of the predictor, speed, so there is no added advantage to using it.

So we prefer quadratic over cubic and linear models.

Cross Examine the models

Using Residual Pattern

We see if the residuals from Linear and Quadratic fit form any pattern -if the residuals form a pattern, the model is under-fitted and may not be suitable for prediction.

Quite obviously, since the fitted linear regression line is straight we can expect the residuals to form a pattern - in this case we see a curve. Whereas in case of quadratic fit, residuals pattern is not that curved - indicating that there is no clear pattern for the residual data points in quadratic fit.

.
So, we can conclude that Quadratic fit is the best for prediction based on r-squared value and no curvature or pattern formed by residuals.