Skip to main content

Chapter 37 Linear Regression

One common task when having a dataset is to build a model of the data and a common model is a linear one. Consider the cars dataset in the RDatasets package. This dataset is a simple two column set with the speed a car is travelling and the distance it takes to reach a full stop. A scatter plot of the data is:
(for accessibility)
Figure 37.1.
This plot was created by loading in RDatasets, CairoMakie, activating CairoMakie and then
cars=RDatasets.dataset("datasets","cars")
fig = Figure()
ax = Axis(fig[1,1], xlabel="Speed of Car", ylabel = "Distance to Stop")
scatter!(ax, cars.Speed, cars.Dist)
fig
The plot reveals that the faster the car is traveling, the longer it takes to stop. This probably isn’t a surprise, but perhaps we’d like to know the relationship between then. That is, can we predict the stopping distance if we know the speed of the car. The simplest model to use for this is a linear one.

Section 37.1 Basics of Linear Regression

Before finding the linear model for the above example, let’s look at a simpler dataset. Consider
data = DataFrame(x=[1,3,4,6,7,9, 10], y = [10, 9, 7, 6, 5, 4, 2])
and has the following scatter plot:
(for accessibility)
Figure 37.2.
We seek a line of the form \(y=ax+b\) which we plot below and for each data value, we define \(\epsilon_k = y_k - (a x_k+b)\text{,}\) which is the vertical signed distance between the line and the data as shown below:
(for accessibility)
Figure 37.3.
We then find the total square distance of all of the errors as
\begin{equation} S(a,b) = \sum_{k=1}^n (y_k - (ax_k+b))^2\tag{37.1} \end{equation}
which is explicitly stated as a function of \(a\) and \(b\text{,}\) the slope and \(y\)-intercept of the line. The best fine line then is the line that minimizes this function. To find the values of \(a\) and \(b\text{,}\) there are many techniques to accomplish this. We find the derivatives \(\partial S/\partial a\) and \(\partial S/\partial b\text{,}\)
\begin{align} \frac{\partial S}{\partial a}\amp = 2 \sum_{k=1}^n(y_k -(ax_k+b))(-x_k) \notag\\ \amp = -2 \sum_{k=1}^n x_k y_k + 2 a \sum_{k=1}^n x_k^2 + 2 b \sum_{k=1}^n x_k \notag\\ \amp = -2 S_{xy} + 2 a S_{xx} + 2b S_x \tag{37.2}\\ \frac{\partial S}{\partial b} \amp = 2 \sum_{k=1}^n(y_k - (ax_k+b))(-1) \notag\\ \amp = - 2 \sum_{k=1}^n y_k + 2a \sum_{k=1}^n x_k +2b \sum_{k=1}^n 1 \notag\\ \amp = - 2 S_y + 2a S_x + 2nb \tag{37.3} \end{align}
where
\begin{align} S_x \amp = \sum x_i, \amp \quad S_y \amp= \sum y_i, \amp S_{xx} \amp = \sum x_i^2, \amp \quad S_{xy} \amp = \sum x_i y_i,\amp \quad S_{yy} \amp = \sum y_i^2 \tag{37.4} \end{align}
Setting both (37.2) and (37.3) to 0 and solving for \(a\) and \(b\) leads to
\begin{align} a \amp = \frac{n S_{xy} -S_x S_y }{n S_{xx} - S_x^2}\amp b \amp = \frac{1}{n} \left( S_y - a S_x \right) \tag{37.5} \end{align}
To find the best-fit line using Julia, we will first define the sums from (37.5) with
Sx = sum(data.x)
Sy = sum(data.y)
Sxx = sum(x -> x^2, data.x)
Sxy = sum(data.x .* data.y)
Syy = sum(y -> y^2, data.y)
where in the case of the 3rd and 5th line above, we have used the option for sum that takes a function and applies the function to the array element and then sums. With these given terms, we can find \(a\) and \(b\) with
a = (length(data.x)*Sxy - Sx*Sy)/(length(data.x)*Sxx-Sx^2)
b = (Sy-a*Sx)/length(data.x)
a,b
and this returns (-0.8468468468468469, 10.981981981981983) for a and b respectivley. The best fit line is then \(y=10.98198 - 0.8468 x\text{.}\) We can use this to plot the data and the best-fit line as follows:
fig, ax = scatter(data.x,data.y)
lines!(ax, 0..10, x -> a*x+b, color = :darkgreen)
fig
which produces the following plot:
(for accessibility)
Figure 37.4.
This looks good visually, but how do we know it is a good fit. There are a number of parameters that we can use to calculate the fit and the next sections goes briefly into this.

Section 37.2 Measures of Regression

As we saw in the previous section, visually the fit of the example above looks good, but it would be nice to have a measure (numerical value) of the fit. We will explore this in this section.
First, we introduce the notion of the residuals for a regression, in short this is
\begin{equation*} \epsilon_i = y_i - (ax_i+b) \end{equation*}
where \(a\) and \(b\) are the best fit slope and intercept. In terms of a plot, each \(\epsilon_i\) is the vertical distance between a datapoint and the regression line.
There are two important measures here, the sum of squares regression or
\begin{equation*} \text{SSR} = \sum \epsilon_i^2 \end{equation*}
Again, as we mentioned in the previous section, the goal of regression is to find the regression coefficients (in this case the slope and intercept) that minimize this value.
For the example above, we can find the SSR in Julia with
SSR = sum([(data.y[i] - a*data.x[i]-b)^2 for i=1:nrow(data) ])
and this returns 1.3693693693693667. The sum of squares error or
\begin{equation*} \text{SSE} = \sum (y_i - \bar{y})^2 \end{equation*}
The SSE can be written as:
\begin{equation*} \text{SSE} = S_{yy} - \frac{1}{n} S_y^2 \end{equation*}
is a sum of squares of the vertical distance between the \(y\)-value of each datapoint and the mean. Using the example above, we have already calculated \(S_{yy}\) and \(S_y\) so
SSE = Syy -Sy^2/nrow(data)
resulting in 46.85714285714283. A nice measure of the fit of the regression line is by using the following:
\begin{equation*} r^2 = 1 - \frac{\text{SSR}}{\text{SSE}} \end{equation*}
The value of \(r^2\) satisfies \(0 \leq r^2 \leq 1\) and a proof is left to the reader. The closer the value is to 1, the better the fit and in fact when \(r^2=1\text{,}\) then the data perfects lies on a line.
For the example above
rsq = 1-SSR/SSE
resulting in 0.970775653702483, which is a good fit, where 97% of the variability in the data is explained with the linear model.

Section 37.3 A Matrix Approach to Linear Regression

The above formulas for linear regression works only for a best-fit line through a set of points \((x_k, y_k)\text{.}\) If we want to either fit other functions, one needs to find another way. A more general way is to define
\begin{align*} X \amp = \begin{bmatrix} 1 \amp x_1 \\ 1 \amp x_2 \\ \vdots \\ 1 \amp x_n \end{bmatrix} \amp \amp \vec{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \amp \vec{c} \amp = \begin{bmatrix} a \\ b \end{bmatrix} \end{align*}
and then (37.1) can be written as
\begin{equation} S(a,b) = (X \vec{c} - \vec{y})^T (X \vec{c}-\vec{y})\tag{37.6} \end{equation}
and differentiating with \(\partial_{\vec{x}} [(\vec{F}(\vec{x}))^{\intercal} \vec{F}(\vec{x})] = 2 \vec{F}(\vec{x})^{\intercal} \vec{F}(\vec{x})\) leads to
\begin{align*} \frac{\partial S}{\partial \vec{c}} \amp = 2 X^{\intercal} (X \vec{c} - \vec{y})\\ \amp = 2 X^{\intercal} X \vec{c} - 2 X^{\intercal} \vec{y} \end{align*}
and setting this to \(\vec{0}\) and solving for \(\vec{c}\) results in
\begin{equation*} \vec{c} = (X^{\intercal} X)^{-1} X^{\intercal} \vec{y} \end{equation*}
We now apply this to the above example. First, we will create the X matrix with the following:
X = [data.x[i]^k for i=1:nrow(data), k=0:1]
which generates the matrix with ones in the first column and data.x in the second. And then
c = inv(X'*X)*X'*data.y
where X' is the way to find the transpose of X in julia. The result is the vector [ 10.982, -0.8468], which is the vector of \(y\)-intercept and slope. So the best-fit line for this is \(y=-0.8468x+10.982\text{.}\)

Section 37.4 Fitting Other Functions to Data

As mentioned above, we may want to fit other functions to a set of data. For example, if we want to fit the following data to a quadratic.

Subsection 37.4.1 Fitting a Quadratic to Data

data2 = DataFrame(x =-4:4, y = [15, 10, 5, 2, 1, 0, 3,8, 17])
which has the following scatter plot
(for accessibility)
Figure 37.5.
It appears that the data might fit a quadratic curve of the form \(q(x) = ax^2+bx+c\text{.}\) Similar to above, we desire to find the constants \(a, b\) and \(c\) such that
\begin{equation*} S(a,b,c) = \sum_{k=1}^n (y_k - (a_2 x_k^2 + a_1 x_k + a_0))^2 \end{equation*}
is minimized. We will write in the same way using matrix as
\begin{equation} S(a,b,c) = (X \vec{c})^{\intercal} X \vec{c}\tag{37.7} \end{equation}
with
\begin{align*} X \amp = \begin{bmatrix} 1 \amp x_1 \amp x_1^2 \\ 1 \amp x_2 \amp x_2^2 \\ \vdots \\ 1 \amp x_n \amp x_n^2 \end{bmatrix} \amp \amp \vec{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \amp \vec{c} \amp \begin{bmatrix} a_0 \\ a_1 \\ a_2 \end{bmatrix} \end{align*}
and since (37.7) has the same form as (37.6), to find the values \(\vec{c}\) that minimizes (37.7) is the same or
\begin{equation*} \vec{c} = (X^{\intercal} X)^{-1} X \vec{y} \end{equation*}
For the example at the top of this section, we define
X = [data2.x[k]^i for k=1:nrow(data), i=0:2]
Then we can find the coefficients with
c = inv(X'*X)*X'*data2.y
The result of this is [0.255, -0.0666, 0.978]. Therefore the best-fit quadratic is
\begin{equation*} q(x) = 0.978 x^2 -0.0666 x + 0.255 \end{equation*}
A plot of the data and the function is given by
(for accessibility)
Figure 37.6.

Subsection 37.4.2 Multilinear Regression

Another common function used to fit to data is that of a linear function in higher dimensions. As an example, consider the following dataset:
data3 = DataFrame(
  x = [1, 1, 2, 3, 3, 4, 5, 5, 6, 2, 4, 8], 
  y = [1, 6, 8, 2, 2, 3, 7, 3, 1, 4, 9, 6], 
  z = [14, 7, 4, 14, 13, 12, 7,  13, 16, 11, 4, 10])
which has the plot
(for accessibility)
Figure 37.7.
Although this is difficult to tell, these points roughly fall on a plane, so we can fit a plane of the form \(z = a_0 + a_1 x + a_2 y\) to the data. This can be done with the matrix form by setting the first column to 1, the second column to \(x\) and the third to \(y\) or
X = ones(Int,size(data3,1),3)
X[:,2] = data3.x
X[:,3] = data3.y
X
and again using the same formula as above to find the best-fit coefficients:
c = inv(X'*X)*X'*data3.z
resulting in [15.13, 0.4085, -1.433] which is the plane \(z = 15.13+0.4085x-1.433y\text{.}\) A plot of the plane and the points is found with:
xpts = ypts = LinRange(0,10,100)
fig, ax = surface(xpts, ypts, [c[1].+c[2].*x .+ c[3] .* y for x in xpts, y in ypts])
scatter!(ax, data3.x, data3.y, data3.z)
fig
resulting in the following plot. Note: to get the interactive version of this plot, use GLMakie and set Makie.inline!(false) to ensure a separate window which allows the plot to rotate interactively.
(for accessibility)
Figure 37.8.

Section 37.5 Linear Regression with GLM package

There is a robust packge for regression called GLM short for Generalized Linear Models, which allows many different kinds of regression performed on a dataset. Make sure that you have added this package and loaded it with using GLM.
To start with a simple model, we can produce a best fit line with the formula y ~ x, meaning that we want y, which is a column in our data Dataframe as the dependent variable to be linear in x the column that will be the independent variable. After creating a formula with the @formula macro, we fit the data with the lm short for linear model method as in:
fm = @formula(y ~ x)
model = lm(fm, data)
The result is
y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  10.982      0.4244      25.88    <1e-05    9.89103  12.0729
x            -0.846847   0.0657102  -12.89    <1e-04   -1.01576  -0.677933
──────────────────────────────────────────────────────────────────────────
There are a number of things to note here:
  • The top line is the formula y ~ 1 + x which is the same as y ~ x. The 1 denotes that you want to include a constant term in your formula. Note: you can do y ~ 0 + x to force no constant term.
  • The cofficients table has information about all of the coefficients, in this case 2, the constant term and the coefficient of the x term. The first column is which coefficient and the second is the best-fit value of this coefficient. Notice that this is the same as we found above using the derived formulas.
  • The remainder of the table indicate statistics to determine the significance of each of the coefficients. The Pr(> |t|) is the \(p\)-value for each of the coefficients differs from 0. In this case, each are highly significant due to the small values. The last two columns give a confidence interval. To 95% confidence, each coefficient is between the last two columns.
Additionally, we can determine the overall fit of the model with an \(r^2\) value. This is accomplished with r2(model) which returns 0.970775653702483. A standard interpretation of this is that the linear model explains 97% of the variation in the data. This represents a very good linear fit.

Subsection 37.5.1 Finding the best-fit quadratic with GLM

You may be surprised to learn that the quadratic fit is actually a linear model. But wait, you say, I learned in Precaclulus that a quadratic function is not linear and that is true. But recall that the linearity is in the unknown coefficents and those are linear.
To find the best-fit quadratic using GLM, we first build the quadratic formula with
fm2 = @formula y ~ 1 + x + x^2
and although you don’t need to include the 1, I would recommend putting it in for any non linear function. This will correspond to fitting to a function of the form \(y = a_0 + a_1 x + a_2 x^2\text{.}\)
To find the coefficients, we’ll use
model2 = lm(fm2, data2)
which returns
y ~ 1 + x + :(x ^ 2)

Coefficients:
──────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)   0.255411     0.600744   0.43    0.6855  -1.21456    1.72538
x            -0.0666667    0.153459  -0.43    0.6792  -0.442168   0.308835
x ^ 2         0.978355     0.067732  14.44    <1e-05   0.812621   1.14409
──────────────────────────────────────────────────────────────────────────
and we’ll find the \(r^2\) value with
r2(model2)
which returns 0.972071266946816, which is surprisingly close to the previous example, but just shows that the overall model is a good fit. However, looking at the table above, only the \(x^2\) term is significantly different from 0, with a \(p\)-value much less that 0.05, the standard cutoff level.

Subsection 37.5.2 Multilinear Regression with GLM

The example using the dataset data above was fit with two predictor variables, x and y. To build the formula and then the model is done with
fm3 = @formula z ~ x + y
model3 = lm(fm3, data3)
and this returns
z ~ 1 + x + y

Coefficients:
──────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────
(Intercept)  15.134      0.32197     47.00    <1e-11  14.4056    15.8623
x             0.408533   0.0623703    6.55    0.0001   0.267442   0.549625
y            -1.4343     0.0472865  -30.33    <1e-09  -1.54126   -1.32733
──────────────────────────────────────────────────────────────────────────
and finding the \(r^2\) value with f2(model3) returns 0.9905202860004992, a very good fit and examining the Pr(>|t|) column all of the coefficients are significant.