Skip to main content

Section 6.2 Discrete Least Squares

Data is ubiquitous throughout today’s society and our ability to process and understand it is a necessity. A manageable example is that of understanding the population growth of a state, the United States or the entire world. Consider the population of the state of Massachusetts from 1900 through 2000 which is given in the following table and plotted below:
year Population (in millions) year Population (in millions)
1900 2.805 1950 4.690
1910 3.366 1960 5.148
1920 3.852 1970 5.689
1930 4.249 1980 5.737
1940 4.316 1990 6.016
2000 6.349
Figure 6.2.1.
The reason that the data is only available on the decades is because of the U.S. census, which only collects data on these years. It is desirable to know the population in years between the known points and in light of this, we may consider using an interpolating polynomial. We skip the details from Chapter 4, and plot the data and the interpolating polynomial. The result is the following plot
Figure 6.2.2.
In the middle of the plotting domain, the function reasonably fits the data and an expected trend. However for \(x \gt 90\) and \(x \lt 10\text{,}\) the function seems to oscillate more than expected.
The problem that occurs in this example is that a high-order (in this case the degree is 10) polynomial is probably not the best functional form to represent the data, where either a lower order polynomial (perhaps 2nd or 3rd order) may be better or perhaps another function like an exponential.
To explain how we can fit data , let’s look at a simpler example. Consider the data \(\{(1,1.5)\text{,}\) \((3,2)\text{,}\) \((5,3)\text{,}\) \((6,4)\}\text{,}\) which is plotted below:
Figure 6.2.3.
It is fairly clear that no line can pass through all four of these points, so we seek a line that is “close”. One way to fit a line to the data would be to find \(a\) and \(b\) for the function \(L(x)=ax+b\) such that
\begin{equation*} S(a,b)=\sum_{i=1}^{n} |L(x_{i})-y_{i}|, \end{equation*}
is minimized. This can be written as the \(\ell_{1}\) norm,
\begin{equation*} S(a,b)= ||L(\mathbf{x}) -\mathbf{y}||_{1}. \end{equation*}
Geometrically, this sum is the total vertical distance from the line \(L(x)\) to the data and we desire to find the minimum value of \(S\text{.}\)
Typically when finding a minimum, we would seek the values of \(a\) and \(b\) such that
\begin{equation*} \begin{aligned} \frac{\partial S}{\partial a}\amp= 0,\amp\frac{\partial S}{\partial b}\amp= 0, \end{aligned} \end{equation*}
however the presence of the absolute values in the definition of \(S\) results in function \(S\) which is non differentiable at a number of points.
There is a number of techniques that allow one to find such a minimum, however we won’t go into the details here. A minimization technique for this problem results in \(a=0.3749\) and \(b=1.1255\text{.}\) A plot of the line and the data is:
Figure 6.2.4.
and as can be seen, the line appears to pass through the first and third points.
Because \(S\) is not differentiable, this is a difficult problem to solve and most do not solve the problem as stated above.
An alternative is to define
\begin{equation} S(a,b)= \sum_{i=1}^{n} (L(x_{i})-y_{i})^{2} = \sum_{i=1}^{n} \bigl( ax_{i}+b-y_{i})^{2}\tag{6.2.1} \end{equation}
which can be written using the \(\ell_{2}\) norm,
\begin{equation*} S(a,b) = ||L(\mathbf{x})-\mathbf{y}||_{2}^{2}. \end{equation*}
The function \(S(a,b)\) is differentiable and thus when solving \(\partial S/\partial a=0\) and \(\partial S/\partial b=0\text{,}\) it can be shown to have a unique solution.
To find the minimum of this problem as is typically done, is to find the derivatives with respect to \(a\) and \(b\) and set them to zero.
\begin{equation*} \begin{aligned} \frac{\partial S}{\partial a}\amp= \sum_{i=1}^{n} 2 \bigl(ax_{i} + b-y_{i}) x_{i} = 0,\amp\frac{\partial S}{\partial b}\amp= \sum_{i=1}^{n} 2 \bigl(ax_{i} + b-y_{i}) = 0,\\ \amp \text{dividing by 2 and distributing:}\\ \amp= a \sum_{i=1}^{n} x_{i}^{2} + b \sum_{i=1}^{n} x_{i} - \sum_{i=1}^{n} x_{i} y_{i} = 0,\amp\amp= a \sum_{i=1}^{n} x_{i} + b n - \sum_{i=1}^{n} y_{i} = 0. \end{aligned} \end{equation*}
These two equations are linear in \(a\) and \(b\) and can easily be solved as:
\begin{equation} \begin{aligned}a\amp= \frac{\displaystyle n \sum_{i=1}^{n} x_{i} y_{i} - \biggl(\sum_{i=1}^{n} x_{i} \biggr)\biggl(\sum_{i=1}^{n} y_{i} \biggr)}{\displaystyle n \sum_{i=1}^{n} x_{i}^{2} - \biggl(\sum_{i=1}^{n} x_{i} \biggr)^{2} },\amp b\amp= \frac{1}{n}\biggl( \sum_{i=1}^{n} y_{i} - a\sum_{i=1}^{n} x_{i} \biggr) . \end{aligned}\tag{6.2.2} \end{equation}

Example 6.2.5.

Find the least squares regression line to fit the data \(\{(1,1.5),(3,2),(5,3),(6,4)\}\text{.}\)
Solution.
For this problem, using the formulas in (6.2.2), the results are \(a=0.48305\) and \(b=0.81355\text{.}\) A plot of this line and the data is:
Figure 6.2.6.
The line in this plot differs from the last one. First of all, the line above does not pass through any of the points, whereas the one on page plot:L1:min passes through two of the points. In short, there are two different best fit lines because each is minimizing a different norm.

Subsection 6.2.1 Matrix Form of the Least Squares Regression

Although the formulas in (6.2.2) are not that difficult, we will generalize to other functional forms and if we can write the problem in a more-convenient manner, then the results will be easier to follow.
Let
\begin{equation*} \begin{aligned}\mathbf{X}\amp= \begin{bmatrix}1\amp x_{1} \\ 1\amp x_{2} \\ \vdots\amp\\ 1\amp x_{n}\end{bmatrix}\amp\mathbf{y}\amp= \begin{bmatrix}y_{1} \\ y_{2} \\ \vdots \\ y_{n}\end{bmatrix},\end{aligned} \end{equation*}
Then (6.2.1) can be written as
\begin{equation} S =\bigl(\mathbf{X}\mathbf{a}- \mathbf{y}\bigr)^{T} \bigl(\mathbf{X}\mathbf{a}- \mathbf{y}\bigr) \qquad \text{where $\displaystyle \mathbf{a}= \begin{bmatrix}a \\ b\end{bmatrix}$}.\tag{6.2.3} \end{equation}
The sum \(S\) can be written as \(||\mathbf{X}\mathbf{a}-\mathbf{y}||_{2}\text{.}\) Using the fact that \(D_{x} \mathbf{F(x)}^{T} \mathbf{F(x)}= \mathbf{F'(x)}^{T}\mathbf{F(x)}\) the derivative of \(S\) with respect to \(\mathbf{a}\) is
\begin{equation*} \begin{aligned}S_{\mathbf{a}}\amp= \mathbf{X}^{T} \bigl(\mathbf{X}\mathbf{a}- \mathbf{y}\bigr)\end{aligned} \end{equation*}
Setting \(S_{\mathbf{a}}=0\) and solving for \(a\) results in:
\begin{equation} \mathbf{a}= \begin{bmatrix}a \\ b\end{bmatrix}=(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \mathbf{y}\tag{6.2.4} \end{equation}
where \(a\) and \(b\) are that in (6.2.2).
The next example shows how to use the matrix form of the least-squares line to find the best-fit line from Example 6.2.5.

Example 6.2.7.

Find the least-squares line that best fits the data in Example ex:ls:line using the matrix form in (6.2.4).
Solution.
In this case,
\begin{equation*} \begin{aligned}\mathbf{X}\amp= \begin{bmatrix}1 \amp 1 \\ 1 \amp 3 \\ 1 \amp 5 \\ 1 \amp 6 \\\end{bmatrix},\amp\mathbf{y}\amp= \begin{bmatrix}1.5 \\ 2 \\ 3 \\ 4\end{bmatrix},\end{aligned} \end{equation*}
and \(\mathbf{X}^{T}\mathbf{X}\) and it inverse are
\begin{equation*} \begin{aligned}\mathbf{X}^{T}\mathbf{X}\amp= \begin{bmatrix}4\amp15 \\ 15\amp17\end{bmatrix},\amp(\mathbf{X}^{T}\mathbf{X})^{-1}\amp= \frac{1}{59}\begin{bmatrix}71\amp-15 \\ -15\amp4\end{bmatrix},\end{aligned} \end{equation*}
and finally using (6.2.4),
\begin{equation*} \begin{aligned} \mathbf{a}\amp= (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \mathbf{y}\\\amp= \frac{1}{59}\begin{bmatrix}71\amp-15 \\ -15\amp4\end{bmatrix} \begin{bmatrix}1 \amp 1 \amp 1 \amp 1 \\ 1 \amp 3 \amp 5 \amp 6\end{bmatrix} \begin{bmatrix}1.5 \\ 2 \\ 3 \\ 4\end{bmatrix}, \\\amp= \begin{bmatrix}48/59 \\ 57/118\end{bmatrix} \approx \begin{bmatrix}0.813559 \\ 0.483051\end{bmatrix}. \end{aligned} \end{equation*}
The vector \(\mathbf{a}\) represents the \(y\)-intercept (first component) and slope (second component) of the line, so the least squares line for this problem is:
\begin{equation*} \begin{aligned}L(x)\amp= 0.813559+ 0.483051 x\end{aligned} \end{equation*}
and this is the same result as the problem in Example ex:ls:line.
The previous example put the least-squares line in a different context. However, this formulation allows us to extend from a least-squares line to any polynomial as we will see in the next section.

Subsection 6.2.2 Least-Squares Polynomials

Now that the matrix form of the least-squares line has been derived, we can actually show that same formula exists for higher-order polynomials. If the data is given as \(\{(x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{n},y_{n})\}\) and let
\begin{equation*} \begin{aligned}\mathbf{X}\amp= \begin{bmatrix}1\amp x_{1}\amp x_{1}^{2}\amp\cdots\amp x_{1}^{m} \\ 1\amp x_{2}\amp x_{2}^{2}\amp\cdots\amp x_{2}^{m} \\ 1\amp x_{3}\amp x_{3}^{2}\amp\cdots\amp x_{3}^{m} \\ \vdots\amp\amp\amp\ddots\amp\vdots \\ 1\amp x_{n}\amp x_{n}^{2}\amp\cdots\amp x_{n}^{m} \\\end{bmatrix},\amp\mathbf{y}\amp= \begin{bmatrix}y_{1} \\ y_{2} \\ \vdots \\ y_{n}\end{bmatrix}.\end{aligned} \end{equation*}
We now desire to find the \(m\)th degree polynomial
\begin{equation*} P(x) = a_{0} + a_{1} x + a_{2} x^{2} + \cdots x_{m} x^{m}, \end{equation*}
that best fits the given data in the least squares sense. More precisely, we seek the coefficients \(a_{0}, a_{1}, a_{2}, \ldots, a_{m}\) that minimize
\begin{equation} S= ||P(\mathbf{x})-\mathbf{y}||_{2} = \sum_{i=1}^{n} (P(x_{i})-y_{i})^{2} = ||\mathbf{X}\mathbf{a}-\mathbf{y}||_{2}\tag{6.2.5} \end{equation}
where \(\mathbf{x}\) is the column vector with the \(x\)-coordinates of the data and \(P(\mathbf{x})\) is the column vector consisting of the polynomial evaluated at each \(x\) value. Also,
\begin{equation} \mathbf{a} = \begin{bmatrix}a_{0} \\ a_{1} \\ a_{2} \\ \vdots \\ a_{m}\end{bmatrix}.\tag{6.2.6} \end{equation}
Since (6.2.5) is the same as that in (6.2.3), solving \(S_{\mathbf{a}}=0\) results in
\begin{equation} \mathbf{a} = ( \mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \mathbf{y},\tag{6.2.7} \end{equation}
the same as (6.2.4).
The next example uses this analysis to find the cubic polynomial that best fits in the least-squares sense.

Example 6.2.8.

Find the cubic polynomial that best fits the population data at the beginning of this section.
Solution.
In this case, \(m=3\text{,}\) and
\begin{equation*} \begin{aligned}\mathbf{X}\amp= \begin{bmatrix}1\amp0\amp0^{2}\amp0^{3} \\ 1\amp10\amp10^{2}\amp10^{3} \\ 1\amp20\amp20^{2}\amp20^{3} \\ \vdots\amp\amp\amp\vdots \\ 1\amp100\amp100^{2}\amp100^{3}\end{bmatrix},\amp\mathbf{y}\amp= \begin{bmatrix}2.805 \\ 3.366 \\ \vdots \\ 6.349\end{bmatrix}.\end{aligned} \end{equation*}
The matrix \(\mathbf{X}^{T} \mathbf{X}\) and its inverse is:
\begin{equation*} \begin{aligned}\mathbf{X}^{T} \mathbf{X}\amp= \begin{bmatrix}11\amp550\amp38500\amp3025000 \\ 550\amp38500\amp3025000\amp253330000 \\ 38500\amp3025000\amp253330000\amp22082500000 \\ 3025000\amp253330000\amp22082500000\amp1978405000000\end{bmatrix}, \\ (\mathbf{X}^{T} \mathbf{X})^{-1}\amp= \frac{1}{6177600000}\begin{bmatrix}4881600000\amp-342000000\amp6480000\amp-36000 \\ -342000000\amp40480000\amp-930000\amp5720 \\ 6480000\amp-930000\amp23220\amp-150 \\ -36000\amp5720\amp-150\amp1\end{bmatrix}.\end{aligned} \end{equation*}
And finally the vector \(\mathbf{a}\) is
\begin{equation*} \begin{aligned}\mathbf{a}\amp= ( \mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T} \mathbf{y}, \\\amp= \begin{bmatrix}2.87025174825175 \\ 0.0478714452214452 \\ -0.000211491841491841 \\ 8.12354312354316 \times 10^{-7}\end{bmatrix}.\end{aligned} \end{equation*}
This means that the best fit cubic in the least-squares sense is:
\begin{equation*} \begin{aligned}P(x)\amp= 2.87025 + 0.0478714 x -0.00021149 x^{2}+ 8.12354 \times 10^{-7}x^{3} .\end{aligned} \end{equation*}
To visualize this cubic, we plot it as well as the data:
Figure 6.2.9.
As one can tell, the cubic approximates the data quite well.

Subsection 6.2.3 Least-Squares and Interpolation

The idea of least-square polynomial fitting is that the number of data points exceeds that of the number of coefficients of the polynomial (degree of polynomial plus one). In interpolation the degree of the polynomial is chosen to be one fewer than the number of data points.
We can write the interpolation problem using the matrix form of the least-squares problem in the section above.

Example 6.2.10.

Find the quadratic polynomial that interpolates the data \(\{(0,0),(1,1),(3,2)\}\) using the matrix least-squares formulation in (6.2.7) and (6.2.6).
Solution.
In this case,
\begin{equation*} \begin{aligned}\mathbf{X}\amp= \begin{bmatrix}1 \amp 0 \amp 0 \\ 1 \amp 1 \amp 1 \\ 1 \amp 3 \amp 9\end{bmatrix}\amp\mathbf{y}\amp= \begin{bmatrix}0 \\ 1 \\ 2\end{bmatrix}\end{aligned} \end{equation*}
and the matrix \(\mathbf{X}^{T} \mathbf{X}\) and its inverse is:
\begin{equation*} \begin{aligned}\mathbf{X}^{T} \mathbf{X}\amp= \begin{bmatrix}3\amp4\amp10\\ 4\amp10\amp28 \\ 10\amp28\amp82\end{bmatrix}\amp(\mathbf{X}^{T} \mathbf{X})^{-1}\amp= \frac{1}{18}\begin{bmatrix}18\amp-24\amp6\\ -24\amp73\amp-22 \\ 6\amp-22\amp7\end{bmatrix}\end{aligned} \end{equation*}
and finally using (6.2.7),
\begin{equation*} \begin{aligned}\mathbf{a}\amp= (\mathbf{X}^{T} \mathbf{X})^{-1}\mathbf{X}^{T} \mathbf{y}\\\amp= \begin{bmatrix}0 \\ 7/6 \\ -1/6\end{bmatrix}\end{aligned} \end{equation*}
so the quadratic is
\begin{equation*} \begin{aligned}P(x)\amp= \frac{7}{6}x - \frac{1}{6}x^{2}\end{aligned} \end{equation*}
which is the same result as in Example 4.1.2 and Example 4.2.2.