Data is ubiquitous throughout today’s society and our ability to process and understand it is a necessity. A manageable example is that of understanding the population growth of a state, the United States or the entire world. Consider the population of the state of Massachusetts from 1900 through 2000 which is given in the following table and plotted below:
The reason that the data is only available on the decades is because of the U.S. census, which only collects data on these years. It is desirable to know the population in years between the known points and in light of this, we may consider using an interpolating polynomial. We skip the details from Chapter 4, and plot the data and the interpolating polynomial. The result is the following plot
In the middle of the plotting domain, the function reasonably fits the data and an expected trend. However for \(x \gt 90\) and \(x \lt 10\text{,}\) the function seems to oscillate more than expected.
The problem that occurs in this example is that a high-order (in this case the degree is 10) polynomial is probably not the best functional form to represent the data, where either a lower order polynomial (perhaps 2nd or 3rd order) may be better or perhaps another function like an exponential.
To explain how we can fit data , let’s look at a simpler example. Consider the data \(\{(1,1.5)\text{,}\)\((3,2)\text{,}\)\((5,3)\text{,}\)\((6,4)\}\text{,}\) which is plotted below:
It is fairly clear that no line can pass through all four of these points, so we seek a line that is “close”. One way to fit a line to the data would be to find \(a\) and \(b\) for the function \(L(x)=ax+b\) such that
There is a number of techniques that allow one to find such a minimum, however we won’t go into the details here. A minimization technique for this problem results in \(a=0.3749\) and \(b=1.1255\text{.}\) A plot of the line and the data is:
The function \(S(a,b)\) is differentiable and thus when solving \(\partial S/\partial a=0\) and \(\partial S/\partial b=0\text{,}\) it can be shown to have a unique solution.
The line in this plot differs from the last one. First of all, the line above does not pass through any of the points, whereas the one on page plot:L1:min passes through two of the points. In short, there are two different best fit lines because each is minimizing a different norm.
Subsection6.2.1Matrix Form of the Least Squares Regression
Although the formulas in (6.2.2) are not that difficult, we will generalize to other functional forms and if we can write the problem in a more-convenient manner, then the results will be easier to follow.
The sum \(S\) can be written as \(||\mathbf{X}\mathbf{a}-\mathbf{y}||_{2}\text{.}\) Using the fact that \(D_{x} \mathbf{F(x)}^{T} \mathbf{F(x)}= \mathbf{F'(x)}^{T}\mathbf{F(x)}\) the derivative of \(S\) with respect to \(\mathbf{a}\) is
The vector \(\mathbf{a}\) represents the \(y\)-intercept (first component) and slope (second component) of the line, so the least squares line for this problem is:
The previous example put the least-squares line in a different context. However, this formulation allows us to extend from a least-squares line to any polynomial as we will see in the next section.
Now that the matrix form of the least-squares line has been derived, we can actually show that same formula exists for higher-order polynomials. If the data is given as \(\{(x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{n},y_{n})\}\) and let
that best fits the given data in the least squares sense. More precisely, we seek the coefficients \(a_{0}, a_{1}, a_{2}, \ldots, a_{m}\) that minimize
where \(\mathbf{x}\) is the column vector with the \(x\)-coordinates of the data and \(P(\mathbf{x})\) is the column vector consisting of the polynomial evaluated at each \(x\) value. Also,
The idea of least-square polynomial fitting is that the number of data points exceeds that of the number of coefficients of the polynomial (degree of polynomial plus one). In interpolation the degree of the polynomial is chosen to be one fewer than the number of data points.