Suppose we want to learn a function (e.g., a line) that describes the relationship between an independent and dependent variable. We saw the method of least squares as a way of finding parameters that minimize the sum of the squared error:
when $f(x) = mx + b$, the SSE is a function of two variables, the slope $m$ and the intercept $b$. Let's illustrate this with an example.
Suppose we want to predict the rent for an apartment here in the city based on its street number. We'll use numpy to define arrays that contain prices (our $y_i$'s) and street numbers (our $x_i$'s). You can add your own street numbers and prices based on what you think is reasonable or what you've paid in the past. There seems to be a trend wherein you pay more to live in midtown and less to live further uptown in the Bronx or far out in Brooklyn.
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
prices = np.array([2000, 3500, 1500, 1200, 1800, 5500, 1750,10000])
streets = np.array([180, 127, 220, 190, 215, 86, 170, 23])
Now let's plot the data:
plt.scatter(streets,prices)
plt.xlabel('Street Number')
plt.ylabel('Monthly Rent')
plt.title('Rent vs Street')
Least squares a method to fit a line through a set of points. We know that the line is parameterized by its slope and its intercept. We want to define a function that minimizes the distance between itself and each point. Why did we square that distance?
Let's substitute $z = (y_i - (mx_i + b))$ and plot the function $f(z) = z^2$:
t = np.arange(-5., 5., 0.01)
plt.plot(t, t**2, '-')
plt.show()
This parabola or quadratic function has a nice property. We say that a function is convex if:
\begin{equation} \forall z_1, z_2 \in Z \qquad \forall t \in [0,1]: \qquad f(tz_1 + (1-t)z_2) \leq tf(z_1) + (1-t)f(z_2) \end{equation}This is a fancy way of saying if we connect any two points on the function, the line connecting them will be above the function. We also say that the chord is above the graph. This function has a unique minimum when the derivative with respect to $z$ is zero.
Now we have an idea. Define the line through the set of points as the one which minimizes the sum of squared error (SSE) or equivalently the mean squared error (MSE). Note that our SSE is a function of two variables, $m$ and $b$ (you can't change the street numbers or the prices, only the line that you fit through them) and so we'll need to plot the function of 2 variables in 3D.
We have two equations and two unknowns. In order to find the values of $m$ and $b$ that minimize the SSE we can rewrite the above as the following system of equations:
One way to solve this system would be to use Gaussian elimination. We can express the result using matrices:
\begin{equation} \left( \begin{array}{cc} (\sum\limits_{i=1}^{n}x_i)^2 & \sum\limits_{i=1}^{n}x_i \\ \sum\limits_{i=1}^{n}x_i & n \end{array} \right) \left( \begin{array}{c} m \\ b \end{array} \right) = \left( \begin{array}{c} \sum\limits_{i=1}^{n} y_ix_i \\ \sum\limits_{i=1}^{n}y_i \end{array} \right) \end{equation}Can you form the augmented matrix $[A | b]$ using the arrays defined above and solve using Gaussian elimination?
Another way to solve is by inverting the matrix $A$:
\begin{equation} \left( \begin{array}{c} m \\ b \end{array} \right) = \left( \begin{array}{cc} (\sum\limits_{i=1}^{n}x_i)^2 & \sum\limits_{i=1}^{n}x_i \\ \sum\limits_{i=1}^{n}x_i & n \end{array} \right)^{-1} \left( \begin{array}{c} \sum\limits_{i=1}^{n} y_ix_i \\ \sum\limits_{i=1}^{n}y_i \end{array} \right) \end{equation}What if, instead of a straight line, we wanted to fit a higher order polynomial? We would have to take derivatives with respect to each of the coefficients of the polynomial. Can you work out a three by three system of equations if we decided to fit the function $f(x) = ax^2 + bx + c$? Write the SSE, take parials with respect to $a, b, c$, set to zero and solve.
This is quickly becoming tedious. We want to derive a general formula. Let's use bold characters to denote vectors and capital letters to denote matrices.
FOIL terms using the fact that $(AB)^T = B^TA^T$:
Why can we combine the second and third terms? It's worthwhile to write out shapes representing the dimension of these objects as row or column vectors and matrices.
Now take the gradient, using the fact that differentiating matrix and vector expressions is similar to differentiating scalars:
Once again we get a system that we can solve using Gaussian elimination or by inverting the matrix $A = X^TX$. The former is known as the Normal Equation:
\begin{equation} \beta = (X^TX)^{-1}X^T\mathbf{y} \end{equation}To get you started, here is a template below (you may use the pseudo inverse function pinv rather than inv):
from numpy import linalg