In this article we study the solution to a regression with squared error loss. We start with the theoretical formulation before tackling the problem in practice.
Theoretical formulation
Let be a random input vector and a random output value with joint distribution .
In a typical machine-learning setup, we have i.i.d. observations of and i.i.d. observations of .
In regression, we seek a function to predict the output value given the input vector .
To quantify the errors, we choose the squared error loss:
Which leads to our criterion for choosing :
By conditioning on , we can rewrite the expectation as:
It is enough to minimize it pointwise:
The solution is:
The k-nearest neighbor method
The nearest neighbor method attempts to directly implement this formula using the training dataset. To do so, expectation is relaxed into sample mean.
For each point , we take the sample mean of the output values such that . Since there is usually only one such output value, we settle for a neighborhood of the input vectors closest to around :
This approach yields good results for large datasets of lower dimensions. But as the dimension increases, so does the computational cost of the metric used to define the neighborhood.
k-nearest neighbor assumes the regression function is well approximated by locally constant function. But we can assume otherwise.
The model-based approach
Another approach is to constraint the regression function to take a particular form.
For instance, when is supposed to be a linear function:
The regression is called a linear regression. And a linear regression with squared error is called linear least squares.
Plugging this linear function into the formula and solving, we find the theoretical solution:
Relaxing the expectation into sample mean, we get the usual linear least square solution (provided it exists):
Where is the matrix whose -th row is the input vector :