Conditional expectations and regression with squared error loss

In this article we study the solution to a regression with squared error loss. We start with the theoretical formulation before tackling the problem in practice.

Theoretical formulation

Let be a random input vector and a random output value with joint distribution .

In a typical machine-learning setup, we have i.i.d. observations of and i.i.d. observations of .

In regression, we seek a function to predict the output value given the input vector .

To quantify the errors, we choose the squared error loss:

Which leads to our criterion for choosing :

By conditioning on , we can rewrite the expectation as:

It is enough to minimize it pointwise:

The solution is:

The k-nearest neighbor method

The nearest neighbor method attempts to directly implement this formula using the training dataset. To do so, expectation is relaxed into sample mean.

For each point , we take the sample mean of the output values such that . Since there is usually only one such output value, we settle for a neighborhood of the input vectors closest to around :

This approach yields good results for large datasets of lower dimensions. But as the dimension increases, so does the computational cost of the metric used to define the neighborhood.

k-nearest neighbor assumes the regression function is well approximated by locally constant function. But we can assume otherwise.

The model-based approach

Another approach is to constraint the regression function to take a particular form.

For instance, when is supposed to be a linear function:

The regression is called a linear regression. And a linear regression with squared error is called linear least squares.

Plugging this linear function into the formula and solving, we find the theoretical solution:

Relaxing the expectation into sample mean, we get the usual linear least square solution (provided it exists):

Where is the matrix whose -th row is the input vector :