# Conditional expectations and regression with squared error loss

In this article we study the solution to a regression with squared error loss. We start with the theoretical formulation before tackling the problem in practice.

## Theoretical formulation

Let $\vec{X} \in \realvset{\inputdim}$ be a random input vector and $Y \in \realset$ a random output value with joint distribution $\prob(\vec{X}, Y)$.

In a typical machine-learning setup, we have $\ndataset$ i.i.d. observations $\inputvec_1, ..., \inputvec_\ndataset$ of $\vec{X}$ and $\ndataset$ i.i.d. observations $y_1, ..., y_\ndataset$ of $Y$.

In regression, we seek a function $\model\paren{\vec{X}}$ to predict the output value $Y$ given the input vector $\vec{X}$.

To quantify the errors, we choose the squared error loss:

Which leads to our criterion for choosing $\model$:

By conditioning on $X$, we can rewrite the expectation as:

It is enough to minimize it pointwise:

The solution is:

## The k-nearest neighbor method

The nearest neighbor method attempts to directly implement this formula using the training dataset. To do so, expectation is relaxed into sample mean.

For each point $\inputvec$, we take the sample mean of the output values $\ioutputval{\idataset}$ such that $\ninputvec{\idataset} = \inputvec$. Since there is usually only one such output value, we settle for a neighborhood $\gaussian_k(\inputvec)$ of the $k$ input vectors closest to $\inputvec$ around $\inputvec$:

This approach yields good results for large datasets of lower dimensions. But as the dimension increases, so does the computational cost of the metric used to define the neighborhood.

k-nearest neighbor assumes the regression function $\model$ is well approximated by locally constant function. But we can assume otherwise.

## The model-based approach

Another approach is to constraint the regression function $\model$ to take a particular form.

For instance, when $\model = \linmodel{\linparamv}$ is supposed to be a linear function:

The regression is called a linear regression. And a linear regression with squared error is called linear least squares.

Plugging this linear function into the formula and solving, we find the theoretical solution:

Relaxing the expectation into sample mean, we get the usual linear least square solution (provided it exists):

Where $M$ is the matrix whose $\idataset$-th row is the input vector $\inputvec^\top_n$: