The Moore-Penrose inverse of a matrix is used to approximatively solve a degenerate system of linear equations.

## In short

Let be a matrix and a vector. Consider the following equation:

When the matrix is invertible, the equation admits a unique solution:

But what if is not invertible? The next best thing to a null vector is a vector of minimal length. So let’s try and solve the following equation:

Whose solution is given by the Moore-Penrose pseudo-inverse of (when it exists):

What is this pseudo-inverse matrix?

## Setup

Let be a matrix and a vector. The following equation:

has a solution only when , which means that is in the linear subspace spanned by the columns of .

When it is not the case, we might want an approximate solution by minimizing the norm of the residual vector:

When this norm is the euclidean 2-norm, we know from geometry that the solution to this equation is the orthogonal projection of onto , but let’s derive it analytically.

## Finding a solution

Recall that the 2-norm of a vector is:

And we can minimize its square instead.

The directional derivative along a vector is (proof):

And (since the 2-norm squared is strictly convex) the minimum is reached when the directional derivatives are 0 in all direction , hence:

which is the matrix form of the normal equations.

## The Moore-Penrose inverse

Rearranging the equation above, we have:

Where the matrix is called the Moore-Penrose inverse of .

## Uniqueness of the solution

The matrix is a square matrix of dimension . It is called the Gram matrix. The Moore-Penrose inverse exists when the Gram matrix is invertible.

The Gram matrix is invertible when the matrix has full column rank, i.e. when .

When (which necessarily happens when ), we are in the case of an underdetermined system and the Gram matrix is not invertible. This means that there exist several solutions. Since it is a minimization problem, we can use a numerical solver such as gradient descent or stochastic gradient descent to find one, even though in practical applications we can often reduce by removing some columns from .

is always at most the column rank (up to ) and at most the row rank (up to ), so it can never happen that the system is overdetermined. In other words, there is always at least one solution.

## Numerical issues

When some columns of are nearly collinear, the matrix is ill-conditioned, leading to numerical issues when solving the linear system. So it is better to use the QR factorization to solve the system.