The Moore-Penrose inverse of a matrix is used to approximatively solve a degenerate system of linear equations.
In short
Let X be a matrix and →y a vector. Consider the following equation:
→y−X→w=→0When the matrix X is invertible, the equation admits a unique solution:
→w=X−1→yBut what if X is not invertible? The next best thing to a null vector is a vector of minimal length. So let’s try and solve the following equation:
argmin‖→y−X→w‖Whose solution is given by the Moore-Penrose pseudo-inverse of X (when it exists):
→w=X†→yWhat is this pseudo-inverse matrix?
Setup
Let X∈RN×D be a matrix and →y∈RN a vector. The following equation:
→y−X→w=→0has a solution only when →y∈span(X), which means that →y is in the linear subspace spanned by the columns of X.
When it is not the case, we might want an approximate solution by minimizing the norm of the residual vector:
argminw‖→y−X→w‖When this norm is the euclidean 2-norm, we know from geometry that the solution to this equation is the orthogonal projection of →y onto span(X), but let’s derive it analytically.
Finding a solution
Recall that the 2-norm of a vector →v is:
‖→v‖2=√D∑i=1v2iAnd we can minimize its square instead.
The directional derivative along a vector ∂→u is (proof):
2(→y−X→w)⋅(X∂→u)=2[X⊤(→y−X→w)]⋅(∂→u)And (since the 2-norm squared is strictly convex) the minimum is reached when the directional derivatives are 0 in all direction ∂→u, hence:
X⊤(→y−X→w)=0which is the matrix form of the normal equations.
The Moore-Penrose inverse
Rearranging the equation above, we have:
→w=(X⊤X)−1X⊤→yWhere the matrix X†=(X⊤X)−1X⊤ is called the Moore-Penrose inverse of X.
Uniqueness of the solution
The matrix X⊤X is a square matrix of dimension D. It is called the Gram matrix. The Moore-Penrose inverse exists when the Gram matrix is invertible.
The Gram matrix X⊤X is invertible when the matrix X has full column rank, i.e. when rank(X)=D.
When rank(X)<D (which necessarily happens when N<D), we are in the case of an underdetermined system and the Gram matrix is not invertible. This means that there exist several solutions. Since it is a minimization problem, we can use a numerical solver such as gradient descent or stochastic gradient descent to find one, even though in practical applications we can often reduce D by removing some columns from X.
rank(X) is always at most the column rank (up to D) and at most the row rank (up to N), so it can never happen that the system is overdetermined. In other words, there is always at least one solution.
Numerical issues
When some columns of X are nearly collinear, the matrix is ill-conditioned, leading to numerical issues when solving the linear system. So it is better to use the QR factorization to solve the system.