This article shows geometrically where the best estimates for the mean and variance of a normally distributed random vector can be found. We start with a simple question and derive both the geometrical meaning and parameter estimation method from scratch.
Goal
If you’re impatient to know where we’re headed, here are the geometrical insights we will develop in this article:
1) Given 2 observations y1 and y2 independently generated at random by the distribution N(μ,σ), our best estimators for μ and σ are ˆμ and ˆσ such that:
(y1y2)=ˆμ(11)+ˆσ√2(−11)2) More generally, given n observations →yn=(y1,...,yn) independently generated at random by the same distribution N(μ,σ2), our best estimators are:
→yn=ˆμ(√nUn)+ˆσ(√n−1U⊥n)where √n and √n−1 are correction factors because as dimension increases, distances increases. The meaning of Un and U⊥n are illustrated on the picture below:
Introduction
We can think of a probability distribution as an engine able to generate values at random. A random vector is a vector whose components have been generated by such engine.
Conceptually, it is useful to see the density function for a random vector as a cloud in Rn that indicates the plausible end points for the random vector: the vector is more likely to end in a region where the cloud is dense than one where it is not dense.
Figure. Density cloud for a vector with N(→0,σ2) distribution on the left and N(→μY,σ2) distribution on the right.
For instance, the following image shows the “density cloud” of a normally distributed random vector. The components of the vector are generated by a normal distribution, and the visualization shows how this translate to 2D geometry.
Here, the shape of the “density cloud” for the random vector is determined by the parameters of each component: the shape of a normal distribution is controlled by the mean μ (= location of the center) and it’s variance σ2 (= size of the cloud). When the variance for every component is the same, the cloud is a circle.
The shape of the normal distribution (or gaussian distribution) is particularly interesting because it models measurement errors. We can think of it as a cloud that generates a target value μ with some measurement noise. The variance parameter σ2 controls the amount of noise that is added. Among all its desirable features, the distribution is symetric: accross a very large number of measurement, we expect the errors to cancel each others, so that the mean of the sample approximates the real value μ. We will see later that it has a nice geometrical feature too.
To learn more about the normal distribution, check out this article: A probability distribution to model measurement errors.
Statistics is all about finding the location of the cloud when we have a few observations but we don’t know the parameters μ and σ. As we will see, the normal distribution has a nice property that allows us to visualize geometrically the process of estimating those parameters.
Finding the cloud
Suppose for instance that we have two observations y1 and y2 independently generated at random by the same normal distribution N(μ,σ2).
We would like to estimate the most likely values for the parameters of the cloud: its center μ and it’s standard deviation σ. Basically, this means that we will try to find the “best guess” for those values based on the location of our observations.
Our best guess for the center is to place it where it has the highest probability to generate our observations. In statistical terms, we are looking for the maximum likelihood value of μ.
To find this maximum likelihood location, we need to study the formula for the could’s density more closely.
f(Y=y∣μ,σ)=1√2πσ2Mexp(−|y−μ|22σ2)Since both observations are generated independently, the joint density (= the density for both) is the product f(Y1=y1∣μ,σ)f(Y2=y2∣μ,σ) which is maximal when μ minimizes this sum:
ˆμ=Margminμ(y1−μ)2+(y2−μ)2You might recognize the Ordinary Least Square equation. This sum has a nice geometrical interpretation because it is exactly the expanded formula for the norm (= length) of the vector →y−(μ,μ):
ˆμ=Margminμ‖(y1y2)−μ(11)‖2Geometry
So we are looking for the point on the line of direction (1,1) that is the closest to →y=(y1,y2).
This point is the orthogonal projection of →y onto the unit vector →u directed along the line. We can use the dot product ⋅ to find the projection coefficient, and multiply by the unit vector to get the projection:
ˆμ(11)=(→y⋅→u)→u(equation 1)Yeay! We found our best estimate for the center of the cloud!
To ease the transition with higher dimensions, let’s shorten our notation for the vector (1,1) as I2 where the number 2 stands for the number of components in the vector. So that for instance, I4=(1,1,1,1). Geometrically, we can see this vector as the diagonal of the n-dimensional (hyper)cube of side 1. The norm of this vector is simply the length of that diagonal: ‖I2‖=√2 or more generally, ‖In‖=√n. This will prove useful later.
Likewise, let’s adopt a more flexible notation for our unit vector →u and use U2 for the unit vector directed along I2. In math notations, this means that: In=√nUn for all values of n. Since Un is a unit vector, we can use it to express the orthogonal projection of →y. With these notations equation 1 becomes:
ˆμI2=(→y⋅U2)U2This is a vector equation. Let’s find the exact value of our estimate:
ˆμI2=(→y⋅U2)U2⟺ˆμ√2U2=(→y⋅U2)U2⟺ˆμ√2=(→y⋅U2)⟺ˆμ=(→y⋅U2)1√2We can rewrite this slightly to get the ordinary least square solution ˆμ=ˉy:
ˆμ=(→y⋅U2)1√2⟺ˆμ=(→y⋅I2√2)1√2⟺ˆμ=(→y⋅I2)12⟺ˆμ=y1+y22This result generalizes easily for a higher number of observations. Suppose for instance that n stands for a positive integer and that we have n observations y1, …, yn independently generated at random by a normal distribution N(μ,σ2). If we write →yn=(y1,...,yn) the random vector associated with our observations, we can find our best guess for the center of the cloud by projecting →yn onto Un. This yields the following best guess for μ:
ˆμ=(→yn⋅Un)1√n=y1+...+ynnFrom now on, I will use the general notation with n to make clear that our results hold in higher dimensions. While reading, feel free to consider that n=2 or n=3 to visualize the geometry.
Before we estimate the second parameter σ, let’s write →yn as the sum of the cloud’s center and a deviation vector →ϵn from the center. It that can be considered as if its components where independently generated at random by a N(0,σ2) distribution.
→yn=ˆμ(√nUn)+→ϵnNow, I will do a few tricks to show that →ϵn can be used to estimate the standard deviation parameter σ. Once done, we will revert back to the geometrical interpretation.
Take n−1 unit vectors to form a basis (Un, u1, …, un−1) of space. This means we take a set of n axis for space where Un is the first of them. Note Yn the general random vector which has been realized as →yn. Along each of those new unit vectors, the projection YN⋅ui has 0 mean and is distributed according to a normal distribution N(0,σ2). We will show that the projection of →yn onto each of those directions yield an unbiased estimator for the variance σ2. Indeed:
σ2=Mvar(Yn⋅ui)=E[(Yn⋅ui)2]−E[Yn⋅ui]2=E[(Yn⋅ui)2]−0=E[(Yn⋅ui)2]We can pool these to get the best estimate for σ2:
ˆσ2=∑n−1i=1(→yn⋅ui)2n−1=‖→ϵn‖2n−1If take a unit vector U⊥n directed along →ϵn, we know that it is orthogonal to Un (hence the notation). And we have ‖→ϵn‖ =(→yn⋅U⊥n). Hence, our best estimate for the standard deviation σ is:
ˆσ=‖→ϵn‖√n−1=→yn⋅U⊥n√n−1In words, the standard deviation is the length of the deviation vector ϵn corrected for the dimension n. As I will explain later, this is because lengths are dilated in higher dimensions.
We can replace →ϵ by the above expression in the formula for the observation vector →yn:
→yn=ˆμ(√nUn)+ˆσ(√n−1U⊥n)The values √n and √n−1 are scale factors due to the dimension of space. Indeed, the length of the diagonal of a square with size s is: √2s, for a cube it is √3s and more generally for a n-dimensional hypercube it is √ns. This explains the √n factor associated with Un which is precisely the direction of that diagonal.
Another way to say this is simply that √n is the norm of In in the n-dimensional space.
Likewise, √n−1 is the norm of I⊥n in the (n−1)-dimensional subspace orthogonal to Mspan(In). We loose one dimension because ϵ can’t have any component colinear to In by definition.
In the special case when the components of ϵ are perfect estimators (i.e. when ϵ=(σ,...,σ)), the picture reduce to a true (n−1)-dimensional hypercube and √n−1σ is its diagonal.
When n=1, we have only one observation (y1) and the formula says that our best estimate for the parameter μ is:
y1=ˆμ√1+ˆσ√0=ˆμWhich means that with only one value, our best guess is to center the distribution on that value. We don’t have enough observations to estimate σ, so it is automatically ruled out of the formula.