We will show that the loss function used by ordinary least-squares (OLS) stems from the statistical theory of maximum likelihood estimation applied to the normal distribution. This fact is crucial to the model’s performance: when the data is generated by a normal distribution, the MSE-loss provides an optimal estimator. Foreword If you are looking for an elementary introduction to OLS regression, check out our article: OLS regressions in simple terms. This article is a sequel to our previous article: Linear regressions from the probabilistic viewpoint. There, we outline the general framework for linear regressions using loss functions. In this article, we will tackle the same goal using a different approach based on the MLE instead of a loss function. Setup Let We assume the following relationship: Where By the properties of the normal distribution, this is equivalent to saying that Given an observation Before we can do so, we need to estimate the value of the parameter Maximum likelihood estimation Let We will use the MLE estimator for estimator is defined as the value Let’s compute the solution to this equation. Since the pairs are drawn independently, the probability to observe our whole sample is the product of the probabilities to observe each pair: The logarithm is an increasing function, so maximizing the likelihood is the same as maximizing the log of the likelihood. Taking the logarithm is convenient since it transforms a product into a sum. Thus, we want to maximize (using the function composition notation): The probability to observe the pair Where Use the definition of a gaussian generator to compute Removing constant factors yields: Maximizing this expression is the same as minimizing its negative: Which is the exact same formula that defines the solution to OLS! Hence, the solution to an OLS regression is the MLE Conclusion OLS will provide us with the maximum likelihood estimator for the parameter Normality: Linearity: Homoskedasticity: The output values When either of these conditions is not met, OLS will produce a suboptimal estimator of In practice, we have the sample Read next: How to graphically check whether OLS is appropriate for the dataset at hand.