What is logistic regression?

A Logistic regression is a generalized linear model which is tailored to classification. In this article, we introduce this regression and explain its origin.

Setup

The dataset of study consists of pairs of input vector and output value :

We suppose that there exists an approximate deterministic relationship between the inputs and the outputs :

The goal of a logistic regression is to learn this relationship using a subset of the dataset.

Generalized Linear models

We will approximate the relationship using the combination of a deterministic function and a linear model. This means that we want to find the best model in the class of linear models such that:

The deterministic function that we will use is the logistic function (we will explain why later):

Here is a graph of this function:

Logistic function

The logistic loss

To measure progress during learning, we use the logistic loss:

Learning the best model then amounts to minimizing the training objective

Origin of the logistic loss

As discussed in this article, usual regression models are ill-adapted to classification.

Logistic regression, however, is tailored to classification problems: instead of directly attempting to predict the label , predict the probability that is in class . That way, we turn a discrete classification problem into a continuous regression problem.

Since the values predicted by a regression model are in range , there only remains to find a way to continuously shrink this range to . This can be done using the logistic function , which is particularly interesting because most of the values it takes agglutinate around and :

Here is a graph of this function:

Logistic function

Using the logistic function, we get the following expression for the probablity that is in class :

And the probability that is in class :

To predict the labels, we compare those probabilities to a threshold ():

Learning the model’s parameters

So, our model predicts the probability that falls within class . How do we learn its parameter vector ?

We will maximize the likelihood to obtain our data. Assuming that each training example was drawn independently from the distribution , the joint likelihood is:

Which is maximal when the log-likelihood is:

Where I use the notation for function composition.

For each , we have:

Since:

We find that:

Since is a constant (the inputs does not depend on ), we find that the value to maximize is:

Replacing by its definition and simplifying, we find the expression to maximize:

Which is precisely . Hence its use as loss function.

But…

The choice of might seem arbitrary. Is it a good idea to predict the probability ? What if we used another function to shrink ?

The theoretical soundness of the logistic regression is explained in the following section.

Logistic regression is a generalized linear model

Logistic regression is a generalized linear model with inverse link function:

The output of the linear regression is an estimate of the natural parameter of a Bernoulli() distribution:

Underlying probabilistic model

The underlying probabilistic model when using a logistic regression is thus:

Let a random vector whose first component is always (this is our bias-term). Let be a vector and let be a Bernoulli random variable with probability of success .

Our dataset is made of i.i.d. samples from the random vector :

Hence, for each , we know that is an observation drawn from a distribution.

In this setup, the logistic regression predicts the value of the natural parameter so as to maximize the likelihood of the observed data.