What is logistic regression?

Oct 26, 2018

A Logistic regression is a generalized linear model which is tailored to classification. In this article, we introduce this regression and explain its origin.

Setup

The dataset S of study consists of N pairs of input vector xn and output value yn:

S={(xn,yn)nN}

We suppose that there exists an approximate deterministic relationship ftrue between the inputs xn and the outputs yn:

nN,ynftrue(xn)

The goal of a logistic regression is to learn this relationship using a subset StrainS of the dataset.

Generalized Linear models

We will approximate the relationship ftrue using the combination of a deterministic function σ and a linear model. This means that we want to find the best model f in the class Flin of linear models such that:

σf(xn)=σ(f(xn))=ftrue(xn)

The deterministic function σ that we will use is the logistic function (we will explain why later):

σ(x)=ex1+ex

Here is a graph of this function:

Logistic function

The logistic loss

To measure progress during learning, we use the logistic loss:

Lσ(fw,S)=n=1Nln(1+efw(xn))ynfw(xn)

Learning the best model then amounts to minimizing the training objective G

Origin of the logistic loss

As discussed in this article, usual regression models are ill-adapted to classification.

Logistic regression, however, is tailored to classification problems: instead of directly attempting to predict the label y{0,1}, predict the probability p(1x) that x is in class 1. That way, we turn a discrete classification problem into a continuous regression problem.

Since the values predicted by a regression model are in range ];+[, there only remains to find a way to continuously shrink this range to [0;1]. This can be done using the logistic function σ, which is particularly interesting because most of the values it takes agglutinate around 0 and 1:

σ(x)=ex1+ex

Here is a graph of this function:

Logistic function

Using the logistic function, we get the following expression for the probablity p(1x) that x is in class 1:

p(1x)=σ(fwx)

And the probability p(0x) that x is in class 0:

p(0x)=1σ(fwx)

To predict the labels, we compare those probabilities to a threshold (0.5):

yn^=1p(1xn)>0.5

Learning the model’s parameters

So, our model predicts the probability p(1x) that x falls within class 1. How do we learn its parameter vector w?

We will maximize the likelihood to obtain our data. Assuming that each training example (xn,yn) was drawn independently from the distribution p, the joint likelihood is:

p(y,Xw)=n=1Np(yn,xnw)

Which is maximal when the log-likelihood is:

lnp(y,Xw)=n=1Nlnp(yn,xnw)

Where I use the notation fg(x)=f(g(x)) for function composition.

For each nN, we have:

p(yn,xnw)=p(ynxn,w)p(xnw)

Since:

yn=1p(ynxn,w)=σ(fwxn)yn=0p(ynxn,w)=1σ(fwxn)

We find that:

p(ynxn,w)p(xnw)=ynσ(fwxn)+(1yn)(1σ(fwxn))

Since p(xnw) is a constant (the inputs does not depend on w), we find that the value to maximize is:

n=1Nynlnσ(fwxn)+(1yn)(1lnσ(fwxn))

Replacing σ by its definition and simplifying, we find the expression to maximize:

n=1Nln(1+efw(xn))ynfw(xn)

Which is precisely Lσ(fw,S). Hence its use as loss function.

But…

The choice of σ might seem arbitrary. Is it a good idea to predict the probability p(1x)? What if we used another function to shrink ];+[?

The theoretical soundness of the logistic regression is explained in the following section.

Logistic regression is a generalized linear model

Logistic regression is a generalized linear model with inverse link function:

σ(x)=ex1+ex

The output fw(x) of the linear regression is an estimate of the natural parameter η of a Bernoulli(p) distribution:

η=σ1(p)=ln(p1p)

Underlying probabilistic model

The underlying probabilistic model when using a logistic regression is thus:

Let X a random vector whose first component is always 1 (this is our bias-term). Let wtrue be a vector and let YBern(p(X)) be a Bernoulli random variable with probability of success p(X)=σ(wtrueX).

Our dataset S is made of N i.i.d. samples from the random vector (X,Y):

S={(xn,yn)i.i.d.(X,Y)nN}

Hence, for each nN, we know that yn is an observation drawn from a Bern(σ(wtruexn)) distribution.

In this setup, the logistic regression predicts the value of the natural parameter η=σ1(p(xn)) so as to maximize the likelihood of the observed data.