What is logistic regression?

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

A Logistic regression is a generalized linear model which is tailored to classification. In this article, we introduce this regression and explain its origin.

Setup

The dataset $\dataset$ of study consists of $\ndataset$ pairs of input vector $\inputvec_{\idataset}$ and output value $\outputval_{\idataset}$ :

S = {({\vec{x}}_{n}, y_{n}) ∣ n \leq N}

$\dataset = \{\paren{\inputvec_{\idataset}, \outputval_{\idataset}} \mid \idataset \leq \ndataset\}$

We suppose that there exists an approximate deterministic relationship $\truemodel$ between the inputs $\inputvec_{\idataset}$ and the outputs $\outputval_{\idataset}$ :

\forall n \leq N, y_{n} \approx f_{true} ({\vec{x}}_{n})

$\forall \idataset \leq \ndataset, \quad \outputval_{\idataset} \approx \truemodel\paren{\inputvec_{\idataset}}$

The goal of a logistic regression is to learn this relationship using a subset $\trainset \subseteq \dataset$ of the dataset.

Generalized Linear models

We will approximate the relationship $\truemodel$ using the combination of a deterministic function $\logit$ and a linear model. This means that we want to find the best model $\bestmodel$ in the class $\linclass$ of linear models such that:

σ \circ f^{*} ({\vec{x}}_{n}) = σ (f^{*} ({\vec{x}}_{n})) = f_{true} ({\vec{x}}_{n})

$\logit\circ\bestmodel(\ninputvec{\idataset}) = \logit\paren{\bestmodel(\ninputvec{\idataset})} = \truemodel(\ninputvec{\idataset})$

The inverse link function

The deterministic function $\logit$ that we will use is the logistic function (we will explain why later):

σ (x) = \frac{e^{x}}{1 + e^{x}}

$\logit(x) = \frac{e^x}{1 + e^x}$

Here is a graph of this function:

Logistic function

The logistic loss

To measure progress during learning, we use the logistic loss:

\begin{aligned} L_{σ} (f_{\vec{w}}, S) = - \sum_{n = 1}^{N} \ln (1 + e^{f_{\vec{w}} ({\vec{x}}_{n})}) - y_{n} f_{\vec{w}} ({\vec{x}}_{n}) \end{aligned}

$\begin{align*} \llogit(\linmodel{\linparamv}, \dataset) = - \sum_{\idataset = 1}^{\ndataset} \, \ln\paren{1 + e^{\linmodel{\linparamv}(\ninputvec{\idataset})}} - \ioutputval{\idataset}\,\linmodel{\linparamv}(\ninputvec{\idataset}) \end{align*}$

Learning the best model then amounts to minimizing the training objective $\g$

Origin of the logistic loss

As discussed in this article, usual regression models are ill-adapted to classification.

Logistic regression, however, is tailored to classification problems: instead of directly attempting to predict the label $\outputval \in \{0, 1\}$ , predict the probability $p(1 \mid \inputvec)$ that $\inputvec$ is in class $1$ . That way, we turn a discrete classification problem into a continuous regression problem.

Since the values predicted by a regression model are in range $]-\infty;+\infty[$ , there only remains to find a way to continuously shrink this range to $[0; 1]$ . This can be done using the logistic function $\logit$ , which is particularly interesting because most of the values it takes agglutinate around $0$ and $1$ :

σ (x) = \frac{e^{x}}{1 + e^{x}}

$\logit(x) = \frac{e^x}{1 + e^x}$

Here is a graph of this function:

Logistic function

Using the logistic function, we get the following expression for the probablity $p(1 \mid \inputvec)$ that $\inputvec$ is in class $1$ :

p (1 ∣ \vec{x}) = σ (f_{\vec{w}} \cdot \vec{x})

$p(1 \mid \inputvec) = \logit\paren{\linmodel{\linparamv}\cdot\inputvec}$

And the probability $p(0 \mid \inputvec)$ that $\inputvec$ is in class $0$ :

p (0 ∣ \vec{x}) = 1 - σ (f_{\vec{w}} \cdot \vec{x})

$p(0 \mid \inputvec) = 1 - \logit\paren{\linmodel{\linparamv}\cdot\inputvec}$

To predict the labels, we compare those probabilities to a threshold ( $0.5$ ):

\hat{y_{n}} = 1 ⟺ p (1 ∣ {\vec{x}}_{n}) > 0.5

$\hat{\ioutputval{\idataset}} = 1 \iff p(1 \mid \ninputvec{\idataset}) > 0.5$

Learning the model’s parameters

So, our model predicts the probability $p(1 \mid \inputvec)$ that $\inputvec$ falls within class $1$ . How do we learn its parameter vector $\linparamv$ ?

We will maximize the likelihood to obtain our data. Assuming that each training example $(\ninputvec{\idataset}, \ioutputval{\idataset})$ was drawn independently from the distribution $p$ , the joint likelihood is:

p (\vec{y}, X ∣ \vec{w}) = \prod_{n = 1}^{N} p (y_{n}, {\vec{x}}_{n} ∣ \vec{w})

$p(\outputvec, \inputmatrix \mid \linparamv) = \prod_{\idataset = 1}^{\ndataset} p(\ioutputval{\idataset}, \ninputvec{\idataset} \mid \linparamv)$

Which is maximal when the log-likelihood is:

\ln \circ p (\vec{y}, X ∣ \vec{w}) = \sum_{n = 1}^{N} \ln \circ p (y_{n}, {\vec{x}}_{n} ∣ \vec{w})

$\ln\circ\, p(\outputvec, \inputmatrix \mid \linparamv) = \sum_{\idataset = 1}^{\ndataset} \ln\circ\, p(\ioutputval{\idataset}, \ninputvec{\idataset} \mid \linparamv)$

Where I use the notation $f \circ g(x) = f(g(x))$ for function composition.

For each $\idataset \leq \ndataset$ , we have:

\begin{aligned} p (y_{n}, {\vec{x}}_{n} ∣ \vec{w}) & = p (y_{n} ∣ {\vec{x}}_{n}, \vec{w}) p ({\vec{x}}_{n} ∣ \vec{w}) \end{aligned}

$\begin{align*} p(\ioutputval{\idataset}, \ninputvec{\idataset} \mid \linparamv) &= p(\ioutputval{\idataset} \mid \ninputvec{\idataset}, \linparamv)\,p(\ninputvec{\idataset} \mid \linparamv) \end{align*}$

Since:

\begin{aligned} y_{n} = 1 ⟹ p (y_{n} ∣ {\vec{x}}_{n}, \vec{w}) = σ (f_{\vec{w}} \cdot {\vec{x}}_{n}) \\ y_{n} = 0 ⟹ p (y_{n} ∣ {\vec{x}}_{n}, \vec{w}) = 1 - σ (f_{\vec{w}} \cdot {\vec{x}}_{n}) \end{aligned}

$\begin{align*} & \ioutputval{\idataset} = 1 \implies p(\ioutputval{\idataset} \mid \ninputvec{\idataset}, \linparamv) = \logit\paren{\linmodel{\linparamv}\cdot\ninputvec{\idataset}} \\ & \ioutputval{\idataset} = 0 \implies p(\ioutputval{\idataset} \mid \ninputvec{\idataset}, \linparamv) = 1 - \logit\paren{\linmodel{\linparamv}\cdot\ninputvec{\idataset}} \end{align*}$

We find that:

\frac{p (y_{n} ∣ {\vec{x}}_{n}, \vec{w})}{p ({\vec{x}}_{n} ∣ \vec{w})} = y_{n} σ (f_{\vec{w}} \cdot {\vec{x}}_{n}) + (1 - y_{n}) (1 - σ (f_{\vec{w}} \cdot {\vec{x}}_{n}))

$\frac{p(\ioutputval{\idataset} \mid \ninputvec{\idataset}, \linparamv)}{p(\ninputvec{\idataset} \mid \linparamv)} = \ioutputval{\idataset}\,\logit\paren{\linmodel{\linparamv}\cdot\ninputvec{\idataset}} + \paren{1 - \ioutputval{\idataset}}\,\paren{1 - \logit\paren{\linmodel{\linparamv}\cdot\ninputvec{\idataset}}}$

Since $p(\ninputvec{\idataset} \mid \linparamv)$ is a constant (the inputs does not depend on $\linparamv$ ), we find that the value to maximize is:

\sum_{n = 1}^{N} y_{n} \ln \circ σ (f_{\vec{w}} \cdot {\vec{x}}_{n}) + (1 - y_{n}) (1 - \ln \circ σ (f_{\vec{w}} \cdot {\vec{x}}_{n}))

$\sum_{\idataset = 1}^{\ndataset} \ioutputval{\idataset}\,\ln\circ\, \logit\paren{\linmodel{\linparamv}\cdot\ninputvec{\idataset}} + \paren{1 - \ioutputval{\idataset}}\,\paren{1 - \ln\circ\, \logit\paren{\linmodel{\linparamv}\cdot\ninputvec{\idataset}}}$

Replacing $\logit$ by its definition and simplifying, we find the expression to maximize:

\sum_{n = 1}^{N} \ln (1 + e^{f_{\vec{w}} ({\vec{x}}_{n})}) - y_{n} f_{\vec{w}} ({\vec{x}}_{n})

$\sum_{\idataset = 1}^{\ndataset} \, \ln\paren{1 + e^{\linmodel{\linparamv}(\ninputvec{\idataset})}} - \ioutputval{\idataset}\,\linmodel{\linparamv}(\ninputvec{\idataset})$

Which is precisely $- \llogit(\linmodel{\linparamv}, \dataset)$ . Hence its use as loss function.

But…

The choice of $\logit$ might seem arbitrary. Is it a good idea to predict the probability $p(1 \mid \inputvec)$ ? What if we used another function to shrink $]-\infty;+\infty[$ ?

The theoretical soundness of the logistic regression is explained in the following section.

Logistic regression is a generalized linear model

Logistic regression is a generalized linear model with inverse link function:

σ (x) = \frac{e^{x}}{1 + e^{x}}

$\logit(x) = \frac{e^x}{1 + e^x}$

The output $\linmodel{\linparamv}(\inputvec)$ of the linear regression is an estimate of the natural parameter $\eta$ of a Bernoulli( $p$ ) distribution:

η = σ^{- 1} (p) = \ln (\frac{p}{1 - p})

$\eta = \logit^{-1}(p) = \ln\paren{\frac{p}{1 - p}}$

Underlying probabilistic model

The underlying probabilistic model when using a logistic regression is thus:

Let $\rve{X}$ a random vector whose first component is always $1$ (this is our bias-term). Let $\linparamv_\text{true}$ be a vector and let $\rva{Y} \distributed Bern(p(\rve{X}))$ be a Bernoulli random variable with probability of success $p(\rve{X}) = \logit(\linparamv_\text{true} \cdot \rve{X})$ .

Our dataset $S$ is made of $\ndataset$ i.i.d. samples from the random vector $(\rve{X}, \rva{Y})$ :

S = {({\vec{x}}_{n}, y_{n}) \overset{i.i.d.}{\sim} (\vec{X}, Y) ∣ n \leq N}

$S = \{(\ninputvec{\idataset}, \ioutputval{\idataset}) \iid (\rve{X}, \rva{Y}) \mid \idataset \leq \ndataset \}$

Hence, for each $\idataset \leq \ndataset$ , we know that $\ioutputval{\idataset}$ is an observation drawn from a $Bern(\logit(\linparamv_\text{true} \cdot \ninputvec{\idataset}))$ distribution.

In this setup, the logistic regression predicts the value of the natural parameter $\eta = \logit^{-1}(p(\ninputvec{\idataset}))$ so as to maximize the likelihood of the observed data.