What is a generalized linear model?

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

To understand what a generalized linear model does, let’s look back at linear models.

Typical linear model setup

In the typical setup for a linear model, we have a random input vector $\rve{X}$ and a random output variable $\rva{Y}$ whose mean $\mu = \expectation\brak{\rva{Y}}$ is a linear function of $\rve{X}$ .

For instance, a linear least squares regression amounts to:

$\rva{Y} \distributed \gaussian(\linparamv\cdot\rve{X}, \sigma^2)$

Similarly, a linear regression with MAE loss amounts to:

$\rva{Y} \distributed \distrib{Laplace}(\linparamv\cdot\rve{X}, b)$

Hence, with tradidional linear models, we attempt to predict the mean $\mu$ of $\rva{Y}$ . When this mean has a non-linear dependence with $\rve{X}$ , we can try to predict another parameter $\eta$ instead and then use a non linear function to tranform this parameter into the mean.

Exponential family

A family of distribution for which this approach works particularly well is the exponential family. Which is good news because the most familiar distributions are part of this family (normal, exponential, Bernoulli, Poisson, geometric, etc.)

A density belongs to the exponential family if it can be writen as:

$p(y \mid \vec{\eta}) = h(y)\,\exp\brak{\vec{\eta} \cdot \Phi(y) - A(\vec{\eta})}$

The parameter $\vec{\eta}$ is called the natural parameter of the density and $A(\vec{\eta})$ is named the cumulant.

This family of distribution is even more interesting since $\Phi(y)$ is a sufficient statistic!

Generalized linear models’ aim is to estimate the natural parameter $\vec{\eta}$ on the basis of the dataset. Before we dive into more details, let’s convince ourselves that exponential families are THE thing.

The link function

Estimating $\vec{\eta}$ using the sufficient statistic $\Phi(y)$ , is made possible through the link function $g$ such that:

$\vec{\eta} = g(\mu) \quad;\quad \mu = g^{-1}(\vec{\eta})$

for $\mu = \expectation\brak{\Phi(y)}$ .

This link function always exists because $\Phi(y)$ is a sufficient statistic.

Usual distributions are members of the exponential family

The $Bernoulli(\mu)$ distribution is a member of the exponential family with parameters:

$\begin{cases} \Phi(y) &= y \\ \vec{\eta} &= \ln \frac{\mu}{1 - \mu} \\ A(\vec{\eta}) &= \ln(1 + e^\vec{\eta}) \\ h(y) &= 1 \end{cases}$

A generalized linear model for this distribution is called a logistic regression.

The $Poisson(\mu)$ distribution is a member of the exponential family with parameters:

$\begin{cases} \Phi(y) &= y \\ \vec{\eta} &= \ln \mu \\ A(\vec{\eta}) &= 0 \\ h(y) &= \frac{1}{y!} \end{cases}$

The normal distribution $\gaussian(\mu, \sigma^2)$ is a member of the exponential family with parameters:

$\begin{cases} \Phi(y) &= \begin{pmatrix}y \\ y^2\end{pmatrix}\,; \quad \vec{\eta} = \begin{pmatrix}\frac{\mu}{\sigma^2} \\ \frac{-1}{2\sigma^2}\end{pmatrix}\\ A(\vec{\eta}) &= \frac{\mu^2}{2\sigma^2} + \frac{1}{2}\ln(2\pi\sigma^2)\\ h(y) &= 1 \end{cases}$

Generalized linear models

Setup

Let $\rve{X}$ a random vector and $\rva{Y}$ a random variable. Assume our dataset $S$ is made of $\ndataset$ i.i.d. samples from $(\rve{X}, \rva{Y})$ :

$S = \{(\ninputvec{\idataset}, \ioutputval{\idataset}) \iid (\rve{X}, \rva{Y}) \mid \idataset \leq \ndataset\}$

We suppose the distribution of $\rva{Y}$ given $\rve{X}$ is member of the exponential family:

$p(\rva{Y} = \outputval \mid \rve{X} = \inputvec) = h(\outputval)\,e^\brak{\eta\Phi(\outputval) - A(\eta)}$

Where the natural parameter $\eta$ linearly depends on $\inputvec$ :

$\eta = \linmodel{\linparamv}(\inputvec)$

Loss function

We will used a maximum likelihood estimation method for $\eta$ . Our goal is thus to maximize the likelihood:

$\linmodel{\linparamv}^* = \argmax_{\linmodel{\linparamv}} \prod_{\idataset \in \trainset}p(\ioutputval{\idataset}, \ninputvec{\idataset} \mid \linmodel{\linparamv})$

Which amounts to maximizing the log-likelihood. In other words, our loss function is:

$\l(\linmodel{\linparamv}, S) = - \sum_{\idataset \in S}\ln\circ\,p(\ioutputval{\idataset}, \ninputvec{\idataset} \mid \linmodel{\linparamv})$

This loss function is convex.

Let’s note $\inputmatrix \in \realset^{\ndataset\times \inputdim}$ the matrix whose $\idataset$ -th row is the vector $\ninputvec{\idataset}$ :

$X = \begin{bmatrix} \longleftarrow & \inputvec^{\top}_1 & \longrightarrow \\ & \vdots & \\ \longleftarrow & \inputvec^{\top}_\ndataset & \longrightarrow \\ \end{bmatrix}$

The gradient is:

$\grad\l(\linmodel{\linparamv}, S) = \inputmatrix^{\top}\brak{g^{-1}(\inputmatrix\linparamv) - \Phi(\outputvec)}$

And this loss function can be minimized using gradient descent to find $\bestmodel$ .