What is stochastic gradient descent (SGD)?

$\def\sa{a} \def\sb{b} \def\sc{c} \def\sd{d} \def\se{e} \def\sf{f} \def\sg{g} \def\sh{h} \def\si{i} \def\sj{j} \def\sk{k} \def\sl{l} \def\sm{m} \def\sn{n} \def\so{o} \def\sp{p} \def\sq{q} \def\sr{r} \def\ss{s} \def\st{t} \def\su{u} \def\sv{v} \def\sw{w} \def\sx{x} \def\sy{y} \def\sz{z} \def\va{\vec{a}} \def\vb{\vec{b}} \def\vc{\vec{c}} \def\vd{\vec{d}} \def\ve{\vec{e}} \def\vf{\vec{f}} \def\vg{\vec{g}} \def\vh{\vec{h}} \def\vi{\vec{i}} \def\vj{\vec{j}} \def\vk{\vec{k}} \def\vl{\vec{l}} \def\vm{\vec{m}} \def\vn{\vec{n}} \def\vo{\vec{o}} \def\vp{\vec{p}} \def\vq{\vec{q}} \def\vr{\vec{r}} \def\vs{\vec{s}} \def\vt{\vec{t}} \def\vu{\vec{u}} \def\vv{\vec{v}} \def\vw{\vec{w}} \def\vx{\vec{x}} \def\vy{\vec{y}} \def\vz{\vec{z}} \def\ga{\mathfrak{A}} \def\gb{\mathfrak{B}} \def\gc{\mathfrak{C}} \def\gd{\mathfrak{D}} \def\ge{\mathfrak{E}} \def\gf{\mathfrak{F}} \def\gg{\mathfrak{G}} \def\gh{\mathfrak{H}} \def\gi{\mathfrak{I}} \def\gj{\mathfrak{J}} \def\gk{\mathfrak{K}} \def\gl{\mathfrak{L}} \def\gm{\mathfrak{M}} \def\gn{\mathfrak{N}} \def\go{\mathfrak{O}} \def\gp{\mathfrak{P}} \def\gq{\mathfrak{Q}} \def\gr{\mathfrak{R}} \def\gs{\mathfrak{S}} \def\gt{\mathfrak{T}} \def\gu{\mathfrak{U}} \def\gv{\mathfrak{V}} \def\gw{\mathfrak{W}} \def\gx{\mathfrak{X}} \def\gy{\mathfrak{Y}} \def\gz{\mathfrak{Z}} \def\ra{A} \def\rb{B} \def\rc{C} \def\rd{D} \def\re{E} \def\rf{F} \def\rg{G} \def\rh{H} \def\ri{I} \def\rj{J} \def\rk{K} \def\rl{L} \def\rm{M} \def\rn{N} \def\ro{O} \def\rp{P} \def\rq{Q} \def\rr{R} \def\rs{S} \def\rt{T} \def\ru{U} \def\rv{V} \def\rw{W} \def\rx{X} \def\ry{Y} \def\rz{Z} \def\rva{\vec{A}} \def\rvb{\vec{B}} \def\rvc{\vec{C}} \def\rvd{\vec{D}} \def\rve{\vec{E}} \def\rvf{\vec{F}} \def\rvg{\vec{G}} \def\rvh{\vec{H}} \def\rvi{\vec{I}} \def\rvj{\vec{J}} \def\rvk{\vec{K}} \def\rvl{\vec{L}} \def\rvm{\vec{M}} \def\rvn{\vec{N}} \def\rvo{\vec{O}} \def\rvp{\vec{P}} \def\rvq{\vec{Q}} \def\rvr{\vec{R}} \def\rvs{\vec{S}} \def\rvt{\vec{T}} \def\rvu{\vec{U}} \def\rvv{\vec{V}} \def\rvw{\vec{W}} \def\rvx{\vec{X}} \def\rvy{\vec{Y}} \def\rvz{\vec{Z}} \def\seta{A} \def\setb{B} \def\setc{C} \def\setd{D} \def\sete{E} \def\setf{F} \def\setg{G} \def\seth{H} \def\seti{I} \def\setj{J} \def\setk{K} \def\setl{L} \def\setm{M} \def\setn{N} \def\seto{O} \def\setp{P} \def\setq{Q} \def\setr{R} \def\sets{S} \def\sett{T} \def\setu{U} \def\setv{V} \def\setw{W} \def\setx{X} \def\sety{Y} \def\setz{Z} \def\fa{a} \def\fb{b} \def\fc{c} \def\fd{d} \def\fe{e} \def\ff{f} \def\fg{g} \def\fh{h} \def\fi{i} \def\fj{j} \def\fk{k} \def\fl{l} \def\fm{m} \def\fn{n} \def\fo{o} \def\fp{p} \def\fq{q} \def\fr{r} \def\fs{s} \def\ft{t} \def\fu{u} \def\fv{v} \def\fw{w} \def\fx{x} \def\fy{y} \def\fz{z} \def\fA{A} \def\fB{B} \def\fC{C} \def\fD{D} \def\fE{E} \def\fF{F} \def\fG{G} \def\fH{H} \def\fI{I} \def\fJ{J} \def\fK{K} \def\fL{L} \def\fM{M} \def\fN{N} \def\fO{O} \def\fP{P} \def\fQ{Q} \def\fR{R} \def\fS{S} \def\fT{T} \def\fU{U} \def\fV{V} \def\fW{W} \def\fX{X} \def\fY{Y} \def\fZ{Z} \def\ma{A} \def\mb{B} \def\mc{C} \def\md{D} \def\me{E} \def\mf{F} \def\mg{G} \def\mh{H} \def\mi{I} \def\mj{J} \def\mk{K} \def\ml{L} \def\mm{M} \def\mn{N} \def\mo{O} \def\mp{P} \def\mq{Q} \def\mr{R} \def\ms{S} \def\mt{T} \def\matu{U} \def\mv{V} \def\mw{W} \def\mx{X} \def\my{Y} \def\mz{Z} \def\loss{\mathcal{L}} \newcommand{\dkl}[2]{D_{\text{KL}}\mathopen{}\paren{#1\,||\,#2}} \newcommand{\dataset}{S} \newcommand{\ndataset}{N} \newcommand{\idataset}{n} \newcommand{\inputRV}{\mathcal{X}} \newcommand{\inputvec}{\vec{x}} \newcommand{\ninputvec}[1]{\vec{x}_{#1}} \newcommand{\iinputvec}[1]{x_{#1}} \newcommand{\niinputvec}[2]{x_{#1, #2}} \newcommand{\icpnt}{i} \newcommand{\inputmatrix}{X} \newcommand{\inputdim}{D} \newcommand{\outputval}{y} \newcommand{\ioutputval}[1]{y_{#1}} \newcommand{\outputvec}{\vec{y}} \newcommand{\trainset}{S_{\text{train}}} \newcommand{\testset}{S_{\text{test}}} \newcommand{\truemodel}{f_{\text{true}}} \newcommand{\trainedmodel}{f_{\trainset}} \newcommand{\linmodel}[1]{f_{#1}} \newcommand{\bestmodel}{f^{*}} \newcommand{\model}{f} \newcommand{\hyperparam}{\lambda} \newcommand{\linparamv}{\vec{w}} \newcommand{\ilinparam}[1]{w_{#1}} \newcommand{\indivloss}{l} \newcommand{\modelclass}{\mathcal{F}} \newcommand{\linclass}{\modelclass_{\text{lin}}} \newcommand{\g}{\mathcal{G}} \newcommand{\gmse}{\g_{\text{MSE}}} \newcommand{\glasso}{\g_{\text{lasso}}} \newcommand{\gridge}{\g_{\text{ridge}}} \newcommand{\glogit}{\g_{\logit}} \newcommand{\l}{\mathcal{L}} \newcommand{\lmse}{\l_{\text{MSE}}} \newcommand{\lmae}{\l_{\text{MAE}}} \newcommand{\llasso}{\l_{\text{lasso}}} \newcommand{\lridge}{\l_{\text{ridge}}} \newcommand{\llogit}{\l_{\logit}} \newcommand{\logit}{\sigma} \newcommand{\reg}{\mathcal{R}} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\mean}{mean} \DeclareMathOperator*{\avg}{avg} \DeclareMathOperator*{\span}{span} \DeclareMathOperator*{\var}{var} \DeclareMathOperator*{\bias}{bias} \newcommand{\expectation}{\mathbb{E}} \newcommand{\brak}[1]{\left[#1\right]} \newcommand{\paren}[1]{\left(#1\right)} \newcommand{\realset}{\mathbb{R}} \newcommand{\realvset}[1]{\realset^{#1}} \newcommand{\prob}{\mathbb{P}} \newcommand{\gaussian}{\mathcal{N}} \newcommand{\iid}{\stackrel{\text{i.i.d.}}{\sim}} \newcommand{\abs}[1]{\left\lvert#1\right\rvert} \newcommand{\norm}[1]{\left\lVert#1\right\rVert} \newcommand{\normtwo}[1]{\norm{#1}_{2}} \newcommand{\normone}[1]{\norm{#1}_{1}} \newcommand{\card}[1]{\left\lvert#1\right\rvert} \newcommand{\grad}{\nabla} \newcommand{\dconv}{\stackrel{d}{\to}} \newcommand{\pconv}{\stackrel{p}{\to}} \newcommand{\rva}[1]{#1} \newcommand{\rve}[1]{\vec{#1}} \newcommand{\obs}[1]{#1} \newcommand{\vobs}[1]{\vec{#1}} \newcommand{\distrib}[1]{#1} \newcommand{\distribof}[2]{#1_{#2}} \newcommand{\density}[1]{#1} \newcommand{\densityof}[2]{#1_{#2}} \newcommand{\distributed}{\sim} \newcommand{\const}[1]{#1} \newcommand{\fun}[1]{#1}$

Stochastic gradient descent is an algorithm that tries to find the minimum of a function expressed as a sum of component functions. It does so by choosing a component function at random then following its gradient.

Sum functions

Suppose $f$ is expressed as the mean of $N$ component functions $f_n$ :

f (x) = \frac{1}{N} \sum_{i = 1}^{N} f_{n} (x)

$f(x) = \frac{1}{N}\sum_{i = 1}^{N} f_n(x)$

The gradient

The gradient $\nabla f(x)$ of $f$ at $x$ is a vector directed along the steepest ascent at $x$ .

Gradient

We can try to find the minimum of $f$ by taking iterative steps in the direction of $-\nabla f$ . This is what the gradient descent algorithm do.

Sometimes, computing the gradient of $f$ is too computationally expensive. Stochastic gradient descent reduces the computational cost by using the gradient $\nabla f_n$ of one component function instead.

Stochastic gradient descent algorithm

The stochastic gradient descent algorithm aims at finding the minimum of $f = (1/N)\sum_{n=1}^{N}f_n$ by taking successive steps directed along $- \nabla f_n$ , where at each step $n$ is chosen at random.

def stochastic_gradient_descent(f, x_0, max_steps, step_size):
    """
    f is a list such that f[n-1] 
    is the n-th component function
    """
    x = x_0
    for step in range(max_steps):
        # Choose the component function at random
        n = random.randint(0, len(f)-1)
        # Update step
        x = x - step_size * gradient(of=f[n], at=x)
    return x

Comparison with gradient descent

The update rule for gradient descent at step t is:

x_{t + 1} = x_{t} - γ \nabla f (x_{t})

$x_{t+1} = x_{t} - \gamma\nabla f(x_{t})$

While the update rule for stochastic gradient descent is:

x_{t + 1} = x_{t} - γ \nabla f_{n} (x_{t})

$x_{t+1} = x_{t} - \gamma\nabla f_n(x_{t})$

n is random

Is it really faster than gradient descent?

For $I$ iterations, and $N$ training examples, gradient descent computes $I*N$ gradients while stochastic gradient descent computes only $I$ gradients. When $N >> I$ , this yields a good optimization.

In practice SGD is worth a shot when we have a very large number of training examples. In machine learning, it often happens that we have hundreds of thousands, or millions of training examples (i.e. $N > 10^6$ ) and that SGD still converges after only a few hundreds of iterations (i.e. $I < 1000$ ).

Why it works

This algorithm works because $\nabla f_n(x)$ is an unbiased estimate of $\nabla f(x)$ that is much simpler to compute.

Computational cost

The gradient of a sum is the sum of the gradients, so:

\nabla f = \sum_{n = 1}^{N} \nabla f_{n}

$\nabla f = \sum_{n = 1}^{N} \nabla f_n$

Which explains why $\nabla f_n$ is computationaly cheaper than $\nabla f$ .

Correctness

Furthermore, $\nabla f_n$ is an unbiased estimate for $\nabla f$ , indeed:

E_{n} [\nabla f_{n}] = \nabla f

$\mathbb{E}_n\left[\nabla f_n\right] = \nabla f$

Where $\mathbb{E}_n$ is the expectation taken over dthe distribution of $n$ (which is chosen randomly, remember).

Mini-batch SGD

There is a variant of stochastic gradient descent called mini-batch, where instead of using only one component function, several components functions are used. These component functions are chosen at random at each step. $\newcommand{\card}[1]{\left|#1\right|}$

In this version, instead of simply choosing one index $n$ at random, a subset $B \subseteq \left[1; N\right]$ is chosen at random. The update step of the algorithm is the same where $f_n$ is replaced by $\frac{1}{\card{B}}\sum_{n \in B} f_n$ .

Comparison with SGD

The update rule for SGD at step t is:

x_{t + 1} = x_{t} - γ [\nabla f_{n} (x_{t})]

$x_{t+1} = x_{t} - \gamma\left[\nabla f_n(x_{t})\right]$

n is random

While the update rule for mini-batch SGD at step t is:

x_{t + 1} = x_{t} - γ [\frac{1}{| B |} \sum_{n \in B} \nabla f_{n} (x_{t})]

$x_{t+1} = x_{t} - \gamma\left[\frac{1}{\card{B}}\sum_{n \in B}\nabla f_n(x_{t})\right]$

B is random

The mini-batch algorithm

def minibatch_SGD(f, x_0, max_steps, step_size):
    """
    f is a list such that f[n-1] 
    is the n-th component function
    """
    x = x_0
    for step in range(max_steps):
        # Choose B=[a;b] at random
        a = random.randint(0, len(f)-1)
        b = random.randint(a, len(f)-1)
        # Compute the sum of gradients
        sigma = sum([gradient(of=f[n], at=x) for n in range(a, b+1)])
        # Update step
        x = x - step_size * 1/(b-a+1) * sigma
    return x