What is stochastic gradient descent (SGD)?

Oct 16, 2018

Stochastic gradient descent is an algorithm that tries to find the minimum of a function expressed as a sum of component functions. It does so by choosing a component function at random then following its gradient.

Sum functions

Suppose f is expressed as the mean of N component functions fn:

f(x)=1Ni=1Nfn(x)

The gradient

The gradient f(x) of f at x is a vector directed along the steepest ascent at x.

Gradient

We can try to find the minimum of f by taking iterative steps in the direction of f. This is what the gradient descent algorithm do.

Sometimes, computing the gradient of f is too computationally expensive. Stochastic gradient descent reduces the computational cost by using the gradient fn of one component function instead.

Stochastic gradient descent algorithm

The stochastic gradient descent algorithm aims at finding the minimum of f=(1/N)n=1Nfn by taking successive steps directed along fn, where at each step n is chosen at random.

def stochastic_gradient_descent(f, x_0, max_steps, step_size):
    """
    f is a list such that f[n-1] 
    is the n-th component function
    """
    x = x_0
    for step in range(max_steps):
        # Choose the component function at random
        n = random.randint(0, len(f)-1)
        # Update step
        x = x - step_size * gradient(of=f[n], at=x)
    return x

Comparison with gradient descent

The update rule for gradient descent at step t is:

xt+1=xtγf(xt)

While the update rule for stochastic gradient descent is:

  xt+1=xtγfn(xt) n is random

Is it really faster than gradient descent?

For I iterations, and N training examples, gradient descent computes IN gradients while stochastic gradient descent computes only I gradients. When N>>I, this yields a good optimization.

In practice SGD is worth a shot when we have a very large number of training examples. In machine learning, it often happens that we have hundreds of thousands, or millions of training examples (i.e. N>106) and that SGD still converges after only a few hundreds of iterations (i.e. I<1000).

Why it works

This algorithm works because fn(x) is an unbiased estimate of f(x) that is much simpler to compute.

Computational cost

The gradient of a sum is the sum of the gradients, so:

f=n=1Nfn

Which explains why fn is computationaly cheaper than f.

Correctness

Furthermore, fn is an unbiased estimate for f, indeed:

En[fn]=f

Where En is the expectation taken over dthe distribution of n (which is chosen randomly, remember).

Mini-batch SGD

There is a variant of stochastic gradient descent called mini-batch, where instead of using only one component function, several components functions are used. These component functions are chosen at random at each step.

In this version, instead of simply choosing one index n at random, a subset B[1;N] is chosen at random. The update step of the algorithm is the same where fn is replaced by 1|B|nBfn.

Comparison with SGD

The update rule for SGD at step t is:

  xt+1=xtγ[fn(xt)] n is random

While the update rule for mini-batch SGD at step t is:

  xt+1=xtγ[1|B|nBfn(xt)] B is random

The mini-batch algorithm

def minibatch_SGD(f, x_0, max_steps, step_size):
    """
    f is a list such that f[n-1] 
    is the n-th component function
    """
    x = x_0
    for step in range(max_steps):
        # Choose B=[a;b] at random
        a = random.randint(0, len(f)-1)
        b = random.randint(a, len(f)-1)
        # Compute the sum of gradients
        sigma = sum([gradient(of=f[n], at=x) for n in range(a, b+1)])
        # Update step
        x = x - step_size * 1/(b-a+1) * sigma
    return x