fgg blog

stochastic_gradient_descent_brothers

Table of Contents

# SGD

更多理论层面的内容,参考这篇读书笔记,这里的代码片段更多是用于复习概念。

SGD只用到一阶导数,使用二阶导数的优化方法可以参考这篇读书笔记

## vanilla SGD

$$ x_{t+1} = x_t - \alpha \nabla f(x_t) $$

for t in range(num_steps):
    dw = compute_gradient(w)
    w -= learning_rate * dw

## SGD + Momentum

但什么是动量(Momentum)呢?可以看这个短文

$$ v_{t+1} = \rho v_t + \nabla f(x_t) \\ x_{t+1} = x_t - \alpha v_{t+1} $$

v = 0
for t in range(num_steps):
    dw = compute_gradient(w)
    v = rho * v + dw
    w -= learning_rate * v

or the equivalent:

v = 0
for t in range(num_steps):
    dw = compute_gradient(w)
    v = rho * v - learning_rate * dw
    w += v

## Nesterov Momentum

  • Momentum update: Combine gradient at current point with velocity to get step used to update weights
  • Nesterov Momentum: “Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction

$$ \begin{eqnarray} v_{t+1} &=& \rho v_t - \alpha \nabla f(x_t + \rho v_t) \\ x_{t+1} &=& x_t + v_{t+1} \end{eqnarray} $$ 但是,我们更想要以$x_t, \nabla f(x_t)$的形式进行更新(而不是$\nabla f(x_t + \rho v_t)$), 因此,通常做法是引入替代变量:

$$ \begin{eqnarray} v_{t+1} &=& \rho v_t - \alpha \nabla f(\tilde{x}_t) \\ \tilde{x}_{t+1} &=& \tilde{x}_t - \rho v_t + (1 + \rho) v_{t+1} \\ &=& \tilde{x} + v_{t+1} + \rho (v_{t+1} - v_t) \end{eqnarray} $$

v = 0
for t in range(num_steps):
    dw = compute_gradient(w)
    prev_v = v
    v = rho * v - learning_rate * dw
    w -= rho * prev_v - (1 + rho) * v

## RMSProp: “Leaky AdaGrad”

  • AdaGrad
grad_squared = 0
for t in range(num_steps):
    dw = compute_gradient(w)
    grad_squared += dw * dw
    w -= learning_rate * dw / (grad_squared.sqrt() + 1e-7)
  • RMSProp
grad_squared = 0
for t in range(num_steps):
    dw = compute_gradient(w)
    grad_squared += decay_rate * grad_squared + (1 - decay_rate) * dw * dw
    w -= learning_rate * dw / (grad_squared.sqrt() + 1e-7)
  • Adam: RMSProp + Momentum
moment1 = 0
moment2 = 0
for t in range(num_steps):
    dw = compute_gradient(w)
    moment1 = beta1 * moment1 + (1 - beta1) * dw
    moment2 = beta2 * moment2 + (1 - beta2) * dw * dw
    w -= learning_rate * moment1 / (moment2.sqrt() + 1e-7)

在这个Adam算法版本中,当t=0,beta2=0.999时,会怎么样?会造成类似1/0的效果,而在迭代 的第一步就得到一个巨大的梯度值。这通常不好,需要修正。

  • Bias correction: Bias correction for the fact that first and second moment estimates start at zero
moment1 = 0
moment2 = 0
for t in range(num_steps):
    dw = compute_gradient(w)
    moment1 = beta1 * moment1 + (1 - beta1) * dw
    moment2 = beta2 * moment2 + (1 - beta2) * dw * dw
    moment1_unbias = moment1 / (1 - beta1 ** t)
    moment2_unbias = moment2 / (1 - beta2 ** t)
    w -= learning_rate * moment1 / (moment2.sqrt() + 1e-7)