stochastic_gradient_descent_brothers
Table of Contents
#
SGD
更多理论层面的内容,参考这篇读书笔记,这里的代码片段更多是用于复习概念。
SGD只用到一阶导数,使用二阶导数的优化方法可以参考这篇读书笔记。
##
vanilla SGD
$$ x_{t+1} = x_t - \alpha \nabla f(x_t) $$
for t in range(num_steps):
dw = compute_gradient(w)
w -= learning_rate * dw
##
SGD + Momentum
但什么是动量(Momentum)呢?可以看这个短文。
$$ v_{t+1} = \rho v_t + \nabla f(x_t) \\ x_{t+1} = x_t - \alpha v_{t+1} $$
v = 0
for t in range(num_steps):
dw = compute_gradient(w)
v = rho * v + dw
w -= learning_rate * v
or the equivalent:
v = 0
for t in range(num_steps):
dw = compute_gradient(w)
v = rho * v - learning_rate * dw
w += v
##
Nesterov Momentum
- Momentum update: Combine gradient at current point with velocity to get step used to update weights
- Nesterov Momentum: “Look ahead” to the point where updating using velocity would take us; compute gradient there and mix it with velocity to get actual update direction
$$ \begin{eqnarray} v_{t+1} &=& \rho v_t - \alpha \nabla f(x_t + \rho v_t) \\ x_{t+1} &=& x_t + v_{t+1} \end{eqnarray} $$ 但是,我们更想要以$x_t, \nabla f(x_t)$的形式进行更新(而不是$\nabla f(x_t + \rho v_t)$), 因此,通常做法是引入替代变量:
$$ \begin{eqnarray} v_{t+1} &=& \rho v_t - \alpha \nabla f(\tilde{x}_t) \\ \tilde{x}_{t+1} &=& \tilde{x}_t - \rho v_t + (1 + \rho) v_{t+1} \\ &=& \tilde{x} + v_{t+1} + \rho (v_{t+1} - v_t) \end{eqnarray} $$
v = 0
for t in range(num_steps):
dw = compute_gradient(w)
prev_v = v
v = rho * v - learning_rate * dw
w -= rho * prev_v - (1 + rho) * v
##
RMSProp: “Leaky AdaGrad”
- AdaGrad
grad_squared = 0
for t in range(num_steps):
dw = compute_gradient(w)
grad_squared += dw * dw
w -= learning_rate * dw / (grad_squared.sqrt() + 1e-7)
- RMSProp
grad_squared = 0
for t in range(num_steps):
dw = compute_gradient(w)
grad_squared += decay_rate * grad_squared + (1 - decay_rate) * dw * dw
w -= learning_rate * dw / (grad_squared.sqrt() + 1e-7)
- Adam: RMSProp + Momentum
moment1 = 0
moment2 = 0
for t in range(num_steps):
dw = compute_gradient(w)
moment1 = beta1 * moment1 + (1 - beta1) * dw
moment2 = beta2 * moment2 + (1 - beta2) * dw * dw
w -= learning_rate * moment1 / (moment2.sqrt() + 1e-7)
在这个Adam
算法版本中,当t=0,beta2=0.999
时,会怎么样?会造成类似1/0
的效果,而在迭代
的第一步就得到一个巨大的梯度值。这通常不好,需要修正。
- Bias correction: Bias correction for the fact that first and second moment estimates start at zero
moment1 = 0
moment2 = 0
for t in range(num_steps):
dw = compute_gradient(w)
moment1 = beta1 * moment1 + (1 - beta1) * dw
moment2 = beta2 * moment2 + (1 - beta2) * dw * dw
moment1_unbias = moment1 / (1 - beta1 ** t)
moment2_unbias = moment2 / (1 - beta2 ** t)
w -= learning_rate * moment1 / (moment2.sqrt() + 1e-7)