##
Approximate Second-Order Methods
For simplicity of exposition, the only objective function we examine is the empirical risk: $$ \tag{8.25} J(\theta) = \mathbb{E}{\mathbf{x, y} \sim \hat{p}{data}} [L(f(x; \theta), y)] = \frac{1}{m} \sum^m_{i=1} L(f(x^{(i)}; \theta), y^{(i)}). $$ In contrast to first-order methods, second-order methods make use of the second derivatives to improve optimization. The most widely used second-order method is Newton’s method.
#
8.6.1 Newton’s Method
Newton’s method is an optimization scheme based on using a second-order Taylor series expansion to approximate $J(\theta)$ near some point $\theta_0$, ignoring derivatives of higher order: $$ \tag{8.26} J(\theta) \approx J(\theta_0) + (\theta - \theta_0)^{\mathsf{T}} \nabla_{\theta} J(\theta_0) + \frac{1}{2} (\theta - \theta_0)^{\mathsf{T}} H (\theta - \theta_0), $$ where $H$ is the Hessian of $J$ with respect to $\theta$ evaluated at $\theta_0$. If we then solve for the critical point of this function, we obtain the Newton parameter update rule: $$ \tag{8.27} \theta^* = \theta_0 - H^{-1} \nabla_{\theta} J(\theta_0). $$ Thus for a locally quadratic function (with positive definite $H$), by rescaling the gradient by $H^{-1}$, Newton’s method jumps directly to the minimum (NOTE that $\theta^*$ means the $\theta$ which optimizes the $J(\theta)$). If the objective function is convex but not quadratic (there are higher-order terms), this update can be iterated, yielding the training algorithm associated with Newton’s method, given in algorithm 8.8.