7.13. Gradient descent¶

Batch GD & Mini-batch GD
GD with momentum
Adam

7.13.1. Batch GD & Mini-batch GD¶

The vanilla GD updates the parameters for entire dataset:

\[ \theta = \theta - \eta\nabla_\theta J(\theta) \]

SGD updates the parameters for each training examples.

\[ \theta = \theta - \eta\nabla_\theta J(\theta;x^{(i)};y^{(i)}) \]

Mini-batch GD performs updates for every mini-batch.

\[ \theta = \theta - \eta\nabla_\theta J(\theta;x^{(i:i+n)};y^{(i:i+n)}) \]

The Batch GD converges to a local minimum for non-convex problems and slow when the size of data is large.
The SGD updates frequently with a high variance, causing the lost function fluctuate heavily.
Mini-batch GD takes the advantage of both. Note that in training neural network models, term SGD is usually employed.
Challenges: Saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.

7.13.2. GD with momentum¶

The momentum term increase for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.
“Ball runing down a surface”
We we gain faster convergence. $$ v_t = \gamma v_{t-1}+ \eta\nabla_\theta J(\theta) \\ \theta = \theta - v_t $$

7.13.3. Adam¶

Adagrad: adapts the learning rate
Adadelta: extension of Adagrad, more conservative
Adam: Adaptive Moment Estimation Previously, we updates for $\theta$ using the same learning rate $\eta$.
The adaptive learning rate is changing with the second-order moment of gradient.
“Ball with friction” $$ m_t = \beta_1 m_{t-1}+(1-\beta_1)g_t\\ v_t = \beta_2{t-1}+(1-\beta_2)g_t^2 $$

The $m_t,v_t$ serve as estimates for the gradient $g_t$. With some bias correction, the updates is performed as follow:

\[ \theta = \theta - \frac{\eta}{\sqrt{\hat{v}_t}+\epsilon}\hat{m}_t \]

Introduction to Data Science

7.13. Gradient descent¶

7.13.1. Batch GD & Mini-batch GD¶

7.13.2. GD with momentum¶

7.13.3. Adam¶