11.3

Gradient Descent Variants

Watch different optimizers navigate a loss landscape. Compare how SGD, Momentum, RMSProp, and Adam handle hills, valleys, and saddle points.

LR:0.0500
Optimizers:

Loss Landscape

Step: 0
-5x5
-5y5
Low loss
High loss

Loss Over Time

1.0x

Summary

Total Steps0
Learning Rate0.0500
LandscapeSimple Bowl
Converged0/0

How It Works

Update Rules

SGD: w = w - lr * grad
Momentum: v = beta*v + lr*grad; w = w - v
Adam: m = b1*m + (1-b1)*g; v = b2*v + (1-b2)*g^2; w -= lr*m_hat/sqrt(v_hat)

Start Position

X:3.5
Y:3.5

SGD

Stochastic Gradient Descent updates parameters directly proportional to the gradient. Simple but can oscillate in narrow valleys and get stuck at saddle points.

Momentum

Adds a velocity term that accumulates past gradients. This smooths oscillations and helps accelerate through flat regions and narrow valleys.

RMSProp

Adapts the learning rate for each parameter using a running average of squared gradients. Parameters with large gradients get smaller effective learning rates.

Adam

Combines momentum (first moment) and RMSProp (second moment) with bias correction. The most widely used optimizer in modern deep learning.