Watch different optimizers navigate a loss landscape. Compare how SGD, Momentum, RMSProp, and Adam handle hills, valleys, and saddle points.
Stochastic Gradient Descent updates parameters directly proportional to the gradient. Simple but can oscillate in narrow valleys and get stuck at saddle points.
Adds a velocity term that accumulates past gradients. This smooths oscillations and helps accelerate through flat regions and narrow valleys.
Adapts the learning rate for each parameter using a running average of squared gradients. Parameters with large gradients get smaller effective learning rates.
Combines momentum (first moment) and RMSProp (second moment) with bias correction. The most widely used optimizer in modern deep learning.