Gradient Descent Variants

Watch different optimizers navigate a loss landscape. Compare how SGD, Momentum, RMSProp, and Adam handle hills, valleys, and saddle points.

LR:0.0500

Optimizers:

Loss Landscape

Step: 0

-5x5

-5y5

Low loss

High loss

Loss Over Time

1.0x

Summary

Total Steps0

Learning Rate0.0500

LandscapeSimple Bowl

Converged0/0

How It Works

Update Rules

SGD: w = w - lr * grad

Momentum: v = beta*v + lr*grad; w = w - v

Adam: m = b1*m + (1-b1)*g; v = b2*v + (1-b2)*g^2; w -= lr*m_hat/sqrt(v_hat)

Start Position

X:3.5

Y:3.5

SGD

Stochastic Gradient Descent updates parameters directly proportional to the gradient. Simple but can oscillate in narrow valleys and get stuck at saddle points.

Momentum

Adds a velocity term that accumulates past gradients. This smooths oscillations and helps accelerate through flat regions and narrow valleys.

RMSProp

Adapts the learning rate for each parameter using a running average of squared gradients. Parameters with large gradients get smaller effective learning rates.

Adam

Combines momentum (first moment) and RMSProp (second moment) with bias correction. The most widely used optimizer in modern deep learning.