11.7

Transformer Architecture

Step through a transformer encoder block. Watch tokens get embedded, attend to each other via multi-head attention, and pass through feed-forward layers.

Heads:

Encoder Block Architecture

Input
Embed
+ Pos
Enc
Multi-Head
Attention
+ residual
Add &
Norm
Feed
Forward
+ residual
Add &
Norm

Input Tokens

The[0]
cat[1]
sat[2]
on[3]
the[4]
mat[5]
1.0x

Current Step

Ready
Waiting to begin
Progress:
0/7

Model Metrics

Sequence Length6
Embedding Dim (d_model)8
Key/Query Dim (d_k)4
Attention Heads2
FFN Hidden Dim16
Active HeadHead 1
Total Params (approx)480

Input Sentence

Tokens: 6 | Press reset after changing

Embeddings

Each token is mapped to a dense vector. Positional encoding adds sin/cos signals so the model knows token order. These combine to form the input representation.

Self-Attention

Each token computes Query, Key, Value vectors. Attention scores (QK^T) determine how much each token attends to others. Softmax normalizes these into weights.

Multi-Head

Multiple attention heads run in parallel, each learning different relationship patterns. Their outputs are concatenated and projected back to the model dimension.

FFN & Residuals

A two-layer feed-forward network processes each position independently. Residual connections and layer normalization around each sub-layer enable deep stacking.