Transformer Architecture

Step through a transformer encoder block. Watch tokens get embedded, attend to each other via multi-head attention, and pass through feed-forward layers.

Heads:

Encoder Block Architecture

Input
Embed

+ Pos
Enc

Multi-Head
Attention

+ residual

Add &
Norm

Feed
Forward

+ residual

Add &
Norm

Input Tokens

The[0]

cat[1]

sat[2]

on[3]

the[4]

mat[5]

1.0x

Current Step

Ready

Waiting to begin

Progress:

0/7

Model Metrics

Sequence Length6

Embedding Dim (d_model)8

Key/Query Dim (d_k)4

Attention Heads2

FFN Hidden Dim16

Active HeadHead 1

Total Params (approx)480

Input Sentence

Tokens: 6 | Press reset after changing

Embeddings

Each token is mapped to a dense vector. Positional encoding adds sin/cos signals so the model knows token order. These combine to form the input representation.

Self-Attention

Each token computes Query, Key, Value vectors. Attention scores (QK^T) determine how much each token attends to others. Softmax normalizes these into weights.

Multi-Head

Multiple attention heads run in parallel, each learning different relationship patterns. Their outputs are concatenated and projected back to the model dimension.

FFN & Residuals

A two-layer feed-forward network processes each position independently. Residual connections and layer normalization around each sub-layer enable deep stacking.