Step through a transformer encoder block. Watch tokens get embedded, attend to each other via multi-head attention, and pass through feed-forward layers.
Each token is mapped to a dense vector. Positional encoding adds sin/cos signals so the model knows token order. These combine to form the input representation.
Each token computes Query, Key, Value vectors. Attention scores (QK^T) determine how much each token attends to others. Softmax normalizes these into weights.
Multiple attention heads run in parallel, each learning different relationship patterns. Their outputs are concatenated and projected back to the model dimension.
A two-layer feed-forward network processes each position independently. Residual connections and layer normalization around each sub-layer enable deep stacking.