2.10CPU Architecture

SIMD / Vector Processing Visualizer

Compare scalar (one element at a time) vs SIMD (multiple elements in one instruction). See the throughput advantage of SSE, AVX, and AVX-512 vector widths.

Operation
Mode
SIMD Width
Speed1x
Mode
SIMD x4
Cycles Used
0
Elements Done
0/16
Speedup
4x

SIMD Processing (4-wide)

[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
56
35
37
16
94
52
26
92
84
89
0
53
43
11
83
2
ADD
ADD
ADD
ADD
ADD
ADD
ADD
ADD
ADD
ADD
ADD
ADD
ADD
ADD
ADD
ADD
52
51
71
18
8
59
50
8
54
26
38
23
68
20
3
57
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?

Throughput Comparison

Scalar
16 cycles
SSE (4)
4 cycles (4x speedup)
AVX (8)
2 cycles (8x speedup)
AVX-512 (16)
1 cycles (16x speedup)

Scalar: Processes one element per clock cycle. Simple but slow for data-parallel operations.

SIMD: Single Instruction, Multiple Data — processes 4/8/16 elements simultaneously using wide vector registers.

Real-world: SSE (128-bit, 4 floats), AVX2 (256-bit, 8 floats), AVX-512 (512-bit, 16 floats). Used in image processing, scientific computing, and ML inference.