GPU vs CPU Architecture

Compare how CPUs and GPUs process workloads. CPUs have few powerful cores optimized for sequential tasks with low latency. GPUs have thousands of simple cores optimized for massive parallelism and high throughput.

CPU8 cores

Core 0

ALU

CTRL

L1/L2

Core 1

ALU

CTRL

L1/L2

Core 2

ALU

CTRL

L1/L2

Core 3

ALU

CTRL

L1/L2

Core 4

ALU

CTRL

L1/L2

Core 5

ALU

CTRL

L1/L2

Core 6

ALU

CTRL

L1/L2

Core 7

ALU

CTRL

L1/L2

Shared L3 Cache (large)

Progress0.0%

Cycle: 0Done: 0/1024

GPU256 cores

Streaming Multiprocessors (SMs)

Global Memory (DRAM)

Progress0.0%

Cycle: 0Done: 0/1024

CPU Task Queue8 at a time

Pending

Running

Done

GPU Task Queue256 at a time

Pending

Running

Done

Latency vs Throughput

Latency (cycles per task)

CPU

2 cycles

GPU

1 cycles

GPU has lower per-task latency

Throughput (tasks per cycle)

CPU

4.0 tasks/cycle

GPU

256.0 tasks/cycle

GPU has 64.0x higher throughput

Speedup Factor

Estimated from current progress

--x

1.0x

CPU Cores8

GPU Cores256

CPU Done0/1024

GPU Done0/1024

CPU Util0%

GPU Util0%

Speedup--

Parallel %100%

Key Concepts

CPU Architecture

Few cores (4-16) with complex control logic
Large caches (L1/L2/L3) for low memory latency
Out-of-order execution, branch prediction
Optimized for single-thread performance
Best for sequential, branching workloads

GPU Architecture

Thousands of simple cores (CUDA cores / SMs)
Small caches, rely on massive parallelism to hide latency
SIMT: Single Instruction, Multiple Threads
Optimized for throughput over latency
Best for data-parallel, regular workloads

Vector Add (1024)

Embarrassingly parallel: each element computed independently. GPU massively outperforms CPU.