10.1

GPU vs CPU Architecture

Compare how CPUs and GPUs process workloads. CPUs have few powerful cores optimized for sequential tasks with low latency. GPUs have thousands of simple cores optimized for massive parallelism and high throughput.

CPU8 cores
Core 0
ALU
CTRL
L1/L2
Core 1
ALU
CTRL
L1/L2
Core 2
ALU
CTRL
L1/L2
Core 3
ALU
CTRL
L1/L2
Core 4
ALU
CTRL
L1/L2
Core 5
ALU
CTRL
L1/L2
Core 6
ALU
CTRL
L1/L2
Core 7
ALU
CTRL
L1/L2
Shared L3 Cache (large)
Progress0.0%
Cycle: 0Done: 0/1024
GPU256 cores
Streaming Multiprocessors (SMs)
Global Memory (DRAM)
Progress0.0%
Cycle: 0Done: 0/1024
CPU Task Queue8 at a time
Pending
Running
Done
GPU Task Queue256 at a time
Pending
Running
Done
Latency vs Throughput
Latency (cycles per task)
CPU
2 cycles
GPU
1 cycles
GPU has lower per-task latency
Throughput (tasks per cycle)
CPU
4.0 tasks/cycle
GPU
256.0 tasks/cycle
GPU has 64.0x higher throughput
Speedup Factor
Estimated from current progress
--x
1.0x
CPU Cores8
GPU Cores256
CPU Done0/1024
GPU Done0/1024
CPU Util0%
GPU Util0%
Speedup--
Parallel %100%
Key Concepts
CPU Architecture
  • Few cores (4-16) with complex control logic
  • Large caches (L1/L2/L3) for low memory latency
  • Out-of-order execution, branch prediction
  • Optimized for single-thread performance
  • Best for sequential, branching workloads
GPU Architecture
  • Thousands of simple cores (CUDA cores / SMs)
  • Small caches, rely on massive parallelism to hide latency
  • SIMT: Single Instruction, Multiple Threads
  • Optimized for throughput over latency
  • Best for data-parallel, regular workloads
Vector Add (1024)

Embarrassingly parallel: each element computed independently. GPU massively outperforms CPU.