10.3

Warp Execution

Visualize how GPU warps of 32 threads execute instructions in lockstep (SIMT), handle branch divergence with active masks, and get scheduled by the warp scheduler.

Warp 0 Lanes (32 threads)32/32 active
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Active Mask:11111111111111111111111111111111
Active
Masked
Idle
Instruction Stream (Warp 0)
0
LOAD x[tid]32/32
1MUL x, 2.032/32
2ADD x, bias32/32
3STORE y[tid]32/32
4LOAD z[tid]32/32
5ADD y, z32/32
6STORE out[tid]32/32
Execution TimelineCycle: 0
Warp 0
Stall
Warp Scheduler
Warp 0
active
0/7 instrs0 switches
SM Occupancy
Warps Loaded1/4
Occupancy: 25%(low - stalls cannot be hidden)
1.0x
Active Lanes32/32
Lane Util100%
Avg Util0.0%
Divergences0
Warp Switches0
Instrs Exec0
Key Concepts
SIMT Execution
  • 32 threads in a warp execute in lockstep
  • All active lanes run the same instruction
  • Maximum efficiency when all 32 lanes are active
Branch Divergence
  • If/else causes some lanes to be masked off
  • Both paths executed serially, not in parallel
  • Reconverge after the branch completes
Warp Scheduling
  • SM schedules warps to hide memory latency
  • Stalled warps yield to eligible ones
  • Higher occupancy = better latency hiding
No Divergence

All 32 lanes execute the same path. Maximum SIMT efficiency with full lane utilization.