You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HARDWARE ARCHITECTURE ALIGNMENT (HOPPER SPECIFICS)
Tensor Core Math: Hopper’s FP16/BF16 tensor cores achieve peak FLOPs when operating on 128x128 matrices. The constraints (576=128*4 + 64, 512=128*4) allow splitting heads into warps that perfectly tile these matrices without padding waste.
Shared Memory Banks: Head dimensions must avoid bank conflicts during k/v cache loads. 576 and 512 are multiples of Hopper’s 32-bank structure—critical for parallel atomic updates in paged KV cache (block size=64)
MEMORY COHERENCE FOR PAGED KVCACHE
Block Size = 64: Each KV cache block holds 64 tokens. With head_dim=576:
576-dim vectors per token → 64 tokens/block → 36,864 elements/block
36,864 elements * 2 bytes (bf16) = 73,728 bytes → Fits Hopper’s L2 cache line (128KB) with room for metadata.
Deviating from 576/512 would fragment blocks across cache lines → thrashing L2/TLB during page table walks.
WARP-LEVEL OPTIMIZATIONS: Each warp (32 threads) handles 16 query vectors (h_q // h_kv=16 common in GQA). For head_dim=576: 576 elements/vector ÷ 32 threads = 18 elements/thread → No remainder → Coalesced loads via PTX ldmatrix.sync.aligned.
Arbitrary dimensions would force partial warps or wasted cycles on padding elements.
In the code, I noticed there are restrictions on head_dim and vhead_dim. What is the reason behind these constraints?
The text was updated successfully, but these errors were encountered: