ThunderKittens - Fading Coder

ThunderKittens: A Minimal CUDA DSL for 30% H100 Performance Gain Over FlashAttention-2

AI’s rapid advancement brings massive computational demands, driving the need to reduce AI’s compute footprint and maximize existing hardware efficiency. Stanford researchers addressed this challenge by developing ThunderKittens, a compact CUDA-embedded DSL for writing high-performance deep learning...

Fading Coder

ThunderKittens: A Minimal CUDA DSL for 30% H100 Performance Gain Over FlashAttention-2

Copyright © fadingcoder.top