Skip to content

benchmarks: add MFU% column to benchmark output#2377

Merged
tridao merged 1 commit intoDao-AILab:mainfrom
Johnsonms:add-mfu-benchmark
Mar 22, 2026
Merged

benchmarks: add MFU% column to benchmark output#2377
tridao merged 1 commit intoDao-AILab:mainfrom
Johnsonms:add-mfu-benchmark

Conversation

@Johnsonms
Copy link
Copy Markdown
Collaborator

@Johnsonms Johnsonms commented Mar 20, 2026

Adds get_peak_flops() for known NVIDIA GPUs (B300, B200, H200, H100, A100, etc.) and shows ms/TFLOPS/MFU% per cell in the benchmark table.
H100 Before
image

H100 After:
image

B200 Before:
image

B200 After:
image

B300 Before:
image

B300 Afer:
image

- Add MFU% column to benchmark output
- Add dtype parameter to get_peak_flops to correctly scale peak FLOPS
  for FP8 (2x), FP32 (0.5x), and FP16/BF16 (1x, identical throughput)
- Fix H200 (989 TFLOPS) and H20 (148 TFLOPS) values
- Add H100 NVL (835 TFLOPS), L40S (362 TFLOPS), B300 (3.5 PFLOPS),
  GB200/GB300 (2.5 PFLOPS) entries
- Add source URLs and sparsity notes from NVIDIA datasheets
@Johnsonms Johnsonms marked this pull request as ready for review March 21, 2026 05:17
@tridao tridao merged commit 3cafddf into Dao-AILab:main Mar 22, 2026
@Johnsonms Johnsonms deleted the add-mfu-benchmark branch March 22, 2026 14:41
zhuochenKIDD pushed a commit to zhuochenKIDD/flash-attention that referenced this pull request Mar 25, 2026
- Add MFU% column to benchmark output
- Add dtype parameter to get_peak_flops to correctly scale peak FLOPS
  for FP8 (2x), FP32 (0.5x), and FP16/BF16 (1x, identical throughput)
- Fix H200 (989 TFLOPS) and H20 (148 TFLOPS) values
- Add H100 NVL (835 TFLOPS), L40S (362 TFLOPS), B300 (3.5 PFLOPS),
  GB200/GB300 (2.5 PFLOPS) entries
- Add source URLs and sparsity notes from NVIDIA datasheets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants