-
Notifications
You must be signed in to change notification settings - Fork 575
More realistic bench for POD Attn #2013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe benchmark file adds new persistent BatchAttention and sequential two-kernel benchmark paths with dedicated timing measurements. Randomized test fixture generation is replaced with fixed deterministic configuration sets. Benchmark outputs are extended to report timing and bandwidth calculations for the new paths. Configuration parameters are updated: num_kv_heads increased from 4 to 8 and num_qo_heads from 28 to 32. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
Summary of ChangesHello @Edenzzzz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the realism and scope of attention mechanism benchmarks. It updates the test configurations to reflect more practical head sizes and sequence lengths, and crucially, adds new performance comparisons against Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the mixed attention benchmark to use more realistic head sizes and sequence lengths, and adds comparisons with a persistent BatchAttention kernel and a sequential prefill-decode implementation. The changes are a good step towards more representative benchmarking. I have a couple of suggestions to improve consistency and maintainability in the benchmark code.
| ) | ||
| o_persistent, _ = wrapper_persistent.run(q, kv_data) | ||
| measurements_persistent = bench_gpu_time(lambda: wrapper_persistent.run(q, kv_data)) | ||
| ms_persistent = np.mean(measurements_persistent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the other measurements in this benchmark, it's better to use np.median instead of np.mean. np.median is more robust to outliers, which can be common in performance measurements.
| ms_persistent = np.mean(measurements_persistent) | |
| ms_persistent = np.median(measurements_persistent) |
| num_kv_heads, | ||
| head_dim, | ||
| page_block_size, | ||
| data_type=torch.bfloat16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
benchmarks/bench_mixed_attention.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
benchmarks/bench_mixed_attention.py (3)
flashinfer/attention.py (1)
BatchAttention(42-198)flashinfer/decode.py (9)
plan(810-1102)plan(1603-1726)run(1132-1145)run(1148-1161)run(1163-1374)run(1728-1852)BatchDecodeWithPagedKVCacheWrapper(581-1410)use_tensor_cores(779-780)use_tensor_cores(1576-1577)flashinfer/prefill.py (11)
plan(1523-1919)plan(2489-2777)run(1950-1962)run(1965-1977)run(1979-2206)run(2807-2817)run(2820-2830)run(2832-2978)single_prefill_with_kv_cache(911-932)single_prefill_with_kv_cache(936-957)single_prefill_with_kv_cache(960-1195)
🪛 Ruff (0.14.2)
benchmarks/bench_mixed_attention.py
90-90: Unpacked variable o_persistent is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
| o_persistent, _ = wrapper_persistent.run(q, kv_data) | ||
| measurements_persistent = bench_gpu_time(lambda: wrapper_persistent.run(q, kv_data)) | ||
| ms_persistent = np.mean(measurements_persistent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop unused persistent output.
Line 90 binds o_persistent, but the value is never read and Ruff emits RUF059. Please discard the binding (for example, call wrapper_persistent.run(q, kv_data) without assignment or bind to _) so the warm-up still happens without leaving an unused variable.
- o_persistent, _ = wrapper_persistent.run(q, kv_data)
+ wrapper_persistent.run(q, kv_data)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| o_persistent, _ = wrapper_persistent.run(q, kv_data) | |
| measurements_persistent = bench_gpu_time(lambda: wrapper_persistent.run(q, kv_data)) | |
| ms_persistent = np.mean(measurements_persistent) | |
| wrapper_persistent.run(q, kv_data) | |
| measurements_persistent = bench_gpu_time(lambda: wrapper_persistent.run(q, kv_data)) | |
| ms_persistent = np.mean(measurements_persistent) |
🧰 Tools
🪛 Ruff (0.14.2)
90-90: Unpacked variable o_persistent is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
In benchmarks/bench_mixed_attention.py around lines 90 to 92, the first call
assigns o_persistent which is never used (RUF059); remove the unused variable by
calling wrapper_persistent.run(q, kv_data) without assignment or assign the
result to _ so the warm-up call still executes but no unused binding remains.
| # Sequential two kernels: single prefill + batch decode (tensor cores) | ||
| # Prefill using single_prefill_with_kv_cache | ||
| def _run_single_prefill(): | ||
| return flashinfer.prefill.single_prefill_with_kv_cache( | ||
| q_p, | ||
| k_p, | ||
| v_p, | ||
| causal=causal, | ||
| pos_encoding_mode="NONE", | ||
| backend="fa2", | ||
| ) | ||
|
|
||
| measurements_prefill = bench_gpu_time(lambda: _run_single_prefill()) | ||
| ms_prefill = np.median(measurements_prefill) | ||
|
|
||
| # Batch decode using tensor cores | ||
| wrapper_decode = flashinfer.BatchDecodeWithPagedKVCacheWrapper( | ||
| workspace_buffer, kv_layout=kv_layout, use_tensor_cores=True | ||
| ) | ||
| wrapper_decode.plan( | ||
| d_kv_indptr.to(device), | ||
| kv_indices_d.to(device), | ||
| last_page_len_d, | ||
| num_qo_heads, | ||
| num_kv_heads, | ||
| head_dim, | ||
| page_block_size, | ||
| data_type=torch.bfloat16, | ||
| q_data_type=torch.bfloat16, | ||
| ) | ||
| measurements_decode = bench_gpu_time(lambda: wrapper_decode.run(q_d, kv_d)) | ||
| ms_decode = np.median(measurements_decode) | ||
| ms_seq_two_kernels = ms_prefill + ms_decode | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Measure sequential path in one benchmarked call.
Lines 158-177 derive ms_seq_two_kernels by summing medians from two completely separate benchmark runs. Because bench_gpu_time synchronizes around each callable, that sum omits the synchronization gap between kernels and hides any stream/data dependency penalties when prefill hands off to decode. As a result, the reported “Sequential two kernels” latency is optimistic and not directly comparable to the single-call POD/persistent timings. Benchmark the sequential path inside a single callable and use that median instead so the printed number reflects the real pipeline cost.
measurements_prefill = bench_gpu_time(lambda: _run_single_prefill())
ms_prefill = np.median(measurements_prefill)
# Batch decode using tensor cores
wrapper_decode = flashinfer.BatchDecodeWithPagedKVCacheWrapper(
workspace_buffer, kv_layout=kv_layout, use_tensor_cores=True
)
@@
)
- measurements_decode = bench_gpu_time(lambda: wrapper_decode.run(q_d, kv_d))
- ms_decode = np.median(measurements_decode)
- ms_seq_two_kernels = ms_prefill + ms_decode
+ measurements_decode = bench_gpu_time(lambda: wrapper_decode.run(q_d, kv_d))
+ ms_decode = np.median(measurements_decode)
+
+ def _run_prefill_and_decode():
+ _run_single_prefill()
+ return wrapper_decode.run(q_d, kv_d)
+
+ measurements_seq = bench_gpu_time(_run_prefill_and_decode)
+ ms_seq_two_kernels = np.median(measurements_seq)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Sequential two kernels: single prefill + batch decode (tensor cores) | |
| # Prefill using single_prefill_with_kv_cache | |
| def _run_single_prefill(): | |
| return flashinfer.prefill.single_prefill_with_kv_cache( | |
| q_p, | |
| k_p, | |
| v_p, | |
| causal=causal, | |
| pos_encoding_mode="NONE", | |
| backend="fa2", | |
| ) | |
| measurements_prefill = bench_gpu_time(lambda: _run_single_prefill()) | |
| ms_prefill = np.median(measurements_prefill) | |
| # Batch decode using tensor cores | |
| wrapper_decode = flashinfer.BatchDecodeWithPagedKVCacheWrapper( | |
| workspace_buffer, kv_layout=kv_layout, use_tensor_cores=True | |
| ) | |
| wrapper_decode.plan( | |
| d_kv_indptr.to(device), | |
| kv_indices_d.to(device), | |
| last_page_len_d, | |
| num_qo_heads, | |
| num_kv_heads, | |
| head_dim, | |
| page_block_size, | |
| data_type=torch.bfloat16, | |
| q_data_type=torch.bfloat16, | |
| ) | |
| measurements_decode = bench_gpu_time(lambda: wrapper_decode.run(q_d, kv_d)) | |
| ms_decode = np.median(measurements_decode) | |
| ms_seq_two_kernels = ms_prefill + ms_decode | |
| # Sequential two kernels: single prefill + batch decode (tensor cores) | |
| # Prefill using single_prefill_with_kv_cache | |
| def _run_single_prefill(): | |
| return flashinfer.prefill.single_prefill_with_kv_cache( | |
| q_p, | |
| k_p, | |
| v_p, | |
| causal=causal, | |
| pos_encoding_mode="NONE", | |
| backend="fa2", | |
| ) | |
| measurements_prefill = bench_gpu_time(lambda: _run_single_prefill()) | |
| ms_prefill = np.median(measurements_prefill) | |
| # Batch decode using tensor cores | |
| wrapper_decode = flashinfer.BatchDecodeWithPagedKVCacheWrapper( | |
| workspace_buffer, kv_layout=kv_layout, use_tensor_cores=True | |
| ) | |
| wrapper_decode.plan( | |
| d_kv_indptr.to(device), | |
| kv_indices_d.to(device), | |
| last_page_len_d, | |
| num_qo_heads, | |
| num_kv_heads, | |
| head_dim, | |
| page_block_size, | |
| data_type=torch.bfloat16, | |
| q_data_type=torch.bfloat16, | |
| ) | |
| measurements_decode = bench_gpu_time(lambda: wrapper_decode.run(q_d, kv_d)) | |
| ms_decode = np.median(measurements_decode) | |
| def _run_prefill_and_decode(): | |
| _run_single_prefill() | |
| return wrapper_decode.run(q_d, kv_d) | |
| measurements_seq = bench_gpu_time(_run_prefill_and_decode) | |
| ms_seq_two_kernels = np.median(measurements_seq) |
🤖 Prompt for AI Agents
In benchmarks/bench_mixed_attention.py around lines 145 to 178, the sequential
two-kernel latency is computed by summing medians from two separate
bench_gpu_time runs (prefill and decode), which omits inter-kernel
synchronization and handoff cost; instead, wrap the whole sequential sequence
(call single_prefill_with_kv_cache followed immediately by wrapper_decode.run)
in a single callable passed to bench_gpu_time so the synchronization overhead
between kernels is measured, take the median of that single measurement as
ms_seq_two_kernels, and use that value wherever the combined sequential latency
is reported.
yzh119
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose the benefit of POD mainly coming from overlapping?
LGTM overall, we will revamp the OSS attention code in the coming release and let's check the performance later.
|
@yzh119 ncu profiling results show POD has less branching and higher memory throughput |






📌 Description
Use real head sizes, seq lens and add comparison with sequential prefill + decode.

Results on H100 (without overlap, which only adds ~150GB/s for persistent):
cc @yzh119
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Release Notes
New Features
Tests