fix: Fix memory bandwidth calculation in MLA benchmarks#2479
fix: Fix memory bandwidth calculation in MLA benchmarks#2479yzh119 merged 6 commits intoflashinfer-ai:mainfrom
Conversation
📝 WalkthroughWalkthroughReplaces CUDA-graph timing with bench_gpu_time and CUPTI-style timing parameters; updates benchmark calls and attention routine to compute explicit q/kv/output memory bytes, use actual accessed KV tokens for FLOPs/bandwidth, and print updated throughput metrics. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @bkryu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the accuracy of performance metrics within the MLA benchmarks by addressing incorrect memory bandwidth and FLOPs calculations. It ensures that these metrics reflect actual data access patterns and computational work, providing a more realistic assessment of performance. Additionally, the benchmarking process in one script is standardized through the adoption of a unified GPU timing utility, enhancing consistency across benchmarks. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly fixes the memory bandwidth calculation in MLA benchmarks by using the actual number of accessed bytes instead of the full tensor allocations. It also standardizes the benchmark timing by switching to the unified bench_gpu_time utility. The changes in benchmarks/routines/attention.py are accurate. However, I've found a small issue in the updated bandwidth calculation in benchmarks/bench_trtllm_gen_mla.py which I've commented on.
benchmarks/bench_trtllm_gen_mla.py
Outdated
| print(f"execution time: {ms} ms") | ||
| print(f"memory bandwidth: {io / ms / 1024 / 1024:.2f} GB/s") | ||
| print(f"FLOPs: {flops * 1e-9 / ms:.2f} TFLOPs/s") | ||
| print(f"memory bandwidth: {total_mem_bytes / ms / 1e12:.2f} TB/s") |
There was a problem hiding this comment.
The memory bandwidth calculation appears to have an incorrect divisor. To convert from bytes per millisecond to terabytes per second, the divisor should be 1e9, not 1e12.
The conversion is: bytes / (ms * 1e-3 s/ms) / (1e12 B/TB) = bytes / (ms * 1e9) TB/s.
| print(f"memory bandwidth: {total_mem_bytes / ms / 1e12:.2f} TB/s") | |
| print(f"memory bandwidth: {total_mem_bytes / ms / 1e9:.2f} TB/s") |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@benchmarks/bench_trtllm_gen_mla.py`:
- Around line 132-133: The memory-bandwidth print uses the wrong unit divisor:
change the computation that prints bandwidth (the line using total_mem_bytes and
ms) to divide by 1e9 instead of 1e12 (or equivalently compute
total_mem_bytes/(ms*1e-3)/1e12) so it matches the logic used in attention.py;
update the print statement that references total_mem_bytes / ms / 1e12 to use
total_mem_bytes / ms / 1e9 (keeping the same f-string and formatting) while
leaving the FLOPs print (flops / ms / 1e9) untouched.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/bench_trtllm_gen_mla.py (1)
85-105:⚠️ Potential issue | 🟡 MinorComment contradicts the actual configuration.
Line 85 states "benchmark using CUPTI" but
enable_cupti=Falseon line 102. With the current settings, this uses CUDA graph timing, not CUPTI.Proposed fix
- # benchmark using CUPTI + # benchmark using CUDA graphs measurements = bench_gpu_time(
…i#2479) <!-- .github/pull_request_template.md --> ## 📌 Description Summary * Fixed incorrect memory bandwidth calculation in `testBatchMLAPagedAttentionWrapper` that was using full tensor allocations instead of actual bytes accessed based on sequence lengths * Updated `bench_trtllm_gen_mla.py` to use the unified `bench_gpu_time()` utility with CUPTI for consistent timing with the benchmark framework cc @hypdeb <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Improved benchmarking: switched to CUDA/CUPTI-based timing with refined iteration controls (dry/run and repeat by iterations) and optional CUDA graph support. * Updated performance reporting to use explicit memory accounting from actual token usage (query, KV, output), and adjusted bandwidth and FLOPs printouts for clearer, more accurate throughput metrics. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
📌 Description
Summary
testBatchMLAPagedAttentionWrapperthat was using full tensor allocations instead of actual bytes accessed based on sequence lengthsbench_trtllm_gen_mla.pyto use the unifiedbench_gpu_time()utility with CUPTI for consistent timing with the benchmark frameworkcc @hypdeb
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit