pick fa2 for BatchDecodeWithPagedKVCacheWrapper auto backend#2530
pick fa2 for BatchDecodeWithPagedKVCacheWrapper auto backend#2530
Conversation
Summary of ChangesHello @saltyminty, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refines the backend selection mechanism for attention operations, specifically ensuring that Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds an "auto" backend token across benchmark wrappers with runtime resolution and reporting, normalizes deprecated backend names, appends a resolved_backend field to perf outputs, and updates decode backend selection to route auto based on FP8 vs non‑FP8 q/kv data types. Changes
Sequence DiagramsequenceDiagram
participant Bench as Benchmark Orchestration
participant Resolver as Backend Resolver
participant Decode as Decode Logic
participant Wrapper as Wrapper Impl
participant Output as Perf Recorder
Bench->>Resolver: Run routine with backend="auto"
Resolver->>Resolver: Check wrapper supports auto
Resolver->>Decode: Query q/kv data types
Decode-->>Resolver: Return resolved_backend (FP8 -> determine_attention_backend, else -> fa2)
Resolver->>Wrapper: Execute using resolved_backend
Wrapper-->>Resolver: Return metrics
Resolver->>Output: Record metrics with backend="auto(resolved_backend)" and resolved_backend field
Output-->>Bench: Present results
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~30 minutes Possibly Related PRs
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
bot run |
There was a problem hiding this comment.
Code Review
This pull request addresses a performance regression in BatchDecodeWithPagedKVCacheWrapper for non-FP8 workloads by hardcoding the fa2 backend when auto is selected. It also introduces auto backend support in the benchmark suite for both BatchDecodeWithPagedKVCacheWrapper and BatchPrefillWithPagedKVCacheWrapper. The changes are logical and well-aligned with the stated objectives. I've included a couple of minor suggestions to enhance code maintainability by reducing hardcoded values.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@flashinfer/decode.py`:
- Around line 1042-1053: The auto-backend logic incorrectly defaults to "fa2"
when only kv_data_type is FP8; update the branch in decode.py where
self._backend is set (the block using q_data_type, kv_data_type,
PosEncodingMode[pos_encoding_mode], and determine_attention_backend) so that it
calls determine_attention_backend whenever either q_data_type or kv_data_type is
an FP8 type (e.g., torch.float8_e4m3fn or torch.float8_e5m2) instead of only
when q_data_type is FP8, ensuring FP8 KV-only configurations select a valid
backend.
bkryu
left a comment
There was a problem hiding this comment.
Thanks @saltyminty . Left a comment on the benchmark code.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/routines/attention.py (1)
480-513:⚠️ Potential issue | 🟡 Minor"auto" backend bypasses head_grp_size == 5 constraint check for FA2.
The constraint check at lines 302–313 guards against
head_grp_size == 5for FA2, but only when "fa2" is explicitly in the backends list. Whenbackend="auto", this check is skipped. Later, during.plan()(flashinfer/decode.py lines 1042–1056), "auto" resolves to "fa2" by default when non-FP8 dtypes are used, silently bypassing the constraint validation.Either add a post-resolution check after the wrapper resolves the backend, or document this limitation with the --backends argument.
🧹 Nitpick comments (1)
benchmarks/routines/attention.py (1)
511-513: Accessing private_backendattribute is fragile.
backend_wrappers[backend]._backendrelies on a private/internal attribute of the wrapper to introspect the resolved backend. This works for benchmarking purposes, but could break silently if the library renames or removes this attribute.Consider checking if there's an official public API (e.g., a
backendproperty) to query the resolved backend. If not, a defensivegetattr(wrapper, '_backend', backend)fallback would prevent crashes.Defensive fallback
- resolved_backends[backend] = backend_wrappers[backend]._backend + resolved_backends[backend] = getattr(backend_wrappers[backend], '_backend', backend)
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/routines/attention.py (1)
102-114:⚠️ Potential issue | 🟡 Minor"auto" backend lacks runtime validation for unsupported routines.
The help text correctly states that
autois only supported forBatchDecodeWithPagedKVCacheWrapperandBatchPrefillWithPagedKVCacheWrapper, but there's no runtime guard. If a user passes--backends autowithBatchPrefillWithRaggedKVCacheWrapperorBatchMLAPagedAttentionWrapper, it will crash with an unhelpfulAttributeError(calling.detach()onNonereturned from therun_backend_wrapperfallback).Consider adding an early validation in those routines to strip
autowith a clear message, or validate inparse_attention_argsagainst the routine.
🧹 Nitpick comments (2)
benchmarks/routines/attention.py (2)
1866-1890: Output schema inconsistency:resolved_backendfield missing in Ragged and MLA routines.The decode and paged-prefill routines now emit a
resolved_backendfield in their output records, buttestBatchPrefillWithRaggedKVCacheWrapper(andtestBatchMLAPagedAttentionWrapper) do not. If downstream tooling consumes these records with a uniform schema, this will causeKeyErroror missing-column issues.Consider adding
cur_res["resolved_backend"] = backendin these routines for schema consistency, even ifautoisn't supported there.Proposed fix for Ragged (similar for MLA)
cur_res["backend"] = backend + cur_res["resolved_backend"] = backend cur_res["page_size"] = 0 # No page size for ragged
471-504:_backendaccesses an internal attribute and may break if FlashInfer internals change.The code accesses
backend_wrappers[backend]._backend(lines 502, 1075, 1105) to retrieve the resolved backend from the wrapper object. The underscore prefix indicates this is a private/internal attribute rather than part of the public API. While this is acceptable for benchmark tooling, the pattern is fragile and could break if FlashInfer updates its wrapper implementation.For benchmark purposes, this is a minor concern since the code will fail loudly at runtime if the attribute disappears. A more robust approach would be to check if FlashInfer exposes a public accessor method, but the current implementation is pragmatic for non-production benchmarking code.
|
/bot run |
|
[CANCELING] Pipeline #43647951: canceled |
|
/bot run |
|
[FAILED] Pipeline #43649627: 4/20 passed |
52841e5 to
8f15557
Compare
|
/bot run |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@flashinfer/decode.py`:
- Around line 1042-1056: The code mutates self._backend from "auto" to a
concrete value, preventing re-evaluation on subsequent plan() calls; change to
preserve the user request (e.g., add a new attribute like
self._requested_backend) and always resolve the effective backend from that on
each plan() invocation: keep self._requested_backend (set from constructor/args)
and in plan() check if self._requested_backend == "auto" then call
determine_attention_backend(...) with PosEncodingMode[pos_encoding_mode].value,
q_data_type, kv_data_type, etc., otherwise use the explicitly requested backend,
and assign the resolved value to a temporary/effective variable (not overwrite
self._requested_backend) or to self._backend only as a cached effective_backend
that can be recomputed next call.
🧹 Nitpick comments (1)
benchmarks/README.md (1)
187-187: Consider adding "auto" to the backend legend for completeness.The documentation correctly adds "auto" to the available backends list and specifies its current support scope. However, the backend legend at lines 447-460 does not include an entry for "auto", which could leave users unclear about what this backend does (e.g., automatically selects the optimal backend based on data types, as mentioned in the PR objectives).
📚 Suggested documentation enhancement
Add an entry for "auto" in the Backend Legend section (after line 460):
- cuda: FlashInfer CUDA kernels - cute-dsl: FlashInfer CuTe-DSL kernels (Blackwell SM10.0+) - moe_a2a: MoE All-to-All communication (requires mpirun, Blackwell SM10.0+ with MNNVL) +- auto: Automatically selects the optimal backend based on workload characteristics (e.g., FP8 vs non-FP8 data types). Currently supported for BatchDecodeWithPagedKVCacheWrapper and BatchPrefillWithPagedKVCacheWrapper.
|
/bot run |
|
[FAILED] Pipeline #43918433: 11/20 passed |
📌 Description
BatchDecodeWithPagedKVCacheWrapper used to be hardcoded to select fa2, but now selects fa3 on Hopper due to the refactoring to use determine_attention_backend, which is slower than fa2. Revert to hardcode to fa2 for non-fp8 workloads.
Add auto backend support to flashinfer_benchmark.py for BatchDecodeWithPagedKVCacheWrapper and BatchPrefillWithPagedKVCacheWrapper.
🔍 Related Issues
#2400
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Documentation
Compatibility