Fix IMA with flashinfer + spec + topk & Add radix attention test cases for eagle#13740
Fix IMA with flashinfer + spec + topk & Add radix attention test cases for eagle#13740
Conversation
Summary of ChangesHello @hnyls2002, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request is a temporary debugging effort to pinpoint the cause of a decode Out-Of-Memory error. It streamlines the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request is for debugging an Out-Of-Memory issue related to radix attention, as indicated by the title and description. The changes isolate the problem by removing numerous tests and adding a specific test_radix_attention test. While these changes are significant, they are appropriate for a temporary debugging branch that is not intended to be merged. I have one minor suggestion to align the new test code with standard unittest practices.
25c6559 to
5c47a11
Compare
|
/tag-and-rerun-ci |
By adding test radix attention in
test_spec_infer_b.py, there is a decode OOM issue.Reproduction
The out-of-memory error has been fixed by #14939.
The flashinfer IMA bugs were reported in #14624.
Why is there an IMA issue?
The custom mask is generated inside the Eagle worker with
build_tree_kernel_efficient.But the
mask_indptrwas generated inside flashinfer'sbegin_forward, and themask_indptris calculated with paddedqo_indptrand paddedkv_indptrwhen applying CUDA graph padding.The shape diff between
custom_maskandmask_indptrwill cause flashinfer to access illegal memory.Note that the current fix is not perfect; we want the padding logic to be natural with the attention init logic