[Bugfix] Align block table for TRTLLM MLA edge-case#39324
Conversation
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
There was a problem hiding this comment.
Code Review
This pull request introduces alignment logic for max_num_blocks in vllm/v1/worker/block_table.py and vllm/v1/worker/gpu/block_table.py to meet the requirements of specific attention backends. Feedback was provided regarding the use of magic numbers and code duplication across the two files. Additionally, the reviewer suggested simplifying redundant conditional checks and noted potential issues with the alignment formula if non-power-of-2 block sizes are used.
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
|
Do you think this is something we should add as an attribute to the attention backend? Not sure if this is clearly good to enforce across the board |
|
Yeah I'm not sure how to handle this. Seems like a pretty uncommon edge case in the first place so it might not be worth making a big change to support it. But I'm not sure what the end-result impact is to padding the block table by a couple blocks. |
|
Unless there's significant demand for supporting arbitrary |
NickLucche
left a comment
There was a problem hiding this comment.
@benchislett @LucasWilkinson I guess we're happy with padding the block_table here?
MatthewBonanni
left a comment
There was a problem hiding this comment.
LGTM, happy to land once #39324 (comment) is addressed
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Libin Tang <libin.tang@intel.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Purpose
When running with unusual max-model-len, such as
9416, TRTLLM MLA crashes:It seems like a straightforward fix to pad the block table's max_num_blocks by a block or two to avoid this case.
Testing
Ran Kimi K2.5 with the max-model-len and it works.