Automatically increased max_num_batched_tokens under Mamba align mode #36734
Automatically increased max_num_batched_tokens under Mamba align mode #36734flutist wants to merge 6 commits into
Conversation
…lign mode block_size Signed-off-by: xjx <493337577@qq.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a helpful change to automatically increase max_num_batched_tokens to match block_size when Mamba's align cache mode is used, preventing a potential assertion failure. The implementation is sound, but I've found a minor issue in the warning log message where the old and new token values are swapped, which could be misleading. I've provided a suggestion to correct the order of arguments in the log message.
…lign mode block_size Signed-off-by: xjx <493337577@qq.com>
|
@DarkLight1337 PTAL |
|
@hmellor @mgoin @Isotr0py @noooop @NickLucche @tdoublep @ywang96 @ProExpertProg Sorry to bother you, but could you please help me merge this PR file? This solved the problem. If there's anything else I can do, I'll continue. I'm very happy to hear your response. |
|
@hmellor @mgoin @Isotr0py @noooop @NickLucche @tdoublep @ywang96 @ProExpertProg Sorry to bother you, but could you please help me merge this PR file? This solved the problem. If there's anything else I can do, I'll continue. I'm very happy to hear your response. |
|
This pull request has merge conflicts that must be resolved before it can be |
Automatically increased max_num_batched_tokens to accommodate Mamba align mode block_size
solve #36697
When using Mamba cache in align mode, block_size may exceed max_num_batched_tokens, causing alignment issues. This PR automatically bumps max_num_batched_tokens to match block_size when this condition is detected, and emits a warning to notify the user of the change.
Purpose
solve
Assertion failed, In Mamba cache align mode, block_size (2096) must be <= max_num_batched_tokens (2048). [type=assertion_error, input_value=ArgsKwargs((), {'model_co...transfer_config': None}), input_type=ArgsKwargs]Test Plan
before hot fix, vllm serve Qwen/Qwen3.5-122B-A10B-GPTQ-Int4 --served-model-name Qwen3.5-27B-AWQ-4bit --gpu-memory-utilization 0.9 --port 8848 -tp 2 --max-model-len 131072 --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 --max-num-seqs 16 --enable-prefix-caching show error
Test Result
After fix, it work.
And completion is work.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.