fix mtp launch error in vllm-0.17.1-rc, about cuda graph during memory profile#36634
fix mtp launch error in vllm-0.17.1-rc, about cuda graph during memory profile#36634flutist wants to merge 12 commits intovllm-project:mainfrom
Conversation
…pce-decode before graph capture Signed-off-by: xjx <493337577@qq.com>
|
@mgoin @benchislett @LucasWilkinson @NickLucche PTAL, thanks |
|
@Isotr0py I changed the implementation,PTAL, thanks |
There was a problem hiding this comment.
Code Review
This pull request addresses a crash that occurs during CUDA graph capture when using MTP speculative decoding. The root cause was that certain Triton kernels required for speculative decoding were not being JIT-compiled during the warmup phase. The fix correctly populates the necessary draft token buffers during the dummy run, ensuring these kernels are compiled before graph capture begins. The change is well-targeted and effectively resolves the issue.
Note: Security Review did not run due to the size of the PR.
|
#30515 is not included in v0.17.0, so it could not have caused this |
Anyway, could you please take a look at this PR and see if it can solve the issue? Thanks. |
|
it still happen in v0.17.1-rc version |
|
We have fixed warmup issue in #36599 |
|
It works for me. |
|
Found an assert error after some time. |
|
@vadiklyutiy didn't we fix this? |
|
That assertion fail looks like a known issue, I thought we merged a fix |
could you help to merge the pr |
@benchislett Sorry to bother you, but could you please help me merge this PR file? This solved the problem. If there's anything else I can do, I'll continue. I'm very happy to hear your response. |
|
@DarkLight1337 @Isotr0py could you help to take a look thanks |
| self.input_batch.num_accepted_tokens_cpu[:num_reqs] = max_query_len | ||
| self.num_decode_draft_tokens.np[:num_reqs] = max_query_len - 1 | ||
| self.num_decode_draft_tokens.np[num_reqs:].fill(-1) |
There was a problem hiding this comment.
Can you explain the choice of the values in each of these cases? Is the .fill(-1) the correct convention?
There was a problem hiding this comment.
num_accepted_tokens_cpu[:num_reqs] = max_query_len
Total tokens per request in spec decode = 1 original + (max_query_len−1) draft. Setting to max_query_len = "all accepted" — dummy value to trigger spec-decode Triton JIT warmup.
num_decode_draft_tokens.np[:num_reqs] = max_query_len - 1
Draft count = total − 1. Lets GDN attention see IS_SPEC_DECODING=True
self.num_decode_draft_tokens.np[num_reqs:].fill(-1)
-1 = "not a decode request". Consumer uses >= 0 as mask:
spec_sequence_masks_cpu = num_decode_draft_tokens_cpu >= 0 # True=decode, False=prefill/unused
0 would be wrong — indistinguishable from "decode with 0 drafts". Unused padding slots must be -1.
There was a problem hiding this comment.
@benchislett Sorry to bother you despite your busy schedule. Could you please take a look and let me know if there's anything else I need to modify? If you feel everything is okay, could you help approve this PR when you have a moment?
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: xjx <30485581+flutist@users.noreply.github.com>
This has been fixed in #34871 |
populate buffers so that GDN attention triggers JIT complication of spce-decode before graph capture, i thought the bug is introduced by #30515
Purpose
when launch vllm serve Qwen/Qwen3.5-0.8B --speculative_config '{"method": "mtp", "num_speculative_tokens":2}'
in vllm 0.17.0, console show error
the root cause is the _dummy_run warmup never initialized the spec-decode draft-token buffers, so the GDN attention builder always saw num_decode_draft_tokens == 0 and skipped the spec-decode code path, leaving the IS_SPEC_DECODING=True Triton kernel variants uncompiled until CUDA graph capture — where JIT compilation is forbidden.
So by populating num_decode_draft_tokens > 0 in the dummy-run buffers before calling _build_attention_metadata, the GDN builder produces a non-None spec_sequence_masks, which causes the model forward pass to take the spec-decode code path and JIT-compile the IS_SPEC_DECODING=True Triton kernel variants during warmup (outside CUDA graph capture), so they are already compiled when graph capture begins.
Test Plan
Test Result
everything is fine after deploy revised code.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.