[Bugfix] Use PIECEWISE cudagraphs on Blackwell if max_model_len > 131072#27114
Merged
yewentao256 merged 5 commits intovllm-project:mainfrom Oct 17, 2025
Merged
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
Contributor
There was a problem hiding this comment.
Code Review
The pull request introduces a bug fix that uses PIECEWISE cudagraphs on Blackwell architecture if the max_model_len exceeds 131072. The code changes modify the VllmConfig class to check for this condition and override the cudagraph_mode accordingly. The changes also include adding warning messages to the logger.
Signed-off-by: mgoin <mgoin64@gmail.com>
tlrmchlsmth
approved these changes
Oct 17, 2025
Signed-off-by: mgoin <mgoin64@gmail.com>
yewentao256
approved these changes
Oct 17, 2025
Member
yewentao256
left a comment
There was a problem hiding this comment.
LGTM, thanks for the work!
lywa1998
pushed a commit
to lywa1998/vllm
that referenced
this pull request
Oct 20, 2025
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
adabeyta
pushed a commit
to adabeyta/vllm
that referenced
this pull request
Oct 20, 2025
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
Contributor
|
Added this issue to Flashinfer to track the long term fix for FULL CG support of TRTLLM backend |
alhridoy
pushed a commit
to alhridoy/vllm
that referenced
this pull request
Oct 24, 2025
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
0xrushi
pushed a commit
to 0xrushi/vllm
that referenced
this pull request
Oct 26, 2025
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi
pushed a commit
to 0xrushi/vllm
that referenced
this pull request
Oct 26, 2025
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
ilmarkov
pushed a commit
to neuralmagic/vllm
that referenced
this pull request
Nov 7, 2025
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
rtourgeman
pushed a commit
to rtourgeman/vllm
that referenced
this pull request
Nov 10, 2025
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
Collaborator
devpatelio
pushed a commit
to SumanthRH/vllm
that referenced
this pull request
Nov 29, 2025
…072 (vllm-project#27114) Signed-off-by: mgoin <mgoin64@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
FIX #27057
The original issue was found because Qwen3-VL models completely lost accuracy (1% vs 86% on GSM8K) on B200 GPUs when using the default FULL_AND_PIECEWISE cudagraph_mode. The issue did not occur on Hopper at all, with PIECEWISE mode only, FlashAttention backend, or when explicitly disabling TRTLLM attention.
Because TRTLLM attention is selected dynamically based on runtime conditions
(num_tokens, max_seq_len, kv_cache_dtype). During FULL CG capture, themax_seq_lenis used which when greater than 128K results in FlashInfer being selected, but during actual inference without using full context length, the same conditions triggered TRTLLM selection. This created a graph/runtime mismatch where captured graphs referenced FlashInfer kernels but runtime attempted to execute TRTLLM kernels, producing incorrect results. I was able to see this behavior on any model with defaultmax_model_len>128KBy enforcing PIECEWISE mode in this PR to disable cuda graph capture of attention, we can avoid this issue of dynamism. In the future we should see if we can made TRTLLM support larger context lengths to support FULL graphs
Test Plan
Test Result
Reproduction on B200 on
main:Running on this PR:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.