Breakable Cuda Graph Support for bs > 1#24662
Merged
Merged
Conversation
Capture only the inner transformer stack (layer_model) instead of the outer *ForCausalLM.forward. logits_processor / pooler now runs eagerly after replay with the live forward_batch, so captured segments are bs-invariant: their kernel launches only depend on num_tokens, not on batch_size. Drops the bs=1 reject in can_run; multi-req prefill stays on graph instead of falling back to eager. Mechanics: - Resolve self.layer_model the same way PCG does (patch_model boundary). - replay() monkey-patches layer_model.forward with a closure that replays the captured CUDAGraph and returns the captured hidden_states; the outer model.forward then runs logits_processor/pooler eagerly. - _run_forward calls layer_model.forward directly during capture, so it must re-apply @torch.no_grad() (the outer *ForCausalLM.forward carried it). Without that, MoE @torch.compile kernels using torch.sum(out=...) fail dynamo with "out= doesn't support autograd", and mamba state ops spuriously track grad and hang capture. - Drops the static_seq_lens / static_extend_* / static_req_pool_indices / static_orig_seq_lens machinery — they only existed to give the in-graph logits_processor stable bs=1 addresses. - can_run gains an is_target_verify reject (matches PCG); per-reject counter + periodic [BCG] replays/rejects log line to verify the fix actually keeps prefill on graph under load. Validated on g294 H100s, mgsm_en 200q, fa3, --mem-fraction-static 0.85, threads=1 and threads=32: | Config | t32 score | t32 tput | t32 replays | t32 rejects | | qwen3_8b_tp1 | 0.82 | 3448 | 300 | 0 | | qwen3_30b_a3b_tp2 | 0.965 | 2711 | 300 | 0 | | nemotronh_8b_tp2 | 0.285 | 3508 | 300 | 0 | rejects=0 across all three confirms multi-req prefill stays on graph (vs the prior bs>1 eager-fallback). Scores match prior BCG baselines within mgsm_en noise (~0.03); throughput on-par or slightly better (qwen3_30b_a3b_tp2 +4.3% vs prior BCG notes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The replay/can_run_reject counter and periodic [BCG] log line were instrumentation to verify multi-req prefill actually stays on graph during validation. Validation passed (rejects=0 across qwen3_8b_tp1, qwen3_30b_a3b_tp2, nemotronh_8b_tp2 at threads=32) — drop the instrumentation now that the fix is confirmed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
/tag-run-ci-label |
Collaborator
Author
|
/rerun-failed-ci |
3 similar comments
Collaborator
Author
|
/rerun-failed-ci |
Collaborator
Author
|
/rerun-failed-ci |
Collaborator
Author
|
/rerun-failed-ci |
ispobock
approved these changes
May 11, 2026
merrymercy
reviewed
May 11, 2026
| if hasattr(language_model, "model") | ||
| and hasattr(language_model.model, "layers") | ||
| else language_model | ||
| ) |
Contributor
There was a problem hiding this comment.
This is very hacky. It is based on string name match.
At least you should raise a warning if the string match failed.
LucQueen
pushed a commit
to LucQueen/sglang
that referenced
this pull request
May 12, 2026
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xjpang
pushed a commit
to xjpang/sglang
that referenced
this pull request
May 13, 2026
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci