Skip to content

Fix flashinfer autotune oom glm51#24195

Merged
Fridge003 merged 7 commits into
mainfrom
kp/fix-flashinfer-autotune-OOM-GLM51
Jun 3, 2026
Merged

Fix flashinfer autotune oom glm51#24195
Fridge003 merged 7 commits into
mainfrom
kp/fix-flashinfer-autotune-OOM-GLM51

Conversation

@kpham-sgl

@kpham-sgl kpham-sgl commented May 1, 2026

Copy link
Copy Markdown
Collaborator

Brought over from PR #23796 to trigger Nightly CI


CI States

Latest PR Test (Base): 🚫 Run #26854148428
Latest PR Test (Extra): ❌ Run #26854148322

Kangyan-Zhou and others added 2 commits May 1, 2026 01:07
The autotune cache only needs attention/MoE/GEMM kernel timings, but
_dummy_run currently goes all the way through LogitsProcessor, where
the [batch * dp_size, vocab] tensor-parallel all-gather buffer can OOM
under DP attention on tight memory budgets (e.g. GLM-5.1-FP8 TP8+DP8
with --mem-fraction-static=0.9, which has been failing every B200
nightly run since the test was added).

Add a module-level autotune_dummy_run_mode() context manager (mirroring
cuda_graph_runner's model_capture_mode pattern), wrap the autotune
dummy run with it, and have LogitsProcessor.forward short-circuit to
LogitsProcessorOutput(next_token_logits=None) when the flag is set.
The return value is discarded by the autotune call site, so the stub
output is safe.

Mirrors vLLM's split between _dummy_run (no lm_head, used by autotune)
and _dummy_sampler_run (lm_head, profile-only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment-analyzer review flagged three near-duplicate comment blocks
explaining the same OOM mechanism. Keep the canonical explanation on
_in_autotune_dummy_run, shrink the call-site comment to a one-liner
that notes the dispatch ordering, and drop the redundant model_runner
comment (the context-manager name is self-descriptive).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@kpham-sgl

Copy link
Copy Markdown
Collaborator Author

/tag-run-ci-label

@github-actions github-actions Bot added the run-ci label May 1, 2026
@kpham-sgl kpham-sgl changed the title Kp/fix flashinfer autotune oom glm51 Fix flashinfer autotune oom glm51 May 1, 2026
@kpham-sgl

Copy link
Copy Markdown
Collaborator Author

…otune-OOM-GLM51

# Conflicts:
#	python/sglang/srt/model_executor/model_runner.py
@kpham-sgl

Copy link
Copy Markdown
Collaborator Author

Comment thread python/sglang/srt/layers/logits_processor.py
@Fridge003 Fridge003 merged commit b5560ff into main Jun 3, 2026
152 of 197 checks passed
@Fridge003 Fridge003 deleted the kp/fix-flashinfer-autotune-OOM-GLM51 branch June 3, 2026 06:29
stellaxcpeng pushed a commit to stellaxcpeng/sglang that referenced this pull request Jun 4, 2026
alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Jun 4, 2026
jeynmann pushed a commit to jeynmann/sglang that referenced this pull request Jun 4, 2026
edwingao28 pushed a commit to edwingao28/sglang that referenced this pull request Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants