Fix flashinfer autotune oom glm51 by kpham-sgl · Pull Request #24195 · sgl-project/sglang

kpham-sgl · 2026-05-01T01:09:40Z

Brought over from PR #23796 to trigger Nightly CI

CI States

Latest PR Test (Base): 🚫 Run #26854148428
Latest PR Test (Extra): ❌ Run #26854148322

The autotune cache only needs attention/MoE/GEMM kernel timings, but _dummy_run currently goes all the way through LogitsProcessor, where the [batch * dp_size, vocab] tensor-parallel all-gather buffer can OOM under DP attention on tight memory budgets (e.g. GLM-5.1-FP8 TP8+DP8 with --mem-fraction-static=0.9, which has been failing every B200 nightly run since the test was added). Add a module-level autotune_dummy_run_mode() context manager (mirroring cuda_graph_runner's model_capture_mode pattern), wrap the autotune dummy run with it, and have LogitsProcessor.forward short-circuit to LogitsProcessorOutput(next_token_logits=None) when the flag is set. The return value is discarded by the autotune call site, so the stub output is safe. Mirrors vLLM's split between _dummy_run (no lm_head, used by autotune) and _dummy_sampler_run (lm_head, profile-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Comment-analyzer review flagged three near-duplicate comment blocks explaining the same OOM mechanism. Keep the canonical explanation on _in_autotune_dummy_run, shrink the call-site comment to a one-liner that notes the dispatch ordering, and drop the redundant model_runner comment (the context-manager name is self-descriptive). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-05-01T01:09:44Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

kpham-sgl · 2026-05-01T01:10:09Z

/tag-run-ci-label

kpham-sgl · 2026-05-01T02:11:16Z

One example of OOM CI https://github.com/sgl-project/sglang/actions/runs/25192084660/job/73864201790 (before fix)

…otune-OOM-GLM51 # Conflicts: # python/sglang/srt/model_executor/model_runner.py

kpham-sgl · 2026-05-05T02:46:02Z

Nightly run https://github.com/sgl-project/sglang/actions/runs/25355092765

Kangyan-Zhou and others added 2 commits May 1, 2026 01:07

kpham-sgl requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock and merrymercy as code owners May 1, 2026 01:09

github-actions Bot added the run-ci label May 1, 2026

kpham-sgl changed the title ~~Kp/fix flashinfer autotune oom glm51~~ Fix flashinfer autotune oom glm51 May 1, 2026

Merge remote-tracking branch 'origin/main' into kp/fix-flashinfer-aut…

f90f4f9

…otune-OOM-GLM51 # Conflicts: # python/sglang/srt/model_executor/model_runner.py

Merge branch 'main' into kp/fix-flashinfer-autotune-OOM-GLM51

d571ee6

Fridge003 reviewed May 25, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/logits_processor.py

Fridge003 approved these changes May 27, 2026

View reviewed changes

Fridge003 added 3 commits May 27, 2026 22:48

Merge branch 'main' into kp/fix-flashinfer-autotune-OOM-GLM51

5d61b01

Merge branch 'main' into kp/fix-flashinfer-autotune-OOM-GLM51

34d925c

Fix lint formatting

76e2a90

sglang-bot mentioned this pull request Jun 3, 2026

CUDA Coredump Tracker #26340

Open

Fridge003 merged commit b5560ff into main Jun 3, 2026
152 of 197 checks passed

Fridge003 deleted the kp/fix-flashinfer-autotune-OOM-GLM51 branch June 3, 2026 06:29

stellaxcpeng pushed a commit to stellaxcpeng/sglang that referenced this pull request Jun 4, 2026

Fix flashinfer autotune oom glm51 (sgl-project#24195)

a1c42e5

alphabetc1 pushed a commit to alphabetc1/sglang that referenced this pull request Jun 4, 2026

Fix flashinfer autotune oom glm51 (sgl-project#24195)

e574095

jeynmann pushed a commit to jeynmann/sglang that referenced this pull request Jun 4, 2026

Fix flashinfer autotune oom glm51 (sgl-project#24195)

b6f03ab

edwingao28 pushed a commit to edwingao28/sglang that referenced this pull request Jun 7, 2026

Fix flashinfer autotune oom glm51 (sgl-project#24195)

89ee69b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flashinfer autotune oom glm51#24195

Fix flashinfer autotune oom glm51#24195
Fridge003 merged 7 commits into
mainfrom
kp/fix-flashinfer-autotune-OOM-GLM51

kpham-sgl commented May 1, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot commented May 1, 2026

Uh oh!

kpham-sgl commented May 1, 2026

Uh oh!

kpham-sgl commented May 1, 2026

Uh oh!

kpham-sgl commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kpham-sgl commented May 1, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI States

Uh oh!

gemini-code-assist Bot commented May 1, 2026

Uh oh!

kpham-sgl commented May 1, 2026

Uh oh!

kpham-sgl commented May 1, 2026

Uh oh!

kpham-sgl commented May 5, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kpham-sgl commented May 1, 2026 •

edited by github-actions Bot

Loading