Skip LM head during FlashInfer autotune dummy run by Kangyan-Zhou · Pull Request #23796 · sgl-project/sglang

Kangyan-Zhou · 2026-04-27T03:47:33Z

Summary

Make LogitsProcessor.forward short-circuit during the FlashInfer autotune dummy run so the LM head + tensor-parallel all-gather are skipped.
Mirrors vLLM's design where _dummy_run returns hidden states without calling compute_logits.
Fixes the persistent CUDA OOM in the GLM-5.1-FP8 nightly test on B200.

The bug

test/registered/8-gpu-models/test_glm_51_fp8.py::TestGlm51Fp8::test_glm51_fp8 (TP8+DP8 variant) has been failing every B200 nightly run since the test was added on 2026-04-09 (PR #22399). Identical signature in every run:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.38 GiB.
GPU 0 has a total capacity of 178.35 GiB of which 2.98 GiB is free.
... 170.11 GiB is allocated by PyTorch ...
  File ".../model_executor/model_runner.py", line 2244, in _flashinfer_autotune
  File ".../model_executor/model_runner.py", line 2522, in _dummy_run
  ...
  File ".../layers/logits_processor.py", line 865, in _get_logits
    logits = tensor_model_parallel_all_gather(logits)
  File ".../distributed/parallel_state.py", line 880, in all_gather
    output_tensor = output_tensor.reshape(...)

Latest reproducer (failing run, job 72975439421 on 2026-04-25).

Why TP8+DP8 specifically OOMs

The autotune dummy run uses batch_size = req_to_token_pool.size. With DP attention enabled (--tp 8 --dp 8 --enable-dp-attention), _get_logits first calls _gather_dp_attn_hidden_states, which multiplies the token count by dp_size. Then the LM head + all-gather produces a [batch × dp_size, vocab] buffer.

Variant	`max_running_requests`	tokens after DP gather	logits all-gather buffer	Result
TP8 (no DP)	2048	2048 (no DP gather)	~0.59 GiB	PASS
TP8+DP8	2048	2048 × 8 = 16,384	~4.74 GiB + input copy → 6.38 GiB	OOM
TP8+DP8+MTP	6 (EAGLE caps to 48 ÷ dp_size=8)	192	~57 MiB	PASS

GLM-5.1-FP8 sits at the unfortunate intersection: flashinfer_trtllm MoE backend (autotune fires), DP attention enabled, large auto-resolved max_running_requests=2048, and aggressive --mem-fraction-static=0.9 leaving only ~3 GiB free after weights+KV. Comparable tests like test_qwen35.py use --mem-fraction-static=0.8, which gives ~17 GiB extra activation headroom and absorbs the 4.7 GiB buffer.

What vLLM does

vllm/v1/worker/gpu_model_runner.py:5615-5619: _dummy_run returns hidden states directly. compute_logits (lm_head) only runs in a separate _dummy_sampler_run, called from profile_run — never from flashinfer_autotune. So vLLM never allocates a [*, vocab] tensor during autotune. Even in production, line 4086-4087 gathers hidden_states[logits_indices] before compute_logits, so the lm_head only sees [num_reqs, hidden] (one row per sequence), never [num_tokens, hidden].

What this PR changes

python/sglang/srt/layers/logits_processor.py: add module-level _in_autotune_dummy_run flag and a @contextmanager autotune_dummy_run_mode() (mirrors cuda_graph_runner.is_capture_mode / model_capture_mode). At the top of LogitsProcessor.forward, return LogitsProcessorOutput(next_token_logits=None) when the flag is set. The short-circuit sits before the MIS / DLLM / common dispatch, so all three LM-head paths are covered.
python/sglang/srt/model_executor/model_runner.py: wrap the _dummy_run call in _flashinfer_autotune with autotune_dummy_run_mode(). The autotune call site discards the return value (run_once() is called without consuming its result), so the stub output is safe.

_dummy_run has only one caller (_flashinfer_autotune), so the bypass cannot leak into cuda graph capture, profiling, or production forward.

Test plan

Reproduce the failure on B200 with the exact failing variant:

python -m sglang.launch_server \
  --model-path zai-org/GLM-5.1-FP8 \
  --trust-remote-code --tp 8 --dp 8 --enable-dp-attention \
  --reasoning-parser glm45 --tool-call-parser glm47 \
  --mem-fraction-static 0.9 --enable-metrics

Expect server start to succeed (no OOM in _flashinfer_autotune) after this fix.

Run the registered test to validate all three variants:

python3 test/registered/8-gpu-models/test_glm_51_fp8.py

Spot-check that gsm8k accuracy on the TP8+DP8+MTP variant is unchanged (the only variant that already passes), to confirm autotune skipping the LM head doesn't regress kernel selection.
Smoke test a non-DP run on a different model to confirm no regression in production lm_head behavior (only autotune path changed).

Notes

Adjacent observation worth flagging separately: EAGLE speculative decoding hardcodes max_running_requests=48 in server_args.py:3370, then _resolve_max_num_reqs in model_runner_kv_cache_mixin.py:707 divides by dp_size. With dp_size=8 this becomes 6 per worker. Whether 48 is meant as system-wide or per-worker is ambiguous in the warning text — out of scope for this PR but worth a follow-up.

The autotune cache only needs attention/MoE/GEMM kernel timings, but _dummy_run currently goes all the way through LogitsProcessor, where the [batch * dp_size, vocab] tensor-parallel all-gather buffer can OOM under DP attention on tight memory budgets (e.g. GLM-5.1-FP8 TP8+DP8 with --mem-fraction-static=0.9, which has been failing every B200 nightly run since the test was added). Add a module-level autotune_dummy_run_mode() context manager (mirroring cuda_graph_runner's model_capture_mode pattern), wrap the autotune dummy run with it, and have LogitsProcessor.forward short-circuit to LogitsProcessorOutput(next_token_logits=None) when the flag is set. The return value is discarded by the autotune call site, so the stub output is safe. Mirrors vLLM's split between _dummy_run (no lm_head, used by autotune) and _dummy_sampler_run (lm_head, profile-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Comment-analyzer review flagged three near-duplicate comment blocks explaining the same OOM mechanism. Keep the canonical explanation on _in_autotune_dummy_run, shrink the call-site comment to a one-liner that notes the dispatch ordering, and drop the redundant model_runner comment (the context-manager name is self-descriptive). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist · 2026-04-27T03:47:37Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-05-01T00:03:15Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

kpham-sgl

I think this is correct but cannot reproduce locally :'(. Move to another branch to trigger nightly CI

Kangyan-Zhou and others added 2 commits April 26, 2026 19:54

kpham-sgl self-assigned this Apr 27, 2026

kpham-sgl marked this pull request as ready for review May 1, 2026 00:03

kpham-sgl requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock and merrymercy as code owners May 1, 2026 00:03

kpham-sgl reviewed May 1, 2026

View reviewed changes

kpham-sgl closed this May 1, 2026

kpham-sgl mentioned this pull request May 1, 2026

Fix flashinfer autotune oom glm51 #24195

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip LM head during FlashInfer autotune dummy run#23796

Skip LM head during FlashInfer autotune dummy run#23796
Kangyan-Zhou wants to merge 2 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_glm51_fp8

Kangyan-Zhou commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot commented May 1, 2026

Uh oh!

kpham-sgl left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kangyan-Zhou commented Apr 27, 2026

Summary

The bug

Why TP8+DP8 specifically OOMs

What vLLM does

What this PR changes

Test plan

Notes

Uh oh!

gemini-code-assist Bot commented Apr 27, 2026

Uh oh!

gemini-code-assist Bot commented May 1, 2026

Uh oh!

kpham-sgl left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants