Skip to content

Skip LM head during FlashInfer autotune dummy run#23796

Closed
Kangyan-Zhou wants to merge 2 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_glm51_fp8
Closed

Skip LM head during FlashInfer autotune dummy run#23796
Kangyan-Zhou wants to merge 2 commits into
sgl-project:mainfrom
Kangyan-Zhou:fix_glm51_fp8

Conversation

@Kangyan-Zhou

Copy link
Copy Markdown
Collaborator

Summary

  • Make LogitsProcessor.forward short-circuit during the FlashInfer autotune dummy run so the LM head + tensor-parallel all-gather are skipped.
  • Mirrors vLLM's design where _dummy_run returns hidden states without calling compute_logits.
  • Fixes the persistent CUDA OOM in the GLM-5.1-FP8 nightly test on B200.

The bug

test/registered/8-gpu-models/test_glm_51_fp8.py::TestGlm51Fp8::test_glm51_fp8 (TP8+DP8 variant) has been failing every B200 nightly run since the test was added on 2026-04-09 (PR #22399). Identical signature in every run:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.38 GiB.
GPU 0 has a total capacity of 178.35 GiB of which 2.98 GiB is free.
... 170.11 GiB is allocated by PyTorch ...
  File ".../model_executor/model_runner.py", line 2244, in _flashinfer_autotune
  File ".../model_executor/model_runner.py", line 2522, in _dummy_run
  ...
  File ".../layers/logits_processor.py", line 865, in _get_logits
    logits = tensor_model_parallel_all_gather(logits)
  File ".../distributed/parallel_state.py", line 880, in all_gather
    output_tensor = output_tensor.reshape(...)

Latest reproducer (failing run, job 72975439421 on 2026-04-25).

Why TP8+DP8 specifically OOMs

The autotune dummy run uses batch_size = req_to_token_pool.size. With DP attention enabled (--tp 8 --dp 8 --enable-dp-attention), _get_logits first calls _gather_dp_attn_hidden_states, which multiplies the token count by dp_size. Then the LM head + all-gather produces a [batch × dp_size, vocab] buffer.

Variant max_running_requests tokens after DP gather logits all-gather buffer Result
TP8 (no DP) 2048 2048 (no DP gather) ~0.59 GiB PASS
TP8+DP8 2048 2048 × 8 = 16,384 ~4.74 GiB + input copy → 6.38 GiB OOM
TP8+DP8+MTP 6 (EAGLE caps to 48 ÷ dp_size=8) 192 ~57 MiB PASS

GLM-5.1-FP8 sits at the unfortunate intersection: flashinfer_trtllm MoE backend (autotune fires), DP attention enabled, large auto-resolved max_running_requests=2048, and aggressive --mem-fraction-static=0.9 leaving only ~3 GiB free after weights+KV. Comparable tests like test_qwen35.py use --mem-fraction-static=0.8, which gives ~17 GiB extra activation headroom and absorbs the 4.7 GiB buffer.

What vLLM does

vllm/v1/worker/gpu_model_runner.py:5615-5619: _dummy_run returns hidden states directly. compute_logits (lm_head) only runs in a separate _dummy_sampler_run, called from profile_run — never from flashinfer_autotune. So vLLM never allocates a [*, vocab] tensor during autotune. Even in production, line 4086-4087 gathers hidden_states[logits_indices] before compute_logits, so the lm_head only sees [num_reqs, hidden] (one row per sequence), never [num_tokens, hidden].

What this PR changes

  • python/sglang/srt/layers/logits_processor.py: add module-level _in_autotune_dummy_run flag and a @contextmanager autotune_dummy_run_mode() (mirrors cuda_graph_runner.is_capture_mode / model_capture_mode). At the top of LogitsProcessor.forward, return LogitsProcessorOutput(next_token_logits=None) when the flag is set. The short-circuit sits before the MIS / DLLM / common dispatch, so all three LM-head paths are covered.
  • python/sglang/srt/model_executor/model_runner.py: wrap the _dummy_run call in _flashinfer_autotune with autotune_dummy_run_mode(). The autotune call site discards the return value (run_once() is called without consuming its result), so the stub output is safe.

_dummy_run has only one caller (_flashinfer_autotune), so the bypass cannot leak into cuda graph capture, profiling, or production forward.

Test plan

  • Reproduce the failure on B200 with the exact failing variant:
    python -m sglang.launch_server \
      --model-path zai-org/GLM-5.1-FP8 \
      --trust-remote-code --tp 8 --dp 8 --enable-dp-attention \
      --reasoning-parser glm45 --tool-call-parser glm47 \
      --mem-fraction-static 0.9 --enable-metrics
    
    Expect server start to succeed (no OOM in _flashinfer_autotune) after this fix.
  • Run the registered test to validate all three variants:
    python3 test/registered/8-gpu-models/test_glm_51_fp8.py
    
  • Spot-check that gsm8k accuracy on the TP8+DP8+MTP variant is unchanged (the only variant that already passes), to confirm autotune skipping the LM head doesn't regress kernel selection.
  • Smoke test a non-DP run on a different model to confirm no regression in production lm_head behavior (only autotune path changed).

Notes

  • Adjacent observation worth flagging separately: EAGLE speculative decoding hardcodes max_running_requests=48 in server_args.py:3370, then _resolve_max_num_reqs in model_runner_kv_cache_mixin.py:707 divides by dp_size. With dp_size=8 this becomes 6 per worker. Whether 48 is meant as system-wide or per-worker is ambiguous in the warning text — out of scope for this PR but worth a follow-up.

Kangyan-Zhou and others added 2 commits April 26, 2026 19:54
The autotune cache only needs attention/MoE/GEMM kernel timings, but
_dummy_run currently goes all the way through LogitsProcessor, where
the [batch * dp_size, vocab] tensor-parallel all-gather buffer can OOM
under DP attention on tight memory budgets (e.g. GLM-5.1-FP8 TP8+DP8
with --mem-fraction-static=0.9, which has been failing every B200
nightly run since the test was added).

Add a module-level autotune_dummy_run_mode() context manager (mirroring
cuda_graph_runner's model_capture_mode pattern), wrap the autotune
dummy run with it, and have LogitsProcessor.forward short-circuit to
LogitsProcessorOutput(next_token_logits=None) when the flag is set.
The return value is discarded by the autotune call site, so the stub
output is safe.

Mirrors vLLM's split between _dummy_run (no lm_head, used by autotune)
and _dummy_sampler_run (lm_head, profile-only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment-analyzer review flagged three near-duplicate comment blocks
explaining the same OOM mechanism. Keep the canonical explanation on
_in_autotune_dummy_run, shrink the call-site comment to a one-liner
that notes the dispatch ordering, and drop the redundant model_runner
comment (the context-manager name is self-descriptive).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@kpham-sgl kpham-sgl self-assigned this Apr 27, 2026
@kpham-sgl kpham-sgl marked this pull request as ready for review May 1, 2026 00:03
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@kpham-sgl kpham-sgl left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct but cannot reproduce locally :'(. Move to another branch to trigger nightly CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants