ci: combine H200 8-GPU warmup steps and surface server log on every path by alisonshao · Pull Request #24253 · sgl-project/sglang

alisonshao · 2026-05-02T01:31:00Z

Summary

Two related changes to the per-commit H200 8-GPU CI warmup pipeline:

1. Remove the heavyweight `Warmup Server CUDA Graphs` step; expand the lightweight DeepGEMM step to cover all per-commit H200 models

warmup_server.py was launching the full sglang server for V3-0324:8 and Ring-2.5-1T:8 to do a full model load + DeepGEMM JIT pre-compile + CUDA graph capture. The graph-capture portion is wasted work — CUDA graphs don't persist across processes, so subsequent test-side server launches re-capture from scratch. Only the DeepGEMM JIT cache (/root/.cache/deep_gemm, persisted via bind mount on the host) carries any benefit forward.

Even worse, the warmup only covered V3-0324 and Ring, leaving every other per-commit H200 model (V3.2, GLM-5-FP8, MiniMax-M2.5, MiMo, Nemotron-3-Super, Step-3.5, Qwen3-Next) to pay its own JIT-compile cost during the test step.

Failure example showing this exact bottleneck: https://github.com/sgl-project/sglang/actions/runs/25205614862/job/73974157058

In that run, step 8 "Warmup Server CUDA Graphs" finished in 1 second (fast-path skip via marker), step 7 "Warmup DeepGEMM JIT Compilation" in 6 seconds (7 canned shapes, all cache hits). But step 9 "Run test" then chewed ~14m 52s of in-test DeepGEMM JIT compile time out of its 20m 15s budget — most of it on four expensive shapes (2:38, 2:38, 2:53, 2:53) the warmup steps never primed. The test file (test_dsa_models_mtp.py) hit the per-file 1200s timeout with a 25th JIT session still running. Same root cause hit test_minimax_m25_basic.py earlier.

This PR collapses both warmup steps into one expanded warmup_deep_gemm.py invocation that covers every per-commit H200 8-GPU model. CUDA graph capture is no longer attempted; only the JIT cache (which actually persists) is populated.

Ring-2.5-1T is dropped from the warmup since it only runs in nightly-8-gpu-common, not per-commit.

Per-commit model list (the canonical source is sglang-ops/ci-machines-overview.md "H200 Pre-cached Models" — comment in the workflow points to it):

deepseek-ai/DeepSeek-V3-0324:8
deepseek-ai/DeepSeek-V3.2:8
zai-org/GLM-5-FP8:8
XiaomiMiMo/MiMo-V2-Flash:8
XiaomiMiMo/MiMo-V2.5:8
MiniMaxAI/MiniMax-M2.5:8
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16:8
stepfun-ai/Step-3.5-Flash:8
Qwen/Qwen3-Next-80B-A3B-Instruct:2

Timeout bumped from 25 → 60 min. Cold-cache cumulative across ~7 unique architectures is up to ~45 min (each goes through the sglang.compile_deep_gemm fallback, which loads weights and JITs all shapes). Warm-cache runs finish in <5 min.

2. Surface the warmup-server log tail on every code path (file remains used by other workflows)

scripts/ci/cuda/warmup_server.py is no longer called by the H200 8-GPU PR test, but is still used by other CUDA workflows (4-gpu-h100, 8-gpu-h200-deepep, B200) and by manual debugging. The script redirects launch_server stdout/stderr to a tempfile to dodge the 64KB pipe deadlock during CUDA graph capture, and previously only dumped that log to its own stdout on the failure path — the success path silently unlinked it.

That's why the recent Validation failed for inclusionAI/Ring-2.5-1T: Missing shards in model-of-00160.safetensors: [160] (separately fixed in #24237) only appeared in /tmp/warmup_server_*.log on the runner and required SSHing in to find. This change moves the tail dump into the finally block so it runs uniformly on success/failure/exception. Failure-path duplicate is removed.

Test plan

On a freshly-recreated H200 runner (cold /root/.cache/deep_gemm): the merged warmup step takes <60 min and the per-test JIT compile sessions seen in test_dsa_models_mtp.py / test_minimax_m25_basic.py are gone.
On a warm runner: the merged step completes in single-digit minutes (DeepSeek-family hits the lightweight no-load path; everything else hits warm fallback cache).
The next 4-gpu-h100 / 8-gpu-h200-deepep / B200 PR test job that does still run warmup_server.py shows a --- server log tail (N lines, last 30) --- block in the GH Actions log regardless of success/failure.
No regression for nightly-8-gpu-common (Ring-2.5-1T's nightly path is unaffected; warmup wasn't relied on there).

`warmup_server.py` redirects launch_server stdout/stderr to a tempfile to dodge the 64KB pipe deadlock during CUDA graph capture. Previously that log was only dumped to stdout when the server failed to start; the success path silently unlinked it. Anything the server logged (validation messages, deprecation warnings, slow-shard-load progress, NCCL init lines) was invisible to CI consumers. Move the dump into the finally block so the tail (last 30 lines) is always emitted, deduplicates with the failure-path branch, and gives CI visibility into noteworthy server-side events without anyone having to SSH onto the runner to read /tmp/warmup_server_*.log.

gemini-code-assist · 2026-05-02T01:31:04Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Removes the "Warmup Server CUDA Graphs" step (full launch_server + CUDA graph capture) from stage-c-test-8-gpu-h200 and folds its purpose into "Warmup DeepGEMM JIT Compilation" by passing every per-commit H200 8-GPU model to warmup_deep_gemm.py. CUDA graphs don't persist across processes anyway — only the DeepGEMM JIT cache does — so the heavyweight server-launch path was paying graph-capture cost for zero cross-process benefit. Why expand the model list: warmup_server.py was warming only V3-0324 and Ring-2.5-1T; tests for V3.2, GLM-5-FP8, MiniMax-M2.5, MiMo, Nemotron-3-Super, Step-3.5, and Qwen3-Next paid full JIT-compile cost during their own startup, often eating the per-test 1200s budget. Ring is removed entirely from per-commit warmup since it's nightly-only (suite=nightly-8-gpu-common). Per-commit H200 8-GPU model list is now duplicated between the workflow and sglang-ops/ci-machines-overview.md "H200 Pre-cached Models". Comment in the workflow points to the doc as the canonical keep-in-sync source. Timeout bumped 25→60 min: cold-cache run loads ~7 unique architectures through `sglang.compile_deep_gemm` fallback (~5–10 min each); warm runs finish in <5 min.

alisonshao · 2026-05-02T05:57:21Z

/rerun-stage stage-c-8-gpu-h200

github-actions · 2026-05-02T05:57:42Z

❌ Stage stage-c-8-gpu-h200 doesn't support isolated runs yet.

NVIDIA stages:

stage-a-test-1-gpu-small
stage-a-test-cpu
stage-b-test-1-gpu-small
stage-b-test-1-gpu-large
stage-b-test-2-gpu-large
stage-b-test-4-gpu-b200
stage-c-test-4-gpu-h100
stage-c-test-8-gpu-h200
stage-c-test-8-gpu-h20
stage-c-test-4-gpu-b200
stage-c-test-4-gpu-gb200
stage-c-test-deepep-4-gpu-h100
stage-c-test-deepep-8-gpu-h200
multimodal-gen-test-1-gpu
multimodal-gen-test-2-gpu
multimodal-gen-component-accuracy
multimodal-gen-component-accuracy-1-gpu
multimodal-gen-component-accuracy-2-gpu
multimodal-gen-test-1-b200

AMD stages:

sgl-kernel-unit-test-amd
sgl-kernel-unit-test-2-gpu-amd
stage-a-test-1-gpu-small-amd
stage-b-test-1-gpu-small-amd
stage-b-test-1-gpu-small-amd-nondeterministic
stage-b-test-1-gpu-small-amd-mi35x
stage-b-test-1-gpu-large-amd
stage-b-test-2-gpu-large-amd
multimodal-gen-test-1-gpu-amd
multimodal-gen-test-2-gpu-amd
stage-c-test-large-8-gpu-amd
stage-c-test-large-8-gpu-amd-mi35x

Other stages will be added soon. For now, use /rerun-failed-ci for those stages.

alisonshao · 2026-05-02T05:58:36Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-02T05:59:06Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

When fallback_compile_deep_gemm spawns sglang.compile_deep_gemm and one TP rank crashes (e.g. the IndexError currently triggered by XiaomiMiMo/MiMo-V2-Flash QKV weight loading), the surviving ranks deadlock on NCCL collectives waiting for the dead one. The parent compile_deep_gemm also hangs waiting on its children. Previously, warmup_deep_gemm.py used subprocess.run() with no timeout and would sit there until the GH Actions step cap killed the whole job. Real example: dispatch run on PR #24253 chewed 57 min before being terminated. This change wraps the fallback in a Popen + os.setsid process group with a 15-min per-model wait. On timeout, SIGTERM the whole group, escalate to SIGKILL after 10s if needed, log a warning, and continue to the next model. One bad model can no longer eat the entire warmup budget. Failure example (the 57-min hang) addressed by this change: https://github.com/sgl-project/sglang/actions/runs/25245303024/job/74028467044

Kangyan-Zhou · 2026-05-02T18:50:56Z

The mimo model seems not working

alisonshao · 2026-05-03T01:25:59Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-03T01:26:25Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

Kangyan-Zhou · 2026-05-03T05:23:09Z

https://github.com/sgl-project/sglang/actions/runs/25266653392/job/74082200496

is this expected to have 40 minutes runtime?

alisonshao · 2026-05-04T09:29:51Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-04T09:30:25Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

The fallback path in warmup_deep_gemm.py was launching every non-DeepSeek model with just `--tp N`, but per-rank DeepGEMM shapes depend on tp/dp/ep. Two of the configured models crashed deterministically at tp=8 — MiMo-V2-Flash on a QKV narrow that overruns at TP=8 (test runs tp=4+dp=2+dp_attn) and MiniMax-M2.5 on the FP8 block_n divisibility check (test runs tp=8+ep=8). Each crash then sat in a 900s parent-side poll loop, costing ~30 min per shard for no cache benefit. Add a per-model FALLBACK_ARGS table so each fallback subprocess passes the same dp/ep/dp-attention flags the test uses, populating cache shapes the test will actually request on a fresh runner. Watch subprocess output for crash markers ("Scheduler hit an exception" / "Received sigquit from a child") and kill the process group as soon as one is seen, instead of waiting out the timeout. Outer timeout dropped 900s -> 600s for any wedge that doesn't emit a marker. Workflow argv changes that pair with the new dispatch: - MiMo-V2-Flash :8 -> :4 (matches test_mimo_models.py) - Qwen3-Next-80B :2 -> :4 (matches test_disaggregation_hybrid_attention.py) - Drop Nemotron-3-Super-BF16 (BF16 model, doesn't use DeepGEMM kernels)

alisonshao · 2026-05-05T02:10:58Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-05T02:11:24Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

MiniMax-M2.5, Step-3.5-Flash, and Qwen3-Next emit zero "Try DeepGEMM JIT Compiling" events on rank 0 in either the warmup subprocess or the actual test step (verified on run 25354167665), even though they're FP8/MoE models. Their MoE paths take moe_a2a_backend='none' which bypasses DeepGEMM (fp8.py:813-827); their linear paths fall to cutlass/triton instead of the block-FP8 DeepGEMM wrapper. Warming them up populates no cache and just wastes 5-10 min/shard on weight load and server bringup. Step-3.5-Flash specifically hit the 600s timeout on slow-disk runners. Drop them from the workflow warmup list and from the FALLBACK_ARGS dispatch in the script. The remaining list (V3-0324, V3.2, GLM-5-FP8, MiMo-V2-Flash, MiMo-V2.5) covers every per-commit model that actually uses DeepGEMM at runtime. If a future change makes any of the dropped models block-FP8 with auto runner backend, "Try DeepGEMM JIT Compiling" lines will start appearing in their test step and they should be re-added.

alisonshao · 2026-05-05T23:33:06Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-05T23:33:38Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

MiMo-V2.5 warmup hangs at the /generate readiness probe with current flags --dp 2 --enable-dp-attention. py-spy from run 25408125444 showed DP0 ranks stuck inside the vision encoder (vision.py:435, seq_lens.max().item()) while DP1 ranks were in forward_idle at the MoE gate — classic cross-DP desync. JIT events emit before the hang, so the cache is populated, but each shard burns the full 600s outer timeout on the dead probe afterwards. The actual test_mimo_models.py TestMiMoV2 launches MiMo-V2.5 with --mm-enable-dp-encoder + --attention-backend fa3 + --mm-attention-backend fa3 on top of the dp/dp-attention flags. --mm-enable-dp-encoder switches the vision encoder to data-parallel mode so both DP groups participate symmetrically, removing the desync. Add the same flags to MiMo-V2.5's FALLBACK_ARGS, and add --attention-backend fa3 to MiMo-V2-Flash so its warmup also matches its test config.

alisonshao · 2026-05-06T10:18:09Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-06T10:18:39Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

alisonshao · 2026-05-06T22:15:58Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-06T22:16:31Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

Today V3.2 dedups to V3-0324 in the lightweight path because they share the same architecture key (hidden_size / num_attention_heads / kv_lora_rank / etc.). The lightweight path computes attention shapes assuming TP-only sharding (num_local_heads = num_attention_heads // tp = 16). But V3.2 is launched in tests with --tp 8 --dp 8 --enable-dp-attention, which keeps all 128 attention heads per rank. So the shapes the dedup populates are correct for the no-DP variant (TestDeepseekV32TP) but wrong for the DP variant (TestDeepseekV32 / mtp / hisparse). V3.2's entry in FALLBACK_ARGS was dead code — the dedup short-circuited before fallback ran. Skip both the dedup check and the lightweight path when the model has an explicit FALLBACK_ARGS entry, so V3.2 goes through compile_deep_gemm with its real launch flags. V3-0324 (no FALLBACK_ARGS) still takes the fast lightweight path; the disk cache it populates also covers V3.2's no-DP test variant since both compute the same N/K. Only effect on the model list today: V3.2 now adds ~3 min of fallback weight load + warmup batch (matching GLM-5-FP8's fallback profile) in exchange for warming the DP-attention attention shapes that the bulk of V3.2 tests actually use.

alisonshao · 2026-05-06T22:48:16Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-06T22:48:47Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

alisonshao · 2026-05-12T23:35:40Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-12T23:36:10Z

🚀 Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

Multiprocessing-spawned scheduler_TP* and detokenizer subprocesses run under their own session/process group, so they survive killpg on launch_server. When wait_for_server returns false (readiness timeout or unclean exit), those orphans stay alive holding ~120 GB / GPU each, turning a "non-fatal" warmup failure into a downstream OOM in the very next CI step. Real example: job 75642889628 step 9 OOMed in DeepSeek-V3.2 create_weights because step 8's Ring-2.5-1T server-startup timeout left all eight scheduler_TP workers parked on the GPUs. The current stage-c-test-8-gpu-h200 path no longer calls warmup_server.py after the earlier commits in this PR, but 4-gpu-h100, 8-gpu-h200-deepep, B200 and manual debugging still do, so the kill path needs to be self-healing. After killpg on launch_server, SIGKILL any survivors matching sglang::scheduler or sglang::detokenizer by name and sleep 2s so the driver releases device memory before the next iteration loads weights.

Two stale model references left over from earlier iterations: 1. .github/workflows/pr-test.yml stage-c-test-deepep-8-gpu-h200 warmup passed `DeepSeek-V3.2-Exp:8` to warmup_deep_gemm.py. The two baseline tests that actually run in that suite (test_deepseek_v32_cp_single_node and test_deepep_large) both use `DeepSeek-V3.2`, so V3.2-Exp warmed the wrong arch and V3.2 itself hit cache misses during the test. Replace with V3.2 to match the suite's actual model. 2. The usage examples in warmup_deep_gemm.py and warmup_server.py docstrings still showed `V3.2-Exp` and `Ring-2.5-1T`. Ring-2.5-1T moved to nightly-8-gpu-common in #24725 and is no longer in any per-commit warmup list; V3.2-Exp has been superseded by V3.2. Update docstring examples so they reflect current per-commit models and don't mislead readers into thinking older models are still in scope. The per-commit warmup model list in pr-test.yml stage-c-test-8-gpu-h200 (V3-0324, V3.2, GLM-5-FP8, MiMo-V2-Flash, MiMo-V2.5) already matches the current baseline 8-GPU H200 tests on the #24725 tag-routing branch (test_deepseek_v3_mtp, test_dsa_models_mtp, test_mimo_models). No change needed there. MiniMax-M2.5 stays out of the warmup list because test_minimax_m25_basic.py emits zero DeepGEMM JIT events (verified in the earlier "drop non-DeepGEMM models" commit).

alisonshao · 2026-05-13T20:34:31Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-13T20:35:05Z

🚀 Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

alisonshao · 2026-05-13T21:15:16Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-13T21:15:46Z

🚀 Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

The fallback path in warmup_deep_gemm.py launches `sglang.compile_deep_gemm` for any model in FALLBACK_ARGS so the populated DeepGEMM cache matches the test's dp/ep/dp-attention launch flags. The compile step itself is fast when the cache is warm, but the subprocess unconditionally loads model weights first (45-170s per model, dominating the work). On a warm CI host this re-runs every commit even though the cache is already valid. Real observation: PR #24253 per-commit warmup took 13 min on a long-running ion-5 runner where every DeepGEMM kernel was already JIT-compiled. Of that, ~12 min went to weight-load on V3.2 / GLM-5-FP8 / MiMo-V2-Flash / MiMo-V2.5 inside the fallback subprocesses. After this commit, a successful fallback drops a marker file: ~/.cache/sglang/warmup_markers/deepgemm_fallback_<model>_tp<tp>_<argshash>_<verkey>.done On the next run, if the marker exists for (model, tp, extra_args), the fallback is skipped entirely. The marker auto-invalidates when: - Python / Triton / PyTorch versions change (version_key) - FALLBACK_ARGS for that model changes (argshash) Same mechanism warmup_server.py already uses for its server warmup markers, so MARKER_DIR (~/.cache/sglang/warmup_markers/) is shared. Expected per-commit savings: 10-12 min on warm machines. Caveat: if /root/.cache/deep_gemm is manually wiped without also clearing /root/.cache/sglang/warmup_markers/deepgemm_fallback_*, the marker will falsely claim warm and the in-test JIT compile cost reappears. Comment in the source documents this; cleanup scripts should clear both.

alisonshao · 2026-05-14T00:16:43Z

/rerun-stage stage-c-test-8-gpu-h200

github-actions · 2026-05-14T00:17:14Z

🚀 Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

Trim historical exposition and implementation-detail comments accumulated across the PR's commits. Keep only the WHYs that aren't obvious from the code: the MARKER_DIR / deep_gemm cache co-invalidation rule, the --mm-enable-dp-encoder MiMo-V2.5 deadlock requirement, and the CRASH_MARKERS rationale. No behavior change.

alisonshao requested review from Fridge003, Kangyan-Zhou, bingxche, ispobock and merrymercy as code owners May 2, 2026 02:10

alisonshao changed the title ~~ci: surface warmup server log tail on every code path~~ ci: combine H200 8-GPU warmup steps and surface server log on every path May 2, 2026

Merge branch 'main' into alison/warmup-server-surface-log

fef2c36

Merge branch 'main' into alison/warmup-server-surface-log

d2bf385

alisonshao and others added 2 commits May 6, 2026 15:25

Merge branch 'main' into alison/warmup-server-surface-log

690a40a

Merge branch 'main' into alison/warmup-server-surface-log

5920238

alisonshao added 2 commits May 12, 2026 17:27

alisonshao and others added 2 commits May 13, 2026 17:11

Merge branch 'main' into alison/warmup-server-surface-log

4a05156

Kangyan-Zhou merged commit b71d746 into main May 14, 2026
70 of 74 checks passed

Kangyan-Zhou deleted the alison/warmup-server-surface-log branch May 14, 2026 03:09

hnyls2002 added a commit that referenced this pull request May 14, 2026

port #24253 warmup updates to caller stubs

5df32a9

hnyls2002 mentioned this pull request May 14, 2026

ci: extract cuda stage actions + runner_config mapping #25138

Merged

Conversation

alisonshao commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Remove the heavyweight Warmup Server CUDA Graphs step; expand the lightweight DeepGEMM step to cover all per-commit H200 models

2. Surface the warmup-server log tail on every code path (file remains used by other workflows)

Test plan

Uh oh!

gemini-code-assist Bot commented May 2, 2026

Uh oh!

alisonshao commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

alisonshao commented May 2, 2026

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

Kangyan-Zhou commented May 2, 2026

Uh oh!

alisonshao commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

Kangyan-Zhou commented May 3, 2026

Uh oh!

alisonshao commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

alisonshao commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

alisonshao commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

alisonshao commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

alisonshao commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

alisonshao commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

alisonshao commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

alisonshao commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

alisonshao commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

alisonshao commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alisonshao commented May 2, 2026 •

edited

Loading

1. Remove the heavyweight `Warmup Server CUDA Graphs` step; expand the lightweight DeepGEMM step to cover all per-commit H200 models