ci: combine H200 8-GPU warmup steps and surface server log on every path#24253
Conversation
`warmup_server.py` redirects launch_server stdout/stderr to a tempfile to dodge the 64KB pipe deadlock during CUDA graph capture. Previously that log was only dumped to stdout when the server failed to start; the success path silently unlinked it. Anything the server logged (validation messages, deprecation warnings, slow-shard-load progress, NCCL init lines) was invisible to CI consumers. Move the dump into the finally block so the tail (last 30 lines) is always emitted, deduplicates with the failure-path branch, and gives CI visibility into noteworthy server-side events without anyone having to SSH onto the runner to read /tmp/warmup_server_*.log.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Removes the "Warmup Server CUDA Graphs" step (full launch_server + CUDA graph capture) from stage-c-test-8-gpu-h200 and folds its purpose into "Warmup DeepGEMM JIT Compilation" by passing every per-commit H200 8-GPU model to warmup_deep_gemm.py. CUDA graphs don't persist across processes anyway — only the DeepGEMM JIT cache does — so the heavyweight server-launch path was paying graph-capture cost for zero cross-process benefit. Why expand the model list: warmup_server.py was warming only V3-0324 and Ring-2.5-1T; tests for V3.2, GLM-5-FP8, MiniMax-M2.5, MiMo, Nemotron-3-Super, Step-3.5, and Qwen3-Next paid full JIT-compile cost during their own startup, often eating the per-test 1200s budget. Ring is removed entirely from per-commit warmup since it's nightly-only (suite=nightly-8-gpu-common). Per-commit H200 8-GPU model list is now duplicated between the workflow and sglang-ops/ci-machines-overview.md "H200 Pre-cached Models". Comment in the workflow points to the doc as the canonical keep-in-sync source. Timeout bumped 25→60 min: cold-cache run loads ~7 unique architectures through `sglang.compile_deep_gemm` fallback (~5–10 min each); warm runs finish in <5 min.
|
/rerun-stage stage-c-8-gpu-h200 |
|
❌ Stage NVIDIA stages:
AMD stages:
Other stages will be added soon. For now, use |
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
When fallback_compile_deep_gemm spawns sglang.compile_deep_gemm and one TP rank crashes (e.g. the IndexError currently triggered by XiaomiMiMo/MiMo-V2-Flash QKV weight loading), the surviving ranks deadlock on NCCL collectives waiting for the dead one. The parent compile_deep_gemm also hangs waiting on its children. Previously, warmup_deep_gemm.py used subprocess.run() with no timeout and would sit there until the GH Actions step cap killed the whole job. Real example: dispatch run on PR #24253 chewed 57 min before being terminated. This change wraps the fallback in a Popen + os.setsid process group with a 15-min per-model wait. On timeout, SIGTERM the whole group, escalate to SIGKILL after 10s if needed, log a warning, and continue to the next model. One bad model can no longer eat the entire warmup budget. Failure example (the 57-min hang) addressed by this change: https://github.com/sgl-project/sglang/actions/runs/25245303024/job/74028467044
When fallback_compile_deep_gemm spawns sglang.compile_deep_gemm and one TP rank crashes (e.g. the IndexError currently triggered by XiaomiMiMo/MiMo-V2-Flash QKV weight loading), the surviving ranks deadlock on NCCL collectives waiting for the dead one. The parent compile_deep_gemm also hangs waiting on its children. Previously, warmup_deep_gemm.py used subprocess.run() with no timeout and would sit there until the GH Actions step cap killed the whole job. Real example: dispatch run on PR #24253 chewed 57 min before being terminated. This change wraps the fallback in a Popen + os.setsid process group with a 15-min per-model wait. On timeout, SIGTERM the whole group, escalate to SIGKILL after 10s if needed, log a warning, and continue to the next model. One bad model can no longer eat the entire warmup budget. Failure example (the 57-min hang) addressed by this change: https://github.com/sgl-project/sglang/actions/runs/25245303024/job/74028467044
|
The mimo model seems not working |
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
|
https://github.com/sgl-project/sglang/actions/runs/25266653392/job/74082200496 is this expected to have 40 minutes runtime? |
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
The fallback path in warmup_deep_gemm.py was launching every non-DeepSeek
model with just `--tp N`, but per-rank DeepGEMM shapes depend on
tp/dp/ep. Two of the configured models crashed deterministically at
tp=8 — MiMo-V2-Flash on a QKV narrow that overruns at TP=8 (test runs
tp=4+dp=2+dp_attn) and MiniMax-M2.5 on the FP8 block_n divisibility check
(test runs tp=8+ep=8). Each crash then sat in a 900s parent-side poll
loop, costing ~30 min per shard for no cache benefit.
Add a per-model FALLBACK_ARGS table so each fallback subprocess passes
the same dp/ep/dp-attention flags the test uses, populating cache
shapes the test will actually request on a fresh runner. Watch
subprocess output for crash markers ("Scheduler hit an exception" /
"Received sigquit from a child") and kill the process group as soon as
one is seen, instead of waiting out the timeout. Outer timeout dropped
900s -> 600s for any wedge that doesn't emit a marker.
Workflow argv changes that pair with the new dispatch:
- MiMo-V2-Flash :8 -> :4 (matches test_mimo_models.py)
- Qwen3-Next-80B :2 -> :4 (matches test_disaggregation_hybrid_attention.py)
- Drop Nemotron-3-Super-BF16 (BF16 model, doesn't use DeepGEMM kernels)
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
MiniMax-M2.5, Step-3.5-Flash, and Qwen3-Next emit zero "Try DeepGEMM JIT Compiling" events on rank 0 in either the warmup subprocess or the actual test step (verified on run 25354167665), even though they're FP8/MoE models. Their MoE paths take moe_a2a_backend='none' which bypasses DeepGEMM (fp8.py:813-827); their linear paths fall to cutlass/triton instead of the block-FP8 DeepGEMM wrapper. Warming them up populates no cache and just wastes 5-10 min/shard on weight load and server bringup. Step-3.5-Flash specifically hit the 600s timeout on slow-disk runners. Drop them from the workflow warmup list and from the FALLBACK_ARGS dispatch in the script. The remaining list (V3-0324, V3.2, GLM-5-FP8, MiMo-V2-Flash, MiMo-V2.5) covers every per-commit model that actually uses DeepGEMM at runtime. If a future change makes any of the dropped models block-FP8 with auto runner backend, "Try DeepGEMM JIT Compiling" lines will start appearing in their test step and they should be re-added.
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
MiMo-V2.5 warmup hangs at the /generate readiness probe with current flags --dp 2 --enable-dp-attention. py-spy from run 25408125444 showed DP0 ranks stuck inside the vision encoder (vision.py:435, seq_lens.max().item()) while DP1 ranks were in forward_idle at the MoE gate — classic cross-DP desync. JIT events emit before the hang, so the cache is populated, but each shard burns the full 600s outer timeout on the dead probe afterwards. The actual test_mimo_models.py TestMiMoV2 launches MiMo-V2.5 with --mm-enable-dp-encoder + --attention-backend fa3 + --mm-attention-backend fa3 on top of the dp/dp-attention flags. --mm-enable-dp-encoder switches the vision encoder to data-parallel mode so both DP groups participate symmetrically, removing the desync. Add the same flags to MiMo-V2.5's FALLBACK_ARGS, and add --attention-backend fa3 to MiMo-V2-Flash so its warmup also matches its test config.
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
Today V3.2 dedups to V3-0324 in the lightweight path because they share the same architecture key (hidden_size / num_attention_heads / kv_lora_rank / etc.). The lightweight path computes attention shapes assuming TP-only sharding (num_local_heads = num_attention_heads // tp = 16). But V3.2 is launched in tests with --tp 8 --dp 8 --enable-dp-attention, which keeps all 128 attention heads per rank. So the shapes the dedup populates are correct for the no-DP variant (TestDeepseekV32TP) but wrong for the DP variant (TestDeepseekV32 / mtp / hisparse). V3.2's entry in FALLBACK_ARGS was dead code — the dedup short-circuited before fallback ran. Skip both the dedup check and the lightweight path when the model has an explicit FALLBACK_ARGS entry, so V3.2 goes through compile_deep_gemm with its real launch flags. V3-0324 (no FALLBACK_ARGS) still takes the fast lightweight path; the disk cache it populates also covers V3.2's no-DP test variant since both compute the same N/K. Only effect on the model list today: V3.2 now adds ~3 min of fallback weight load + warmup batch (matching GLM-5-FP8's fallback profile) in exchange for warming the DP-attention attention shapes that the bulk of V3.2 tests actually use.
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
✅ Triggered |
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
🚀 Triggered |
Multiprocessing-spawned scheduler_TP* and detokenizer subprocesses run under their own session/process group, so they survive killpg on launch_server. When wait_for_server returns false (readiness timeout or unclean exit), those orphans stay alive holding ~120 GB / GPU each, turning a "non-fatal" warmup failure into a downstream OOM in the very next CI step. Real example: job 75642889628 step 9 OOMed in DeepSeek-V3.2 create_weights because step 8's Ring-2.5-1T server-startup timeout left all eight scheduler_TP workers parked on the GPUs. The current stage-c-test-8-gpu-h200 path no longer calls warmup_server.py after the earlier commits in this PR, but 4-gpu-h100, 8-gpu-h200-deepep, B200 and manual debugging still do, so the kill path needs to be self-healing. After killpg on launch_server, SIGKILL any survivors matching sglang::scheduler or sglang::detokenizer by name and sleep 2s so the driver releases device memory before the next iteration loads weights.
Two stale model references left over from earlier iterations: 1. .github/workflows/pr-test.yml stage-c-test-deepep-8-gpu-h200 warmup passed `DeepSeek-V3.2-Exp:8` to warmup_deep_gemm.py. The two baseline tests that actually run in that suite (test_deepseek_v32_cp_single_node and test_deepep_large) both use `DeepSeek-V3.2`, so V3.2-Exp warmed the wrong arch and V3.2 itself hit cache misses during the test. Replace with V3.2 to match the suite's actual model. 2. The usage examples in warmup_deep_gemm.py and warmup_server.py docstrings still showed `V3.2-Exp` and `Ring-2.5-1T`. Ring-2.5-1T moved to nightly-8-gpu-common in #24725 and is no longer in any per-commit warmup list; V3.2-Exp has been superseded by V3.2. Update docstring examples so they reflect current per-commit models and don't mislead readers into thinking older models are still in scope. The per-commit warmup model list in pr-test.yml stage-c-test-8-gpu-h200 (V3-0324, V3.2, GLM-5-FP8, MiMo-V2-Flash, MiMo-V2.5) already matches the current baseline 8-GPU H200 tests on the #24725 tag-routing branch (test_deepseek_v3_mtp, test_dsa_models_mtp, test_mimo_models). No change needed there. MiniMax-M2.5 stays out of the warmup list because test_minimax_m25_basic.py emits zero DeepGEMM JIT events (verified in the earlier "drop non-DeepGEMM models" commit).
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
🚀 Triggered |
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
🚀 Triggered |
The fallback path in warmup_deep_gemm.py launches `sglang.compile_deep_gemm` for any model in FALLBACK_ARGS so the populated DeepGEMM cache matches the test's dp/ep/dp-attention launch flags. The compile step itself is fast when the cache is warm, but the subprocess unconditionally loads model weights first (45-170s per model, dominating the work). On a warm CI host this re-runs every commit even though the cache is already valid. Real observation: PR #24253 per-commit warmup took 13 min on a long-running ion-5 runner where every DeepGEMM kernel was already JIT-compiled. Of that, ~12 min went to weight-load on V3.2 / GLM-5-FP8 / MiMo-V2-Flash / MiMo-V2.5 inside the fallback subprocesses. After this commit, a successful fallback drops a marker file: ~/.cache/sglang/warmup_markers/deepgemm_fallback_<model>_tp<tp>_<argshash>_<verkey>.done On the next run, if the marker exists for (model, tp, extra_args), the fallback is skipped entirely. The marker auto-invalidates when: - Python / Triton / PyTorch versions change (version_key) - FALLBACK_ARGS for that model changes (argshash) Same mechanism warmup_server.py already uses for its server warmup markers, so MARKER_DIR (~/.cache/sglang/warmup_markers/) is shared. Expected per-commit savings: 10-12 min on warm machines. Caveat: if /root/.cache/deep_gemm is manually wiped without also clearing /root/.cache/sglang/warmup_markers/deepgemm_fallback_*, the marker will falsely claim warm and the in-test JIT compile cost reappears. Comment in the source documents this; cleanup scripts should clear both.
|
/rerun-stage stage-c-test-8-gpu-h200 |
|
🚀 Triggered |
Trim historical exposition and implementation-detail comments accumulated across the PR's commits. Keep only the WHYs that aren't obvious from the code: the MARKER_DIR / deep_gemm cache co-invalidation rule, the --mm-enable-dp-encoder MiMo-V2.5 deadlock requirement, and the CRASH_MARKERS rationale. No behavior change.
Summary
Two related changes to the per-commit H200 8-GPU CI warmup pipeline:
1. Remove the heavyweight
Warmup Server CUDA Graphsstep; expand the lightweight DeepGEMM step to cover all per-commit H200 modelswarmup_server.pywas launching the full sglang server forV3-0324:8andRing-2.5-1T:8to do a full model load + DeepGEMM JIT pre-compile + CUDA graph capture. The graph-capture portion is wasted work — CUDA graphs don't persist across processes, so subsequent test-side server launches re-capture from scratch. Only the DeepGEMM JIT cache (/root/.cache/deep_gemm, persisted via bind mount on the host) carries any benefit forward.Even worse, the warmup only covered V3-0324 and Ring, leaving every other per-commit H200 model (V3.2, GLM-5-FP8, MiniMax-M2.5, MiMo, Nemotron-3-Super, Step-3.5, Qwen3-Next) to pay its own JIT-compile cost during the test step.
Failure example showing this exact bottleneck: https://github.com/sgl-project/sglang/actions/runs/25205614862/job/73974157058
In that run, step 8 "Warmup Server CUDA Graphs" finished in 1 second (fast-path skip via marker), step 7 "Warmup DeepGEMM JIT Compilation" in 6 seconds (7 canned shapes, all cache hits). But step 9 "Run test" then chewed ~14m 52s of in-test DeepGEMM JIT compile time out of its 20m 15s budget — most of it on four expensive shapes (2:38, 2:38, 2:53, 2:53) the warmup steps never primed. The test file (
test_dsa_models_mtp.py) hit the per-file 1200s timeout with a 25th JIT session still running. Same root cause hittest_minimax_m25_basic.pyearlier.This PR collapses both warmup steps into one expanded
warmup_deep_gemm.pyinvocation that covers every per-commit H200 8-GPU model. CUDA graph capture is no longer attempted; only the JIT cache (which actually persists) is populated.Ring-2.5-1T is dropped from the warmup since it only runs in
nightly-8-gpu-common, not per-commit.Per-commit model list (the canonical source is
sglang-ops/ci-machines-overview.md"H200 Pre-cached Models" — comment in the workflow points to it):Timeout bumped from 25 → 60 min. Cold-cache cumulative across ~7 unique architectures is up to ~45 min (each goes through the
sglang.compile_deep_gemmfallback, which loads weights and JITs all shapes). Warm-cache runs finish in <5 min.2. Surface the warmup-server log tail on every code path (file remains used by other workflows)
scripts/ci/cuda/warmup_server.pyis no longer called by the H200 8-GPU PR test, but is still used by other CUDA workflows (4-gpu-h100, 8-gpu-h200-deepep, B200) and by manual debugging. The script redirects launch_server stdout/stderr to a tempfile to dodge the 64KB pipe deadlock during CUDA graph capture, and previously only dumped that log to its own stdout on the failure path — the success path silently unlinked it.That's why the recent
Validation failed for inclusionAI/Ring-2.5-1T: Missing shards in model-of-00160.safetensors: [160](separately fixed in #24237) only appeared in/tmp/warmup_server_*.logon the runner and required SSHing in to find. This change moves the tail dump into thefinallyblock so it runs uniformly on success/failure/exception. Failure-path duplicate is removed.Test plan
/root/.cache/deep_gemm): the merged warmup step takes <60 min and the per-test JIT compile sessions seen intest_dsa_models_mtp.py/test_minimax_m25_basic.pyare gone.warmup_server.pyshows a--- server log tail (N lines, last 30) ---block in the GH Actions log regardless of success/failure.nightly-8-gpu-common(Ring-2.5-1T's nightly path is unaffected; warmup wasn't relied on there).