Skip to content

ci: combine H200 8-GPU warmup steps and surface server log on every path#24253

Merged
Kangyan-Zhou merged 16 commits into
mainfrom
alison/warmup-server-surface-log
May 14, 2026
Merged

ci: combine H200 8-GPU warmup steps and surface server log on every path#24253
Kangyan-Zhou merged 16 commits into
mainfrom
alison/warmup-server-surface-log

Conversation

@alisonshao
Copy link
Copy Markdown
Collaborator

@alisonshao alisonshao commented May 2, 2026

Summary

Two related changes to the per-commit H200 8-GPU CI warmup pipeline:

1. Remove the heavyweight Warmup Server CUDA Graphs step; expand the lightweight DeepGEMM step to cover all per-commit H200 models

warmup_server.py was launching the full sglang server for V3-0324:8 and Ring-2.5-1T:8 to do a full model load + DeepGEMM JIT pre-compile + CUDA graph capture. The graph-capture portion is wasted work — CUDA graphs don't persist across processes, so subsequent test-side server launches re-capture from scratch. Only the DeepGEMM JIT cache (/root/.cache/deep_gemm, persisted via bind mount on the host) carries any benefit forward.

Even worse, the warmup only covered V3-0324 and Ring, leaving every other per-commit H200 model (V3.2, GLM-5-FP8, MiniMax-M2.5, MiMo, Nemotron-3-Super, Step-3.5, Qwen3-Next) to pay its own JIT-compile cost during the test step.

Failure example showing this exact bottleneck: https://github.com/sgl-project/sglang/actions/runs/25205614862/job/73974157058

In that run, step 8 "Warmup Server CUDA Graphs" finished in 1 second (fast-path skip via marker), step 7 "Warmup DeepGEMM JIT Compilation" in 6 seconds (7 canned shapes, all cache hits). But step 9 "Run test" then chewed ~14m 52s of in-test DeepGEMM JIT compile time out of its 20m 15s budget — most of it on four expensive shapes (2:38, 2:38, 2:53, 2:53) the warmup steps never primed. The test file (test_dsa_models_mtp.py) hit the per-file 1200s timeout with a 25th JIT session still running. Same root cause hit test_minimax_m25_basic.py earlier.

This PR collapses both warmup steps into one expanded warmup_deep_gemm.py invocation that covers every per-commit H200 8-GPU model. CUDA graph capture is no longer attempted; only the JIT cache (which actually persists) is populated.

Ring-2.5-1T is dropped from the warmup since it only runs in nightly-8-gpu-common, not per-commit.

Per-commit model list (the canonical source is sglang-ops/ci-machines-overview.md "H200 Pre-cached Models" — comment in the workflow points to it):

deepseek-ai/DeepSeek-V3-0324:8
deepseek-ai/DeepSeek-V3.2:8
zai-org/GLM-5-FP8:8
XiaomiMiMo/MiMo-V2-Flash:8
XiaomiMiMo/MiMo-V2.5:8
MiniMaxAI/MiniMax-M2.5:8
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16:8
stepfun-ai/Step-3.5-Flash:8
Qwen/Qwen3-Next-80B-A3B-Instruct:2

Timeout bumped from 25 → 60 min. Cold-cache cumulative across ~7 unique architectures is up to ~45 min (each goes through the sglang.compile_deep_gemm fallback, which loads weights and JITs all shapes). Warm-cache runs finish in <5 min.

2. Surface the warmup-server log tail on every code path (file remains used by other workflows)

scripts/ci/cuda/warmup_server.py is no longer called by the H200 8-GPU PR test, but is still used by other CUDA workflows (4-gpu-h100, 8-gpu-h200-deepep, B200) and by manual debugging. The script redirects launch_server stdout/stderr to a tempfile to dodge the 64KB pipe deadlock during CUDA graph capture, and previously only dumped that log to its own stdout on the failure path — the success path silently unlinked it.

That's why the recent Validation failed for inclusionAI/Ring-2.5-1T: Missing shards in model-of-00160.safetensors: [160] (separately fixed in #24237) only appeared in /tmp/warmup_server_*.log on the runner and required SSHing in to find. This change moves the tail dump into the finally block so it runs uniformly on success/failure/exception. Failure-path duplicate is removed.

Test plan

  • On a freshly-recreated H200 runner (cold /root/.cache/deep_gemm): the merged warmup step takes <60 min and the per-test JIT compile sessions seen in test_dsa_models_mtp.py / test_minimax_m25_basic.py are gone.
  • On a warm runner: the merged step completes in single-digit minutes (DeepSeek-family hits the lightweight no-load path; everything else hits warm fallback cache).
  • The next 4-gpu-h100 / 8-gpu-h200-deepep / B200 PR test job that does still run warmup_server.py shows a --- server log tail (N lines, last 30) --- block in the GH Actions log regardless of success/failure.
  • No regression for nightly-8-gpu-common (Ring-2.5-1T's nightly path is unaffected; warmup wasn't relied on there).

`warmup_server.py` redirects launch_server stdout/stderr to a tempfile
to dodge the 64KB pipe deadlock during CUDA graph capture. Previously
that log was only dumped to stdout when the server failed to start;
the success path silently unlinked it. Anything the server logged
(validation messages, deprecation warnings, slow-shard-load progress,
NCCL init lines) was invisible to CI consumers.

Move the dump into the finally block so the tail (last 30 lines) is
always emitted, deduplicates with the failure-path branch, and gives
CI visibility into noteworthy server-side events without anyone having
to SSH onto the runner to read /tmp/warmup_server_*.log.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Removes the "Warmup Server CUDA Graphs" step (full launch_server +
CUDA graph capture) from stage-c-test-8-gpu-h200 and folds its purpose
into "Warmup DeepGEMM JIT Compilation" by passing every per-commit
H200 8-GPU model to warmup_deep_gemm.py. CUDA graphs don't persist
across processes anyway — only the DeepGEMM JIT cache does — so the
heavyweight server-launch path was paying graph-capture cost for zero
cross-process benefit.

Why expand the model list: warmup_server.py was warming only V3-0324
and Ring-2.5-1T; tests for V3.2, GLM-5-FP8, MiniMax-M2.5, MiMo,
Nemotron-3-Super, Step-3.5, and Qwen3-Next paid full JIT-compile cost
during their own startup, often eating the per-test 1200s budget.
Ring is removed entirely from per-commit warmup since it's nightly-only
(suite=nightly-8-gpu-common).

Per-commit H200 8-GPU model list is now duplicated between the
workflow and sglang-ops/ci-machines-overview.md "H200 Pre-cached
Models". Comment in the workflow points to the doc as the canonical
keep-in-sync source.

Timeout bumped 25→60 min: cold-cache run loads ~7 unique architectures
through `sglang.compile_deep_gemm` fallback (~5–10 min each); warm
runs finish in <5 min.
@alisonshao alisonshao changed the title ci: surface warmup server log tail on every code path ci: combine H200 8-GPU warmup steps and surface server log on every path May 2, 2026
@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

❌ Stage stage-c-8-gpu-h200 doesn't support isolated runs yet.

NVIDIA stages:

  • stage-a-test-1-gpu-small
  • stage-a-test-cpu
  • stage-b-test-1-gpu-small
  • stage-b-test-1-gpu-large
  • stage-b-test-2-gpu-large
  • stage-b-test-4-gpu-b200
  • stage-c-test-4-gpu-h100
  • stage-c-test-8-gpu-h200
  • stage-c-test-8-gpu-h20
  • stage-c-test-4-gpu-b200
  • stage-c-test-4-gpu-gb200
  • stage-c-test-deepep-4-gpu-h100
  • stage-c-test-deepep-8-gpu-h200
  • multimodal-gen-test-1-gpu
  • multimodal-gen-test-2-gpu
  • multimodal-gen-component-accuracy
  • multimodal-gen-component-accuracy-1-gpu
  • multimodal-gen-component-accuracy-2-gpu
  • multimodal-gen-test-1-b200

AMD stages:

  • sgl-kernel-unit-test-amd
  • sgl-kernel-unit-test-2-gpu-amd
  • stage-a-test-1-gpu-small-amd
  • stage-b-test-1-gpu-small-amd
  • stage-b-test-1-gpu-small-amd-nondeterministic
  • stage-b-test-1-gpu-small-amd-mi35x
  • stage-b-test-1-gpu-large-amd
  • stage-b-test-2-gpu-large-amd
  • multimodal-gen-test-1-gpu-amd
  • multimodal-gen-test-2-gpu-amd
  • stage-c-test-large-8-gpu-amd
  • stage-c-test-large-8-gpu-amd-mi35x

Other stages will be added soon. For now, use /rerun-failed-ci for those stages.

@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

alisonshao added a commit that referenced this pull request May 2, 2026
When fallback_compile_deep_gemm spawns sglang.compile_deep_gemm and one
TP rank crashes (e.g. the IndexError currently triggered by
XiaomiMiMo/MiMo-V2-Flash QKV weight loading), the surviving ranks
deadlock on NCCL collectives waiting for the dead one. The parent
compile_deep_gemm also hangs waiting on its children. Previously,
warmup_deep_gemm.py used subprocess.run() with no timeout and would
sit there until the GH Actions step cap killed the whole job. Real
example: dispatch run on PR #24253 chewed 57 min before being terminated.

This change wraps the fallback in a Popen + os.setsid process group
with a 15-min per-model wait. On timeout, SIGTERM the whole group,
escalate to SIGKILL after 10s if needed, log a warning, and continue
to the next model. One bad model can no longer eat the entire warmup
budget.

Failure example (the 57-min hang) addressed by this change:
https://github.com/sgl-project/sglang/actions/runs/25245303024/job/74028467044
When fallback_compile_deep_gemm spawns sglang.compile_deep_gemm and one
TP rank crashes (e.g. the IndexError currently triggered by
XiaomiMiMo/MiMo-V2-Flash QKV weight loading), the surviving ranks
deadlock on NCCL collectives waiting for the dead one. The parent
compile_deep_gemm also hangs waiting on its children. Previously,
warmup_deep_gemm.py used subprocess.run() with no timeout and would
sit there until the GH Actions step cap killed the whole job. Real
example: dispatch run on PR #24253 chewed 57 min before being terminated.

This change wraps the fallback in a Popen + os.setsid process group
with a 15-min per-model wait. On timeout, SIGTERM the whole group,
escalate to SIGKILL after 10s if needed, log a warning, and continue
to the next model. One bad model can no longer eat the entire warmup
budget.

Failure example (the 57-min hang) addressed by this change:
https://github.com/sgl-project/sglang/actions/runs/25245303024/job/74028467044
@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

The mimo model seems not working

@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

The fallback path in warmup_deep_gemm.py was launching every non-DeepSeek
model with just `--tp N`, but per-rank DeepGEMM shapes depend on
tp/dp/ep. Two of the configured models crashed deterministically at
tp=8 — MiMo-V2-Flash on a QKV narrow that overruns at TP=8 (test runs
tp=4+dp=2+dp_attn) and MiniMax-M2.5 on the FP8 block_n divisibility check
(test runs tp=8+ep=8). Each crash then sat in a 900s parent-side poll
loop, costing ~30 min per shard for no cache benefit.

Add a per-model FALLBACK_ARGS table so each fallback subprocess passes
the same dp/ep/dp-attention flags the test uses, populating cache
shapes the test will actually request on a fresh runner. Watch
subprocess output for crash markers ("Scheduler hit an exception" /
"Received sigquit from a child") and kill the process group as soon as
one is seen, instead of waiting out the timeout. Outer timeout dropped
900s -> 600s for any wedge that doesn't emit a marker.

Workflow argv changes that pair with the new dispatch:
- MiMo-V2-Flash :8 -> :4 (matches test_mimo_models.py)
- Qwen3-Next-80B :2 -> :4 (matches test_disaggregation_hybrid_attention.py)
- Drop Nemotron-3-Super-BF16 (BF16 model, doesn't use DeepGEMM kernels)
@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

MiniMax-M2.5, Step-3.5-Flash, and Qwen3-Next emit zero "Try DeepGEMM
JIT Compiling" events on rank 0 in either the warmup subprocess or the
actual test step (verified on run 25354167665), even though they're
FP8/MoE models. Their MoE paths take moe_a2a_backend='none' which
bypasses DeepGEMM (fp8.py:813-827); their linear paths fall to
cutlass/triton instead of the block-FP8 DeepGEMM wrapper. Warming them
up populates no cache and just wastes 5-10 min/shard on weight load and
server bringup. Step-3.5-Flash specifically hit the 600s timeout on
slow-disk runners.

Drop them from the workflow warmup list and from the FALLBACK_ARGS
dispatch in the script. The remaining list (V3-0324, V3.2, GLM-5-FP8,
MiMo-V2-Flash, MiMo-V2.5) covers every per-commit model that actually
uses DeepGEMM at runtime.

If a future change makes any of the dropped models block-FP8 with auto
runner backend, "Try DeepGEMM JIT Compiling" lines will start appearing
in their test step and they should be re-added.
@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 5, 2026

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

MiMo-V2.5 warmup hangs at the /generate readiness probe with current
flags --dp 2 --enable-dp-attention. py-spy from run 25408125444 showed
DP0 ranks stuck inside the vision encoder (vision.py:435,
seq_lens.max().item()) while DP1 ranks were in forward_idle at the
MoE gate — classic cross-DP desync. JIT events emit before the hang,
so the cache is populated, but each shard burns the full 600s outer
timeout on the dead probe afterwards.

The actual test_mimo_models.py TestMiMoV2 launches MiMo-V2.5 with
--mm-enable-dp-encoder + --attention-backend fa3 + --mm-attention-backend
fa3 on top of the dp/dp-attention flags. --mm-enable-dp-encoder switches
the vision encoder to data-parallel mode so both DP groups participate
symmetrically, removing the desync. Add the same flags to MiMo-V2.5's
FALLBACK_ARGS, and add --attention-backend fa3 to MiMo-V2-Flash so its
warmup also matches its test config.
@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

alisonshao and others added 2 commits May 6, 2026 15:25
Today V3.2 dedups to V3-0324 in the lightweight path because they share
the same architecture key (hidden_size / num_attention_heads / kv_lora_rank
/ etc.). The lightweight path computes attention shapes assuming TP-only
sharding (num_local_heads = num_attention_heads // tp = 16). But V3.2 is
launched in tests with --tp 8 --dp 8 --enable-dp-attention, which keeps
all 128 attention heads per rank. So the shapes the dedup populates are
correct for the no-DP variant (TestDeepseekV32TP) but wrong for the DP
variant (TestDeepseekV32 / mtp / hisparse). V3.2's entry in FALLBACK_ARGS
was dead code — the dedup short-circuited before fallback ran.

Skip both the dedup check and the lightweight path when the model has an
explicit FALLBACK_ARGS entry, so V3.2 goes through compile_deep_gemm with
its real launch flags. V3-0324 (no FALLBACK_ARGS) still takes the fast
lightweight path; the disk cache it populates also covers V3.2's no-DP
test variant since both compute the same N/K. Only effect on the model
list today: V3.2 now adds ~3 min of fallback weight load + warmup batch
(matching GLM-5-FP8's fallback profile) in exchange for warming the
DP-attention attention shapes that the bulk of V3.2 tests actually use.
@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

Multiprocessing-spawned scheduler_TP* and detokenizer subprocesses run
under their own session/process group, so they survive killpg on
launch_server. When wait_for_server returns false (readiness timeout or
unclean exit), those orphans stay alive holding ~120 GB / GPU each,
turning a "non-fatal" warmup failure into a downstream OOM in the very
next CI step.

Real example: job 75642889628 step 9 OOMed in DeepSeek-V3.2
create_weights because step 8's Ring-2.5-1T server-startup timeout left
all eight scheduler_TP workers parked on the GPUs. The current
stage-c-test-8-gpu-h200 path no longer calls warmup_server.py after the
earlier commits in this PR, but 4-gpu-h100, 8-gpu-h200-deepep, B200 and
manual debugging still do, so the kill path needs to be self-healing.

After killpg on launch_server, SIGKILL any survivors matching
sglang::scheduler or sglang::detokenizer by name and sleep 2s so the
driver releases device memory before the next iteration loads weights.
Two stale model references left over from earlier iterations:

1. .github/workflows/pr-test.yml stage-c-test-deepep-8-gpu-h200 warmup
   passed `DeepSeek-V3.2-Exp:8` to warmup_deep_gemm.py. The two baseline
   tests that actually run in that suite (test_deepseek_v32_cp_single_node
   and test_deepep_large) both use `DeepSeek-V3.2`, so V3.2-Exp warmed
   the wrong arch and V3.2 itself hit cache misses during the test.
   Replace with V3.2 to match the suite's actual model.

2. The usage examples in warmup_deep_gemm.py and warmup_server.py
   docstrings still showed `V3.2-Exp` and `Ring-2.5-1T`. Ring-2.5-1T
   moved to nightly-8-gpu-common in #24725 and is no longer in any
   per-commit warmup list; V3.2-Exp has been superseded by V3.2.
   Update docstring examples so they reflect current per-commit models
   and don't mislead readers into thinking older models are still in
   scope.

The per-commit warmup model list in pr-test.yml stage-c-test-8-gpu-h200
(V3-0324, V3.2, GLM-5-FP8, MiMo-V2-Flash, MiMo-V2.5) already matches
the current baseline 8-GPU H200 tests on the #24725 tag-routing branch
(test_deepseek_v3_mtp, test_dsa_models_mtp, test_mimo_models). No
change needed there. MiniMax-M2.5 stays out of the warmup list because
test_minimax_m25_basic.py emits zero DeepGEMM JIT events (verified in
the earlier "drop non-DeepGEMM models" commit).
@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

alisonshao and others added 2 commits May 13, 2026 17:11
The fallback path in warmup_deep_gemm.py launches `sglang.compile_deep_gemm`
for any model in FALLBACK_ARGS so the populated DeepGEMM cache matches the
test's dp/ep/dp-attention launch flags. The compile step itself is fast
when the cache is warm, but the subprocess unconditionally loads model
weights first (45-170s per model, dominating the work). On a warm CI host
this re-runs every commit even though the cache is already valid.

Real observation: PR #24253 per-commit warmup took 13 min on a long-running
ion-5 runner where every DeepGEMM kernel was already JIT-compiled. Of that,
~12 min went to weight-load on V3.2 / GLM-5-FP8 / MiMo-V2-Flash / MiMo-V2.5
inside the fallback subprocesses.

After this commit, a successful fallback drops a marker file:
  ~/.cache/sglang/warmup_markers/deepgemm_fallback_<model>_tp<tp>_<argshash>_<verkey>.done
On the next run, if the marker exists for (model, tp, extra_args), the
fallback is skipped entirely. The marker auto-invalidates when:
- Python / Triton / PyTorch versions change (version_key)
- FALLBACK_ARGS for that model changes (argshash)

Same mechanism warmup_server.py already uses for its server warmup markers,
so MARKER_DIR (~/.cache/sglang/warmup_markers/) is shared.

Expected per-commit savings: 10-12 min on warm machines.

Caveat: if /root/.cache/deep_gemm is manually wiped without also clearing
/root/.cache/sglang/warmup_markers/deepgemm_fallback_*, the marker will
falsely claim warm and the in-test JIT compile cost reappears. Comment in
the source documents this; cleanup scripts should clear both.
@alisonshao
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@github-actions
Copy link
Copy Markdown
Contributor

🚀 Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies). View workflow run

Trim historical exposition and implementation-detail comments accumulated
across the PR's commits. Keep only the WHYs that aren't obvious from the
code: the MARKER_DIR / deep_gemm cache co-invalidation rule, the
--mm-enable-dp-encoder MiMo-V2.5 deadlock requirement, and the CRASH_MARKERS
rationale. No behavior change.
@Kangyan-Zhou Kangyan-Zhou merged commit b71d746 into main May 14, 2026
70 of 74 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the alison/warmup-server-surface-log branch May 14, 2026 03:09
hnyls2002 added a commit that referenced this pull request May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants