Skip to content

[CI] Lower mem-fraction-static for GLM-5.1 FP8 8-GPU test to 0.85#25453

Merged
Kangyan-Zhou merged 1 commit into
sgl-project:mainfrom
Jiminator:fix/glm5-fp8-tp8dp8-mem-fraction
May 16, 2026
Merged

[CI] Lower mem-fraction-static for GLM-5.1 FP8 8-GPU test to 0.85#25453
Kangyan-Zhou merged 1 commit into
sgl-project:mainfrom
Jiminator:fix/glm5-fp8-tp8dp8-mem-fraction

Conversation

@Jiminator
Copy link
Copy Markdown
Collaborator

@Jiminator Jiminator commented May 16, 2026

Motivation

test/registered/8-gpu-models/test_glm_51_fp8.py::test_glm51_fp8 has been failing on every B200 nightly since the test was added in #22399 (2026-04-09, ~37 days, 0 passes on B200). The TP8+DP8 variant OOMs at scheduler init; TP8 and TP8+DP8+MTP are fine. The same test passes on H200 in the same scheduled runs.

Example failure: run 25835354140 / job 75909128349.

--mem-fraction-static=0.9 on B200 (178 GiB/GPU) leaves only ~3 GiB free after the DP-attention workspaces and cuda-graph capture intermediates land; the next 6.38 GiB cuda-graph alloc OOMs. Not a code regression — the B200 runner has ~5 GiB higher baseline residency than H200 for this model.

Modifications

-    "--mem-fraction-static=0.9",
+    "--mem-fraction-static=0.85",

0.85 is already used by test_minimax_m25.py for the same 8-GPU + DP-attention shape; every other 8-GPU + DP-attention test in this dir uses 0.85 or lower. GB300 tests keep 0.9 (different runner, 288 GiB/GPU).

Verification (8x B200, local repro)

Variant 0.9 (current) 0.85 (this PR)
TP8 PASS, gsm8k 0.966 PASS, gsm8k 0.967
TP8+DP8 FAIL: 6.38 GiB OOM, scheduler died PASS, gsm8k 0.967
TP8+DP8+MTP PASS PASS, gsm8k 0.965
Overall FAILED ALL TESTS PASSED

Local OOM signature at 0.9 matches CI exactly (6.38 GiB request, ~3.3 GiB free, 170.07 GiB resident).

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

The TP8+DP8 variant has been OOMing at scheduler init on the B200
nightly runner since the test was added (sgl-project#22399, 2026-04-09), failing
every nightly partition for ~37 days. At mem-fraction-static=0.9 the
B200's actual peak residency during DP-attention workspace allocation
hits ~170 GiB of the 178 GiB total, leaving only ~3 GiB free; a
subsequent 6.38 GiB cuda-graph workspace allocation then OOMs. The
same test passes on H200 because its baseline residency leaves more
headroom.

Reproduced locally on 8x B200 (b200-novita-1 class):
  0.90 -> Model 2 (TP8+DP8) crashes with the exact CI OOM signature
          (6.38 GiB requested, ~3.3 GiB free, 170 GiB resident)
  0.85 -> all three variants (TP8, TP8+DP8, TP8+DP8+MTP) pass

Note: 0.85 is already the value used by test_minimax_m25.py for the
same 8-GPU + DP-attention shape; every other 8-GPU + DP-attention
test in this directory uses 0.85 or lower. GLM-5.1 FP8 at 0.9 was
the outlier.

The GB300 variants (test_glm5_fp8.py, test_glm5_nvfp4.py) keep 0.9 -
that runner has 288 GiB/GPU and the same fraction yields more
absolute headroom.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Jiminator Jiminator marked this pull request as ready for review May 16, 2026 02:29
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Jiminator
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@Jiminator Jiminator self-assigned this May 16, 2026
@Jiminator Jiminator requested a review from Kangyan-Zhou May 16, 2026 02:29
@Jiminator Jiminator requested review from Fridge003 and b8zhong May 16, 2026 02:30
@Kangyan-Zhou Kangyan-Zhou merged commit a741d0c into sgl-project:main May 16, 2026
142 of 184 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants