[CI] Lower mem-fraction-static for GLM-5.1 FP8 8-GPU test to 0.85#25453
Merged
Kangyan-Zhou merged 1 commit intoMay 16, 2026
Merged
Conversation
The TP8+DP8 variant has been OOMing at scheduler init on the B200 nightly runner since the test was added (sgl-project#22399, 2026-04-09), failing every nightly partition for ~37 days. At mem-fraction-static=0.9 the B200's actual peak residency during DP-attention workspace allocation hits ~170 GiB of the 178 GiB total, leaving only ~3 GiB free; a subsequent 6.38 GiB cuda-graph workspace allocation then OOMs. The same test passes on H200 because its baseline residency leaves more headroom. Reproduced locally on 8x B200 (b200-novita-1 class): 0.90 -> Model 2 (TP8+DP8) crashes with the exact CI OOM signature (6.38 GiB requested, ~3.3 GiB free, 170 GiB resident) 0.85 -> all three variants (TP8, TP8+DP8, TP8+DP8+MTP) pass Note: 0.85 is already the value used by test_minimax_m25.py for the same 8-GPU + DP-attention shape; every other 8-GPU + DP-attention test in this directory uses 0.85 or lower. GLM-5.1 FP8 at 0.9 was the outlier. The GB300 variants (test_glm5_fp8.py, test_glm5_nvfp4.py) keep 0.9 - that runner has 288 GiB/GPU and the same fraction yields more absolute headroom.
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
Author
|
/tag-and-rerun-ci |
b8zhong
approved these changes
May 16, 2026
Fridge003
pushed a commit
that referenced
this pull request
May 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
test/registered/8-gpu-models/test_glm_51_fp8.py::test_glm51_fp8has been failing on every B200 nightly since the test was added in #22399 (2026-04-09, ~37 days, 0 passes on B200). TheTP8+DP8variant OOMs at scheduler init;TP8andTP8+DP8+MTPare fine. The same test passes on H200 in the same scheduled runs.Example failure: run 25835354140 / job 75909128349.
--mem-fraction-static=0.9on B200 (178 GiB/GPU) leaves only ~3 GiB free after the DP-attention workspaces and cuda-graph capture intermediates land; the next 6.38 GiB cuda-graph alloc OOMs. Not a code regression — the B200 runner has ~5 GiB higher baseline residency than H200 for this model.Modifications
0.85is already used bytest_minimax_m25.pyfor the same 8-GPU + DP-attention shape; every other 8-GPU + DP-attention test in this dir uses 0.85 or lower. GB300 tests keep 0.9 (different runner, 288 GiB/GPU).Verification (8x B200, local repro)
TP8TP8+DP8TP8+DP8+MTPLocal OOM signature at 0.9 matches CI exactly (6.38 GiB request, ~3.3 GiB free, 170.07 GiB resident).
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci