Skip to content

[CI] Increase stage-c-test-4-gpu-b200 partitions from 4 to 5#22395

Merged
hnyls2002 merged 1 commit intomainfrom
ci/b200-4gpu-partitions-5
Apr 8, 2026
Merged

[CI] Increase stage-c-test-4-gpu-b200 partitions from 4 to 5#22395
hnyls2002 merged 1 commit intomainfrom
ci/b200-4gpu-partitions-5

Conversation

@alisonshao
Copy link
Copy Markdown
Collaborator

Summary

  • Increase partition count for stage-c-test-4-gpu-b200 from 4 to 5 to fix step timeouts

The suite currently has 14 tests totaling 7010s (116.8 min) of est_time. With 4 partitions, the average is 29.2 min/partition — less than 1 minute of headroom against the 30-minute step timeout, which doesn't account for ~2 min of setup overhead (dep install, validation).

This caused partition 2 to time out mid-test (test_update_weights_from_disk_mxfp8.py was interrupted).

3 lora tests were recently added to this suite, contributing 620s (10.3 min):

Test est_time PR
test_lora_qwen3_30b_a3b_instruct_2507_logprob_diff.py 160s #21466
test_lora_qwen3_vl_30b_a3b_instruct_logprob_diff.py 160s #21469
test_lora_gpt_oss_20b_logprob_diff.py 300s #21570

With 5 partitions the average drops to 23.4 min/partition, providing ~6 min of headroom.

Test plan

  • Verify all 5 partitions run and complete within the 30 min step timeout

The total est_time across 14 tests in this suite is 7010s (116.8 min).
With 4 partitions that averages 29.2 min/partition, leaving <1 min of
headroom against the 30 min step timeout — not enough to cover the ~2
min of setup overhead (dep install, validation). This caused partition 2
to time out mid-test.

3 lora tests (620s total) were recently added to this suite:
- test_lora_qwen3_30b_a3b_instruct_2507_logprob_diff.py (160s, #21466)
- test_lora_qwen3_vl_30b_a3b_instruct_logprob_diff.py (160s, #21469)
- test_lora_gpt_oss_20b_logprob_diff.py (300s, #21570)

With 5 partitions the average drops to 23.4 min/partition, providing
comfortable margin.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hnyls2002 hnyls2002 merged commit cf27b11 into main Apr 8, 2026
40 of 42 checks passed
@hnyls2002 hnyls2002 deleted the ci/b200-4gpu-partitions-5 branch April 8, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants