[CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions#24002
[CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions#24002yichiche wants to merge 1 commit into
Conversation
Increase total_partitions from 4 to 7 for the AMD 1-GPU diffusion tests. With 3 standalone files consuming 3 partitions, only 1 partition remained for all 21 parametrized test cases, causing shard 0 to exceed the 90-minute timeout. Now 4 partitions are available for parametrized cases, with LPT distributing them to ~30 min each on AMD.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@amd-bot ci-status |
CI Status for PR #24002PR: [CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions The PR only changes the matrix partition count for AMD: 5 failures (0 likely related) | Others: 0 failures AMD CI Failures
(Also failed/cancelled but downstream-only: pr-test-amd-finish, wait-for-stage-b-amd, stage-b-test-1-gpu-small-amd-mi35x — these are aggregator/cancellation propagations of the failures above.) DetailsThe PR diff only edits
Both changes are scoped to the Verdict: safe to ignore the failures with respect to this PR's changes. They look like infrastructure / pre-existing AMD CI flakes. Recommend a CI re-run; if they persist, they should be triaged independently (separate from this PR).Generated by amd-bot using Claude Code CLI |
Motivation
The
multimodal-gen-test-1-gpu-amdCI job (shard 0) times out at 90 minutes. The root cause is a partitioning imbalance: withtotal_partitions=4and 3 standalone test files(
test_generate_t2i_perf.py,test_update_weights_from_disk.py,test_tracing.py), only 1 partition remains for all 21 parametrized test cases. The LPT load-balancing algorithmis effectively a no-op when distributing across a single partition.
The H100-based time estimates for those 21 cases sum to ~50 min, but actual AMD MI300 runtimes are ~2-2.5x longer due to:
wan2_2_ti2v_5b: 437s actual vs 142s estimated)Modifications
Increase
total_partitionsfrom 4 to 7 in the AMD 1-GPU diffusion test job (pr-test-amd.yml):Two lines changed:
matrix.part: [0, 1, 2, 3]→[0, 1, 2, 3, 4, 5, 6]--total-partitions 4→--total-partitions 7No changes to test logic, baselines, or the partitioning algorithm.
max-parallel: 1is preserved (required for aiter JIT resource management).Accuracy Tests
N/A — CI-only change; no model or test logic modifications.
Speed Tests and Profiling
N/A — no runtime code changes. Expected per-partition execution time after fix: ~30 min each (vs ~90+ min before for the single overloaded partition).
Checklist
the speed.
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci