Skip to content

[CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions#24002

Closed
yichiche wants to merge 1 commit into
sgl-project:mainfrom
yichiche:yichiche/fix-ci-multimodal-gen-test-1-gpu
Closed

[CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions#24002
yichiche wants to merge 1 commit into
sgl-project:mainfrom
yichiche:yichiche/fix-ci-multimodal-gen-test-1-gpu

Conversation

@yichiche
Copy link
Copy Markdown
Collaborator

Motivation

The multimodal-gen-test-1-gpu-amd CI job (shard 0) times out at 90 minutes. The root cause is a partitioning imbalance: with total_partitions=4 and 3 standalone test files
(test_generate_t2i_perf.py, test_update_weights_from_disk.py, test_tracing.py), only 1 partition remains for all 21 parametrized test cases. The LPT load-balancing algorithm
is effectively a no-op when distributing across a single partition.

The H100-based time estimates for those 21 cases sum to ~50 min, but actual AMD MI300 runtimes are ~2-2.5x longer due to:

  • aiter kernel JIT compilation (~120s on first model load)
  • Slower HF model downloads on the runner
  • Longer warmup for large models (e.g., wan2_2_ti2v_5b: 437s actual vs 142s estimated)
  • ROCm GPU memory cleanup between tests (15s each)

Modifications

Increase total_partitions from 4 to 7 in the AMD 1-GPU diffusion test job (pr-test-amd.yml):

  • Before: 4 partitions = 1 parametrized + 3 standalone → shard 0 gets all 21 cases (~90+ min on AMD)
  • After: 7 partitions = 4 parametrized + 3 standalone → LPT distributes cases to ~30 min each on AMD

Two lines changed:

  1. matrix.part: [0, 1, 2, 3][0, 1, 2, 3, 4, 5, 6]
  2. --total-partitions 4--total-partitions 7

No changes to test logic, baselines, or the partitioning algorithm. max-parallel: 1 is preserved (required for aiter JIT resource management).

Accuracy Tests

N/A — CI-only change; no model or test logic modifications.

Speed Tests and Profiling

N/A — no runtime code changes. Expected per-partition execution time after fix: ~30 min each (vs ~90+ min before for the single overloaded partition).

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Increase total_partitions from 4 to 7 for the AMD 1-GPU diffusion tests.
With 3 standalone files consuming 3 partitions, only 1 partition remained
for all 21 parametrized test cases, causing shard 0 to exceed the
90-minute timeout. Now 4 partitions are available for parametrized cases,
with LPT distributing them to ~30 min each on AMD.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the amd label Apr 29, 2026
@yichiche yichiche changed the title [CI] Fix multimodal-gen-test-1-gpu-amd timeout by increasing partitions [CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions Apr 29, 2026
@yichiche
Copy link
Copy Markdown
Collaborator Author

@amd-bot ci-status

@amd-bot
Copy link
Copy Markdown

amd-bot commented Apr 30, 2026

@yichiche

CI Status for PR #24002

PR: [CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions
Changed files: .github/workflows/pr-test-amd.yml (+2/-2)

The PR only changes the matrix partition count for multimodal-gen-test-1-gpu-amd from 4 → 7 in pr-test-amd.yml. None of the failing jobs are the multimodal-gen-* jobs (those were skipped for this PR).

AMD: 5 failures (0 likely related) | Others: 0 failures

AMD CI Failures

Job Test File Test Function Error Related? Explanation Log
stage-b-test-1-gpu-small-amd (mi300, 3) test/registered/mla/test_mla.py setUpClass (TestMLA) Scheduler hit an exceptionpopen_launch_server failed (exit 255) 🟢 Unlikely Server-launch failure on MLA test — PR only edits a workflow YAML matrix unrelated to this test path. Log
stage-b-test-1-gpu-small-amd (mi300, 1) test/registered/moe/test_torch_compile_moe.py setUpClass (TestTorchCompileMoe) Scheduler hit an exceptionpopen_launch_server failed (exit 255) 🟢 Unlikely Server-launch failure during torch.compile MoE setup — unrelated to multimodal-gen partition change. Log
stage-b-test-large-8-gpu-35x-disaggregation-amd (mi35x-8.fabric) test/registered/amd/disaggregation/test_disaggregation_basic.py setUpClass (TestDisaggregationAccuracy) Multi-rank Traceback → setUpClass error (exit 255) 🟢 Unlikely Disaggregation infra/server bring-up failure — unrelated to multimodal-gen YAML changes. Log
stage-b-test-2-gpu-large-amd (mi300, 0) test/registered/perf/test_bench_serving_2gpu.py test_moe_offline_throughput_default Scheduler hit an exceptionpopen_launch_server failed (exit 255) 🟢 Unlikely 2-GPU MoE benchmark server launch failed — unrelated to PR. Log
stage-b-test-2-gpu-large-amd (mi300, 1) test/registered/hicache/test_hicache_storage_file_backend.py setUpClass (TestHiCacheStorageAccuracy) _launch_server_with_hicachepopen_launch_server failed (exit 255) 🟢 Unlikely HiCache server launch failure — unrelated to PR. Log

(Also failed/cancelled but downstream-only: pr-test-amd-finish, wait-for-stage-b-amd, stage-b-test-1-gpu-small-amd-mi35x — these are aggregator/cancellation propagations of the failures above.)

Details

The PR diff only edits .github/workflows/pr-test-amd.yml:

  • part: [0, 1, 2, 3]part: [0, 1, 2, 3, 4, 5, 6]
  • --total-partitions 4--total-partitions 7

Both changes are scoped to the multimodal-gen-test-1-gpu-amd job. The multimodal-gen-* jobs themselves were skipped in this run (so the actual partition behavior couldn't even be exercised here). All 5 test failures are server bring-up errors (Scheduler hit an exception in popen_launch_server) in unrelated test suites (MLA, torch-compile MoE, disaggregation, 2-GPU bench-serving, HiCache). There is no plausible code path by which a multimodal-gen partition-count change could affect any of these tests.

Verdict: safe to ignore the failures with respect to this PR's changes. They look like infrastructure / pre-existing AMD CI flakes. Recommend a CI re-run; if they persist, they should be triaged independently (separate from this PR).

Generated by amd-bot using Claude Code CLI

@yichiche yichiche closed this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants