[CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions by yichiche · Pull Request #24002 · sgl-project/sglang

yichiche · 2026-04-29T04:41:14Z

Motivation

The multimodal-gen-test-1-gpu-amd CI job (shard 0) times out at 90 minutes. The root cause is a partitioning imbalance: with total_partitions=4 and 3 standalone test files
(test_generate_t2i_perf.py, test_update_weights_from_disk.py, test_tracing.py), only 1 partition remains for all 21 parametrized test cases. The LPT load-balancing algorithm
is effectively a no-op when distributing across a single partition.

The H100-based time estimates for those 21 cases sum to ~50 min, but actual AMD MI300 runtimes are ~2-2.5x longer due to:

aiter kernel JIT compilation (~120s on first model load)
Slower HF model downloads on the runner
Longer warmup for large models (e.g., wan2_2_ti2v_5b: 437s actual vs 142s estimated)
ROCm GPU memory cleanup between tests (15s each)

Modifications

Increase total_partitions from 4 to 7 in the AMD 1-GPU diffusion test job (pr-test-amd.yml):

Before: 4 partitions = 1 parametrized + 3 standalone → shard 0 gets all 21 cases (~90+ min on AMD)
After: 7 partitions = 4 parametrized + 3 standalone → LPT distributes cases to ~30 min each on AMD

Two lines changed:

matrix.part: [0, 1, 2, 3] → [0, 1, 2, 3, 4, 5, 6]
--total-partitions 4 → --total-partitions 7

No changes to test logic, baselines, or the partitioning algorithm. max-parallel: 1 is preserved (required for aiter JIT resource management).

Accuracy Tests

N/A — CI-only change; no model or test logic modifications.

Speed Tests and Profiling

N/A — no runtime code changes. Expected per-partition execution time after fix: ~30 min each (vs ~90+ min before for the single overloaded partition).

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark
the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Increase total_partitions from 4 to 7 for the AMD 1-GPU diffusion tests. With 3 standalone files consuming 3 partitions, only 1 partition remained for all 21 parametrized test cases, causing shard 0 to exceed the 90-minute timeout. Now 4 partitions are available for parametrized cases, with LPT distributing them to ~30 min each on AMD.

gemini-code-assist · 2026-04-29T04:41:17Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yichiche · 2026-04-30T02:22:11Z

@amd-bot ci-status

amd-bot · 2026-04-30T02:23:56Z

@yichiche

CI Status for PR #24002

PR: [CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions
Changed files: .github/workflows/pr-test-amd.yml (+2/-2)

The PR only changes the matrix partition count for multimodal-gen-test-1-gpu-amd from 4 → 7 in pr-test-amd.yml. None of the failing jobs are the multimodal-gen-* jobs (those were skipped for this PR).

AMD: 5 failures (0 likely related) | Others: 0 failures

AMD CI Failures

Job	Test File	Test Function	Error	Related?	Explanation	Log
stage-b-test-1-gpu-small-amd (mi300, 3)	`test/registered/mla/test_mla.py`	`setUpClass (TestMLA)`	`Scheduler hit an exception` → `popen_launch_server` failed (exit 255)	🟢 Unlikely	Server-launch failure on MLA test — PR only edits a workflow YAML matrix unrelated to this test path.	Log
stage-b-test-1-gpu-small-amd (mi300, 1)	`test/registered/moe/test_torch_compile_moe.py`	`setUpClass (TestTorchCompileMoe)`	`Scheduler hit an exception` → `popen_launch_server` failed (exit 255)	🟢 Unlikely	Server-launch failure during torch.compile MoE setup — unrelated to multimodal-gen partition change.	Log
stage-b-test-large-8-gpu-35x-disaggregation-amd (mi35x-8.fabric)	`test/registered/amd/disaggregation/test_disaggregation_basic.py`	`setUpClass (TestDisaggregationAccuracy)`	Multi-rank `Traceback` → setUpClass error (exit 255)	🟢 Unlikely	Disaggregation infra/server bring-up failure — unrelated to multimodal-gen YAML changes.	Log
stage-b-test-2-gpu-large-amd (mi300, 0)	`test/registered/perf/test_bench_serving_2gpu.py`	`test_moe_offline_throughput_default`	`Scheduler hit an exception` → `popen_launch_server` failed (exit 255)	🟢 Unlikely	2-GPU MoE benchmark server launch failed — unrelated to PR.	Log
stage-b-test-2-gpu-large-amd (mi300, 1)	`test/registered/hicache/test_hicache_storage_file_backend.py`	`setUpClass (TestHiCacheStorageAccuracy)`	`_launch_server_with_hicache` → `popen_launch_server` failed (exit 255)	🟢 Unlikely	HiCache server launch failure — unrelated to PR.	Log

(Also failed/cancelled but downstream-only: pr-test-amd-finish, wait-for-stage-b-amd, stage-b-test-1-gpu-small-amd-mi35x — these are aggregator/cancellation propagations of the failures above.)

Details

The PR diff only edits .github/workflows/pr-test-amd.yml:

part: [0, 1, 2, 3] → part: [0, 1, 2, 3, 4, 5, 6]
--total-partitions 4 → --total-partitions 7

Both changes are scoped to the multimodal-gen-test-1-gpu-amd job. The multimodal-gen-* jobs themselves were skipped in this run (so the actual partition behavior couldn't even be exercised here). All 5 test failures are server bring-up errors (Scheduler hit an exception in popen_launch_server) in unrelated test suites (MLA, torch-compile MoE, disaggregation, 2-GPU bench-serving, HiCache). There is no plausible code path by which a multimodal-gen partition-count change could affect any of these tests.

Verdict: safe to ignore the failures with respect to this PR's changes. They look like infrastructure / pre-existing AMD CI flakes. Recommend a CI re-run; if they persist, they should be triaged independently (separate from this PR).

Generated by amd-bot using Claude Code CLI

yichiche requested review from Fridge003, Kangyan-Zhou, bingxche, ispobock and merrymercy as code owners April 29, 2026 04:41

github-actions Bot added the amd label Apr 29, 2026

yichiche added the run-ci label Apr 29, 2026

yichiche mentioned this pull request Apr 29, 2026

[AMD] Fix CI RuntimeError: opentelemetry package is not installed #23940

Merged

5 tasks

yichiche changed the title ~~[CI] Fix multimodal-gen-test-1-gpu-amd timeout by increasing partitions~~ [CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions Apr 29, 2026

yichiche closed this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions#24002

[CI] Fix CI error: multimodal-gen-test-1-gpu-amd timeout by increasing partitions#24002
yichiche wants to merge 1 commit into
sgl-project:mainfrom
yichiche:yichiche/fix-ci-multimodal-gen-test-1-gpu

yichiche commented Apr 29, 2026

Uh oh!

gemini-code-assist Bot commented Apr 29, 2026

Uh oh!

yichiche commented Apr 30, 2026

Uh oh!

amd-bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yichiche commented Apr 29, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 29, 2026

Uh oh!

yichiche commented Apr 30, 2026

Uh oh!

amd-bot commented Apr 30, 2026

CI Status for PR #24002

AMD CI Failures

Details

Verdict: safe to ignore the failures with respect to this PR's changes. They look like infrastructure / pre-existing AMD CI flakes. Recommend a CI re-run; if they persist, they should be triaged independently (separate from this PR).

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants