[AMD] Fix CI error: Increase disagg test timeout from 120s to 300s by yichiche · Pull Request #24015 · sgl-project/sglang

yichiche · 2026-04-29T06:36:55Z

Motivation

On MI300 with cold AITER JIT and MIOpen caches, the first disaggregated diffusion inference can exceed the 120s --disagg-timeout, producing:

RuntimeError: Model generation returned no output. Error from scheduler:
DiffusionServer timeout: request ... not completed within 120.0s

This triggers up to 6 retries in run_suite.py (each with full cluster restart + another 120s wait), accumulating past the 90-minute CI timeout. The issue was revealed by #23940
which added the tracing dependency — TestDisaggZImageTracing now actually runs the full disagg pipeline instead of failing fast on missing opentelemetry. This PR is used to solve multimodal-gen-test-2-gpu-amd (mi300, 2) timeout issue.

Modifications

Increased --disagg-timeout from 120 to 300 in DisaggCluster._launch_server_head() (test_disagg_server.py:229)
300s gives each of the 3 disagg roles (encoder → denoiser → decoder) ~100s for JIT compilation + inference on first use, while still catching genuine hangs

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

On MI300 with cold AITER JIT and MIOpen caches, the first disagg inference can exceed the 120s server-side timeout, causing retries that accumulate past the CI time limit.

gemini-code-assist · 2026-04-29T06:36:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yichiche · 2026-04-30T02:25:28Z

@amd-bot ci-status

amd-bot · 2026-04-30T02:29:08Z

@yichiche

CI Status for PR #24015

PR: [AMD] Fix CI error: Increase disagg test timeout from 120s to 300s
Changed files: python/sglang/multimodal_gen/test/server/test_disagg_server.py (+1/-1)

The PR makes a single one-line change in test_disagg_server.py:_launch_server_head bumping --disagg-timeout from 120 to 300 seconds. This only affects how long the disagg server head waits before timing out a transfer.

AMD: 4 failures (0 likely related) | Others: 4 failures (0 likely related)

AMD CI Failures

Job	Test File	Test Function	Error	Related?	Explanation	Log
multimodal-gen-test-2-gpu-amd (1)	`test/server/test_server_2_gpu.py`	`test_diffusion_generation[wan2_1_t2v_14b_2gpu]` and ~10 other parametrized cases	`RuntimeError: Server exited early (code 1)` at `test_server_utils.py:449`	🟢 Unlikely	Failure is in a different test file (`test_server_2_gpu.py`), not the file modified by this PR. The roles fail to start with `code 1` before any timeout would matter.	Log
multimodal-gen-test-1-gpu-amd (3)	`test/server/test_tracing.py`	`test_spans_exported`, `test_spans_without_traceparent`, `test_batch_requests`	`RuntimeError: opentelemetry package is not installed!!!`	🟢 Unlikely	Missing pip dependency on AMD image; PR does not touch packaging or tracing.	Log
multimodal-gen-test-2-gpu-amd (2)	`test/server/test_disagg_server.py`	`TestDisaggZImageTracing.test_disagg_spans_share_trace_id`	`RuntimeError: opentelemetry package is not installed!!!` (encoder failed to start)	🟢 Unlikely	Same opentelemetry-missing issue. Note the other two disagg tests in the same file (`TestDisaggZImage1Rank.test_generates_image`, `TestDisaggZImage2RankDenoiser.test_generates_image_with_sp2_denoiser`) ran in this shard and are not listed as failures, so the PR's timeout bump is taking effect for non-tracing cases.	Log
multimodal-gen-test-1-gpu-amd (0)	N/A	N/A	Log expired (`BlobNotFound`); `Run diffusion server tests` step has no conclusion (job ran ~2.5 h, likely runner-side cancellation)	🟢 Unlikely	No evidence ties this to the PR; appears to be runner timeout / log retention issue. Re-run to confirm.	Log

Other CI Failures

Job	Test File	Test Function	Error	Related?	Explanation	Log
call-multimodal-gen-tests / multimodal-gen-test-2-gpu (0)	`test/server/test_server_2_gpu.py`	`test_diffusion_generation[ltx_2.3_two_stage_t2v_2gpus]`, `[flux_image_t2i_2_gpus]`, `[ltx_2.3_one_stage_ti2v]`	`Failed: Diffusion testcase ... failed N check(s)` (consistency / performance validation)	🟢 Unlikely	Numerical / quality regression in unrelated diffusion pipelines on NVIDIA. PR only touches a timeout constant.	Log
call-multimodal-gen-tests / multimodal-gen-test-1-gpu (0)	`test/server/test_server_1_gpu.py`	`test_diffusion_generation[layerwise_offload]`, `[ltx_2_3_hq_pipeline]`	Consistency check failed (clip/ssim/psnr below threshold)	🟢 Unlikely	Same NVIDIA quality-regression pattern; unrelated to disagg timeout.	Log
call-multimodal-gen-tests / multimodal-gen-test-1-b200	N/A	N/A	`Fast-fail: skipping — root cause job(s): multimodal-gen-test-1-gpu (0), multimodal-gen-test-2-gpu (0), diffusion-coverage-check`	🟢 Unlikely	Cascading skip from the two NVIDIA failures above.	Log
call-multimodal-gen-tests / diffusion-coverage-check	N/A (coverage check, not pytest)	N/A	`❌ COVERAGE FAILURE: Missing test cases` — 1-GPU missing `test_generate_t2i_perf.py`, `test_tracing.py`, `test_update_weights_from_disk.py`; 2-GPU missing `test_disagg_server.py`	🟢 Unlikely	The "missing" entries are a downstream effect of the NVIDIA shards above failing/being skipped (the b200 shard that would have run `test_disagg_server.py` was fast-failed). The PR does not alter the test inventory or coverage script.	Log

(Not listed: pr-test-amd-finish / pr-test-finish / notebook-finish are rollup jobs that mirror the failures above; call-gate / pr-gate failed once because the PR was missing the run-ci label, then succeeded after the label was added — not a test failure.)

Details

No failure on this page is in code paths touched by this PR. The change is a 3-character edit to a CLI argument (120 → 300) inside test_disagg_server.py::_launch_server_head. It cannot:

introduce Server exited early (code 1) in test_server_2_gpu.py (that test does not call _launch_server_head),
cause opentelemetry package is not installed (a missing pip package on the runner image),
shift diffusion quality metrics on NVIDIA (no model/kernel changes).

The most relevant signal in favour of this PR is in shard 2 (job 73528163674): the two non-tracing disagg tests in the modified file (TestDisaggZImage1Rank.test_generates_image and TestDisaggZImage2RankDenoiser.test_generates_image_with_sp2_denoiser) executed without being listed as ERRORs/FAILUREs — consistent with the timeout bump fixing the original flake.

You can ignore all of these failures for this PR. Recommended next steps: (1) trigger a re-run of `multimodal-gen-test-1-gpu-amd (0)` to clear the expired-log shard; (2) the AMD `opentelemetry` and NVIDIA consistency-check failures are pre-existing infra/flakes that should be tracked separately by CI Monitor, not by this PR.

Generated by amd-bot using Claude Code CLI

[AMD] Increase disagg test timeout from 120s to 300s

4b3c0ce

On MI300 with cold AITER JIT and MIOpen caches, the first disagg inference can exceed the 120s server-side timeout, causing retries that accumulate past the CI time limit.

yichiche requested review from mickqian, ping1jing2 and yhyang201 as code owners April 29, 2026 06:36

github-actions Bot added the diffusion SGLang Diffusion label Apr 29, 2026

yichiche added amd run-ci and removed diffusion SGLang Diffusion labels Apr 29, 2026

yichiche changed the title ~~[AMD] Increase disagg test timeout from 120s to 300s~~ [AMD] Fix CI error: Increase disagg test timeout from 120s to 300s Apr 29, 2026

yichiche mentioned this pull request Apr 29, 2026

[AMD] Fix CI RuntimeError: opentelemetry package is not installed #23940

Merged

5 tasks

yichiche closed this May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Fix CI error: Increase disagg test timeout from 120s to 300s#24015

[AMD] Fix CI error: Increase disagg test timeout from 120s to 300s#24015
yichiche wants to merge 1 commit into
sgl-project:mainfrom
yichiche:yichiche/fix-ci-aiter-jit-cache

yichiche commented Apr 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 29, 2026

Uh oh!

yichiche commented Apr 30, 2026

Uh oh!

amd-bot commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yichiche commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot commented Apr 29, 2026

Uh oh!

yichiche commented Apr 30, 2026

Uh oh!

amd-bot commented Apr 30, 2026

CI Status for PR #24015

AMD CI Failures

Other CI Failures

Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yichiche commented Apr 29, 2026 •

edited

Loading