Skip to content

[AMD] Fix CI error: Increase disagg test timeout from 120s to 300s#24015

Closed
yichiche wants to merge 1 commit into
sgl-project:mainfrom
yichiche:yichiche/fix-ci-aiter-jit-cache
Closed

[AMD] Fix CI error: Increase disagg test timeout from 120s to 300s#24015
yichiche wants to merge 1 commit into
sgl-project:mainfrom
yichiche:yichiche/fix-ci-aiter-jit-cache

Conversation

@yichiche
Copy link
Copy Markdown
Collaborator

@yichiche yichiche commented Apr 29, 2026

Motivation

On MI300 with cold AITER JIT and MIOpen caches, the first disaggregated diffusion inference can exceed the 120s --disagg-timeout, producing:

RuntimeError: Model generation returned no output. Error from scheduler:
DiffusionServer timeout: request ... not completed within 120.0s

This triggers up to 6 retries in run_suite.py (each with full cluster restart + another 120s wait), accumulating past the 90-minute CI timeout. The issue was revealed by #23940
which added the tracing dependency — TestDisaggZImageTracing now actually runs the full disagg pipeline instead of failing fast on missing opentelemetry. This PR is used to solve multimodal-gen-test-2-gpu-amd (mi300, 2) timeout issue.

Modifications

  • Increased --disagg-timeout from 120 to 300 in DisaggCluster._launch_server_head() (test_disagg_server.py:229)
  • 300s gives each of the 3 disagg roles (encoder → denoiser → decoder) ~100s for JIT compilation + inference on first use, while still catching genuine hangs

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

On MI300 with cold AITER JIT and MIOpen caches, the first disagg
inference can exceed the 120s server-side timeout, causing retries
that accumulate past the CI time limit.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Apr 29, 2026
@yichiche yichiche added amd run-ci and removed diffusion SGLang Diffusion labels Apr 29, 2026
@yichiche yichiche changed the title [AMD] Increase disagg test timeout from 120s to 300s [AMD] Fix CI error: Increase disagg test timeout from 120s to 300s Apr 29, 2026
@yichiche
Copy link
Copy Markdown
Collaborator Author

@amd-bot ci-status

@amd-bot
Copy link
Copy Markdown

amd-bot commented Apr 30, 2026

@yichiche

CI Status for PR #24015

PR: [AMD] Fix CI error: Increase disagg test timeout from 120s to 300s
Changed files: python/sglang/multimodal_gen/test/server/test_disagg_server.py (+1/-1)

The PR makes a single one-line change in test_disagg_server.py:_launch_server_head bumping --disagg-timeout from 120 to 300 seconds. This only affects how long the disagg server head waits before timing out a transfer.

AMD: 4 failures (0 likely related) | Others: 4 failures (0 likely related)

AMD CI Failures

Job Test File Test Function Error Related? Explanation Log
multimodal-gen-test-2-gpu-amd (1) test/server/test_server_2_gpu.py test_diffusion_generation[wan2_1_t2v_14b_2gpu] and ~10 other parametrized cases RuntimeError: Server exited early (code 1) at test_server_utils.py:449 🟢 Unlikely Failure is in a different test file (test_server_2_gpu.py), not the file modified by this PR. The roles fail to start with code 1 before any timeout would matter. Log
multimodal-gen-test-1-gpu-amd (3) test/server/test_tracing.py test_spans_exported, test_spans_without_traceparent, test_batch_requests RuntimeError: opentelemetry package is not installed!!! 🟢 Unlikely Missing pip dependency on AMD image; PR does not touch packaging or tracing. Log
multimodal-gen-test-2-gpu-amd (2) test/server/test_disagg_server.py TestDisaggZImageTracing.test_disagg_spans_share_trace_id RuntimeError: opentelemetry package is not installed!!! (encoder failed to start) 🟢 Unlikely Same opentelemetry-missing issue. Note the other two disagg tests in the same file (TestDisaggZImage1Rank.test_generates_image, TestDisaggZImage2RankDenoiser.test_generates_image_with_sp2_denoiser) ran in this shard and are not listed as failures, so the PR's timeout bump is taking effect for non-tracing cases. Log
multimodal-gen-test-1-gpu-amd (0) N/A N/A Log expired (BlobNotFound); Run diffusion server tests step has no conclusion (job ran ~2.5 h, likely runner-side cancellation) 🟢 Unlikely No evidence ties this to the PR; appears to be runner timeout / log retention issue. Re-run to confirm. Log

Other CI Failures

Job Test File Test Function Error Related? Explanation Log
call-multimodal-gen-tests / multimodal-gen-test-2-gpu (0) test/server/test_server_2_gpu.py test_diffusion_generation[ltx_2.3_two_stage_t2v_2gpus], [flux_image_t2i_2_gpus], [ltx_2.3_one_stage_ti2v] Failed: Diffusion testcase ... failed N check(s) (consistency / performance validation) 🟢 Unlikely Numerical / quality regression in unrelated diffusion pipelines on NVIDIA. PR only touches a timeout constant. Log
call-multimodal-gen-tests / multimodal-gen-test-1-gpu (0) test/server/test_server_1_gpu.py test_diffusion_generation[layerwise_offload], [ltx_2_3_hq_pipeline] Consistency check failed (clip/ssim/psnr below threshold) 🟢 Unlikely Same NVIDIA quality-regression pattern; unrelated to disagg timeout. Log
call-multimodal-gen-tests / multimodal-gen-test-1-b200 N/A N/A Fast-fail: skipping — root cause job(s): multimodal-gen-test-1-gpu (0), multimodal-gen-test-2-gpu (0), diffusion-coverage-check 🟢 Unlikely Cascading skip from the two NVIDIA failures above. Log
call-multimodal-gen-tests / diffusion-coverage-check N/A (coverage check, not pytest) N/A ❌ COVERAGE FAILURE: Missing test cases — 1-GPU missing test_generate_t2i_perf.py, test_tracing.py, test_update_weights_from_disk.py; 2-GPU missing test_disagg_server.py 🟢 Unlikely The "missing" entries are a downstream effect of the NVIDIA shards above failing/being skipped (the b200 shard that would have run test_disagg_server.py was fast-failed). The PR does not alter the test inventory or coverage script. Log

(Not listed: pr-test-amd-finish / pr-test-finish / notebook-finish are rollup jobs that mirror the failures above; call-gate / pr-gate failed once because the PR was missing the run-ci label, then succeeded after the label was added — not a test failure.)

Details

No failure on this page is in code paths touched by this PR. The change is a 3-character edit to a CLI argument (120300) inside test_disagg_server.py::_launch_server_head. It cannot:

  • introduce Server exited early (code 1) in test_server_2_gpu.py (that test does not call _launch_server_head),
  • cause opentelemetry package is not installed (a missing pip package on the runner image),
  • shift diffusion quality metrics on NVIDIA (no model/kernel changes).

The most relevant signal in favour of this PR is in shard 2 (job 73528163674): the two non-tracing disagg tests in the modified file (TestDisaggZImage1Rank.test_generates_image and TestDisaggZImage2RankDenoiser.test_generates_image_with_sp2_denoiser) executed without being listed as ERRORs/FAILUREs — consistent with the timeout bump fixing the original flake.

You can ignore all of these failures for this PR. Recommended next steps: (1) trigger a re-run of multimodal-gen-test-1-gpu-amd (0) to clear the expired-log shard; (2) the AMD opentelemetry and NVIDIA consistency-check failures are pre-existing infra/flakes that should be tracked separately by CI Monitor, not by this PR.

Generated by amd-bot using Claude Code CLI

@yichiche yichiche closed this May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants