[AMD] Fix CI error: Increase disagg test timeout from 120s to 300s#24015
[AMD] Fix CI error: Increase disagg test timeout from 120s to 300s#24015yichiche wants to merge 1 commit into
Conversation
On MI300 with cold AITER JIT and MIOpen caches, the first disagg inference can exceed the 120s server-side timeout, causing retries that accumulate past the CI time limit.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@amd-bot ci-status |
CI Status for PR #24015PR: [AMD] Fix CI error: Increase disagg test timeout from 120s to 300s The PR makes a single one-line change in AMD: 4 failures (0 likely related) | Others: 4 failures (0 likely related) AMD CI Failures
Other CI Failures
(Not listed: DetailsNo failure on this page is in code paths touched by this PR. The change is a 3-character edit to a CLI argument (
The most relevant signal in favour of this PR is in shard 2 (job 73528163674): the two non-tracing disagg tests in the modified file ( You can ignore all of these failures for this PR. Recommended next steps: (1) trigger a re-run of
|
Motivation
On MI300 with cold AITER JIT and MIOpen caches, the first disaggregated diffusion inference can exceed the 120s
--disagg-timeout, producing:RuntimeError: Model generation returned no output. Error from scheduler:
DiffusionServer timeout: request ... not completed within 120.0s
This triggers up to 6 retries in
run_suite.py(each with full cluster restart + another 120s wait), accumulating past the 90-minute CI timeout. The issue was revealed by #23940which added the
tracingdependency —TestDisaggZImageTracingnow actually runs the full disagg pipeline instead of failing fast on missing opentelemetry. This PR is used to solve multimodal-gen-test-2-gpu-amd (mi300, 2) timeout issue.Modifications
--disagg-timeoutfrom120to300inDisaggCluster._launch_server_head()(test_disagg_server.py:229)Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci