[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout by micah-wil · Pull Request #34922 · vllm-project/vllm

micah-wil · 2026-02-19T22:03:54Z

We've noticed some flakiness in amd-ci related to tests timing out in Entrypoints Integration Test (API Server 1). Here is one example from a recent nightly build: https://buildkite.com/vllm/amd-ci/builds/4972/summary?jid=019c6f8c-8d3b-47b7-977b-ab61ac8730ce&tab=output#019c6f8c-8d3b-47b7-977b-ab61ac8730ce/L6775

In this example, the following test times out after the 240 second server init timeout defined in RemoteOpenAIServer:

pytest -v -s entrypoints/openai/test_serving_chat.py::TestGPTOSSSpeculativeChat::test_gpt_oss_speculative_reasoning_leakage[with_tool_parser-exclude_tools_when_tool_choice_none]

From local testing, it looks like it is just barely exceeding the timeout due to AITER JIT compilation taking 130 seconds for one particular kernel (it took 250 seconds for the server to come up overall). I loosened the overall timeout restriction by a couple minutes, increasing it from 240 to 360 seconds. The reason for this is that I noticed there is some variance in JIT compile time for AITER kernels, and I noticed when testing with torch 2.10 and triton 3.6, some triton compile times are a little bit longer. So, this 2 minute increase is somewhat preparatory for when we upgrade.

The TestGPTOSSSpeculativeChat test takes a particularly long time to initialize between AITER JIT and triton 3.6 compile times (it took a about 470 seconds when I tested locally), so I added a bit of extra timeout padding on that one as well.

I did implement these changes such that they will take effect on both AMD and Nvidia CI. If it is preferred that I isolate these changes to AMD platforms, I can do that. Further, since the magnitude of timeout increases are a bit preemptive, I could limit them to the minimal necessary values for now and then revisit in the future if it becomes an issue again. Would like to hear the reviewers' preferences. Thanks!

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

gemini-code-assist

Code Review

This pull request increases the server startup timeouts to address flakiness in CI tests, particularly on ROCm platforms. The changes are reasonable, but the default timeout increase is applied globally. I've suggested making this change platform-specific to ROCm to avoid unintentionally masking performance regressions on other platforms.

gemini-code-assist · 2026-02-19T22:11:44Z

tests/utils.py


        self._start_server(model, vllm_serve_args, env_dict)
-        max_wait_seconds = max_wait_seconds or 240
+        max_wait_seconds = max_wait_seconds or 360


Since the flakiness is observed on AMD CI, it's better to increase the timeout only for ROCm platforms. This avoids masking potential performance regressions on other platforms like NVIDIA. Also, using if max_wait_seconds is None: is more robust than the or operator, as it correctly handles the case where max_wait_seconds=0 is passed intentionally.

Suggested change

max_wait_seconds = max_wait_seconds or 360

if max_wait_seconds is None:

max_wait_seconds = 360 if current_platform.is_rocm() else 240

) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

) Signed-off-by: Micah Williamson <micah.williamson@amd.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

loosen server startup time restriction

6122c28

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

micah-wil requested review from DarkLight1337, NickLucche, aarnphm and robertgshaw2-redhat as code owners February 19, 2026 22:03

mergify bot added the rocm Related to AMD ROCm label Feb 19, 2026

github-project-automation bot added this to AMD Feb 19, 2026

github-project-automation bot moved this to Todo in AMD Feb 19, 2026

AndreasKaratzas mentioned this pull request Feb 19, 2026

[CI Failure]: mi325_1: Entrypoints Integration Test (API Server 1) #29541

Closed

3 tasks

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

AndreasKaratzas mentioned this pull request Feb 19, 2026

[ROCm][CI] Added MI325 mirrors #34923

Merged

DarkLight1337 approved these changes Feb 20, 2026

View reviewed changes

DarkLight1337 enabled auto-merge (squash) February 20, 2026 03:59

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 20, 2026

DarkLight1337 merged commit f5432e3 into vllm-project:main Feb 20, 2026
19 checks passed

github-project-automation bot moved this from Todo to Done in AMD Feb 20, 2026

AndreasKaratzas mentioned this pull request Feb 22, 2026

[ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays #35052

Merged

yugong333 pushed a commit to yugong333/vllm that referenced this pull request Feb 22, 2026

[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout (vllm-project#34922

3d38583

) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026

[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout (vllm-project#34922

f13f1fe

) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026

[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout (vllm-project#34922

aa993a0

) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout (vllm-project#34922

9b4bb09

) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026

[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout (vllm-project#34922

52e92f0

) Signed-off-by: Micah Williamson <micah.williamson@amd.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>

Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026

[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout (vllm-project#34922

95c5b60

) Signed-off-by: Micah Williamson <micah.williamson@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout#34922

[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout#34922
DarkLight1337 merged 1 commit intovllm-project:mainfrom
ROCm:micah/remote-openai-timeout

micah-wil commented Feb 19, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	max_wait_seconds = max_wait_seconds or 360
	if max_wait_seconds is None:
	max_wait_seconds = 360 if current_platform.is_rocm() else 240

Uh oh!

Conversation

micah-wil commented Feb 19, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

micah-wil commented Feb 19, 2026 •

edited by github-actions bot

Loading