Skip to content

[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout#34922

Merged
DarkLight1337 merged 1 commit intovllm-project:mainfrom
ROCm:micah/remote-openai-timeout
Feb 20, 2026
Merged

[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout#34922
DarkLight1337 merged 1 commit intovllm-project:mainfrom
ROCm:micah/remote-openai-timeout

Conversation

@micah-wil
Copy link
Contributor

@micah-wil micah-wil commented Feb 19, 2026

We've noticed some flakiness in amd-ci related to tests timing out in Entrypoints Integration Test (API Server 1). Here is one example from a recent nightly build: https://buildkite.com/vllm/amd-ci/builds/4972/summary?jid=019c6f8c-8d3b-47b7-977b-ab61ac8730ce&tab=output#019c6f8c-8d3b-47b7-977b-ab61ac8730ce/L6775

In this example, the following test times out after the 240 second server init timeout defined in RemoteOpenAIServer:

pytest -v -s entrypoints/openai/test_serving_chat.py::TestGPTOSSSpeculativeChat::test_gpt_oss_speculative_reasoning_leakage[with_tool_parser-exclude_tools_when_tool_choice_none]

From local testing, it looks like it is just barely exceeding the timeout due to AITER JIT compilation taking 130 seconds for one particular kernel (it took 250 seconds for the server to come up overall). I loosened the overall timeout restriction by a couple minutes, increasing it from 240 to 360 seconds. The reason for this is that I noticed there is some variance in JIT compile time for AITER kernels, and I noticed when testing with torch 2.10 and triton 3.6, some triton compile times are a little bit longer. So, this 2 minute increase is somewhat preparatory for when we upgrade.

The TestGPTOSSSpeculativeChat test takes a particularly long time to initialize between AITER JIT and triton 3.6 compile times (it took a about 470 seconds when I tested locally), so I added a bit of extra timeout padding on that one as well.

I did implement these changes such that they will take effect on both AMD and Nvidia CI. If it is preferred that I isolate these changes to AMD platforms, I can do that. Further, since the magnitude of timeout increases are a bit preemptive, I could limit them to the minimal necessary values for now and then revisit in the future if it becomes an issue again. Would like to hear the reviewers' preferences. Thanks!

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the server startup timeouts to address flakiness in CI tests, particularly on ROCm platforms. The changes are reasonable, but the default timeout increase is applied globally. I've suggested making this change platform-specific to ROCm to avoid unintentionally masking performance regressions on other platforms.


self._start_server(model, vllm_serve_args, env_dict)
max_wait_seconds = max_wait_seconds or 240
max_wait_seconds = max_wait_seconds or 360
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Since the flakiness is observed on AMD CI, it's better to increase the timeout only for ROCm platforms. This avoids masking potential performance regressions on other platforms like NVIDIA. Also, using if max_wait_seconds is None: is more robust than the or operator, as it correctly handles the case where max_wait_seconds=0 is passed intentionally.

Suggested change
max_wait_seconds = max_wait_seconds or 360
if max_wait_seconds is None:
max_wait_seconds = 360 if current_platform.is_rocm() else 240

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) February 20, 2026 03:59
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 20, 2026
@DarkLight1337 DarkLight1337 merged commit f5432e3 into vllm-project:main Feb 20, 2026
19 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Feb 20, 2026
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Feb 22, 2026
)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026
)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants