[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout#34922
[ROCm][CI] Loosen RemoteOpenAIServer Startup Timeout#34922DarkLight1337 merged 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
There was a problem hiding this comment.
Code Review
This pull request increases the server startup timeouts to address flakiness in CI tests, particularly on ROCm platforms. The changes are reasonable, but the default timeout increase is applied globally. I've suggested making this change platform-specific to ROCm to avoid unintentionally masking performance regressions on other platforms.
|
|
||
| self._start_server(model, vllm_serve_args, env_dict) | ||
| max_wait_seconds = max_wait_seconds or 240 | ||
| max_wait_seconds = max_wait_seconds or 360 |
There was a problem hiding this comment.
Since the flakiness is observed on AMD CI, it's better to increase the timeout only for ROCm platforms. This avoids masking potential performance regressions on other platforms like NVIDIA. Also, using if max_wait_seconds is None: is more robust than the or operator, as it correctly handles the case where max_wait_seconds=0 is passed intentionally.
| max_wait_seconds = max_wait_seconds or 360 | |
| if max_wait_seconds is None: | |
| max_wait_seconds = 360 if current_platform.is_rocm() else 240 |
We've noticed some flakiness in amd-ci related to tests timing out in Entrypoints Integration Test (API Server 1). Here is one example from a recent nightly build: https://buildkite.com/vllm/amd-ci/builds/4972/summary?jid=019c6f8c-8d3b-47b7-977b-ab61ac8730ce&tab=output#019c6f8c-8d3b-47b7-977b-ab61ac8730ce/L6775
In this example, the following test times out after the 240 second server init timeout defined in RemoteOpenAIServer:
From local testing, it looks like it is just barely exceeding the timeout due to AITER JIT compilation taking 130 seconds for one particular kernel (it took 250 seconds for the server to come up overall). I loosened the overall timeout restriction by a couple minutes, increasing it from 240 to 360 seconds. The reason for this is that I noticed there is some variance in JIT compile time for AITER kernels, and I noticed when testing with torch 2.10 and triton 3.6, some triton compile times are a little bit longer. So, this 2 minute increase is somewhat preparatory for when we upgrade.
The TestGPTOSSSpeculativeChat test takes a particularly long time to initialize between AITER JIT and triton 3.6 compile times (it took a about 470 seconds when I tested locally), so I added a bit of extra timeout padding on that one as well.
I did implement these changes such that they will take effect on both AMD and Nvidia CI. If it is preferred that I isolate these changes to AMD platforms, I can do that. Further, since the magnitude of timeout increases are a bit preemptive, I could limit them to the minimal necessary values for now and then revisit in the future if it becomes an issue again. Would like to hear the reviewers' preferences. Thanks!