Skip to content

[ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays#35052

Merged
DarkLight1337 merged 3 commits intovllm-project:mainfrom
ROCm:akaratza_entrypoints_api_server_i
Feb 22, 2026
Merged

[ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays#35052
DarkLight1337 merged 3 commits intovllm-project:mainfrom
ROCm:akaratza_entrypoints_api_server_i

Conversation

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas AndreasKaratzas commented Feb 22, 2026

test_multi_chunk_streaming and test_empty_commit_does_not_crash_engine in entrypoints/openai/test_realtime_validation.py (Entrypoints Integration Test - API Server 1) intermittently fail with TimeoutError on ROCm builds.

The root cause is that aiter modules are JIT-compiled on the first inference request. The original test timeouts (30-60s) are insufficient when this compilation happens during the test's critical path.

Fix

  • Add a warm-up step to test_multi_chunk_streaming: send a small audio chunk before the real transcription to absorb JIT compilation latency, with a generous 360s timeout.
  • Increase the first-request timeout in test_empty_commit_does_not_crash_engine from 30s to 360s since the empty commit triggers the first inference (and thus JIT compilation).
  • Wait for session.updated after session.update to avoid racing the server's session setup.
  • Preserve the non-final input_audio_buffer.commit before sending audio, as it is required by the protocol to start a transcription session.

… delays

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@dosubot
Copy link

dosubot bot commented Feb 22, 2026

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@mergify mergify bot added the rocm Related to AMD ROCm label Feb 22, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 22, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses intermittent test timeouts on ROCm builds by introducing a warm-up step and increasing timeouts to account for JIT compilation delays. The changes are logical and directly address the issue. However, I've found a recurring issue in the implementation of waiting for the session.updated event. The current try...except TimeoutError: pass pattern is duplicated in three places and can hide bugs or lead to flaky tests by silently ignoring timeouts. My review comments suggest a more robust implementation that explicitly fails the test on timeout, which aligns with the goal of preventing race conditions.

Comment on lines +89 to +95
try:
while True:
event = await receive_event(ws, timeout=5.0)
if event["type"] == "session.updated":
break
except TimeoutError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The try...except TimeoutError: pass block is problematic. It will silently ignore a timeout if the session.updated event is not received within 5 seconds. This defeats the purpose of waiting for the event to avoid race conditions, as stated in the pull request description. If a timeout occurs, the test will proceed, potentially leading to flaky failures or hiding underlying issues.

To make the test more robust, the TimeoutError should be handled by explicitly failing the test. This ensures that the absence of the session.updated event is caught and reported. This pattern is repeated elsewhere in the file and should be fixed in all locations.

Suggested change
try:
while True:
event = await receive_event(ws, timeout=5.0)
if event["type"] == "session.updated":
break
except TimeoutError:
pass
try:
while True:
event = await receive_event(ws, timeout=5.0)
if event["type"] == "session.updated":
break
except TimeoutError:
pytest.fail("Timed out waiting for session.updated event.")

Comment on lines +186 to +192
try:
while True:
event = await receive_event(ws, timeout=5.0)
if event["type"] == "session.updated":
break
except TimeoutError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

As with the previous occurrence, this try...except TimeoutError: pass block can lead to flaky tests or hide bugs by not ensuring the session.updated event is received. The test should explicitly fail if a timeout occurs to ensure reliability.

Suggested change
try:
while True:
event = await receive_event(ws, timeout=5.0)
if event["type"] == "session.updated":
break
except TimeoutError:
pass
try:
while True:
event = await receive_event(ws, timeout=5.0)
if event["type"] == "session.updated":
break
except TimeoutError:
pytest.fail("Timed out waiting for session.updated event.")

Comment on lines +218 to +224
try:
while True:
event = await receive_event(ws, timeout=5.0)
if event["type"] == "session.updated":
break
except TimeoutError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is the third instance of the problematic try...except TimeoutError: pass pattern. To ensure test reliability and prevent race conditions, the test should fail explicitly if the session.updated event is not received within the timeout period.

Suggested change
try:
while True:
event = await receive_event(ws, timeout=5.0)
if event["type"] == "session.updated":
break
except TimeoutError:
pass
try:
while True:
event = await receive_event(ws, timeout=5.0)
if event["type"] == "session.updated":
break
except TimeoutError:
pytest.fail("Timed out waiting for session.updated event.")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I would not fail the test necessarily, but I added a warning.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas
Copy link
Collaborator Author

AndreasKaratzas commented Feb 22, 2026

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) February 22, 2026 08:40
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 22, 2026
@DarkLight1337 DarkLight1337 merged commit dd8c3a7 into vllm-project:main Feb 22, 2026
16 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Feb 22, 2026
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Feb 22, 2026
… delays (vllm-project#35052)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas deleted the akaratza_entrypoints_api_server_i branch February 22, 2026 21:21
jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026
… delays (vllm-project#35052)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
… delays (vllm-project#35052)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
… delays (vllm-project#35052)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
… delays (vllm-project#35052)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
… delays (vllm-project#35052)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants