[ROCm][CI] Fix realtime test timeouts caused by aiter JIT compilation delays#35052
Conversation
… delays Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request addresses intermittent test timeouts on ROCm builds by introducing a warm-up step and increasing timeouts to account for JIT compilation delays. The changes are logical and directly address the issue. However, I've found a recurring issue in the implementation of waiting for the session.updated event. The current try...except TimeoutError: pass pattern is duplicated in three places and can hide bugs or lead to flaky tests by silently ignoring timeouts. My review comments suggest a more robust implementation that explicitly fails the test on timeout, which aligns with the goal of preventing race conditions.
| try: | ||
| while True: | ||
| event = await receive_event(ws, timeout=5.0) | ||
| if event["type"] == "session.updated": | ||
| break | ||
| except TimeoutError: | ||
| pass |
There was a problem hiding this comment.
The try...except TimeoutError: pass block is problematic. It will silently ignore a timeout if the session.updated event is not received within 5 seconds. This defeats the purpose of waiting for the event to avoid race conditions, as stated in the pull request description. If a timeout occurs, the test will proceed, potentially leading to flaky failures or hiding underlying issues.
To make the test more robust, the TimeoutError should be handled by explicitly failing the test. This ensures that the absence of the session.updated event is caught and reported. This pattern is repeated elsewhere in the file and should be fixed in all locations.
| try: | |
| while True: | |
| event = await receive_event(ws, timeout=5.0) | |
| if event["type"] == "session.updated": | |
| break | |
| except TimeoutError: | |
| pass | |
| try: | |
| while True: | |
| event = await receive_event(ws, timeout=5.0) | |
| if event["type"] == "session.updated": | |
| break | |
| except TimeoutError: | |
| pytest.fail("Timed out waiting for session.updated event.") |
| try: | ||
| while True: | ||
| event = await receive_event(ws, timeout=5.0) | ||
| if event["type"] == "session.updated": | ||
| break | ||
| except TimeoutError: | ||
| pass |
There was a problem hiding this comment.
As with the previous occurrence, this try...except TimeoutError: pass block can lead to flaky tests or hide bugs by not ensuring the session.updated event is received. The test should explicitly fail if a timeout occurs to ensure reliability.
| try: | |
| while True: | |
| event = await receive_event(ws, timeout=5.0) | |
| if event["type"] == "session.updated": | |
| break | |
| except TimeoutError: | |
| pass | |
| try: | |
| while True: | |
| event = await receive_event(ws, timeout=5.0) | |
| if event["type"] == "session.updated": | |
| break | |
| except TimeoutError: | |
| pytest.fail("Timed out waiting for session.updated event.") |
| try: | ||
| while True: | ||
| event = await receive_event(ws, timeout=5.0) | ||
| if event["type"] == "session.updated": | ||
| break | ||
| except TimeoutError: | ||
| pass |
There was a problem hiding this comment.
This is the third instance of the problematic try...except TimeoutError: pass pattern. To ensure test reliability and prevent race conditions, the test should fail explicitly if the session.updated event is not received within the timeout period.
| try: | |
| while True: | |
| event = await receive_event(ws, timeout=5.0) | |
| if event["type"] == "session.updated": | |
| break | |
| except TimeoutError: | |
| pass | |
| try: | |
| while True: | |
| event = await receive_event(ws, timeout=5.0) | |
| if event["type"] == "session.updated": | |
| break | |
| except TimeoutError: | |
| pytest.fail("Timed out waiting for session.updated event.") |
There was a problem hiding this comment.
Good suggestion. I would not fail the test necessarily, but I added a warning.
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… delays (vllm-project#35052) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… delays (vllm-project#35052) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… delays (vllm-project#35052) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… delays (vllm-project#35052) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… delays (vllm-project#35052) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
… delays (vllm-project#35052) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
test_multi_chunk_streamingandtest_empty_commit_does_not_crash_engineinentrypoints/openai/test_realtime_validation.py(Entrypoints Integration Test - API Server 1) intermittently fail withTimeoutErroron ROCm builds.The root cause is that aiter modules are JIT-compiled on the first inference request. The original test timeouts (30-60s) are insufficient when this compilation happens during the test's critical path.
Fix
test_multi_chunk_streaming: send a small audio chunk before the real transcription to absorb JIT compilation latency, with a generous 360s timeout.test_empty_commit_does_not_crash_enginefrom 30s to 360s since the empty commit triggers the first inference (and thus JIT compilation).session.updatedaftersession.updateto avoid racing the server's session setup.input_audio_buffer.commitbefore sending audio, as it is required by the protocol to start a transcription session.