[CI] Stabilize multinode DP internal LB completion tests by AndreasKaratzas · Pull Request #36356 · vllm-project/vllm

AndreasKaratzas · 2026-03-07T21:13:14Z

Fixes flaky test_api_only_multinode_dp_completion and test_multinode_dp_completion which intermittently fail with empty model responses during concurrent load balancer testing.

Motivation

> assert len(choice.text) >= 1
E AssertionError: assert 0 >= 1
E  +  where 0 = len('')

These tests intentionally use temperature=1.0 to produce diverse outputs across 200 concurrent requests for realistic load balancer distribution testing. However, at temperature=1.0 the model can legitimately emit a stop token as its very first token, producing text='' with finish_reason='stop'. Over 400 requests (two bursts of 200), the probability of at least one empty response is high. Rather than changing temperature=0.0 (which would undermine the test's intent of exercising load balancing with diverse requests), the fix tolerates the valid edge case: when finish_reason='stop', empty text is accepted. The non-empty text assertion is only enforced when finish_reason='length'.

_make_completion_request helper: Extracted the duplicated make_request() closure from both non-streaming completion tests into a shared module-level function with diagnostic assertion messages that print actual values on failure.
_run_request_bursts helper: Extracted the duplicated two-burst loop pattern (create tasks -> gather -> validate -> sleep) shared by both non-streaming tests.
Streaming tests unchanged: They already use temperature=0.0 and have adequate assertions.

cc @kenroche

…ponses at temperature 1 Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gemini-code-assist

Code Review

This pull request effectively addresses a flaky test issue by correctly handling empty model responses when finish_reason is 'stop'. The refactoring into _make_completion_request and _run_request_bursts helper functions significantly improves code readability and maintainability by removing duplication. The changes are well-justified and implemented correctly. I have one suggestion to further improve the robustness of the new test helper.

gemini-code-assist · 2026-03-07T21:15:18Z

tests/v1/distributed/test_internal_lb_dp.py

+        results = await asyncio.gather(*all_tasks)
+        assert len(results) == num_requests, (
+            f"Burst {burst}: expected {num_requests} results, got {len(results)}"
+        )
+        assert all(completion is not None for completion in results), (
+            f"Burst {burst}: some completions were None"
+        )


Using asyncio.gather without return_exceptions=True can lead to unhandled exceptions and resource leaks if one of the tasks fails. When a task in gather raises an exception, gather propagates that exception immediately, and other tasks might not be cancelled, potentially continuing to run in the background. This can affect the stability of subsequent tests in the suite.

By setting return_exceptions=True, gather will wait for all tasks to complete and return exceptions as results. You can then explicitly check for and handle any exceptions, ensuring a cleaner test shutdown. This improves test robustness.

results = await asyncio.gather(*all_tasks, return_exceptions=True) assert len(results) == num_requests, ( f"Burst {burst}: expected {num_requests} results, got {len(results)}" ) # Raise any exceptions that were caught for result in results: if isinstance(result, BaseException): raise result assert all(completion is not None for completion in results), ( f"Burst {burst}: some completions were None" )

Using return_exceptions=True and re-raising to ensure clean task shutdown. Done :)

…ponses at temperature 1 Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…stributed

njhill

Thanks @AndreasKaratzas

…t#36356) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

[ROCm][CI] Fix flaky DP load balancer test - tolerate valid empty res…

eb1f2ca

…ponses at temperature 1 Signed-off-by: Andreas Karatzas <akaratza@amd.com>

mergify bot added the v1 label Mar 7, 2026

gemini-code-assist bot reviewed Mar 7, 2026

View reviewed changes

[ROCm][CI] Fix flaky DP load balancer test - tolerate valid empty res…

a2553dc

…ponses at temperature 1 Signed-off-by: Andreas Karatzas <akaratza@amd.com>

AndreasKaratzas added ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm labels Mar 7, 2026

github-project-automation bot added this to AMD Mar 7, 2026

github-project-automation bot moved this to Todo in AMD Mar 7, 2026

DarkLight1337 requested review from njhill and robertgshaw2-redhat March 9, 2026 03:07

Merge remote-tracking branch 'origin/main' into akaratza_stabilize_di…

03e2d22

…stributed

AndreasKaratzas mentioned this pull request Mar 15, 2026

[ROCm][CI] Revamping AMD mirrors #35897

Open

njhill approved these changes Mar 16, 2026

View reviewed changes

njhill merged commit 4f9b14c into vllm-project:main Mar 16, 2026
17 checks passed

github-project-automation bot moved this from Todo to Done in AMD Mar 16, 2026

AndreasKaratzas deleted the akaratza_stabilize_distributed branch March 16, 2026 23:09

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[CI] Stabilize multinode DP internal LB completion tests (vllm-projec…

736a370

…t#36356) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

andylolu2 pushed a commit to andylolu2/vllm that referenced this pull request Mar 18, 2026

[CI] Stabilize multinode DP internal LB completion tests (vllm-projec…

b03c59e

…t#36356) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[CI] Stabilize multinode DP internal LB completion tests (vllm-projec…

c45fbf4

…t#36356) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[CI] Stabilize multinode DP internal LB completion tests (vllm-projec…

5a3f0c9

…t#36356) Signed-off-by: Andreas Karatzas <akaratza@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Stabilize multinode DP internal LB completion tests#36356

[CI] Stabilize multinode DP internal LB completion tests#36356
njhill merged 3 commits intovllm-project:mainfrom
ROCm:akaratza_stabilize_distributed

AndreasKaratzas commented Mar 7, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 7, 2026

Uh oh!

AndreasKaratzas Mar 7, 2026

Uh oh!

njhill left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

AndreasKaratzas commented Mar 7, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

AndreasKaratzas Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AndreasKaratzas commented Mar 7, 2026 •

edited by github-actions bot

Loading