[CI] Stabilize multinode DP internal LB completion tests#36356
[CI] Stabilize multinode DP internal LB completion tests#36356njhill merged 3 commits intovllm-project:mainfrom
Conversation
…ponses at temperature 1 Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a flaky test issue by correctly handling empty model responses when finish_reason is 'stop'. The refactoring into _make_completion_request and _run_request_bursts helper functions significantly improves code readability and maintainability by removing duplication. The changes are well-justified and implemented correctly. I have one suggestion to further improve the robustness of the new test helper.
| results = await asyncio.gather(*all_tasks) | ||
| assert len(results) == num_requests, ( | ||
| f"Burst {burst}: expected {num_requests} results, got {len(results)}" | ||
| ) | ||
| assert all(completion is not None for completion in results), ( | ||
| f"Burst {burst}: some completions were None" | ||
| ) |
There was a problem hiding this comment.
Using asyncio.gather without return_exceptions=True can lead to unhandled exceptions and resource leaks if one of the tasks fails. When a task in gather raises an exception, gather propagates that exception immediately, and other tasks might not be cancelled, potentially continuing to run in the background. This can affect the stability of subsequent tests in the suite.
By setting return_exceptions=True, gather will wait for all tasks to complete and return exceptions as results. You can then explicitly check for and handle any exceptions, ensuring a cleaner test shutdown. This improves test robustness.
results = await asyncio.gather(*all_tasks, return_exceptions=True)
assert len(results) == num_requests, (
f"Burst {burst}: expected {num_requests} results, got {len(results)}"
)
# Raise any exceptions that were caught
for result in results:
if isinstance(result, BaseException):
raise result
assert all(completion is not None for completion in results), (
f"Burst {burst}: some completions were None"
)There was a problem hiding this comment.
Using return_exceptions=True and re-raising to ensure clean task shutdown. Done :)
…ponses at temperature 1 Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…t#36356) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…t#36356) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…t#36356) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…t#36356) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Fixes flaky
test_api_only_multinode_dp_completionandtest_multinode_dp_completionwhich intermittently fail with empty model responses during concurrent load balancer testing.Motivation
These tests intentionally use
temperature=1.0to produce diverse outputs across 200 concurrent requests for realistic load balancer distribution testing. However, attemperature=1.0the model can legitimately emit a stop token as its very first token, producingtext=''withfinish_reason='stop'. Over 400 requests (two bursts of 200), the probability of at least one empty response is high. Rather than changingtemperature=0.0(which would undermine the test's intent of exercising load balancing with diverse requests), the fix tolerates the valid edge case: whenfinish_reason='stop', empty text is accepted. The non-empty text assertion is only enforced whenfinish_reason='length'._make_completion_requesthelper: Extracted the duplicatedmake_request()closure from both non-streaming completion tests into a shared module-level function with diagnostic assertion messages that print actual values on failure._run_request_burstshelper: Extracted the duplicated two-burst loop pattern (create tasks -> gather -> validate -> sleep) shared by both non-streaming tests.Streaming tests unchanged: They already use
temperature=0.0and have adequate assertions.cc @kenroche