Skip to content

[CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure#33949

Merged
vllm-bot merged 19 commits intovllm-project:mainfrom
ROCm:akaratza_refactor_entrypoints_response_api
Feb 21, 2026
Merged

[CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure#33949
vllm-bot merged 19 commits intovllm-project:mainfrom
ROCm:akaratza_refactor_entrypoints_response_api

Conversation

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas AndreasKaratzas commented Feb 5, 2026

This PR eliminates systemic test flakiness in the Harmony and MCP Responses API integration tests by addressing root causes in both the test infrastructure and source code. The core problem was that tests asserted on non-deterministic LLM output using @pytest.mark.flaky, which corrupted server fixture lifecycles and masked real failures. This PR replaces that pattern with deterministic infrastructure: pinned system prompts, API-level retries, and a clear separation between server invariants (hard assertions) and model behaviour (soft/xfail).

Motivation

The following tests were consistently flaky in CI:

  • test_mcp_tool_env_flag_enabled
  • test_mcp_tool_with_allowed_tools_star
  • test_mcp_tool_calling_streaming_types
  • test_mcp_code_interpreter_streaming
  • test_system_prompt_override
  • test_function_calling_multi_turn

Root causes identified:

  • @pytest.mark.flaky re-ran entire tests including scope="module" server fixtures, causing port conflicts and zombie processes
  • No temperature=0.0 on tool-calling tests, making model output non-deterministic
  • System prompt date changes daily (datetime.now()), altering token sequences and shifting model behaviour at temperature=0.0
  • Tests mixed server invariants with model behaviour in the same assertions
  • Duplicate test logic across files with no shared infrastructure

Source changes

vllm/envs.py

  • Registered VLLM_GPT_OSS_SYSTEM_START_DATE --- new optional env var (default None) that pins the conversation start date in the Harmony system message. When unset, production behaviour is unchanged (datetime.now() is used). This directly addresses the non-determinism comment already present in the source: # NOTE(woosuk): This brings non-determinism in vLLM. Be careful.

vllm/entrypoints/openai/parser/harmony_utils.py

  • Consumed VLLM_GPT_OSS_SYSTEM_START_DATE in get_system_message --- checks the env var before falling back to datetime.now(), eliminating the flagged non-determinism when the var is set
  • Derived MCP_BUILTIN_TOOLS from _BUILTIN_TOOL_TO_MCP_SERVER_LABEL --- these two data structures were defined independently and could drift apart silently. MCP_BUILTIN_TOOLS is now set(_BUILTIN_TOOL_TO_MCP_SERVER_LABEL.values()), making the relationship enforced by construction
  • Collapsed duplicated commentary/analysis branches in parse_remaining_state --- two identical code blocks reduced to a single if parser.current_channel in ("commentary", "analysis") branch
  • Added logger.warning on JSON decode failure in _parse_browser_tool_call --- previously, invalid JSON was silently stuffed into data fields (query, url, pattern) with no logging, making parsing bugs invisible in production

Test infrastructure changes

tests/entrypoints/openai/responses/conftest.py

  • Added BASE_TEST_ENV --- shared dict with VLLM_GPT_OSS_SYSTEM_START_DATE pinned to "2023-09-12". All server fixtures include this to ensure identical system prompt tokens across runs
  • Added retry_for_tool_call() --- calls client.responses.create up to max_retries times, returning the first response containing the expected tool type. Replaces @pytest.mark.flaky without restarting server fixtures
  • Added retry_streaming_for() --- same pattern for streaming responses, accepting a validate_events callable to check whether the event stream meets expectations
  • Added has_output_type() --- predicate checking if a response contains an output item of a given type
  • Added events_contain_type() --- predicate checking if any event in a list contains a type substring
  • Added validate_streaming_event_stack() --- validates that streaming events are properly nested/paired using a stack-based algorithm against the pairs_of_event_types fixture
  • Added log_response_diagnostics() --- extracts reasoning, tool call attempts, MCP items, and output types from a response, logs them as structured JSON via logger.info, and returns the data dict for optional further assertion. Visible with pytest -s or --log-cli-level=INFO
  • Added pairs_of_event_types fixture --- maps done event types to their corresponding start event types for streaming validation

Test file changes

test_mcp_tools.py

  • Removed all @pytest.mark.flaky decorators --- replaced by retry_for_tool_call / retry_streaming_for which retry at the API level without touching server fixtures
  • Restructured into three test classes: TestMCPToolServerUnit (no server), TestMCPEnabled (MCP env flag set), TestMCPDisabled (MCP env flag unset). Each integration class owns its own scope="class" server fixture, preventing fixture lifecycle interference
  • Added test_builtin_tools_consistency --- unit test asserting MCP_BUILTIN_TOOLS == set(_BUILTIN_TOOL_TO_MCP_SERVER_LABEL.values()), catching mapping drift at test time
  • All integration tests now use temperature=0.0 --- eliminates sampling randomness as a flakiness source
  • All fixtures include BASE_TEST_ENV --- pinned system prompt date
  • Merged duplicate disabled tests --- test_mcp_tool_env_flag_disabled and test_mcp_disabled_model_does_not_attempt_tool_call were asserting the same invariant with the same setup. Consolidated into test_mcp_disabled_server_does_not_execute
  • Removed the xfail test for model not attempting tool calls --- when MCP is disabled, the tool description can still be in the prompt, so the model will consistently attempt tool calls. The old xfail(strict=False) could never XPASS by design. The actual invariant (server doesn't execute) is covered by the merged test
  • Added log_response_diagnostics calls --- every integration test now logs full model reasoning, tool call attempts, and output types for CI visibility
  • Streaming test restores per-event lifecycle assertions --- each MCP streaming event type (mcp_call.in_progress, mcp_call_arguments.delta, mcp_call_arguments.done, mcp_call.completed) is asserted individually with failure messages that dump all seen event types
  • Imports moved to file top --- json and logging imports that were previously inline in test methods

test_harmony.py

  • Server fixture uses BASE_TEST_ENV --- pinned date
  • Split test_system_prompt_override into two tests:
    • test_system_prompt_override_no_duplication (hard) --- asserts exactly one system message in input_messages using raw dict access instead of Message.from_dict (which fails on the {"author": {"role": "system"}} format returned by enable_response_messages)
    • test_system_prompt_override_follows_personality (soft) --- xfail(strict=False) checking for pirate language keywords. XPASS means the model cooperated, XFAIL means it didn't --- neither blocks CI
  • Multi-turn function calling test uses pytest.xfail() runtime call --- when the model goes straight to answering instead of calling a second tool, the test calls pytest.xfail() at runtime. This reports as XFAIL (yellow) in CI instead of FAILED (red), while the happy path reports as PASSED
  • Added log_response_diagnostics calls to tool-calling tests
  • Streaming tests use retry_streaming_for instead of @pytest.mark.flaky

test_simple.py / test_parsable_context.py

  • Server fixtures include BASE_TEST_ENV --- consistency across all test files, even for non-gpt-oss models

Design principles

  1. Hard assertions for server invariants, soft assertions for model behaviour --- if the server should never do X (execute a disabled tool, duplicate a system message), assert hard. If the model might or might not do Y (use pirate language, call a second tool), use xfail or pytest.xfail()
  2. Retry at API level, not test level --- @pytest.mark.flaky restarts entire tests including fixtures. retry_for_tool_call retries only the API call, leaving the server fixture untouched
  3. Diagnostics on every run, not just failures --- log_response_diagnostics logs reasoning and tool call data on passing runs too, so behaviour changes are visible in CI logs before they cause failures
  4. Shared infrastructure in conftest --- retry helpers, event validation, diagnostics, and env pinning are defined once and imported everywhere. No copy-paste across test files
  5. Env var over monkey-patching --- the date pinning uses a registered env var consumed through the standard vllm/envs.py machinery rather than test-only hacks

Testing

All tests pass locally with the pinned date. The previously flaky tests have been run 10 times without failure: pytest -v -s tests/entrypoints/openai/responses

Expected logs (and verified on MI355X):

==================================================================================================================================================================================== warnings summary ====================================================================================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
  /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================================================================================ 68 passed, 1 skipped, 1 xpassed, 3 warnings in 385.92s (0:06:25) ============================================================================================================================================================

… prompt date

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR is a significant and well-executed refactoring to improve the stability and determinism of Harmony and MCP response tests. It addresses systemic flakiness by replacing @pytest.mark.flaky with a robust, deterministic test infrastructure, including API-level retries, pinned system prompts, and improved server lifecycle management. The changes are thorough, well-reasoned, and introduce valuable testing patterns and helpers. My review found one minor edge case in a new retry helper function. Overall, this is an excellent contribution that will greatly improve CI stability.

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>

def get_weather(latitude, longitude):
try:
response = requests.get(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen some weather flakes on CI; i wonder if we even need to call open-meteo for this test. should we just mock return?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we are trying to test a tool call, so I think this function should be enough for the test, and also it's stable with the try catch (it will always return a value).

stream=True,
background=False,
)
stream = await client.responses.create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need to refactor this? maybe someone will want to add another prompt in the future?

Copy link
Collaborator Author

@AndreasKaratzas AndreasKaratzas Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iin the future if anyone wants to add another prompt then this could be a list and a for loop or something, but if we do it now, it would be a bit too overengineered I think.

@@ -283,75 +350,41 @@ async def test_stateful_multi_turn(client: OpenAI, model_name: str):
async def test_streaming_types(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could maybe remove this test, it's mostly covered in test_function_calling_with_streaming_types.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not test function calling with streaming types, so I think we could keep that.


@pytest.mark.asyncio
@pytest.mark.parametrize("model_name", [MODEL_NAME])
async def test_code_interpreter(client: OpenAI, model_name: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we move this back to where the original code location was, doesn't seem necessary for a git blame change

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly do that :)

]

# Step 1: First call with the function tool
stream_response = await client.responses.create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original reason why i implemented streaming was because response.input_messages wouldn't show up for non streaming. I think that bug is fixed now, we should just do non streaming and simplify the logic here :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful to know. Will commit that change :)

client,
*,
model: str,
expected_tool_type: str = "function_call",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think it's better to explicitly set this var, so we shouldn't have a default val

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean pass the default inside the function? Something like max_retries: int = 3?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it should be expected_tool_type: str, so the caller always has to explicitly set it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, yeah, didn't think about that. Will commit that.

sys_msg_content = sys_msg_content.with_reasoning_effort(
REASONING_EFFORT[reasoning_effort]
)
if start_date is None:
Copy link
Contributor

@qandrew qandrew Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic contradicts the lines below, maybe remove?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it contradict it? VLLM_GPT_OSS_SYSTEM_START_DATE is defaulted to None, and is set during CI runs, except if someone sets it manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah makes sense, it was a bit confusing to see if start_date is None: repeated twice.

can we do something like this?

if start_date is None:
        # NOTE(woosuk): This brings non-determinism in vLLM. Be careful.
        start_date = VLLM_GPT_OSS_SYSTEM_START_DATE or datetime.datetime.now().strftime("%Y-%m-%d")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's another thing I didn't think of doing 😅 I think I was thinking to have more declarative code block with the change, but your proposal keeps it clean. Will commit the proposed change yes :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! actually should we rename it to be VLLM_SYSTEM_START_DATE? this can probably be extended to other models in the future

Copy link
Collaborator Author

@AndreasKaratzas AndreasKaratzas Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Will CC others as well for this. Do you guys agree on that?

@robertgshaw2-redhat @DarkLight1337 @NickLucche

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to go with this idea since there is no other similar variable, and if there is at some point a need for a new tailored one, then we can modify this one too at that time.

Copy link
Contributor

@qandrew qandrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for putting this together!

@AndreasKaratzas
Copy link
Collaborator Author

thanks for putting this together!

Thank you for your review :) Lmk if my responses resonate well with you :)

@AndreasKaratzas
Copy link
Collaborator Author

AndreasKaratzas commented Feb 10, 2026

I have pushed a commit that addresses any pending review points.

harmony_utils.py:

  • Added .with_recipient("assistant") to the tool message branch in _parse_chat_format_message, aligning it with parse_chat_input_to_harmony_message behavior

serving_responses.py (_construct_input_messages_with_harmony):

  • Added guard to skip empty string input when previous_input_messages supplies the full conversation history
  • Added clarifying FIXME comment on the existing chain-of-thought removal block (currently a no-op --- slices messages out then re-appends them all unfiltered; may be intentionally deferred per the existing FIXME or redundant if the Harmony encoder handles analysis stripping at render time)

@bbrowning
Copy link
Contributor

I gave this a once-over, focusing on the correctness of Harmony format following and the changes to harmony_utils.py and serving.py look reasonable. I attempted to run the changed tests on my own machine, but am hitting what looks like some likely unrelated issues with gpt-oss models on my DGX Spark. I'll trust CI to tell us that this works well on the known-good hardware.

What's the best way to keep track of how this improves the pass rate of CI tests before/after this change?

Copy link
Contributor

@qandrew qandrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks! cc @bbrowning if you wanna merge it

@AndreasKaratzas
Copy link
Collaborator Author

I gave this a once-over, focusing on the correctness of Harmony format following and the changes to harmony_utils.py and serving.py look reasonable. I attempted to run the changed tests on my own machine, but am hitting what looks like some likely unrelated issues with gpt-oss models on my DGX Spark. I'll trust CI to tell us that this works well on the known-good hardware.

Let's see how the test runs on CI :) I tested the entire test group 50 times in a loop on an MI355 machine without a single failure, so it should also work on NVIDIA.

What's the best way to keep track of how this improves the pass rate of CI tests before/after this change?

Well, on ROCm it fails daily, on NVIDIA it fails sometimes, but there is a buildkite statistic for each test group (which I don't know how to find). I'll let @robertgshaw2-redhat and @khluu also comment on that part as well :)

…rypoints_response_api

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…rypoints_response_api

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable @AndreasKaratzas. Good find. Let's make sure these tests are green in CI and we should be good.

@AndreasKaratzas
Copy link
Collaborator Author

@SageMoore Thanks for approving this :) Yes, this PR has already run a fully green run, but with recent merges, there are these failures. I'm pretty sure these failures are known, and currently triaged though on NVIDIA nightly CI, right?

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping to fix CI

@github-project-automation github-project-automation bot moved this from To Triage to Ready in gpt-oss Issues & Enhancements Feb 21, 2026
@vllm-bot vllm-bot merged commit 991d6bf into vllm-project:main Feb 21, 2026
49 of 51 checks passed
@dosubot
Copy link

dosubot bot commented Feb 21, 2026

Related Documentation

Checked 0 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@AndreasKaratzas AndreasKaratzas deleted the akaratza_refactor_entrypoints_response_api branch February 21, 2026 04:13
DarkLight1337 pushed a commit to DarkLight1337/vllm that referenced this pull request Feb 21, 2026
… stabilizing with deterministic test infrastructure (vllm-project#33949)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
joeqzzuo pushed a commit to joeqzzuo/vllm that referenced this pull request Feb 21, 2026
… stabilizing with deterministic test infrastructure (vllm-project#33949)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: joezuo <qianzhou.zuo@gmail.com>
yugong333 pushed a commit to yugong333/vllm that referenced this pull request Feb 22, 2026
… stabilizing with deterministic test infrastructure (vllm-project#33949)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
jmamou pushed a commit to jmamou/vllm that referenced this pull request Feb 23, 2026
… stabilizing with deterministic test infrastructure (vllm-project#33949)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
llsj14 pushed a commit to llsj14/vllm that referenced this pull request Mar 1, 2026
… stabilizing with deterministic test infrastructure (vllm-project#33949)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026
… stabilizing with deterministic test infrastructure (vllm-project#33949)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
askliar pushed a commit to askliar/vllm that referenced this pull request Mar 9, 2026
… stabilizing with deterministic test infrastructure (vllm-project#33949)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andrii Skliar <askliar@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
… stabilizing with deterministic test infrastructure (vllm-project#33949)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
markmc pushed a commit that referenced this pull request Mar 13, 2026
Recent PR #33949 changed the teardown logic of the RemoteVLLMServer test utility class to
send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated
shutdown logic that assumes only the top-level process will receive a signal (for example
when running in a container that's shut down).

This caused a bunch of errors and stacktraces in some test logs, even though those tests
still pass. We should still attempt a normal shutdown and only kill other procs if they are
still running after a few seconds.

Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming

Signed-off-by: Nick Hill <nickhill123@gmail.com>
whycoming pushed a commit to whycoming/vllm that referenced this pull request Mar 13, 2026
Recent PR vllm-project#33949 changed the teardown logic of the RemoteVLLMServer test utility class to
send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated
shutdown logic that assumes only the top-level process will receive a signal (for example
when running in a container that's shut down).

This caused a bunch of errors and stacktraces in some test logs, even though those tests
still pass. We should still attempt a normal shutdown and only kill other procs if they are
still running after a few seconds.

Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming

Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: whycoming <120623296@qq.com>
athrael-soju pushed a commit to athrael-soju/vllm that referenced this pull request Mar 16, 2026
Recent PR vllm-project#33949 changed the teardown logic of the RemoteVLLMServer test utility class to
send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated
shutdown logic that assumes only the top-level process will receive a signal (for example
when running in a container that's shut down).

This caused a bunch of errors and stacktraces in some test logs, even though those tests
still pass. We should still attempt a normal shutdown and only kill other procs if they are
still running after a few seconds.

Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming

Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
Recent PR vllm-project#33949 changed the teardown logic of the RemoteVLLMServer test utility class to
send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated
shutdown logic that assumes only the top-level process will receive a signal (for example
when running in a container that's shut down).

This caused a bunch of errors and stacktraces in some test logs, even though those tests
still pass. We should still attempt a normal shutdown and only kill other procs if they are
still running after a few seconds.

Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming

Signed-off-by: Nick Hill <nickhill123@gmail.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Recent PR vllm-project#33949 changed the teardown logic of the RemoteVLLMServer test utility class to
send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated
shutdown logic that assumes only the top-level process will receive a signal (for example
when running in a container that's shut down).

This caused a bunch of errors and stacktraces in some test logs, even though those tests
still pass. We should still attempt a normal shutdown and only kill other procs if they are
still running after a few seconds.

Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming

Signed-off-by: Nick Hill <nickhill123@gmail.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
Recent PR vllm-project#33949 changed the teardown logic of the RemoteVLLMServer test utility class to
send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated
shutdown logic that assumes only the top-level process will receive a signal (for example
when running in a container that's shut down).

This caused a bunch of errors and stacktraces in some test logs, even though those tests
still pass. We should still attempt a normal shutdown and only kill other procs if they are
still running after a few seconds.

Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming

Signed-off-by: Nick Hill <nickhill123@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants