[CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure by AndreasKaratzas · Pull Request #33949 · vllm-project/vllm

AndreasKaratzas · 2026-02-05T23:26:04Z

This PR eliminates systemic test flakiness in the Harmony and MCP Responses API integration tests by addressing root causes in both the test infrastructure and source code. The core problem was that tests asserted on non-deterministic LLM output using @pytest.mark.flaky, which corrupted server fixture lifecycles and masked real failures. This PR replaces that pattern with deterministic infrastructure: pinned system prompts, API-level retries, and a clear separation between server invariants (hard assertions) and model behaviour (soft/xfail).

Motivation

The following tests were consistently flaky in CI:

test_mcp_tool_env_flag_enabled
test_mcp_tool_with_allowed_tools_star
test_mcp_tool_calling_streaming_types
test_mcp_code_interpreter_streaming
test_system_prompt_override
test_function_calling_multi_turn

Root causes identified:

@pytest.mark.flaky re-ran entire tests including scope="module" server fixtures, causing port conflicts and zombie processes
No temperature=0.0 on tool-calling tests, making model output non-deterministic
System prompt date changes daily (datetime.now()), altering token sequences and shifting model behaviour at temperature=0.0
Tests mixed server invariants with model behaviour in the same assertions
Duplicate test logic across files with no shared infrastructure

Source changes

`vllm/envs.py`

Registered VLLM_GPT_OSS_SYSTEM_START_DATE --- new optional env var (default None) that pins the conversation start date in the Harmony system message. When unset, production behaviour is unchanged (datetime.now() is used). This directly addresses the non-determinism comment already present in the source: # NOTE(woosuk): This brings non-determinism in vLLM. Be careful.

`vllm/entrypoints/openai/parser/harmony_utils.py`

Consumed VLLM_GPT_OSS_SYSTEM_START_DATE in get_system_message --- checks the env var before falling back to datetime.now(), eliminating the flagged non-determinism when the var is set
Derived MCP_BUILTIN_TOOLS from _BUILTIN_TOOL_TO_MCP_SERVER_LABEL --- these two data structures were defined independently and could drift apart silently. MCP_BUILTIN_TOOLS is now set(_BUILTIN_TOOL_TO_MCP_SERVER_LABEL.values()), making the relationship enforced by construction
Collapsed duplicated commentary/analysis branches in parse_remaining_state --- two identical code blocks reduced to a single if parser.current_channel in ("commentary", "analysis") branch
Added logger.warning on JSON decode failure in _parse_browser_tool_call --- previously, invalid JSON was silently stuffed into data fields (query, url, pattern) with no logging, making parsing bugs invisible in production

Test infrastructure changes

`tests/entrypoints/openai/responses/conftest.py`

Added BASE_TEST_ENV --- shared dict with VLLM_GPT_OSS_SYSTEM_START_DATE pinned to "2023-09-12". All server fixtures include this to ensure identical system prompt tokens across runs
Added retry_for_tool_call() --- calls client.responses.create up to max_retries times, returning the first response containing the expected tool type. Replaces @pytest.mark.flaky without restarting server fixtures
Added retry_streaming_for() --- same pattern for streaming responses, accepting a validate_events callable to check whether the event stream meets expectations
Added has_output_type() --- predicate checking if a response contains an output item of a given type
Added events_contain_type() --- predicate checking if any event in a list contains a type substring
Added validate_streaming_event_stack() --- validates that streaming events are properly nested/paired using a stack-based algorithm against the pairs_of_event_types fixture
Added log_response_diagnostics() --- extracts reasoning, tool call attempts, MCP items, and output types from a response, logs them as structured JSON via logger.info, and returns the data dict for optional further assertion. Visible with pytest -s or --log-cli-level=INFO
Added pairs_of_event_types fixture --- maps done event types to their corresponding start event types for streaming validation

Test file changes

`test_mcp_tools.py`

Removed all @pytest.mark.flaky decorators --- replaced by retry_for_tool_call / retry_streaming_for which retry at the API level without touching server fixtures
Restructured into three test classes: TestMCPToolServerUnit (no server), TestMCPEnabled (MCP env flag set), TestMCPDisabled (MCP env flag unset). Each integration class owns its own scope="class" server fixture, preventing fixture lifecycle interference
Added test_builtin_tools_consistency --- unit test asserting MCP_BUILTIN_TOOLS == set(_BUILTIN_TOOL_TO_MCP_SERVER_LABEL.values()), catching mapping drift at test time
All integration tests now use temperature=0.0 --- eliminates sampling randomness as a flakiness source
All fixtures include BASE_TEST_ENV --- pinned system prompt date
Merged duplicate disabled tests --- test_mcp_tool_env_flag_disabled and test_mcp_disabled_model_does_not_attempt_tool_call were asserting the same invariant with the same setup. Consolidated into test_mcp_disabled_server_does_not_execute
Removed the xfail test for model not attempting tool calls --- when MCP is disabled, the tool description can still be in the prompt, so the model will consistently attempt tool calls. The old xfail(strict=False) could never XPASS by design. The actual invariant (server doesn't execute) is covered by the merged test
Added log_response_diagnostics calls --- every integration test now logs full model reasoning, tool call attempts, and output types for CI visibility
Streaming test restores per-event lifecycle assertions --- each MCP streaming event type (mcp_call.in_progress, mcp_call_arguments.delta, mcp_call_arguments.done, mcp_call.completed) is asserted individually with failure messages that dump all seen event types
Imports moved to file top --- json and logging imports that were previously inline in test methods

`test_harmony.py`

Server fixture uses BASE_TEST_ENV --- pinned date
Split test_system_prompt_override into two tests:
- test_system_prompt_override_no_duplication (hard) --- asserts exactly one system message in input_messages using raw dict access instead of Message.from_dict (which fails on the {"author": {"role": "system"}} format returned by enable_response_messages)
- test_system_prompt_override_follows_personality (soft) --- xfail(strict=False) checking for pirate language keywords. XPASS means the model cooperated, XFAIL means it didn't --- neither blocks CI
Multi-turn function calling test uses pytest.xfail() runtime call --- when the model goes straight to answering instead of calling a second tool, the test calls pytest.xfail() at runtime. This reports as XFAIL (yellow) in CI instead of FAILED (red), while the happy path reports as PASSED
Added log_response_diagnostics calls to tool-calling tests
Streaming tests use retry_streaming_for instead of @pytest.mark.flaky

`test_simple.py` / `test_parsable_context.py`

Server fixtures include BASE_TEST_ENV --- consistency across all test files, even for non-gpt-oss models

Design principles

Hard assertions for server invariants, soft assertions for model behaviour --- if the server should never do X (execute a disabled tool, duplicate a system message), assert hard. If the model might or might not do Y (use pirate language, call a second tool), use xfail or pytest.xfail()
Retry at API level, not test level --- @pytest.mark.flaky restarts entire tests including fixtures. retry_for_tool_call retries only the API call, leaving the server fixture untouched
Diagnostics on every run, not just failures --- log_response_diagnostics logs reasoning and tool call data on passing runs too, so behaviour changes are visible in CI logs before they cause failures
Shared infrastructure in conftest --- retry helpers, event validation, diagnostics, and env pinning are defined once and imported everywhere. No copy-paste across test files
Env var over monkey-patching --- the date pinning uses a registered env var consumed through the standard vllm/envs.py machinery rather than test-only hacks

Testing

All tests pass locally with the pinned date. The previously flaky tests have been run 10 times without failure: pytest -v -s tests/entrypoints/openai/responses

Expected logs (and verified on MI355X):

==================================================================================================================================================================================== warnings summary ====================================================================================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
  /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================================================================================================================ 68 passed, 1 skipped, 1 xpassed, 3 warnings in 385.92s (0:06:25) ============================================================================================================================================================

… prompt date Signed-off-by: Andreas Karatzas <akaratza@amd.com>

gemini-code-assist

Code Review

This PR is a significant and well-executed refactoring to improve the stability and determinism of Harmony and MCP response tests. It addresses systemic flakiness by replacing @pytest.mark.flaky with a robust, deterministic test infrastructure, including API-level retries, pinned system prompts, and improved server lifecycle management. The changes are thorough, well-reasoned, and introduce valuable testing patterns and helpers. My review found one minor edge case in a new retry helper function. Overall, this is an excellent contribution that will greatly improve CI stability.

tests/entrypoints/openai/responses/conftest.py

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

…rypoints_response_api

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

qandrew · 2026-02-08T06:41:00Z