[CI][MCP][Harmony] Heavy refactoring Harmony & MCP response tests and stabilizing with deterministic test infrastructure#33949
Conversation
… prompt date Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This PR is a significant and well-executed refactoring to improve the stability and determinism of Harmony and MCP response tests. It addresses systemic flakiness by replacing @pytest.mark.flaky with a robust, deterministic test infrastructure, including API-level retries, pinned system prompts, and improved server lifecycle management. The changes are thorough, well-reasoned, and introduce valuable testing patterns and helpers. My review found one minor edge case in a new retry helper function. Overall, this is an excellent contribution that will greatly improve CI stability.
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…rypoints_response_api
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
|
||
| def get_weather(latitude, longitude): | ||
| try: | ||
| response = requests.get( |
There was a problem hiding this comment.
I've seen some weather flakes on CI; i wonder if we even need to call open-meteo for this test. should we just mock return?
There was a problem hiding this comment.
I mean we are trying to test a tool call, so I think this function should be enough for the test, and also it's stable with the try catch (it will always return a value).
| stream=True, | ||
| background=False, | ||
| ) | ||
| stream = await client.responses.create( |
There was a problem hiding this comment.
nit: do we need to refactor this? maybe someone will want to add another prompt in the future?
There was a problem hiding this comment.
Iin the future if anyone wants to add another prompt then this could be a list and a for loop or something, but if we do it now, it would be a bit too overengineered I think.
| @@ -283,75 +350,41 @@ async def test_stateful_multi_turn(client: OpenAI, model_name: str): | |||
| async def test_streaming_types( | |||
There was a problem hiding this comment.
we could maybe remove this test, it's mostly covered in test_function_calling_with_streaming_types.
There was a problem hiding this comment.
This does not test function calling with streaming types, so I think we could keep that.
|
|
||
| @pytest.mark.asyncio | ||
| @pytest.mark.parametrize("model_name", [MODEL_NAME]) | ||
| async def test_code_interpreter(client: OpenAI, model_name: str): |
There was a problem hiding this comment.
nit: can we move this back to where the original code location was, doesn't seem necessary for a git blame change
There was a problem hiding this comment.
I can certainly do that :)
| ] | ||
|
|
||
| # Step 1: First call with the function tool | ||
| stream_response = await client.responses.create( |
There was a problem hiding this comment.
The original reason why i implemented streaming was because response.input_messages wouldn't show up for non streaming. I think that bug is fixed now, we should just do non streaming and simplify the logic here :)
There was a problem hiding this comment.
Useful to know. Will commit that change :)
| client, | ||
| *, | ||
| model: str, | ||
| expected_tool_type: str = "function_call", |
There was a problem hiding this comment.
nit: i think it's better to explicitly set this var, so we shouldn't have a default val
There was a problem hiding this comment.
You mean pass the default inside the function? Something like max_retries: int = 3?
There was a problem hiding this comment.
i think it should be expected_tool_type: str, so the caller always has to explicitly set it
There was a problem hiding this comment.
Ohh, yeah, didn't think about that. Will commit that.
| sys_msg_content = sys_msg_content.with_reasoning_effort( | ||
| REASONING_EFFORT[reasoning_effort] | ||
| ) | ||
| if start_date is None: |
There was a problem hiding this comment.
this logic contradicts the lines below, maybe remove?
There was a problem hiding this comment.
Why does it contradict it? VLLM_GPT_OSS_SYSTEM_START_DATE is defaulted to None, and is set during CI runs, except if someone sets it manually.
There was a problem hiding this comment.
ah makes sense, it was a bit confusing to see if start_date is None: repeated twice.
can we do something like this?
if start_date is None:
# NOTE(woosuk): This brings non-determinism in vLLM. Be careful.
start_date = VLLM_GPT_OSS_SYSTEM_START_DATE or datetime.datetime.now().strftime("%Y-%m-%d")
There was a problem hiding this comment.
That's another thing I didn't think of doing 😅 I think I was thinking to have more declarative code block with the change, but your proposal keeps it clean. Will commit the proposed change yes :)
There was a problem hiding this comment.
thanks! actually should we rename it to be VLLM_SYSTEM_START_DATE? this can probably be extended to other models in the future
There was a problem hiding this comment.
Good idea. Will CC others as well for this. Do you guys agree on that?
There was a problem hiding this comment.
I'm going to go with this idea since there is no other similar variable, and if there is at some point a need for a new tailored one, then we can modify this one too at that time.
qandrew
left a comment
There was a problem hiding this comment.
thanks for putting this together!
Thank you for your review :) Lmk if my responses resonate well with you :) |
…rypoints_response_api
…sage parsing Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…rypoints_response_api
|
I have pushed a commit that addresses any pending review points.
|
|
I gave this a once-over, focusing on the correctness of Harmony format following and the changes to harmony_utils.py and serving.py look reasonable. I attempted to run the changed tests on my own machine, but am hitting what looks like some likely unrelated issues with gpt-oss models on my DGX Spark. I'll trust CI to tell us that this works well on the known-good hardware. What's the best way to keep track of how this improves the pass rate of CI tests before/after this change? |
qandrew
left a comment
There was a problem hiding this comment.
lgtm, thanks! cc @bbrowning if you wanna merge it
Let's see how the test runs on CI :) I tested the entire test group 50 times in a loop on an MI355 machine without a single failure, so it should also work on NVIDIA.
Well, on ROCm it fails daily, on NVIDIA it fails sometimes, but there is a buildkite statistic for each test group (which I don't know how to find). I'll let @robertgshaw2-redhat and @khluu also comment on that part as well :) |
…rypoints_response_api Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…rypoints_response_api Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…rypoints_response_api
…rypoints_response_api
SageMoore
left a comment
There was a problem hiding this comment.
This looks reasonable @AndreasKaratzas. Good find. Let's make sure these tests are green in CI and we should be good.
|
@SageMoore Thanks for approving this :) Yes, this PR has already run a fully green run, but with recent merges, there are these failures. I'm pretty sure these failures are known, and currently triaged though on NVIDIA nightly CI, right? |
…rypoints_response_api
… stabilizing with deterministic test infrastructure (vllm-project#33949) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… stabilizing with deterministic test infrastructure (vllm-project#33949) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: joezuo <qianzhou.zuo@gmail.com>
… stabilizing with deterministic test infrastructure (vllm-project#33949) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… stabilizing with deterministic test infrastructure (vllm-project#33949) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… stabilizing with deterministic test infrastructure (vllm-project#33949) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… stabilizing with deterministic test infrastructure (vllm-project#33949) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… stabilizing with deterministic test infrastructure (vllm-project#33949) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Andrii Skliar <askliar@nvidia.com>
… stabilizing with deterministic test infrastructure (vllm-project#33949) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Recent PR #33949 changed the teardown logic of the RemoteVLLMServer test utility class to send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated shutdown logic that assumes only the top-level process will receive a signal (for example when running in a container that's shut down). This caused a bunch of errors and stacktraces in some test logs, even though those tests still pass. We should still attempt a normal shutdown and only kill other procs if they are still running after a few seconds. Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming Signed-off-by: Nick Hill <nickhill123@gmail.com>
Recent PR vllm-project#33949 changed the teardown logic of the RemoteVLLMServer test utility class to send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated shutdown logic that assumes only the top-level process will receive a signal (for example when running in a container that's shut down). This caused a bunch of errors and stacktraces in some test logs, even though those tests still pass. We should still attempt a normal shutdown and only kill other procs if they are still running after a few seconds. Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: whycoming <120623296@qq.com>
Recent PR vllm-project#33949 changed the teardown logic of the RemoteVLLMServer test utility class to send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated shutdown logic that assumes only the top-level process will receive a signal (for example when running in a container that's shut down). This caused a bunch of errors and stacktraces in some test logs, even though those tests still pass. We should still attempt a normal shutdown and only kill other procs if they are still running after a few seconds. Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
Recent PR vllm-project#33949 changed the teardown logic of the RemoteVLLMServer test utility class to send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated shutdown logic that assumes only the top-level process will receive a signal (for example when running in a container that's shut down). This caused a bunch of errors and stacktraces in some test logs, even though those tests still pass. We should still attempt a normal shutdown and only kill other procs if they are still running after a few seconds. Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming Signed-off-by: Nick Hill <nickhill123@gmail.com>
Recent PR vllm-project#33949 changed the teardown logic of the RemoteVLLMServer test utility class to send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated shutdown logic that assumes only the top-level process will receive a signal (for example when running in a container that's shut down). This caused a bunch of errors and stacktraces in some test logs, even though those tests still pass. We should still attempt a normal shutdown and only kill other procs if they are still running after a few seconds. Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming Signed-off-by: Nick Hill <nickhill123@gmail.com>
Recent PR vllm-project#33949 changed the teardown logic of the RemoteVLLMServer test utility class to send SIGTERM to all vllm (sub)processes at once, which breaks the clean/coordinated shutdown logic that assumes only the top-level process will receive a signal (for example when running in a container that's shut down). This caused a bunch of errors and stacktraces in some test logs, even though those tests still pass. We should still attempt a normal shutdown and only kill other procs if they are still running after a few seconds. Example: tests/v1/distributed/test_external_lb_dp.py::test_external_lb_completion_streaming Signed-off-by: Nick Hill <nickhill123@gmail.com>
This PR eliminates systemic test flakiness in the Harmony and MCP Responses API integration tests by addressing root causes in both the test infrastructure and source code. The core problem was that tests asserted on non-deterministic LLM output using
@pytest.mark.flaky, which corrupted server fixture lifecycles and masked real failures. This PR replaces that pattern with deterministic infrastructure: pinned system prompts, API-level retries, and a clear separation between server invariants (hard assertions) and model behaviour (soft/xfail).Motivation
The following tests were consistently flaky in CI:
test_mcp_tool_env_flag_enabledtest_mcp_tool_with_allowed_tools_startest_mcp_tool_calling_streaming_typestest_mcp_code_interpreter_streamingtest_system_prompt_overridetest_function_calling_multi_turnRoot causes identified:
@pytest.mark.flakyre-ran entire tests includingscope="module"server fixtures, causing port conflicts and zombie processestemperature=0.0on tool-calling tests, making model output non-deterministicdatetime.now()), altering token sequences and shifting model behaviour attemperature=0.0Source changes
vllm/envs.pyVLLM_GPT_OSS_SYSTEM_START_DATE--- new optional env var (defaultNone) that pins the conversation start date in the Harmony system message. When unset, production behaviour is unchanged (datetime.now()is used). This directly addresses the non-determinism comment already present in the source:# NOTE(woosuk): This brings non-determinism in vLLM. Be careful.vllm/entrypoints/openai/parser/harmony_utils.pyVLLM_GPT_OSS_SYSTEM_START_DATEinget_system_message--- checks the env var before falling back todatetime.now(), eliminating the flagged non-determinism when the var is setMCP_BUILTIN_TOOLSfrom_BUILTIN_TOOL_TO_MCP_SERVER_LABEL--- these two data structures were defined independently and could drift apart silently.MCP_BUILTIN_TOOLSis nowset(_BUILTIN_TOOL_TO_MCP_SERVER_LABEL.values()), making the relationship enforced by constructioncommentary/analysisbranches inparse_remaining_state--- two identical code blocks reduced to a singleif parser.current_channel in ("commentary", "analysis")branchlogger.warningon JSON decode failure in_parse_browser_tool_call--- previously, invalid JSON was silently stuffed into data fields (query, url, pattern) with no logging, making parsing bugs invisible in productionTest infrastructure changes
tests/entrypoints/openai/responses/conftest.pyBASE_TEST_ENV--- shared dict withVLLM_GPT_OSS_SYSTEM_START_DATEpinned to"2023-09-12". All server fixtures include this to ensure identical system prompt tokens across runsretry_for_tool_call()--- callsclient.responses.createup tomax_retriestimes, returning the first response containing the expected tool type. Replaces@pytest.mark.flakywithout restarting server fixturesretry_streaming_for()--- same pattern for streaming responses, accepting avalidate_eventscallable to check whether the event stream meets expectationshas_output_type()--- predicate checking if a response contains an output item of a given typeevents_contain_type()--- predicate checking if any event in a list contains a type substringvalidate_streaming_event_stack()--- validates that streaming events are properly nested/paired using a stack-based algorithm against thepairs_of_event_typesfixturelog_response_diagnostics()--- extracts reasoning, tool call attempts, MCP items, and output types from a response, logs them as structured JSON vialogger.info, and returns the data dict for optional further assertion. Visible withpytest -sor--log-cli-level=INFOpairs_of_event_typesfixture --- maps done event types to their corresponding start event types for streaming validationTest file changes
test_mcp_tools.py@pytest.mark.flakydecorators --- replaced byretry_for_tool_call/retry_streaming_forwhich retry at the API level without touching server fixturesTestMCPToolServerUnit(no server),TestMCPEnabled(MCP env flag set),TestMCPDisabled(MCP env flag unset). Each integration class owns its ownscope="class"server fixture, preventing fixture lifecycle interferencetest_builtin_tools_consistency--- unit test assertingMCP_BUILTIN_TOOLS == set(_BUILTIN_TOOL_TO_MCP_SERVER_LABEL.values()), catching mapping drift at test timetemperature=0.0--- eliminates sampling randomness as a flakiness sourceBASE_TEST_ENV--- pinned system prompt datetest_mcp_tool_env_flag_disabledandtest_mcp_disabled_model_does_not_attempt_tool_callwere asserting the same invariant with the same setup. Consolidated intotest_mcp_disabled_server_does_not_executexfail(strict=False)could never XPASS by design. The actual invariant (server doesn't execute) is covered by the merged testlog_response_diagnosticscalls --- every integration test now logs full model reasoning, tool call attempts, and output types for CI visibilitymcp_call.in_progress,mcp_call_arguments.delta,mcp_call_arguments.done,mcp_call.completed) is asserted individually with failure messages that dump all seen event typesjsonandloggingimports that were previously inline in test methodstest_harmony.pyBASE_TEST_ENV--- pinned datetest_system_prompt_overrideinto two tests:test_system_prompt_override_no_duplication(hard) --- asserts exactly one system message ininput_messagesusing raw dict access instead ofMessage.from_dict(which fails on the{"author": {"role": "system"}}format returned byenable_response_messages)test_system_prompt_override_follows_personality(soft) ---xfail(strict=False)checking for pirate language keywords. XPASS means the model cooperated, XFAIL means it didn't --- neither blocks CIpytest.xfail()runtime call --- when the model goes straight to answering instead of calling a second tool, the test callspytest.xfail()at runtime. This reports as XFAIL (yellow) in CI instead of FAILED (red), while the happy path reports as PASSEDlog_response_diagnosticscalls to tool-calling testsretry_streaming_forinstead of@pytest.mark.flakytest_simple.py/test_parsable_context.pyBASE_TEST_ENV--- consistency across all test files, even for non-gpt-ossmodelsDesign principles
xfailorpytest.xfail()@pytest.mark.flakyrestarts entire tests including fixtures.retry_for_tool_callretries only the API call, leaving the server fixture untouchedlog_response_diagnosticslogs reasoning and tool call data on passing runs too, so behaviour changes are visible in CI logs before they cause failuresvllm/envs.pymachinery rather than test-only hacksTesting
All tests pass locally with the pinned date. The previously flaky tests have been run 10 times without failure:
pytest -v -s tests/entrypoints/openai/responsesExpected logs (and verified on MI355X):