[CI/Build] Add common tool call parser test suite#27599
[CI/Build] Add common tool call parser test suite#27599bbrowning wants to merge 6 commits intovllm-project:mainfrom
Conversation
bc9a4d0 to
3dfa4d8
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable common test suite for tool call parsers, which will greatly improve testing consistency and help identify gaps in different parser implementations. The structure with a configuration dataclass and a test mixin is well-designed. My review focuses on strengthening some of the new common tests to make them more robust and comprehensive. Specifically, I've suggested improvements to test_various_data_types to validate parsed values and to test_streaming_reconstruction for a more complete comparison between streaming and non-streaming outputs.
|
I had tests for the mistral tool parser as part of this as well locally, but decided to wait on adding tests for that parser until #19425 lands since that PR adds an initial mistral tool parser test suite and I didn't want to cause a rebase headache there since that other PR is already quite large. |
|
I created #27661 to track the overall arc I'm working towards here for broader context as to why I'm adding a common test suite and expanding the tests across all parsers. To briefly recap, these tests serve double duty of identifying existing bugs across parsers and de-risking a future refactor of tool call parsers by ensuring we have comprehensive test coverage. |
|
The precommit failed here due to #27811 . I confirmed the ruff failures there were unrelated to this change. |
|
Looks like this CI run failed with the same flake I previously reported as #27576 . |
|
Adding a note to myself and any future reviewers that if #27747 lands before this PR, this PR needs to update the location of the new tests it's adding to align with the test reorganization in 27747. |
This adds a common test suite for tool call parsers and wires all of the existing tool call parsers that had no tests into the common suite. It doesn't yet adapt existing tool call parser tests to fit into the common suite nor augment tool call parsers that already had tests with the new set of common tests. Those tasks can come later, as this PR is already quite large. Not all of the existing tool call parsers can pass every test in the common test suite. The ones that are not passing today are marked as xfail, and those represent opportunities to identify and fix gaps in all of these tool call parsers in the future until we get down to zero expected to fail tests within the common suite for each parser. Given how many tests are here, the default_tokenizer fixture used by tests when tokenizing strings was also adjusted to be module-scoped, so we don't create a new version of that for every single test. Signed-off-by: Ben Browning <bbrownin@redhat.com>
This tightens up the data type checking in the common tool call parser test suite to ensure parsers are not only parsing various data types of function arguments, but also that they are parsed into the expected Python type. The XML-based parsers do not support parsing into any data type but string, so there's flag added to control this stricter behavior so that tool call parsers that cannot deal with parsing different data types into their non-string native types are excluded from this checking. Signed-off-by: Ben Browning <bbrownin@redhat.com>
ffdc985 to
6588078
Compare
|
Rebased this on top of latest main to fix a small conflict in conftest.py. All added tests still pass. |
This just updates a few import paths to match how the location of ToolParserManager and renaming/moving of AnyTokenizer to TokenizerLike have happened since this PR was initially opened. Signed-off-by: Ben Browning <bbrownin@redhat.com>
|
Hi @bbrowning, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Ben Browning <bbrownin@redhat.com>
|
@sfeng33 This is an older PR I have that adds some minimal "standard" tool calling test suite, initially focused only on adding tests for the parsers that have none today. I'm happy to rethink this or pivot to another way if you have input on how you think we should best try and add some minimal standard of tool parser unit tests to verify basic functionality in expected scenarios. |
Purpose
This adds a common test suite for tool call parsers and wires all of the existing tool call parsers that had no tests into the common suite. It doesn't yet adapt existing tool call parser tests to fit into the common suite nor augment tool call parsers that already had tests with the new set of common tests. Those tasks can come later, as this PR is already quite large.
Not all of the existing tool call parsers can pass every test in the common test suite. The ones that are not passing today are marked as xfail, and those represent opportunities to identify and fix gaps in all of these tool call parsers in the future until we get down to zero expected to fail tests within the common suite for each parser.
Given how many tests are here, the default_tokenizer fixture used by tests when tokenizing strings was also adjusted to be module-scoped, so we don't create a new version of that for every single test. That keeps test execution fast, and avoids the need to instantiate a new identical tokenizer for every individual test function.
I used Claude Code to help me write the example model outputs for every added test and to help write the initial version of the common set of tests based on existing patterns in our other tool call parser tests.
Test Plan
Run all the newly added tool parser tests:
Test Result
Each of those
xfailedtests is a bug in one of the tool call parsers we'll want to track down. The expected failures are marked asstrict, so that the test will fail if one of them unexpectedly passes so that we can keep the list of expected failures accurate with the real state of things.