Fix: Prevents merging of tool arguments during preprocessing#2909
Conversation
WalkthroughA helper function was introduced within the Changes
Suggested labels
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/axolotl/prompt_strategies/chat_template.py (1)
381-394: Consider extracting the helper function for better maintainability.While the current implementation works, defining the helper function inside the method limits its reusability and impacts code organization.
Consider extracting the function to the class level or module level:
+ @staticmethod + def _remove_none_values(obj): + """Recursively remove None values from dictionaries and lists.""" + if isinstance(obj, dict): + return {k: ChatTemplateStrategy._remove_none_values(v) for k, v in obj.items() if v is not None} + if isinstance(obj, list): + return [ChatTemplateStrategy._remove_none_values(elem) for elem in obj] + return obj + def tokenize_prompt(self, prompt: dict[str, Any]): """ Public method that can handle either a single prompt or a batch of prompts. """ - def _remove_none_values(obj): - if isinstance(obj, dict): - return {k: _remove_none_values(v) for k, v in obj.items() if v is not None} - if isinstance(obj, list): - return [_remove_none_values(elem) for elem in obj] - return obj if not self.is_prompt_batched(prompt) or not self.supports_batched: return self._tokenize_single_prompt(prompt) res = defaultdict(lambda: []) feature_names = list(prompt.keys()) - prompt = _remove_none_values(prompt) + prompt = self._remove_none_values(prompt)This improves code organization and allows the function to be tested independently.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/axolotl/prompt_strategies/chat_template.py(1 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
src/axolotl/prompt_strategies/chat_template.py
[refactor] 382-385: Unnecessary "elif" after "return", remove the leading "el" from "elif"
(R1705)
🔇 Additional comments (1)
src/axolotl/prompt_strategies/chat_template.py (1)
394-394: None value removal is safe for all prompt structuresThe
_remove_none_valueshelper only drops keys whose values are literallyNone. Downstream code always usesprompt.get(key, None)or explicitis Nonechecks (e.g. in_get_tools,_get_messages, default‐value branches in__init__), so removingNoneentries simply falls back to the same default logic. No intentionalNonevalues are relied on, and no changes are required.
Codecov ReportAll modified and coverable lines are covered by tests ✅ 📢 Thoughts on this report? Let us know! |
|
@greenhestu Thanks for tracking this down. I wrote up a quick test case, but it doesn't pass. see https://gist.github.com/winglian/b1b658a37b2dc337448aef9f10080095 EDIT: found the issue and added separate tests for batched and non-batched parsing |
There was a problem hiding this comment.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
tests/prompt_strategies/test_chat_template_ds_schema_unification.py (1)
67-76: Apply the same robustness improvements to batched test.The batched test has the same fragility issues as the single prompt test. Additionally, it's missing proper iteration over the batch results.
Apply this refactor to fix the batched test:
def test_schema_unification_batched( self, messages_w_tools, qwen3_prompt_strategy, qwen3_tokenizer ): rows = messages_w_tools.map(qwen3_prompt_strategy.tokenize_prompt, batched=True) - for row in rows: - decoded = qwen3_tokenizer.decode(row["input_ids"]) - tool_call = decoded.split("<tool_call>")[-1].split("</tool_call>")[0] - assert '"message": null' not in tool_call - assert '"theta": null' not in tool_call + + # Iterate through batch results properly + for i in range(len(rows["input_ids"])): + decoded = qwen3_tokenizer.decode(rows["input_ids"][i]) + + # More robust tool call extraction and validation + if "<tool_call>" in decoded and "</tool_call>" in decoded: + tool_call = decoded.split("<tool_call>")[-1].split("</tool_call>")[0] + + try: + tool_data = json.loads(tool_call) + if "arguments" in tool_data: + for key, value in tool_data["arguments"].items(): + assert value is not None, f"Found null value for argument '{key}' in batch item {i}" + except json.JSONDecodeError: + pytest.fail(f"Tool call is not valid JSON in batch item {i}: {tool_call}") + else: + pytest.fail(f"No tool call found in decoded output for batch item {i}")
🧹 Nitpick comments (2)
tests/prompt_strategies/test_chat_template_ds_schema_unification.py (2)
15-25: Consider improving test data readability and maintainability.The JSON test data is embedded as long single-line strings, making it difficult to read, debug, and maintain. Consider extracting this to separate JSON files or formatting it more readably.
Example refactor to improve readability:
- jsons = """ -{"messages":[{"role":"user","content":"move to (0, 1)"},{"role":"assistant","content":"","tool_calls":[{"function":{"name":"move","arguments":{"x":0,"y":1}}}]}],"tools":[...]} -{"messages":[{"role":"user","content":"turn 270 degree"},{"role":"assistant","content":"","tool_calls":[{"function":{"name":"turn","arguments":{"theta": 270}}}]}],"tools":[...]} - """.strip().split("\n") + test_data = [ + { + "messages": [ + {"role": "user", "content": "move to (0, 1)"}, + { + "role": "assistant", + "content": "", + "tool_calls": [{"function": {"name": "move", "arguments": {"x": 0, "y": 1}}}] + } + ], + "tools": [...], # Tool definitions + "add_generation_prompt": False + }, + # Additional test cases... + ]
52-76: Consider adding edge case tests to improve coverage.The current tests cover basic functionality but miss important edge cases that could reveal issues with the schema unification fix.
Consider adding these additional test cases:
def test_schema_unification_edge_cases( self, qwen3_prompt_strategy, qwen3_tokenizer ): """Test edge cases for schema unification""" # Test with empty tool calls empty_tool_data = { "messages": [ {"role": "user", "content": "test"}, {"role": "assistant", "content": "", "tool_calls": []} ], "tools": [], "add_generation_prompt": False } # Test with malformed tool definitions malformed_tool_data = { "messages": [ {"role": "user", "content": "test"}, {"role": "assistant", "content": "No tool calls"} ], "tools": [{"type": "function", "function": {"name": "incomplete"}}], "add_generation_prompt": False } # Add assertions for these edge cases
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/axolotl/prompt_strategies/chat_template.py(1 hunks)tests/prompt_strategies/test_chat_template_ds_schema_unification.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/axolotl/prompt_strategies/chat_template.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: greenhestu
PR: axolotl-ai-cloud/axolotl#2909
File: src/axolotl/prompt_strategies/chat_template.py:381-386
Timestamp: 2025-07-12T13:14:29.237Z
Learning: In src/axolotl/prompt_strategies/chat_template.py, the _remove_none_values function uses hasattr(obj, "items") instead of isinstance(obj, dict) for dictionary detection to support duck typing. This allows the function to work with any object that implements a dictionary-like interface, including custom classes, OrderedDict, defaultdict, etc.
🔇 Additional comments (2)
tests/prompt_strategies/test_chat_template_ds_schema_unification.py (2)
1-13: Imports and documentation look good.The file has clear documentation and appropriate imports for the testing functionality.
28-34: No action needed:download_qwen3_half_billion_modelfixture is defined
Thedownload_qwen3_half_billion_modelfixture lives intests/conftest.pyas a session-scoped, autouse fixture, so it’s available in your test without any local definition.
NanoCode012
left a comment
There was a problem hiding this comment.
Thank you for the clear issue and fix.
This brings up a second issue, whether this could also occur for tools, since it has a lot of nested elements. For example, one int field could have a min attribute. Would this also appear as null for a string field due to the potential schema merge?
Either way, your fix should catch it.
|
That’s definitely an issue—thanks for pointing it out. Then we need to use something other than
I considered using Would it be better to close this PR and continue the discussion in an issue instead? |
|
Oh, it’s already been merged. Thanks! |
|
For now, I think your fix covers that case too. Thank you for the PR! |
|
I now see you were talking about schema merging for nested elements. I had confused it with a type conflict issue, like when a field type changes from a {"messages":[{"role":"user","content":"send data1 1"},{"role":"assistant","content":"","tool_calls":[{"function":{"name":"send_data1","arguments":{"data":1}}}]}],"tools":[{"type":"function","function":{"name":"send_data1","description":"send data","parameters":{"type":"object","properties":{"data":{"type":"integer","description":"the data"}},"required":["data"]}}},{"type":"function","function":{"name":"send_data2","description":"send data","parameters":{"type":"object","properties":{"data":{"type":"string","description":"the data"}},"required":["data"]}}}],"add_generation_prompt":false}
{"messages":[{"role":"user","content":"send data2 2"},{"role":"assistant","content":"","tool_calls":[{"function":{"name":"send_data2","arguments":{"data":"2"}}}]}],"tools":[{"type":"function","function":{"name":"send_data1","description":"send data","parameters":{"type":"object","properties":{"data":{"type":"integer","description":"the data"}},"required":["data"]}}},{"type":"function","function":{"name":"send_data2","description":"send data","parameters":{"type":"object","properties":{"data":{"type":"string","description":"the data"}},"required":["data"]}}}],"add_generation_prompt":false}
This could be avoided by using different field names for different types, though. Appreciate the review! |
|
I have a pr to solve this issue. The root cause is that arguments is of type dict. The ideal solution would be to change arguments to a string type, which would resolve all related problems. #3136
|
Description
This PR modifies the data preprocessing logic for tool calls. It resolves an issue where arguments from multiple tools were being merged during preprocessing, causing unnecessary fields to be populated with null values.
This change reforms the structure of tool_calls within the preprocessed data. By removing extraneous null arguments, each tool call in the training data now strictly adheres to its own schema, ensuring it matches the format expected by standard chat templates.
Motivation and Context
Problem:
The current logic merges the function signatures of all available tools, likely due to the schema unification behavior of Hugging Face's datasets.load_dataset with JSON files. (See HF Docs on JSON loading).
Impact:
This results in training data where tool_calls contain extraneous arguments with null values.
A model fine-tuned on this data performs poorly when used for inference with standard chat templates (e.g., in vLLM), as there is a mismatch between the training format and the expected inference format.
Example:
{"messages":[{"role":"user","content":"move to (0, 1)"},{"role":"assistant","content":"","tool_calls":[{"function":{"name":"move","arguments":{"x":0,"y":1}}}]}],"tools":[{"type":"function","function":{"name":"move","description":"Move to a given location measured in meters","parameters":{"type":"object","properties":{"x":{"type":"number","description":"The x coordinate of the location, negative values are to the left, positive values are to the right"},"y":{"type":"number","description":"The y coordinate of the location, negative values are backward, positive values are forward"}},"required":["x","y"]}}},{"type":"function","function":{"name":"turn","description":"Turn the robot to a given direction","parameters":{"type":"object","properties":{"theta":{"type":"integer","description":"The angle to turn to, in degrees, positive values are counter-clockwise, negative values are clockwise"}},"required":["theta"]}}},{"type":"function","function":{"name":"invalid_prompt","description":"call when the user's prompt is invalid","parameters":{"type":"object","properties":{"message":{"type":"string","description":"why the prompt is invalid"}},"required":["message"]}}}],"add_generation_prompt":false} {"messages":[{"role":"user","content":"turn 270 degree"},{"role":"assistant","content":"","tool_calls":[{"function":{"name":"turn","arguments":{"theta": 270}}}]}],"tools":[{"type":"function","function":{"name":"move","description":"Move to a given location measured in meters","parameters":{"type":"object","properties":{"x":{"type":"number","description":"The x coordinate of the location, negative values are to the left, positive values are to the right"},"y":{"type":"number","description":"The y coordinate of the location, negative values are backward, positive values are forward"}},"required":["x","y"]}}},{"type":"function","function":{"name":"turn","description":"Turn the robot to a given direction","parameters":{"type":"object","properties":{"theta":{"type":"integer","description":"The angle to turn to, in degrees, positive values are counter-clockwise, negative values are clockwise"}},"required":["theta"]}}},{"type":"function","function":{"name":"invalid_prompt","description":"call when the user's prompt is invalid","parameters":{"type":"object","properties":{"message":{"type":"string","description":"why the prompt is invalid"}},"required":["message"]}}}],"add_generation_prompt":false} {"messages":[{"role":"user","content":"jump high"},{"role":"assistant","content":"","tool_calls":[{"function":{"name":"invalid_prompt","arguments":{"message": "jump is not a valid action"}}}]}],"tools":[{"type":"function","function":{"name":"move","description":"Move to a given location measured in meters","parameters":{"type":"object","properties":{"x":{"type":"number","description":"The x coordinate of the location, negative values are to the left, positive values are to the right"},"y":{"type":"number","description":"The y coordinate of the location, negative values are backward, positive values are forward"}},"required":["x","y"]}}},{"type":"function","function":{"name":"turn","description":"Turn the robot to a given direction","parameters":{"type":"object","properties":{"theta":{"type":"integer","description":"The angle to turn to, in degrees, positive values are counter-clockwise, negative values are clockwise"}},"required":["theta"]}}},{"type":"function","function":{"name":"invalid_prompt","description":"call when the user's prompt is invalid","parameters":{"type":"object","properties":{"message":{"type":"string","description":"why the prompt is invalid"}},"required":["message"]}}}],"add_generation_prompt":false}Applying the chat template to the data above produces the following result (first column)
axolotl preprocess --debug makes following output
How has this been tested?
This change was evaluated using the qwen3-1.7b model with vLLM for inference.
The results are as follows:
Before fine-tuning: 55.5% accuracy
After fine-tuning (before this PR): 29.8% (Performance degraded)
After fine-tuning (this PR): 90.3% (Performance improved)
Summary by CodeRabbit