Check prefix preservation at the token level#5559
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7258c00575
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if isinstance(tokenizer, ProcessorMixin): | ||
| from PIL import Image | ||
|
|
||
| dummy_image = Image.new("RGB", (8, 8)) | ||
| messages1 = prepare_multimodal_messages(messages1, images=[dummy_image]) | ||
| messages2 = prepare_multimodal_messages(messages2, images=[dummy_image]) |
There was a problem hiding this comment.
taken from #5558 ta avoid that tokenize=False -> True breaks this function for VLMs
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit af51127. Configure here.

is_chat_template_prefix_preservingpreviously compared the rendered chat template as strings (tokenize=False). But_get_tool_suffix_ids, the consumer that relies on this property, slices token ids, not text. Two templates that share a string prefix can still diverge at the token level if the final character of the prefix merges with the following character under BPE.This PR switches the check to
tokenize=True, return_dict=Falseand compares the resulting id lists directly, so the test matches what the trainer actually does.Note
Medium Risk
Touches logic that gates training-time tool formatting extraction; incorrect token prefix checks could cause subtle GRPO/tool-loop token slicing issues across different tokenizers and VLM processors.
Overview
Updates
is_chat_template_prefix_preservingto check token-level prefix preservation by callingapply_chat_template(..., tokenize=True, return_dict=False)and comparing the resulting ID prefixes, instead of comparing rendered strings.Adds VLM-specific handling by unbatching processor outputs before comparing, keeping the existing multimodal dummy-image path and the DeepSeek-V3
TypeErrorfallback while making the prefix-preservation signal align with_get_tool_suffix_ids’ token slicing.Reviewed by Cursor Bugbot for commit 723c064. Bugbot is set up for automated code reviews on this repo. Configure here.