Check prefix preservation at the token level by qgallouedec · Pull Request #5559 · huggingface/trl

qgallouedec · 2026-04-15T16:08:45Z

is_chat_template_prefix_preserving previously compared the rendered chat template as strings (tokenize=False). But _get_tool_suffix_ids, the consumer that relies on this property, slices token ids, not text. Two templates that share a string prefix can still diverge at the token level if the final character of the prefix merges with the following character under BPE.

This PR switches the check to tokenize=True, return_dict=False and compares the resulting id lists directly, so the test matches what the trainer actually does.

Note

Medium Risk
Touches logic that gates training-time tool formatting extraction; incorrect token prefix checks could cause subtle GRPO/tool-loop token slicing issues across different tokenizers and VLM processors.

Overview
Updates is_chat_template_prefix_preserving to check token-level prefix preservation by calling apply_chat_template(..., tokenize=True, return_dict=False) and comparing the resulting ID prefixes, instead of comparing rendered strings.

Adds VLM-specific handling by unbatching processor outputs before comparing, keeping the existing multimodal dummy-image path and the DeepSeek-V3 TypeError fallback while making the prefix-preservation signal align with _get_tool_suffix_ids’ token slicing.

^{Reviewed by Cursor Bugbot for commit 723c064. Bugbot is set up for automated code reviews on this repo. Configure here.}

HuggingFaceDocBuilderDev · 2026-04-15T16:11:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7258c00575

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

qgallouedec · 2026-04-15T17:40:45Z

+    if isinstance(tokenizer, ProcessorMixin):
+        from PIL import Image
+
+        dummy_image = Image.new("RGB", (8, 8))
+        messages1 = prepare_multimodal_messages(messages1, images=[dummy_image])
+        messages2 = prepare_multimodal_messages(messages2, images=[dummy_image])


taken from #5558 ta avoid that tokenize=False -> True breaks this function for VLMs

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit af51127. Configure here.}

albertvillanova

Thanks.

Check prefix preservation at the token level

7258c00

qgallouedec requested review from AmineDiro, albertvillanova and kashif April 15, 2026 16:08

chatgpt-codex-connector Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread trl/chat_template_utils.py Outdated

fix vml

a076c4a

qgallouedec commented Apr 15, 2026

View reviewed changes

cursor Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread trl/chat_template_utils.py

qgallouedec and others added 3 commits April 16, 2026 16:39

fix vlm

84e90a0

Merge branch 'main' into check-prefix-preservation-at-the-token-level

22dc962

Merge branch 'main' into check-prefix-preservation-at-the-token-level

af51127

cursor Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread trl/chat_template_utils.py Outdated

Comment thread trl/chat_template_utils.py

qgallouedec and others added 3 commits April 17, 2026 18:35

fix

5f3f91b

rm duplicate

d3f2474

Merge branch 'main' into check-prefix-preservation-at-the-token-level

723c064

albertvillanova approved these changes Apr 20, 2026

View reviewed changes

qgallouedec merged commit d5b534e into main Apr 20, 2026
13 checks passed

qgallouedec deleted the check-prefix-preservation-at-the-token-level branch April 20, 2026 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check prefix preservation at the token level#5559

Check prefix preservation at the token level#5559
qgallouedec merged 8 commits into
mainfrom
check-prefix-preservation-at-the-token-level

qgallouedec commented Apr 15, 2026 •

edited by cursor Bot

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

qgallouedec Apr 15, 2026

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

albertvillanova left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qgallouedec commented Apr 15, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Apr 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

qgallouedec Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qgallouedec commented Apr 15, 2026 •

edited by cursor Bot

Loading