Skip to content
76 changes: 76 additions & 0 deletions unsloth/tokenizer_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -671,6 +671,82 @@ def _fix_chat_template(chat_template):
)

chat_template = chat_template[: where + len(chosen_end)] + after_endfor

elif (
re.sub(r"\{#.*?#\}", "", after_endfor, flags=re.DOTALL).strip() == ""
):
# GH#4150: ChatML-style templates (Hermes, Magnum, Phi-4, etc.) that
# end with {% endfor %} (optionally followed by only whitespace
# and/or Jinja comments) and have no add_generation_prompt block.
# Strip Jinja `{# ... #}` comments before any regex / substring check
# so that neither the guard nor the separator inference can be fooled
# by ChatML tokens or `add_generation_prompt` mentions that appear
# only inside a comment.
scrubbed = re.sub(r"\{#.*?#\}", "", chat_template, flags=re.DOTALL)
if (
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1/11 reviewers] Low - Sonnet C noted that the Jinja comment scrub regex \{#.*?#\} would also match {# ... #} sequences that appear inside a Jinja string literal (e.g. {% set x = '{# fake #}' %}), silently stripping the string's content from the scrubbed copy. No real tokenizer template stores a literal {# inside a string, and Sonnet C explicitly classified this as unreachable theoretical. Not acted on.

"<|im_start|>" in scrubbed
and "<|im_end|>" in scrubbed
and "add_generation_prompt" not in scrubbed
):
# Infer the model-specific separator after "assistant" from the
# template itself. Strategy:
# 1. Prefer an explicit assistant literal
# ('<|im_start|>assistant<sep>') if the template contains one.
# 2. Otherwise scan all role-concatenation separators
# (message['role'] + '<sep>'). If they all agree, use that
# single separator.
# 3. If multiple different role separators are present (e.g. a
# Phi-4-mini style template that uses '\n' for system and
# '<|im_sep|>' for user/assistant), prefer '<|im_sep|>' when
# it appears in the template, otherwise fall back to the
# ChatML newline default.
assistant_match = re.search(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1/11 reviewers] Low - Sonnet A noted that assistant_match could match a '<|im_start|>assistant<sep>' literal stored in an unused {% set _helper = '...' %} block with a separator different from the actual assistant role branch. This is only reachable on synthetic templates that define a helper variable for documentation purposes; no known real tokenizer does this. The role_seps fallback path (which Sonnet B's real-tokenizer integration run confirmed fires correctly for mixed-separator role-branched templates) already handles the realistic cases. Not acted on in this round.

r"""(['"])<\|im_start\|>assistant([^'"]*)\1""",
scrubbed,
)
role_seps = [
m.group(2)
for m in re.finditer(
r"""message(?:\[['"]role['"]\]|\.role)\s*\+\s*(['"])([^'"]*)\1""",
scrubbed,
)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[2/11 reviewers] Low - Empty separator from bare '<|im_start|>assistant' literal. When a template splits the assistant prefix across concatenated string literals ({{ '<|im_start|>assistant' }}{{ '\n' }} or {{ '<|im_start|>assistant' + '\n' + ... }}), the regex captures an empty group, and the injected generation block becomes {{ "<|im_start|>assistant" }} with no trailing newline -- add_generation_prompt=True then renders a bare <|im_start|>assistant token that the model will not interpret correctly. No known real-world tokenizer uses this split-literal pattern, but the defensive guard is one line.

Suggested change
)
if assistant_match is not None and assistant_match.group(2):

]
unique_role_seps = list(dict.fromkeys(role_seps)) # preserves order
if assistant_match is not None and assistant_match.group(2):
# Non-empty captured separator (e.g. '<|im_start|>assistant\n').
separator = assistant_match.group(2)
elif len(unique_role_seps) == 1:
separator = unique_role_seps[0]
elif "<|im_sep|>" in scrubbed:
separator = "<|im_sep|>"
else:
# Fallback for both the no-match case and the degenerate
# `'<|im_start|>assistant'` bare-literal case, so the
# injected Jinja block always produces a usable ChatML
# generation prefix.
separator = "\\n"
# Use a double-quoted Jinja string literal so a single quote in
# the separator (should one ever appear) cannot break the
# generated block.
assistant_prefix = "<|im_start|>assistant" + separator
generation_block = (
"{%" + dash + " if add_generation_prompt %}"
'{{ "' + assistant_prefix.replace('"', '\\"') + '" }}'
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1/7, contradicted] Not acted on - reviewer.py run 2 claimed this line emits invalid Jinja {{% ... %}}. This is contradicted by 6/7 reviewer.py runs + Sonnet A + Sonnet B's real transformers.AutoTokenizer.apply_chat_template round-trip, all of which successfully parse and render the generated block. Static read of the source string "{%" + dash + " if add_generation_prompt %}" concatenates to {% if add_generation_prompt %}, not {{% ... %}}. The 6/7 consensus and live Jinja2 parser evidence is authoritative here.

"{%" + dash + " endif %}"
)
# `after_endfor` contains only whitespace and/or Jinja comments
# at this point (verified by the scrubbed emptiness check
# above). Whitespace would render as stray trailing output
# after the generation prefix when
# `apply_chat_template(add_generation_prompt=True)` is called,
# so drop it entirely and place the generation block directly
# after `endfor`. Jinja comments are also dropped for the same
# reason: they have no runtime effect but their surrounding
# whitespace would be preserved if we kept them.
chat_template = (
chat_template[: where + len(chosen_end)] + generation_block
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[1/11 reviewers] Info - Hosted Gemini asked to preserve after_endfor by appending it to the generation block in the output. We are accepting the scrub-for-detection half of this suggestion but rejecting the preserve-in-output half: at this point after_endfor contains only whitespace and/or Jinja comments (verified by the scrubbed emptiness check above), and appending it would cause apply_chat_template(add_generation_prompt=True) to emit trailing whitespace after the generation prefix (e.g. ...<|im_start|>assistant\n \n\n ), which the model would see as part of the prompt. The first repair branch also does not preserve trailing whitespace, so this keeps the two branches symmetric. Keeping the output as-is.

)
Comment on lines +746 to +748
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When appending the generation_block, it is recommended to include the original after_endfor content (which is verified to be only whitespace in this branch). This preserves any trailing whitespace or newlines that were present at the end of the original template, maintaining consistent file formatting and avoiding unnecessary changes to the template's suffix.

Suggested change
chat_template = (
chat_template[: where + len(chosen_end)] + generation_block
)
chat_template = (
chat_template[: where + len(chosen_end)] + generation_block + after_endfor
)

Comment on lines +737 to +748
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation discards after_endfor, which may contain trailing whitespace or Jinja comments (e.g., {# end of template #}). While these are not semantically functional, discarding them is inconsistent with the first repair branch (lines 669-673), which preserves after_endfor by wrapping it. To maintain the original template's structure and preserve user comments, consider appending after_endfor after the generation_block. Note that the comment on line 735 is currently misleading as it claims to match the behavior of the first branch while doing the opposite.

            # `after_endfor` contains only whitespace and/or Jinja comments
            # at this point (verified above). Append the generation block
            # and preserve the trailing content for consistency.
            chat_template = (
                chat_template[: where + len(chosen_end)] + generation_block + after_endfor
            )


return chat_template


Expand Down
Loading