From 574b19665d52adcda8baad7f89805f4de2f88660 Mon Sep 17 00:00:00 2001
From: sergiopaniego <sergiopaniegoblanco@gmail.com>
Date: Fri, 17 Apr 2026 11:40:21 +0200
Subject: [PATCH 1/8] [docs] Add chat templates page to web docs

---
 docs/source/_toctree.yml           | 2 ++
 docs/source/chat_template_utils.md | 6 ++++++
 docs/source/grpo_trainer.md        | 3 +++
 docs/source/sft_trainer.md         | 2 +-
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 72970949c5b..368fa28330a 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -7,6 +7,8 @@
     title: Quickstart
   title: Getting started
 - sections:
+  - local: chat_templates
+    title: Chat Templates
   - local: dataset_formats
     title: Dataset Formats
   - local: paper_index
diff --git a/docs/source/chat_template_utils.md b/docs/source/chat_template_utils.md
index 2608f702560..b75fc8b0d15 100644
--- a/docs/source/chat_template_utils.md
+++ b/docs/source/chat_template_utils.md
@@ -1,5 +1,11 @@
 # Chat template utilities
 
+For an overview of the chat templates bundled with TRL and the rationale behind the training patches, see [Chat Templates](chat_templates).
+
+## add_response_schema
+
+[[autodoc]] add_response_schema
+
 ## clone_chat_template
 
 [[autodoc]] clone_chat_template
diff --git a/docs/source/grpo_trainer.md b/docs/source/grpo_trainer.md
index 935ac78f1d5..673717f8bcd 100644
--- a/docs/source/grpo_trainer.md
+++ b/docs/source/grpo_trainer.md
@@ -632,6 +632,9 @@ trainer = GRPOTrainer(
 Each tool must be a standard Python function with **type-hinted arguments and return types**, along with a **Google-style docstring** describing its purpose, arguments, and return value.
 For more details, see the [Passing tools guide](https://huggingface.co/docs/transformers/en/chat_extras#passing-tools).
 
+> [!TIP]
+> The GRPO tool call loop requires the chat template to be *prefix-preserving* (appending messages must not change how earlier messages are rendered). For known model families (e.g. Qwen3, DeepSeek-V3), TRL automatically swaps in a patched training template when tools are enabled. See [Chat Templates](chat_templates#training-templates) for the full list.
+
 Example:
 
 ```python
diff --git a/docs/source/sft_trainer.md b/docs/source/sft_trainer.md
index 07ccde4f114..706335462bb 100644
--- a/docs/source/sft_trainer.md
+++ b/docs/source/sft_trainer.md
@@ -169,7 +169,7 @@ training_args = SFTConfig(assistant_only_loss=True)
 ![train_on_assistant](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/train_on_assistant.png)
 
 > [!WARNING]
-> This functionality requires the chat template to include `&#123;% generation %&#125;` and `&#123;% endgeneration %&#125;` keywords. For known model families (e.g. Qwen3), TRL automatically patches the template when `assistant_only_loss=True`. For other models, check that your chat template includes these keywords — see [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B/blob/main/chat_template.jinja#L76-L82) for an example.
+> This functionality requires the chat template to include `&#123;% generation %&#125;` and `&#123;% endgeneration %&#125;` keywords. For known model families (e.g. Qwen3), TRL automatically patches the template when `assistant_only_loss=True`. See [Chat Templates](chat_templates#training-templates) for the full list of bundled training templates. For other models, check that your chat template includes these keywords. See [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B/blob/main/chat_template.jinja#L76-L82) for an example.
 
 ### Train on completion only
 

From 5159a07395b6cb540af2e71d95ed771802250a35 Mon Sep 17 00:00:00 2001
From: sergiopaniego <sergiopaniegoblanco@gmail.com>
Date: Fri, 17 Apr 2026 11:43:14 +0200
Subject: [PATCH 2/8] [docs] Add chat templates page to web docs

---
 docs/source/chat_templates.md | 111 ++++++++++++++++++++++++++++++++++
 1 file changed, 111 insertions(+)
 create mode 100644 docs/source/chat_templates.md

diff --git a/docs/source/chat_templates.md b/docs/source/chat_templates.md
new file mode 100644
index 00000000000..83a8a9670a0
--- /dev/null
+++ b/docs/source/chat_templates.md
@@ -0,0 +1,111 @@
+# Chat Templates
+
+TRL ships a small collection of Jinja2 chat templates under [`trl/chat_templates/`](https://github.com/huggingface/trl/tree/main/trl/chat_templates). They serve two purposes:
+
+1. **Identity comparison**: detecting which model is being used (by comparing `tokenizer.chat_template` against known templates) to add the appropriate response schema ([`add_response_schema`]) or swap in a training template ([`get_training_chat_template`]).
+2. **Training patches**: modified templates that fix training-specific issues (prefix-preservation for GRPO, `&#123;% generation %&#125;` markers for SFT assistant-only loss).
+
+**Why prefix-preserving?** The GRPO tool call loop extracts tool response formatting tokens by comparing tokenizations with and without tool messages appended (`_get_tool_suffix_ids`). This requires the chat template to be *prefix-preserving*: appending messages must not change how earlier messages are rendered.
+
+**Why generation-tagged?** SFT with `assistant_only_loss=True` requires the chat template to include `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` markers around assistant output, so `return_assistant_tokens_mask=True` can produce correct masks. Most model templates don't include these markers natively.
+
+## Original templates
+
+Used for identity comparison only.
+
+### `deepseekv3.jinja`
+
+Original DeepSeek-V3 chat template.
+
+### `glm4moe.jinja`
+
+Original GLM-4-MoE chat template.
+
+### `gptoss.jinja`
+
+Original GPT-OSS chat template.
+
+### `llama3.jinja`
+
+Original Llama 3 chat template.
+
+### `llama3_1.jinja` / `llama3_2.jinja`
+
+Original Llama 3.1 / 3.2 chat templates. Both render tool calls as a single bare JSON object using the key `parameters` (instead of `arguments`) and support at most one tool call per assistant turn.
+
+### `qwen2_5.jinja`
+
+Original Qwen2.5 chat template.
+
+### `qwen3.jinja`
+
+Original Qwen3 chat template.
+
+### `qwen3_vl.jinja`
+
+Original Qwen3-VL chat template. Unlike text-only Qwen3, this template is already prefix-preserving (no conditional thinking blocks), so no training patch is needed.
+
+### `qwen3_5_2b_and_below.jinja` / `qwen3_5_4b_and_above.jinja`
+
+Original Qwen3.5 chat templates.
+
+## Training templates
+
+Patched templates that fix training-specific issues. Swapped in at init when tools are enabled (GRPO) or when `assistant_only_loss=True` (SFT).
+
+### `deepseekv3_training.jinja`
+
+Patched DeepSeek-V3 template. Diff vs `deepseekv3.jinja`:
+
+- Uses `| tojson` on `tool['function']['arguments']` so that `arguments` can be passed as a `dict` (the documented format per [transformers docs](https://huggingface.co/docs/transformers/en/chat_extras#tool-calling-example)). The original template uses raw string concatenation, which crashes on dict inputs.
+- Wraps assistant message output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` markers for SFT assistant-only loss.
+
+### `qwen3_training.jinja`
+
+Patched Qwen3 template. Diff vs `qwen3.jinja`:
+
+Require both `<think>` and `</think>` to be present before parsing, to avoid incorrect splitting when the model generates only one tag:
+
+```diff
+- {%- if '</think>' in content %}
++ {%- if '<think>' in content and '</think>' in content %}
+```
+
+Always include the thinking block regardless of message position. The original conditionally omits it based on `loop.last`, which changes the assistant rendering when a tool message is appended, breaking prefix-preservation:
+
+```diff
+- {%- if loop.index0 > ns.last_query_index %}
+-     {%- if loop.last or (not loop.last and reasoning_content) %}
+-         {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+-     {%- else %}
+-         {{- '<|im_start|>' + message.role + '\n' + content }}
+-     {%- endif %}
+- {%- else %}
+-     {{- '<|im_start|>' + message.role + '\n' + content }}
+- {%- endif %}
++ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+```
+
+Wrap assistant message output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` so that `return_assistant_tokens_mask=True` produces correct masks for SFT assistant-only loss.
+
+### `gptoss_training.jinja`
+
+Patched GPT-OSS template. Diff vs `gptoss.jinja`:
+
+Wrap assistant message output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` so that `return_assistant_tokens_mask=True` produces correct masks for SFT assistant-only loss.
+
+### `llama3_training.jinja`
+
+Patched Llama 3 template. Diff vs `llama3.jinja`:
+
+Wrap assistant message output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` so that `return_assistant_tokens_mask=True` produces correct masks for SFT assistant-only loss.
+
+### `qwen2_5_training.jinja`
+
+Patched Qwen2.5 template. Diff vs `qwen2_5.jinja`:
+
+Wrap assistant message output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` so that `return_assistant_tokens_mask=True` produces correct masks for SFT assistant-only loss.
+
+## Related utilities
+
+See [Chat Template Utilities](chat_template_utils) for the helper functions (`clone_chat_template`, `is_chat_template_prefix_preserving`, `get_training_chat_template`) that operate on these templates.

From 1e83e037fe02de0ef0e8a1b03407f3e1ed044724 Mon Sep 17 00:00:00 2001
From: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date: Fri, 17 Apr 2026 14:50:20 +0200
Subject: [PATCH 3/8] Update docs/source/chat_templates.md

Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
---
 docs/source/chat_templates.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/chat_templates.md b/docs/source/chat_templates.md
index 83a8a9670a0..327b4697f89 100644
--- a/docs/source/chat_templates.md
+++ b/docs/source/chat_templates.md
@@ -2,7 +2,7 @@
 
 TRL ships a small collection of Jinja2 chat templates under [`trl/chat_templates/`](https://github.com/huggingface/trl/tree/main/trl/chat_templates). They serve two purposes:
 
-1. **Identity comparison**: detecting which model is being used (by comparing `tokenizer.chat_template` against known templates) to add the appropriate response schema ([`add_response_schema`]) or swap in a training template ([`get_training_chat_template`]).
+1. **Identity comparison**: detecting which model is being used (by comparing `processing_class.chat_template` against known templates) to add the appropriate response schema ([`add_response_schema`]) or swap in a training template ([`get_training_chat_template`]).
 2. **Training patches**: modified templates that fix training-specific issues (prefix-preservation for GRPO, `&#123;% generation %&#125;` markers for SFT assistant-only loss).
 
 **Why prefix-preserving?** The GRPO tool call loop extracts tool response formatting tokens by comparing tokenizations with and without tool messages appended (`_get_tool_suffix_ids`). This requires the chat template to be *prefix-preserving*: appending messages must not change how earlier messages are rendered.

From 81327310e15793958355437207e6a9276feeec61 Mon Sep 17 00:00:00 2001
From: sergiopaniego <sergiopaniegoblanco@gmail.com>
Date: Fri, 17 Apr 2026 16:04:11 +0200
Subject: [PATCH 4/8] remove

---
 docs/source/chat_template_utils.md | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/docs/source/chat_template_utils.md b/docs/source/chat_template_utils.md
index b75fc8b0d15..53ee8fae46d 100644
--- a/docs/source/chat_template_utils.md
+++ b/docs/source/chat_template_utils.md
@@ -2,10 +2,6 @@
 
 For an overview of the chat templates bundled with TRL and the rationale behind the training patches, see [Chat Templates](chat_templates).
 
-## add_response_schema
-
-[[autodoc]] add_response_schema
-
 ## clone_chat_template
 
 [[autodoc]] clone_chat_template

From 8c08b112e1fde79ac2599b7e6993f62bb5215141 Mon Sep 17 00:00:00 2001
From: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date: Fri, 17 Apr 2026 18:16:29 +0200
Subject: [PATCH 5/8] Update docs/source/grpo_trainer.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
---
 docs/source/grpo_trainer.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/grpo_trainer.md b/docs/source/grpo_trainer.md
index 673717f8bcd..a4c9108c703 100644
--- a/docs/source/grpo_trainer.md
+++ b/docs/source/grpo_trainer.md
@@ -633,7 +633,7 @@ Each tool must be a standard Python function with **type-hinted arguments and re
 For more details, see the [Passing tools guide](https://huggingface.co/docs/transformers/en/chat_extras#passing-tools).
 
 > [!TIP]
-> The GRPO tool call loop requires the chat template to be *prefix-preserving* (appending messages must not change how earlier messages are rendered). For known model families (e.g. Qwen3, DeepSeek-V3), TRL automatically swaps in a patched training template when tools are enabled. See [Chat Templates](chat_templates#training-templates) for the full list.
+> The GRPO tool call loop requires the chat template to be *prefix-preserving* (appending a tool message must not change how earlier messages are rendered). For known model families (e.g. Qwen3, DeepSeek-V3), TRL automatically swaps in a patched training template when tools are enabled. See [Chat Templates](chat_templates#training-templates) for the full list.
 
 Example:
 

From e0bb78abf9460c6fb8b38587e2bd0f904930a8b0 Mon Sep 17 00:00:00 2001
From: sergiopaniego <sergiopaniegoblanco@gmail.com>
Date: Mon, 20 Apr 2026 11:10:04 +0200
Subject: [PATCH 6/8] update naming

---
 docs/source/chat_templates.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/chat_templates.md b/docs/source/chat_templates.md
index 327b4697f89..4f084391867 100644
--- a/docs/source/chat_templates.md
+++ b/docs/source/chat_templates.md
@@ -2,7 +2,7 @@
 
 TRL ships a small collection of Jinja2 chat templates under [`trl/chat_templates/`](https://github.com/huggingface/trl/tree/main/trl/chat_templates). They serve two purposes:
 
-1. **Identity comparison**: detecting which model is being used (by comparing `processing_class.chat_template` against known templates) to add the appropriate response schema ([`add_response_schema`]) or swap in a training template ([`get_training_chat_template`]).
+1. **Identity comparison**: detecting which model is being used (by comparing `processing_class.chat_template` against known templates) to add the appropriate response schema (`add_response_schema`) or swap in a training template (`get_training_chat_template`).
 2. **Training patches**: modified templates that fix training-specific issues (prefix-preservation for GRPO, `&#123;% generation %&#125;` markers for SFT assistant-only loss).
 
 **Why prefix-preserving?** The GRPO tool call loop extracts tool response formatting tokens by comparing tokenizations with and without tool messages appended (`_get_tool_suffix_ids`). This requires the chat template to be *prefix-preserving*: appending messages must not change how earlier messages are rendered.

From de7dd672b84d469d7ee41150156200cc7f638fa0 Mon Sep 17 00:00:00 2001
From: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date: Mon, 20 Apr 2026 11:11:08 +0200
Subject: [PATCH 7/8] Update docs/source/chat_templates.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
---
 docs/source/chat_templates.md | 59 ++++++++++-------------------------
 1 file changed, 16 insertions(+), 43 deletions(-)

diff --git a/docs/source/chat_templates.md b/docs/source/chat_templates.md
index 4f084391867..37d3409f797 100644
--- a/docs/source/chat_templates.md
+++ b/docs/source/chat_templates.md
@@ -1,53 +1,26 @@
 # Chat Templates
 
-TRL ships a small collection of Jinja2 chat templates under [`trl/chat_templates/`](https://github.com/huggingface/trl/tree/main/trl/chat_templates). They serve two purposes:
-
-1. **Identity comparison**: detecting which model is being used (by comparing `processing_class.chat_template` against known templates) to add the appropriate response schema (`add_response_schema`) or swap in a training template (`get_training_chat_template`).
-2. **Training patches**: modified templates that fix training-specific issues (prefix-preservation for GRPO, `&#123;% generation %&#125;` markers for SFT assistant-only loss).
-
-**Why prefix-preserving?** The GRPO tool call loop extracts tool response formatting tokens by comparing tokenizations with and without tool messages appended (`_get_tool_suffix_ids`). This requires the chat template to be *prefix-preserving*: appending messages must not change how earlier messages are rendered.
-
-**Why generation-tagged?** SFT with `assistant_only_loss=True` requires the chat template to include `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` markers around assistant output, so `return_assistant_tokens_mask=True` can produce correct masks. Most model templates don't include these markers natively.
-
-## Original templates
-
-Used for identity comparison only.
-
-### `deepseekv3.jinja`
-
-Original DeepSeek-V3 chat template.
-
-### `glm4moe.jinja`
-
-Original GLM-4-MoE chat template.
-
-### `gptoss.jinja`
-
-Original GPT-OSS chat template.
-
-### `llama3.jinja`
-
-Original Llama 3 chat template.
-
-### `llama3_1.jinja` / `llama3_2.jinja`
-
-Original Llama 3.1 / 3.2 chat templates. Both render tool calls as a single bare JSON object using the key `parameters` (instead of `arguments`) and support at most one tool call per assistant turn.
-
-### `qwen2_5.jinja`
-
-Original Qwen2.5 chat template.
-
-### `qwen3.jinja`
+A [chat template](https://huggingface.co/docs/transformers/en/chat_templating) is a Jinja2 snippet that formats messages into the string a model was trained on. For example:
+
+```python
+>>> from transformers import AutoTokenizer
+>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
+>>> tokenizer.chat_template
+"{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
+>>> tokenizer.apply_chat_template([{"role": "user", "content": "Hi!"}], tokenize=False)
+'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi!<|im_end|>\n'
+```
 
-Original Qwen3 chat template.
+In most cases you don't need to worry about chat templates: models ship their template along with the tokenizer, and TRL applies it for you. The whole thing is transparent. But some TRL recipes rely on features that most shipped templates don't include:
 
-### `qwen3_vl.jinja`
+- **SFT with `assistant_only_loss=True`** needs `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` markers around assistant output, so the loss mask can target only assistant tokens.
+- **GRPO with tool calls** needs the template to be *prefix-preserving*: appending a tool message must not change how earlier messages are rendered.
 
-Original Qwen3-VL chat template. Unlike text-only Qwen3, this template is already prefix-preserving (no conditional thinking blocks), so no training patch is needed.
+TRL ships patched templates under [`trl/chat_templates/`](https://github.com/huggingface/trl/tree/main/trl/chat_templates) for common families (Qwen, Llama, DeepSeek-V3, GPT-OSS, ...) and swaps them in automatically for supported models. For any other model, you'll need to patch its template yourself. The rest of this page catalogs what's bundled.
 
-### `qwen3_5_2b_and_below.jinja` / `qwen3_5_4b_and_above.jinja`
+## Supported model families
 
-Original Qwen3.5 chat templates.
+TRL stores reference copies of the original templates so it can identify supported models at init and swap in a training template when needed. The following families are recognized: DeepSeek-V3, GLM-4-MoE, GPT-OSS, Llama 3 / 3.1 / 3.2, Qwen2.5, Qwen3, Qwen3-VL, Qwen3.5.
 
 ## Training templates
 

From 6099d3c48e69316263239a28cca70263fc11224f Mon Sep 17 00:00:00 2001
From: Sergio Paniego Blanco <sergiopaniegoblanco@gmail.com>
Date: Mon, 20 Apr 2026 11:11:16 +0200
Subject: [PATCH 8/8] Update docs/source/chat_templates.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
---
 docs/source/chat_templates.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/chat_templates.md b/docs/source/chat_templates.md
index 37d3409f797..75fe352db62 100644
--- a/docs/source/chat_templates.md
+++ b/docs/source/chat_templates.md
@@ -81,4 +81,4 @@ Wrap assistant message output with `&#123;% generation %&#125;` / `&#123;% endge
 
 ## Related utilities
 
-See [Chat Template Utilities](chat_template_utils) for the helper functions (`clone_chat_template`, `is_chat_template_prefix_preserving`, `get_training_chat_template`) that operate on these templates.
+See [Chat Template Utilities](chat_template_utils) for the helper functions ([`clone_chat_template`], [`is_chat_template_prefix_preserving`], [`get_training_chat_template`]) that operate on these templates.