Patch MistralCommonTokenizer by juliendenize · Pull Request #41439 · huggingface/transformers

juliendenize · 2025-10-08T09:44:35Z

What does this PR do?

This PR patches the MistralCommonTokenizer:

[BUG FIX] spm now is correctly supported as previous usage would result in an error for _piece_to_id and _is_control_token.
[BUG FIX] previous implementation of _piece_to_id was incorrect for Tekkenizer.

a) the special tokens were not supported
b) the normal tokens were not shifted by adding the number of special tokens

[MAYBE BUG FIX] Changed get_vocab to a function that should better mimic what happens in Transformers: now the mapping is based on the real ids but some ids are missing due to conversion loss of some tokens in Tekken.
[FEATURE] add_generation_prompt has been added. This is to match signature of Transformers. In practice this value is ignored except if:

a) continue_final_message and add_generation_prompt are True an error is raised.
b) if add_generation_prompt is True and the last message is assistant then an error is raised as the user should have passed continue_final_message.

~~[FEATURE] Now the tokenizer is set to fast because:~~

a) ~~it is true~~(Edit: not so sure about that after discussion)
b) it removes annoying message when initializing the tokenizer.

Edit 2: this has been reverted as it doesn't bring value and is misleading.

[OPTIMIZATION] now image tensors are converted without slowness warnings from torch.
[DOCS] updated some docs to remove unused args.

Also added minimal tests to ensure SPM also works in the future.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@patrickvonplaten @ArthurZucker @itazap

github-actions · 2025-10-09T10:00:50Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

ArthurZucker

Thanks for updating! 🤗

HuggingFaceDocBuilderDev · 2025-10-09T15:23:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* Fix token_to_id and add add_generation_prompt * Fix spm download * Refactor spm * Try another possibly non-gated spm * Improve get_vocab * lint * Improve get_vocab * Add warn to piece_to_id * Improve from_pretrained raise and revert model spm * Revert fast

* Prepare for transformers v5 upgrade * fix hf cli * update for hf hub changes * fix tokenizer apply_chat_template args * remap include_tokens_per_second * fix tps * handle migration for warmup * use latest hf hub * Fix scan -> ls * fix import * fix for renaming of mistral common tokenizer -> backend * update for fixed tokenziation for llama * Skip phi35 tests for now * remove mistral patch fixed upstream in huggingface/transformers#41439 * use namespacing for patch * don't rely on sdist for e2e tests for now * run modal ci without waiting too * Fix dep for ci * fix imports * Fix fp8 check * fsdp2 fixes * fix version handling * update fsdp version tests for new v5 behavior * Fail multigpu tests after 3 failures * skip known v5 broken tests for now and cleanup * bump deps * unmark skipped test * re-enable test_fsdp_qlora_prequant_packed test * increase multigpu ci timeout * skip broken gemma3 test * reduce timout back to original 120min now that the hanging test is skipped * fix for un-necessary collator for pretraining with bsz=1 * fix: safe_serialization deprecated in transformers v5 rc01 (#3318) * torch_dtype deprecated * load model in float32 for consistency with tests * revert some test fixtures back * use hf cache ls instead of scan * don't strip fsdp_version more fdsp_Version fixes for v5 fix version in fsdp_config fix aliasing fix fsdp_version check check fsdp_version is 2 in both places * Transformers v5 rc2 (#3347) * bump dep * use latest fbgemm, grab model config as part of fixture, un-skip test * import AutoConfig * don't need more problematic autoconfig when specifying config.json manually * add fixtures for argilla ultrafeedback datasets * download phi4-reasoning * fix arg * update tests for phi fast tokenizer changes * use explicit model types for gemma3 --------- Co-authored-by: Wing Lian <wing@axolotl.ai> * fix: AutoModelForVision2Seq -> AutoModelForImageTextToText * chore: remove duplicate * fix: attempt fix gemma3 text mode * chore: lint * ga release of v5 * need property setter for name_or_path for mistral tokenizer * vllm not compatible with transformers v5 * setter for chat_template w mistral too --------- Co-authored-by: NanoCode012 <nano@axolotl.ai> Co-authored-by: salman <salman.mohammadi@outlook.com>

juliendenize commented Oct 8, 2025

View reviewed changes

Comment thread tests/test_tokenization_mistral_common.py Outdated

patrickvonplaten reviewed Oct 8, 2025

View reviewed changes

Comment thread src/transformers/models/auto/tokenization_auto.py

patrickvonplaten reviewed Oct 8, 2025

View reviewed changes

Comment thread src/transformers/tokenization_mistral_common.py Outdated

patrickvonplaten approved these changes Oct 8, 2025

View reviewed changes

juliendenize changed the title ~~Fix token_to_id and add add_generation_prompt~~ Patch MistralCommonTokenizer Oct 9, 2025

ArthurZucker approved these changes Oct 9, 2025

View reviewed changes

Comment thread src/transformers/models/auto/tokenization_auto.py

Comment thread src/transformers/tokenization_mistral_common.py Outdated

juliendenize added 10 commits October 10, 2025 19:50

Fix token_to_id and add add_generation_prompt

3a6d15e

Fix spm download

bb988cf

Refactor spm

2854a32

Try another possibly non-gated spm

8c35dfb

Improve get_vocab

eba2e7d

lint

fb5f272

Improve get_vocab

b477ba4

Add warn to piece_to_id

e0ffb72

Improve from_pretrained raise and revert model spm

904ecbb

Revert fast

8938039

juliendenize force-pushed the patch_mistral_tokenizer branch from bbe0c0f to 8938039 Compare October 10, 2025 17:53

ArthurZucker enabled auto-merge (squash) October 14, 2025 11:05

ArthurZucker merged commit 0566b6f into huggingface:main Oct 14, 2025
17 checks passed

Rocketknight1 mentioned this pull request Nov 18, 2025

How to use padding with Mistral? #42241

Closed

winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Dec 1, 2025

remove mistral patch fixed upstream in huggingface/transformers#41439

608e10c

winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Dec 3, 2025

remove mistral patch fixed upstream in huggingface/transformers#41439

07e2608

winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Dec 8, 2025

remove mistral patch fixed upstream in huggingface/transformers#41439

575ab81

winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Dec 17, 2025

remove mistral patch fixed upstream in huggingface/transformers#41439

5097f11

winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Dec 30, 2025

remove mistral patch fixed upstream in huggingface/transformers#41439

7739bf3

winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Jan 14, 2026

remove mistral patch fixed upstream in huggingface/transformers#41439

eaec0ff

winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Jan 22, 2026

remove mistral patch fixed upstream in huggingface/transformers#41439

e58eb0a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch MistralCommonTokenizer#41439

Patch MistralCommonTokenizer#41439
ArthurZucker merged 10 commits into
huggingface:mainfrom
juliendenize:patch_mistral_tokenizer

juliendenize commented Oct 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Oct 9, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

juliendenize commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Oct 9, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juliendenize commented Oct 8, 2025 •

edited

Loading