Patch MistralCommonTokenizer#41439
Merged
ArthurZucker merged 10 commits intoOct 14, 2025
Merged
Conversation
juliendenize
commented
Oct 8, 2025
patrickvonplaten
approved these changes
Oct 8, 2025
Contributor
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto |
ArthurZucker
approved these changes
Oct 9, 2025
ArthurZucker
left a comment
Collaborator
There was a problem hiding this comment.
Thanks for updating! 🤗
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
bbe0c0f to
8938039
Compare
ngazagna-qc
pushed a commit
to ngazagna-qc/transformers
that referenced
this pull request
Oct 23, 2025
* Fix token_to_id and add add_generation_prompt * Fix spm download * Refactor spm * Try another possibly non-gated spm * Improve get_vocab * lint * Improve get_vocab * Add warn to piece_to_id * Improve from_pretrained raise and revert model spm * Revert fast
winglian
added a commit
to axolotl-ai-cloud/axolotl
that referenced
this pull request
Dec 1, 2025
winglian
added a commit
to axolotl-ai-cloud/axolotl
that referenced
this pull request
Dec 3, 2025
winglian
added a commit
to axolotl-ai-cloud/axolotl
that referenced
this pull request
Dec 8, 2025
winglian
added a commit
to axolotl-ai-cloud/axolotl
that referenced
this pull request
Dec 17, 2025
winglian
added a commit
to axolotl-ai-cloud/axolotl
that referenced
this pull request
Dec 30, 2025
winglian
added a commit
to axolotl-ai-cloud/axolotl
that referenced
this pull request
Jan 14, 2026
winglian
added a commit
to axolotl-ai-cloud/axolotl
that referenced
this pull request
Jan 22, 2026
SangbumChoi
pushed a commit
to SangbumChoi/transformers
that referenced
this pull request
Jan 23, 2026
* Fix token_to_id and add add_generation_prompt * Fix spm download * Refactor spm * Try another possibly non-gated spm * Improve get_vocab * lint * Improve get_vocab * Add warn to piece_to_id * Improve from_pretrained raise and revert model spm * Revert fast
winglian
added a commit
to axolotl-ai-cloud/axolotl
that referenced
this pull request
Jan 27, 2026
* Prepare for transformers v5 upgrade * fix hf cli * update for hf hub changes * fix tokenizer apply_chat_template args * remap include_tokens_per_second * fix tps * handle migration for warmup * use latest hf hub * Fix scan -> ls * fix import * fix for renaming of mistral common tokenizer -> backend * update for fixed tokenziation for llama * Skip phi35 tests for now * remove mistral patch fixed upstream in huggingface/transformers#41439 * use namespacing for patch * don't rely on sdist for e2e tests for now * run modal ci without waiting too * Fix dep for ci * fix imports * Fix fp8 check * fsdp2 fixes * fix version handling * update fsdp version tests for new v5 behavior * Fail multigpu tests after 3 failures * skip known v5 broken tests for now and cleanup * bump deps * unmark skipped test * re-enable test_fsdp_qlora_prequant_packed test * increase multigpu ci timeout * skip broken gemma3 test * reduce timout back to original 120min now that the hanging test is skipped * fix for un-necessary collator for pretraining with bsz=1 * fix: safe_serialization deprecated in transformers v5 rc01 (#3318) * torch_dtype deprecated * load model in float32 for consistency with tests * revert some test fixtures back * use hf cache ls instead of scan * don't strip fsdp_version more fdsp_Version fixes for v5 fix version in fsdp_config fix aliasing fix fsdp_version check check fsdp_version is 2 in both places * Transformers v5 rc2 (#3347) * bump dep * use latest fbgemm, grab model config as part of fixture, un-skip test * import AutoConfig * don't need more problematic autoconfig when specifying config.json manually * add fixtures for argilla ultrafeedback datasets * download phi4-reasoning * fix arg * update tests for phi fast tokenizer changes * use explicit model types for gemma3 --------- Co-authored-by: Wing Lian <wing@axolotl.ai> * fix: AutoModelForVision2Seq -> AutoModelForImageTextToText * chore: remove duplicate * fix: attempt fix gemma3 text mode * chore: lint * ga release of v5 * need property setter for name_or_path for mistral tokenizer * vllm not compatible with transformers v5 * setter for chat_template w mistral too --------- Co-authored-by: NanoCode012 <nano@axolotl.ai> Co-authored-by: salman <salman.mohammadi@outlook.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR patches the
MistralCommonTokenizer:_piece_to_idand_is_control_token._piece_to_idwas incorrect for Tekkenizer.a) the special tokens were not supported
b) the normal tokens were not shifted by adding the number of special tokens
get_vocabto a function that should better mimic what happens inTransformers: now the mapping is based on the real ids but some ids are missing due to conversion loss of some tokens in Tekken.add_generation_prompthas been added. This is to match signature of Transformers. In practice this value is ignored except if:a)
continue_final_messageandadd_generation_promptareTruean error is raised.b) if
add_generation_promptisTrueand the last message isassistantthen an error is raised as the user should have passedcontinue_final_message.[FEATURE] Now the tokenizer is set to fast because:a)
it is true(Edit: not so sure about that after discussion)b) it removes annoying message when initializing the tokenizer.
Edit 2: this has been reverted as it doesn't bring value and is misleading.
Also added minimal tests to ensure SPM also works in the future.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@patrickvonplaten @ArthurZucker @itazap