Skip to content

Patch MistralCommonTokenizer#41439

Merged
ArthurZucker merged 10 commits into
huggingface:mainfrom
juliendenize:patch_mistral_tokenizer
Oct 14, 2025
Merged

Patch MistralCommonTokenizer#41439
ArthurZucker merged 10 commits into
huggingface:mainfrom
juliendenize:patch_mistral_tokenizer

Conversation

@juliendenize

@juliendenize juliendenize commented Oct 8, 2025

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR patches the MistralCommonTokenizer:

  1. [BUG FIX] spm now is correctly supported as previous usage would result in an error for _piece_to_id and _is_control_token.
  2. [BUG FIX] previous implementation of _piece_to_id was incorrect for Tekkenizer.

a) the special tokens were not supported
b) the normal tokens were not shifted by adding the number of special tokens

  1. [MAYBE BUG FIX] Changed get_vocab to a function that should better mimic what happens in Transformers: now the mapping is based on the real ids but some ids are missing due to conversion loss of some tokens in Tekken.
  2. [FEATURE] add_generation_prompt has been added. This is to match signature of Transformers. In practice this value is ignored except if:

a) continue_final_message and add_generation_prompt are True an error is raised.
b) if add_generation_prompt is True and the last message is assistant then an error is raised as the user should have passed continue_final_message.

  1. [FEATURE] Now the tokenizer is set to fast because:

a) it is true(Edit: not so sure about that after discussion)
b) it removes annoying message when initializing the tokenizer.

Edit 2: this has been reverted as it doesn't bring value and is misleading.

  1. [OPTIMIZATION] now image tensors are converted without slowness warnings from torch.
  2. [DOCS] updated some docs to remove unused args.

Also added minimal tests to ensure SPM also works in the future.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@patrickvonplaten @ArthurZucker @itazap

Comment thread tests/test_tokenization_mistral_common.py Outdated
Comment thread src/transformers/models/auto/tokenization_auto.py
Comment thread src/transformers/tokenization_mistral_common.py Outdated
@juliendenize juliendenize changed the title Fix token_to_id and add add_generation_prompt Patch MistralCommonTokenizer Oct 9, 2025
@github-actions

github-actions Bot commented Oct 9, 2025

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

@ArthurZucker ArthurZucker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating! 🤗

Comment thread src/transformers/models/auto/tokenization_auto.py
Comment thread src/transformers/tokenization_mistral_common.py Outdated
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@juliendenize juliendenize force-pushed the patch_mistral_tokenizer branch from bbe0c0f to 8938039 Compare October 10, 2025 17:53
@ArthurZucker ArthurZucker enabled auto-merge (squash) October 14, 2025 11:05
@ArthurZucker ArthurZucker merged commit 0566b6f into huggingface:main Oct 14, 2025
17 checks passed
ngazagna-qc pushed a commit to ngazagna-qc/transformers that referenced this pull request Oct 23, 2025
* Fix token_to_id and add add_generation_prompt

* Fix spm download

* Refactor spm

* Try another possibly non-gated spm

* Improve get_vocab

* lint

* Improve get_vocab

* Add warn to piece_to_id

* Improve from_pretrained raise and revert model spm

* Revert fast
winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Dec 1, 2025
winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Dec 3, 2025
winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Dec 8, 2025
winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Dec 17, 2025
winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Dec 30, 2025
winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Jan 14, 2026
winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Jan 22, 2026
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
* Fix token_to_id and add add_generation_prompt

* Fix spm download

* Refactor spm

* Try another possibly non-gated spm

* Improve get_vocab

* lint

* Improve get_vocab

* Add warn to piece_to_id

* Improve from_pretrained raise and revert model spm

* Revert fast
winglian added a commit to axolotl-ai-cloud/axolotl that referenced this pull request Jan 27, 2026
* Prepare for transformers v5 upgrade

* fix hf cli

* update for hf hub changes

* fix tokenizer apply_chat_template args

* remap include_tokens_per_second

* fix tps

* handle migration for warmup

* use latest hf hub

* Fix scan -> ls

* fix import

* fix for renaming of mistral common tokenizer -> backend

* update for fixed tokenziation for llama

* Skip phi35 tests for now

* remove mistral patch fixed upstream in huggingface/transformers#41439

* use namespacing for patch

* don't rely on sdist for e2e tests for now

* run modal ci without waiting too

* Fix dep for ci

* fix imports

* Fix fp8 check

* fsdp2 fixes

* fix version handling

* update fsdp version tests for new v5 behavior

* Fail multigpu tests after 3 failures

* skip known v5 broken tests for now and cleanup

* bump deps

* unmark skipped test

* re-enable test_fsdp_qlora_prequant_packed test

* increase multigpu ci timeout

* skip broken gemma3 test

* reduce timout back to original 120min now that the hanging test is skipped

* fix for un-necessary collator for pretraining with bsz=1

* fix: safe_serialization deprecated in transformers v5 rc01 (#3318)

* torch_dtype deprecated

* load model in float32 for consistency with tests

* revert some test fixtures back

* use hf cache ls instead of scan

* don't strip fsdp_version

more fdsp_Version fixes for v5
fix version in fsdp_config
fix aliasing
fix fsdp_version check
check fsdp_version is 2 in both places

* Transformers v5 rc2 (#3347)

* bump dep

* use latest fbgemm, grab model config as part of fixture, un-skip test

* import AutoConfig

* don't need more problematic autoconfig when specifying config.json manually

* add fixtures for argilla ultrafeedback datasets

* download phi4-reasoning

* fix arg

* update tests for phi fast tokenizer changes

* use explicit model types for gemma3

---------

Co-authored-by: Wing Lian <wing@axolotl.ai>

* fix: AutoModelForVision2Seq -> AutoModelForImageTextToText

* chore: remove duplicate

* fix: attempt fix gemma3 text mode

* chore: lint

* ga release of v5

* need property setter for name_or_path for mistral tokenizer

* vllm not compatible with transformers v5

* setter for chat_template w mistral too

---------

Co-authored-by: NanoCode012 <nano@axolotl.ai>
Co-authored-by: salman <salman.mohammadi@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants