Skip to content

Auto detect wrong mapping models#44298

Draft
ArthurZucker wants to merge 16 commits intomainfrom
auto-detect-wrong-mapping-models
Draft

Auto detect wrong mapping models#44298
ArthurZucker wants to merge 16 commits intomainfrom
auto-detect-wrong-mapping-models

Conversation

@ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Feb 26, 2026

What does this PR do?

A few issues we did not catch:

  • def pre_tokenizer(self, replacement, add_prefix_space):
    return pre_tokenizers.Split(" ", "merged_with_previous")
    missing from GemmaTokenier
  • SPM's precompiled charset is super important for t5 like models so we are bringing it back to make sure AutoTokenizer that maps to T5 "works"

Equivalence check:

from transformers import AutoTokenizer, TokenizersBackend
import difflib
from rich.console import Console
from rich.text import Text

def print_unified_diff_colored(a: str, b: str, fromfile="AutoTokenizer", tofile="TokenizersBackend") -> None:
    diff = difflib.unified_diff(
        a.splitlines(),
        b.splitlines(),
        fromfile=fromfile,
        tofile=tofile,
        lineterm="",
        n=10,
    )

    console = Console()
    for line in diff:
        # Basic unified diff coloring
        if line.startswith(("---", "+++")):
            console.print(Text(line, style="bold"))
        elif line.startswith("@@"):
            console.print(Text(line, style="bold cyan"))
        elif line.startswith("+") and not line.startswith("+++"):
            console.print(Text(line, style="green"))
        elif line.startswith("-") and not line.startswith("---"):
            console.print(Text(line, style="red"))
        else:
            console.print(line)


model_id = "Buseak/md_mt5_0109_v8"
model_id = "DreamFast/gemma-3-12b-it-heretic"
model_id = "thelamapi/next-1b"
# model_id = "cheyennewing/umt5-base-cak-denoise_finetuning_fold2"
# model_id = "mlx-community/granite-34b-code-instruct-8bit"
tok = AutoTokenizer.from_pretrained(model_id) # forces remote code

def print_tokenizer(tokenizer):
    model_str ="\n".join([repr(k) for k in tokenizer.get_added_tokens_decoder().values()])
    string = f"model:\t\t\t{model_str}\nnormalizer:\t\t{tokenizer.normalizer}\npre_tokenizer:\t\t{tokenizer.pre_tokenizer}\npost_processor:\t\t{tokenizer.post_processor}\ndecoder:\t\t{tokenizer.decoder}\ntruncation:\t\t{tokenizer.truncation}\n"
    return string

j1 = print_tokenizer(tok._tokenizer)

tok2= TokenizersBackend.from_pretrained(model_id)
j2 = print_tokenizer(tok2._tokenizer)

print("AutoTokenizer:", tok.__class__.__name__)
print_unified_diff_colored(j1, j2)
# if there is a diff:
#   - maybe the tiktoken.model is not enough to get all the info. here: Split(pattern=String(" "), behavior=MergedWithPrevious, invert=False) can it be extracted? NO? -> map to `GemmaTokenizer` (basically a converter)
#   - maybe the mapped tokenizer is not correct. 
#       - either the `XxxxxTokenizer` is missing something
#       - the specific model was converted in a different manner (maybe from sentencepiece) -> LlamaTokenizer had a universal converter so all models only need the universal TokenizersBackend and not map to Llama.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, gemma, t5

itazap added a commit that referenced this pull request Mar 2, 2026
itazap added a commit that referenced this pull request Mar 4, 2026
ArthurZucker added a commit that referenced this pull request Mar 4, 2026
…4255)

* update deepseek v2 for tokenizers v5

* adding remote code fix

* fix deepseek name

* handle spm conversion from proto only when overriding bad_models

* add script to compare xlni and code_search_net output of 2 tokenizers

* tiktoken models support

* fix tests

* testssss

* fix gemma

* apply some feedback

* paligemma processor tests fix

* add relevant changes from #44298

* json serializable fix

* add more xlni cases

* t5 fix

* ruff check code quality

* missed file for t5 test fix

* modular failures

* other modular fixes

* tiktoken.model test

* more feedback updates!

* fixing models so AutoTokenizer == TokenizersBackend - aligning with converters

* seamless m4t

* missed the most important files

* Revert "missed the most important files"

This reverts commit 8fcc625.

* undo changes to big bird , bert, seamless

* setup and qual

* lasr

* t5

* dpr bert

* xlmroberta

* reformer

* nllb

* style and shit

* update

* fix

* extract the charsmap

* fix mbart?

* style

* nllb and test tok common read spm precompiled charsmap

* fix whisper?

* nllb

* checked on v4!

* fix repo

* fix lasr

* style

---------

Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-75.ec2.internal>
Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-170-31.ec2.internal>
Co-authored-by: Arthur <arthur.zucker@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants