Auto detect wrong mapping models by ArthurZucker · Pull Request #44298 · huggingface/transformers

ArthurZucker · 2026-02-26T12:34:38Z

What does this PR do?

A few issues we did not catch:

transformers/src/transformers/convert_slow_tokenizer.py

Lines 1314 to 1315 in 47b0e47

    
           def pre_tokenizer(self, replacement, add_prefix_space): 
        
               return pre_tokenizers.Split(" ", "merged_with_previous")

missing from GemmaTokenier

SPM's precompiled charset is super important for t5 like models so we are bringing it back to make sure AutoTokenizer that maps to T5 "works"

Equivalence check:

from transformers import AutoTokenizer, TokenizersBackend
import difflib
from rich.console import Console
from rich.text import Text

def print_unified_diff_colored(a: str, b: str, fromfile="AutoTokenizer", tofile="TokenizersBackend") -> None:
    diff = difflib.unified_diff(
        a.splitlines(),
        b.splitlines(),
        fromfile=fromfile,
        tofile=tofile,
        lineterm="",
        n=10,
    )

    console = Console()
    for line in diff:
        # Basic unified diff coloring
        if line.startswith(("---", "+++")):
            console.print(Text(line, style="bold"))
        elif line.startswith("@@"):
            console.print(Text(line, style="bold cyan"))
        elif line.startswith("+") and not line.startswith("+++"):
            console.print(Text(line, style="green"))
        elif line.startswith("-") and not line.startswith("---"):
            console.print(Text(line, style="red"))
        else:
            console.print(line)


model_id = "Buseak/md_mt5_0109_v8"
model_id = "DreamFast/gemma-3-12b-it-heretic"
model_id = "thelamapi/next-1b"
# model_id = "cheyennewing/umt5-base-cak-denoise_finetuning_fold2"
# model_id = "mlx-community/granite-34b-code-instruct-8bit"
tok = AutoTokenizer.from_pretrained(model_id) # forces remote code

def print_tokenizer(tokenizer):
    model_str ="\n".join([repr(k) for k in tokenizer.get_added_tokens_decoder().values()])
    string = f"model:\t\t\t{model_str}\nnormalizer:\t\t{tokenizer.normalizer}\npre_tokenizer:\t\t{tokenizer.pre_tokenizer}\npost_processor:\t\t{tokenizer.post_processor}\ndecoder:\t\t{tokenizer.decoder}\ntruncation:\t\t{tokenizer.truncation}\n"
    return string

j1 = print_tokenizer(tok._tokenizer)

tok2= TokenizersBackend.from_pretrained(model_id)
j2 = print_tokenizer(tok2._tokenizer)

print("AutoTokenizer:", tok.__class__.__name__)
print_unified_diff_colored(j1, j2)
# if there is a diff:
#   - maybe the tiktoken.model is not enough to get all the info. here: Split(pattern=String(" "), behavior=MergedWithPrevious, invert=False) can it be extracted? NO? -> map to `GemmaTokenizer` (basically a converter)
#   - maybe the mapped tokenizer is not correct. 
#       - either the `XxxxxTokenizer` is missing something
#       - the specific model was converted in a different manner (maybe from sentencepiece) -> LlamaTokenizer had a universal converter so all models only need the universal TokenizersBackend and not map to Llama.

HuggingFaceDocBuilderDev · 2026-02-26T12:44:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/models/t5/tokenization_t5.py

Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

github-actions · 2026-03-02T10:11:57Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, gemma, t5

…face/transformers into auto-detect-wrong-mapping-models

…4255) * update deepseek v2 for tokenizers v5 * adding remote code fix * fix deepseek name * handle spm conversion from proto only when overriding bad_models * add script to compare xlni and code_search_net output of 2 tokenizers * tiktoken models support * fix tests * testssss * fix gemma * apply some feedback * paligemma processor tests fix * add relevant changes from #44298 * json serializable fix * add more xlni cases * t5 fix * ruff check code quality * missed file for t5 test fix * modular failures * other modular fixes * tiktoken.model test * more feedback updates! * fixing models so AutoTokenizer == TokenizersBackend - aligning with converters * seamless m4t * missed the most important files * Revert "missed the most important files" This reverts commit 8fcc625. * undo changes to big bird , bert, seamless * setup and qual * lasr * t5 * dpr bert * xlmroberta * reformer * nllb * style and shit * update * fix * extract the charsmap * fix mbart? * style * nllb and test tok common read spm precompiled charsmap * fix whisper? * nllb * checked on v4! * fix repo * fix lasr * style --------- Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-164-75.ec2.internal> Co-authored-by: ita.zaporozhets@huggingface.co <ita_zaporozhets@ip-26-0-170-31.ec2.internal> Co-authored-by: Arthur <arthur.zucker@gmail.com>

ArthurZucker added 2 commits February 25, 2026 16:56

current changes

7584a04

udpate

0dc716b

hmellor mentioned this pull request Feb 26, 2026

Update to transformers v5 vllm-project/vllm#30566

Open

ArthurZucker added 7 commits February 26, 2026 16:08

update

d9962a4

multithread

2195a77

up

3301eec

update

e16cb6f

better report

072fdaf

match precompiled charset map

ccf6649

was gemma missing this?

a3178b8

hmellor reviewed Feb 27, 2026

View reviewed changes

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

ArthurZucker added 4 commits February 27, 2026 16:26

more fixes?

ea5070c

script nits

980d449

close to last version

c2744e1

final script version

9c8e8d3

redpanda1995 approved these changes Feb 28, 2026

View reviewed changes

Update src/transformers/models/t5/tokenization_t5.py

1055561

Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

ArthurZucker added 2 commits March 2, 2026 11:13

latest updates

3159524

Merge branch 'auto-detect-wrong-mapping-models' of github.com:hugging…

814aa28

…face/transformers into auto-detect-wrong-mapping-models

itazap added a commit that referenced this pull request Mar 2, 2026

add relevant changes from #44298

9367de7

itazap added a commit that referenced this pull request Mar 4, 2026

add relevant changes from #44298

f5fd840

ArthurZucker mentioned this pull request Mar 4, 2026

[vllm + v5 fix] handle TokenizersBackend fallback properly for v5 #44255

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto detect wrong mapping models#44298

Auto detect wrong mapping models#44298
ArthurZucker wants to merge 16 commits intomainfrom
auto-detect-wrong-mapping-models

ArthurZucker commented Feb 26, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Feb 26, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	def pre_tokenizer(self, replacement, add_prefix_space):
	return pre_tokenizers.Split(" ", "merged_with_previous")

Conversation

ArthurZucker commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Equivalence check:

Uh oh!

HuggingFaceDocBuilderDev commented Feb 26, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ArthurZucker commented Feb 26, 2026 •

edited

Loading