fix: Prevent AutoTokenizer type mismatch from directory name substrin… by tarekziade · Pull Request #43791 · huggingface/transformers

tarekziade · 2026-02-06T08:03:34Z

When saving a tokenizer to a local directory and reloading it, the tokenizer type could change to an incorrect class (or fall back to TokenizersBackend) if the directory name contained a model type substring.

Example:

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer.save_pretrained("./dumptruck")  # Contains "mpt"
new_tokenizer = AutoTokenizer.from_pretrained("./dumptruck")
type(new_tokenizer)  # TokenizersBackend (WRONG! Should be BertTokenizer)

This affected any directory name containing model type substrings like:

"dumptruck" → matched "mpt"
"gpt2-test" → matched "gpt2"
"roberta_v2" → matched "roberta"

A cascade failure involving two components:

AutoConfig substring matching: When loading a config without an explicit model_type field, AutoConfig would perform substring matching on the path to infer the model type. This caused false positives for local paths.
AutoTokenizer mismatch handling: When the (incorrectly) inferred model_type didn't match the saved tokenizer_class, AutoTokenizer would fall back to the generic TokenizersBackend instead of trusting the explicitly saved tokenizer_class.

HuggingFaceDocBuilderDev · 2026-02-06T08:17:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

tarekziade · 2026-02-06T08:30:47Z

run-slow: auto

github-actions · 2026-02-06T08:32:04Z

This comment contains run-slow, running the specified jobs:

models: ["models/auto"]
quantizations: []

github-actions · 2026-02-06T08:43:59Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	2dffdd2e	merge commit
PR	2f758b13	branch commit
main	ab3d1c22	base commit

✅ No failing test specific to this PR 🎉 👏 !

tarekziade · 2026-02-06T09:23:27Z

run-slow: auto

github-actions · 2026-02-06T09:24:35Z

This comment contains run-slow, running the specified jobs:

models: ["models/auto"]
quantizations: []

github-actions · 2026-02-06T09:39:02Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	3397eebc	merge commit
PR	6af95676	branch commit
main	0b2900dd	base commit

✅ No failing test specific to this PR 🎉 👏 !

ArthurZucker

Aria's failure just seems to be default padding side being different but its probably just something we can change in expected values

tarekziade · 2026-02-06T10:32:59Z

run-slow: aria, auto

github-actions · 2026-02-06T10:34:18Z

This comment contains run-slow, running the specified jobs:

models: ["models/aria", "models/auto"]
quantizations: []

github-actions · 2026-02-06T10:50:19Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	d446ae5f	merge commit
PR	0953aa3b	branch commit
main	0b2900dd	base commit

✅ No failing test specific to this PR 🎉 👏 !

Rocketknight1

Overall looks great! We can probably remove the changes in test_tokenization_auto.py though, right?

…g matching When saving a tokenizer to a local directory and reloading it, the tokenizer type could change to an incorrect class (or fall back to TokenizersBackend) if the directory name contained a model type substring. Example: ```python tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") tokenizer.save_pretrained("./dumptruck") # Contains "mpt" new_tokenizer = AutoTokenizer.from_pretrained("./dumptruck") type(new_tokenizer) # TokenizersBackend (WRONG! Should be BertTokenizer) ``` This affected any directory name containing model type substrings like: - "dumptruck" → matched "mpt" - "gpt2-test" → matched "gpt2" - "roberta_v2" → matched "roberta" A cascade failure involving two components: 1. **AutoConfig substring matching**: When loading a config without an explicit model_type field, AutoConfig would perform substring matching on the path to infer the model type. This caused false positives for local paths. 2. **AutoTokenizer mismatch handling**: When the (incorrectly) inferred model_type didn't match the saved tokenizer_class, AutoTokenizer would fall back to the generic TokenizersBackend instead of trusting the explicitly saved tokenizer_class. Only apply substring matching to remote repository identifiers (containing "/"), not to local directory paths. This prevents false positives while preserving the intended behavior for remote repos like "org/model-name". When there's a mismatch between config.model_type and tokenizer_config_class, prioritize the explicitly saved tokenizer_class (which is always saved during save_pretrained) instead of immediately falling back to TokenizersBackend.

github-actions · 2026-02-06T15:22:14Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

huggingface#43791) When saving a tokenizer to a local directory and reloading it, the tokenizer type could change to an incorrect class (or fall back to TokenizersBackend) if the directory name contained a model type substring. We're removing this fallback behavior and explicitly check that `model_type` is provided

tomaarsen · 2026-02-17T12:18:01Z

-        else:
-            # Fallback: use pattern matching on the string.
-            # We go from longer names to shorter names to catch roberta before bert (for instance)
-            for pattern in sorted(CONFIG_MAPPING.keys(), key=len, reverse=True):
-                if pattern in str(pretrained_model_name_or_path):
-                    return CONFIG_MAPPING[pattern].from_dict(config_dict, **unused_kwargs)


This is breaking, right? We need to mention this explicitly in https://github.com/huggingface/transformers/releases/tag/v5.2.0, cc @ArthurZucker @LysandreJik

For example, this now fails:

from transformers import AutoModel model = AutoModel.from_pretrained("prajjwal1/bert-tiny")

yeah, for a super small minority of models but yeah. Let's put it in front sorry

ArthurZucker reviewed Feb 6, 2026

View reviewed changes

Comment thread src/transformers/models/auto/configuration_auto.py Outdated

ArthurZucker approved these changes Feb 6, 2026

View reviewed changes

tarekziade force-pushed the tarekziade-fix-name-mangling branch from 6af9567 to 0953aa3 Compare February 6, 2026 10:01

tarekziade requested a review from Rocketknight1 February 6, 2026 10:35

vasqu reviewed Feb 6, 2026

View reviewed changes

Comment thread tests/models/auto/test_tokenization_auto.py Outdated

Rocketknight1 approved these changes Feb 6, 2026

View reviewed changes

Comment thread tests/models/auto/test_tokenization_auto.py Outdated

tarekziade added 14 commits February 6, 2026 16:05

re-raise if not available

1da41ae

trying the tag way

614bb2c

reduce complexity

3609666

no need to sort if early exit

fa6052f

another early exit

220ac6a

more defensive on the tag

56b8682

stick with the tests style (no docstrings)

09d3076

revert tag checking we will just remove entirely this behavior

decc562

imports at the top

cea4c9b

fix padding order

d404bd1

flip back behavior for TokenizersBackend

6f5f8a7

check all cases

6049678

revert test_tokenization_auto changes

ec5296f

tarekziade force-pushed the tarekziade-fix-name-mangling branch from 47f73d8 to ec5296f Compare February 6, 2026 15:07

tarekziade enabled auto-merge (squash) February 6, 2026 15:15

tarekziade disabled auto-merge February 6, 2026 15:15

tarekziade enabled auto-merge (squash) February 6, 2026 15:17

Rocketknight1 mentioned this pull request Feb 6, 2026

Chase TokenizersBackend issue #43735

Closed

tarekziade disabled auto-merge February 6, 2026 15:18

revert aria test change

281114e

tarekziade merged commit b9042c4 into main Feb 6, 2026
26 checks passed

tarekziade deleted the tarekziade-fix-name-mangling branch February 6, 2026 15:32

tomaarsen reviewed Feb 17, 2026

View reviewed changes

tomaarsen mentioned this pull request Feb 17, 2026

[compat] Introduce Transformers v5.2 compatibility: trainer _nested_gather moved huggingface/sentence-transformers#3664

Merged

winglian mentioned this pull request Mar 6, 2026

upgrade transformers==5.3.0 trl==0.29.0 kernels axolotl-ai-cloud/axolotl#3459

Merged

Conversation

tarekziade commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 6, 2026

Uh oh!

tarekziade commented Feb 6, 2026

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

github-actions Bot commented Feb 6, 2026

CI Results

Commit Info

Uh oh!

tarekziade commented Feb 6, 2026

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

github-actions Bot commented Feb 6, 2026

CI Results

Commit Info

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

tarekziade commented Feb 6, 2026

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

github-actions Bot commented Feb 6, 2026

CI Results

Commit Info

Uh oh!

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Feb 6, 2026

Uh oh!

Uh oh!

tomaarsen Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tarekziade commented Feb 6, 2026 •

edited

Loading