Skip to content

fix: Prevent AutoTokenizer type mismatch from directory name substrin…#43791

Merged
tarekziade merged 15 commits into
mainfrom
tarekziade-fix-name-mangling
Feb 6, 2026
Merged

fix: Prevent AutoTokenizer type mismatch from directory name substrin…#43791
tarekziade merged 15 commits into
mainfrom
tarekziade-fix-name-mangling

Conversation

@tarekziade

@tarekziade tarekziade commented Feb 6, 2026

Copy link
Copy Markdown
Collaborator

When saving a tokenizer to a local directory and reloading it, the tokenizer type could change to an incorrect class (or fall back to TokenizersBackend) if the directory name contained a model type substring.

Example:

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer.save_pretrained("./dumptruck")  # Contains "mpt"
new_tokenizer = AutoTokenizer.from_pretrained("./dumptruck")
type(new_tokenizer)  # TokenizersBackend (WRONG! Should be BertTokenizer)

This affected any directory name containing model type substrings like:

  • "dumptruck" → matched "mpt"
  • "gpt2-test" → matched "gpt2"
  • "roberta_v2" → matched "roberta"

A cascade failure involving two components:

  1. AutoConfig substring matching: When loading a config without an explicit model_type field, AutoConfig would perform substring matching on the path to infer the model type. This caused false positives for local paths.

  2. AutoTokenizer mismatch handling: When the (incorrectly) inferred model_type didn't match the saved tokenizer_class, AutoTokenizer would fall back to the generic TokenizersBackend instead of trusting the explicitly saved tokenizer_class.

Comment thread src/transformers/models/auto/configuration_auto.py Outdated
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@tarekziade

Copy link
Copy Markdown
Collaborator Author

run-slow: auto

@github-actions

github-actions Bot commented Feb 6, 2026

Copy link
Copy Markdown
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/auto"]
quantizations: []

@github-actions

github-actions Bot commented Feb 6, 2026

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 2dffdd2e merge commit
PR 2f758b13 branch commit
main ab3d1c22 base commit

✅ No failing test specific to this PR 🎉 👏 !

@tarekziade

Copy link
Copy Markdown
Collaborator Author

run-slow: auto

@github-actions

github-actions Bot commented Feb 6, 2026

Copy link
Copy Markdown
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/auto"]
quantizations: []

@github-actions

github-actions Bot commented Feb 6, 2026

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 3397eebc merge commit
PR 6af95676 branch commit
main 0b2900dd base commit

✅ No failing test specific to this PR 🎉 👏 !

@ArthurZucker ArthurZucker left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aria's failure just seems to be default padding side being different but its probably just something we can change in expected values

@tarekziade tarekziade force-pushed the tarekziade-fix-name-mangling branch from 6af9567 to 0953aa3 Compare February 6, 2026 10:01
@tarekziade

Copy link
Copy Markdown
Collaborator Author

run-slow: aria, auto

@github-actions

github-actions Bot commented Feb 6, 2026

Copy link
Copy Markdown
Contributor

This comment contains run-slow, running the specified jobs:

models: ["models/aria", "models/auto"]
quantizations: []

@github-actions

github-actions Bot commented Feb 6, 2026

Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN d446ae5f merge commit
PR 0953aa3b branch commit
main 0b2900dd base commit

✅ No failing test specific to this PR 🎉 👏 !

Comment thread tests/models/auto/test_tokenization_auto.py Outdated

@Rocketknight1 Rocketknight1 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! We can probably remove the changes in test_tokenization_auto.py though, right?

Comment thread tests/models/auto/test_tokenization_auto.py Outdated
…g matching

When saving a tokenizer to a local directory and reloading it, the tokenizer
type could change to an incorrect class (or fall back to TokenizersBackend)
if the directory name contained a model type substring.

Example:
```python
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer.save_pretrained("./dumptruck")  # Contains "mpt"
new_tokenizer = AutoTokenizer.from_pretrained("./dumptruck")
type(new_tokenizer)  # TokenizersBackend (WRONG! Should be BertTokenizer)
```

This affected any directory name containing model type substrings like:
- "dumptruck" → matched "mpt"
- "gpt2-test" → matched "gpt2"
- "roberta_v2" → matched "roberta"

A cascade failure involving two components:

1. **AutoConfig substring matching**: When loading a config without an explicit
   model_type field, AutoConfig would perform substring matching on the path
   to infer the model type. This caused false positives for local paths.

2. **AutoTokenizer mismatch handling**: When the (incorrectly) inferred
   model_type didn't match the saved tokenizer_class, AutoTokenizer would
   fall back to the generic TokenizersBackend instead of trusting the
   explicitly saved tokenizer_class.

Only apply substring matching to remote repository identifiers (containing "/"),
not to local directory paths. This prevents false positives while preserving
the intended behavior for remote repos like "org/model-name".

When there's a mismatch between config.model_type and tokenizer_config_class,
prioritize the explicitly saved tokenizer_class (which is always saved during
save_pretrained) instead of immediately falling back to TokenizersBackend.
@tarekziade tarekziade force-pushed the tarekziade-fix-name-mangling branch from 47f73d8 to ec5296f Compare February 6, 2026 15:07
@tarekziade tarekziade enabled auto-merge (squash) February 6, 2026 15:15
@tarekziade tarekziade disabled auto-merge February 6, 2026 15:15
@tarekziade tarekziade enabled auto-merge (squash) February 6, 2026 15:17
@tarekziade tarekziade disabled auto-merge February 6, 2026 15:18
@github-actions

github-actions Bot commented Feb 6, 2026

Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

@tarekziade tarekziade merged commit b9042c4 into main Feb 6, 2026
26 checks passed
@tarekziade tarekziade deleted the tarekziade-fix-name-mangling branch February 6, 2026 15:32
jiosephlee pushed a commit to jiosephlee/transformers_latest that referenced this pull request Feb 11, 2026
huggingface#43791)

When saving a tokenizer to a local directory and reloading it, the tokenizer
type could change to an incorrect class (or fall back to TokenizersBackend)
if the directory name contained a model type substring.

We're removing this fallback behavior and explicitly check that `model_type` is provided
Comment on lines -1424 to -1429
else:
# Fallback: use pattern matching on the string.
# We go from longer names to shorter names to catch roberta before bert (for instance)
for pattern in sorted(CONFIG_MAPPING.keys(), key=len, reverse=True):
if pattern in str(pretrained_model_name_or_path):
return CONFIG_MAPPING[pattern].from_dict(config_dict, **unused_kwargs)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is breaking, right? We need to mention this explicitly in https://github.com/huggingface/transformers/releases/tag/v5.2.0, cc @ArthurZucker @LysandreJik

For example, this now fails:

from transformers import AutoModel

model = AutoModel.from_pretrained("prajjwal1/bert-tiny")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, for a super small minority of models but yeah. Let's put it in front sorry

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants