fix: Prevent AutoTokenizer type mismatch from directory name substrin…#43791
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: auto |
|
This comment contains models: ["models/auto"] |
|
run-slow: auto |
|
This comment contains models: ["models/auto"] |
ArthurZucker
left a comment
There was a problem hiding this comment.
Aria's failure just seems to be default padding side being different but its probably just something we can change in expected values
6af9567 to
0953aa3
Compare
|
run-slow: aria, auto |
|
This comment contains models: ["models/aria", "models/auto"] |
Rocketknight1
left a comment
There was a problem hiding this comment.
Overall looks great! We can probably remove the changes in test_tokenization_auto.py though, right?
…g matching
When saving a tokenizer to a local directory and reloading it, the tokenizer
type could change to an incorrect class (or fall back to TokenizersBackend)
if the directory name contained a model type substring.
Example:
```python
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer.save_pretrained("./dumptruck") # Contains "mpt"
new_tokenizer = AutoTokenizer.from_pretrained("./dumptruck")
type(new_tokenizer) # TokenizersBackend (WRONG! Should be BertTokenizer)
```
This affected any directory name containing model type substrings like:
- "dumptruck" → matched "mpt"
- "gpt2-test" → matched "gpt2"
- "roberta_v2" → matched "roberta"
A cascade failure involving two components:
1. **AutoConfig substring matching**: When loading a config without an explicit
model_type field, AutoConfig would perform substring matching on the path
to infer the model type. This caused false positives for local paths.
2. **AutoTokenizer mismatch handling**: When the (incorrectly) inferred
model_type didn't match the saved tokenizer_class, AutoTokenizer would
fall back to the generic TokenizersBackend instead of trusting the
explicitly saved tokenizer_class.
Only apply substring matching to remote repository identifiers (containing "/"),
not to local directory paths. This prevents false positives while preserving
the intended behavior for remote repos like "org/model-name".
When there's a mismatch between config.model_type and tokenizer_config_class,
prioritize the explicitly saved tokenizer_class (which is always saved during
save_pretrained) instead of immediately falling back to TokenizersBackend.
47f73d8 to
ec5296f
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto |
huggingface#43791) When saving a tokenizer to a local directory and reloading it, the tokenizer type could change to an incorrect class (or fall back to TokenizersBackend) if the directory name contained a model type substring. We're removing this fallback behavior and explicitly check that `model_type` is provided
| else: | ||
| # Fallback: use pattern matching on the string. | ||
| # We go from longer names to shorter names to catch roberta before bert (for instance) | ||
| for pattern in sorted(CONFIG_MAPPING.keys(), key=len, reverse=True): | ||
| if pattern in str(pretrained_model_name_or_path): | ||
| return CONFIG_MAPPING[pattern].from_dict(config_dict, **unused_kwargs) |
There was a problem hiding this comment.
This is breaking, right? We need to mention this explicitly in https://github.com/huggingface/transformers/releases/tag/v5.2.0, cc @ArthurZucker @LysandreJik
For example, this now fails:
from transformers import AutoModel
model = AutoModel.from_pretrained("prajjwal1/bert-tiny")There was a problem hiding this comment.
yeah, for a super small minority of models but yeah. Let's put it in front sorry

When saving a tokenizer to a local directory and reloading it, the tokenizer type could change to an incorrect class (or fall back to TokenizersBackend) if the directory name contained a model type substring.
Example:
This affected any directory name containing model type substrings like:
A cascade failure involving two components:
AutoConfig substring matching: When loading a config without an explicit model_type field, AutoConfig would perform substring matching on the path to infer the model type. This caused false positives for local paths.
AutoTokenizer mismatch handling: When the (incorrectly) inferred model_type didn't match the saved tokenizer_class, AutoTokenizer would fall back to the generic TokenizersBackend instead of trusting the explicitly saved tokenizer_class.