-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Disentangle auto modules from other modeling files #13023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| _import_structure["models.marian"].append("MarianTokenizer") | ||
| _import_structure["models.mbart"].append("MBartTokenizer") | ||
| _import_structure["models.mbart"].append("MBart50Tokenizer") | ||
| _import_structure["models.mbart50"].append("MBart50Tokenizer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cleaner to have the mBART-50 tokenizers in their own folder.
| cpm, | ||
| ctrl, | ||
| deberta, | ||
| deberta_v2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of modules were missing here.
| from_config_docstring = from_config_docstring.replace("checkpoint_placeholder", checkpoint_for_example) | ||
| from_config.__doc__ = from_config_docstring | ||
| from_config = replace_list_option_in_docstrings(model_mapping, use_model_types=False)(from_config) | ||
| from_config = replace_list_option_in_docstrings(model_mapping._model_mapping, use_model_types=False)(from_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The internal attribute _model_mapping contains the mapping model type to model class name. We use this to avoid importing all models when generating the docstring (which would defeat the purpose of this PR).
| # Some of the mappings have entries model_type -> object of another model type. In that case we try to grab the | ||
| # object at the top level. | ||
| transformers_module = importlib.import_module("transformers") | ||
| return getattribute_from_module(transformers_module, attr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is mainly there to support use-cases like ("ibert", ("RobertaTokenizer", "RobertaTokenizerFast"))
| # we are loading we see if we can infer it from the type of the configuration file | ||
| from .models.auto.configuration_auto import CONFIG_MAPPING # tests_ignore | ||
| from .models.auto.tokenization_auto import TOKENIZER_MAPPING # tests_ignore | ||
| from .models.auto.tokenization_auto import TOKENIZER_MAPPING_NAMES # tests_ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes here are to use the name mappings (since we only want the names of the class), which go model_type to tokenizer classes.
| ) | ||
|
|
||
| # For tokenizers which are not directly mapped from a config | ||
| NO_CONFIG_TOKENIZER = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes allow us to remove that list entirely, since we use a map model_type to tokenziers. This way, all tokenizers can be in it, even if they don't have a config.
| return getattribute_from_module(transformers_module, attr) | ||
|
|
||
|
|
||
| class LazyAutoMapping(OrderedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) maybe a short docstring like we have for _LazyModule?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe also make it private _LazyAutoMapping for consistency?
| CONFIG_MAPPING = LazyConfigMapping(CONFIG_MAPPING_NAMES) | ||
|
|
||
|
|
||
| class LazyLoadAllMappings(OrderedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also maybe private class + docstring?
| ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = LazyLoadAllMappings(CONFIG_ARCHIVE_MAP_MAPPING_NAMES) | ||
|
|
||
|
|
||
| def _get_class_name(model_class): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has the input type changed here from class to str? maybe add type hinting?
patrickvonplaten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the clean-up!
Left some nits
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, very welcome change! Thanks for working on that behemoth of a PR, @sgugger.
| warnings.warn( | ||
| "ALL_PRETRAINED_CONFIG_ARCHIVE_MAP is deprecated and will be removed in v5 of Transformers. " | ||
| "It does not contain all available model checkpoints, far from it. Checkout hf.co/models for that.", | ||
| FutureWarning, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
| if class_name in tokenizers: | ||
| break | ||
|
|
||
| module = importlib.import_module(f".{module_name}", "transformers.models") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks for some models with - in its name. e.g. xlm-roberta,
For example:
Traceback (most recent call last):
File "/mnt/nvme1/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_mlm.py", line 550, in <module>
main()
File "/mnt/nvme1/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_mlm.py", line 337, in main
tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/auto/tokenization_auto.py", line 432, in from_pretrained
tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/auto/tokenization_auto.py", line 226, in tokenizer_class_from_name
module = importlib.import_module(f".{module_name}", "transformers.models")
File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
File "<frozen importlib._bootstrap>", line 991, in _find_and_load
File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'transformers.models.xlm-roberta'
as you can see it tries to import "transformers.models.xlm-roberta", to reproduce:
PYTHONPATH=src python examples/pytorch/language-modeling/run_mlm.py --train_file tests/fixtures/sample_text.txt --model_name_or_path hf-internal-testing/tiny-xlm-roberta --do_train --max_train_samples 4 --per_device_train_batch_size 2 --num_train_epochs 1 --fp16 --report_to none --overwrite_output_dir --output_dir output_dir --save_steps 1
# module_name, tokenizers debug print:
xlm-roberta ('XLMRobertaTokenizer', 'XLMRobertaTokenizerFast')
Should it do:
module = importlib.import_module(f".{module_name.replace('-', '_')}", "transformers.models")
Oddly enough I don't get this problem if I run xlm-roberta-base, so this is an edge case.
Probably should include this tiny model as a test as it triggers this bug.
I detected it since 2 deepspeed tests got broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see the failure on CI:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed fix: #13251
What does this PR do?
This PR cleans up the auto modules to have them rely on string mappings and dynamically import the model when they are needed, instead of having a hard dependency on every modeling file.
There is no breaking changes are all the MAPPING classes are still present and will behave like regular dictionaries, just loading the objects as needed. On the internal tooling side, this allows us to remove the script that was extracting the names of the auto-mapping (since we have them now) and the file that stored them.