Skip to content

Conversation

@sgugger
Copy link
Collaborator

@sgugger sgugger commented Aug 6, 2021

What does this PR do?

This PR cleans up the auto modules to have them rely on string mappings and dynamically import the model when they are needed, instead of having a hard dependency on every modeling file.

There is no breaking changes are all the MAPPING classes are still present and will behave like regular dictionaries, just loading the objects as needed. On the internal tooling side, this allows us to remove the script that was extracting the names of the auto-mapping (since we have them now) and the file that stored them.

_import_structure["models.marian"].append("MarianTokenizer")
_import_structure["models.mbart"].append("MBartTokenizer")
_import_structure["models.mbart"].append("MBart50Tokenizer")
_import_structure["models.mbart50"].append("MBart50Tokenizer")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cleaner to have the mBART-50 tokenizers in their own folder.

cpm,
ctrl,
deberta,
deberta_v2,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of modules were missing here.

from_config_docstring = from_config_docstring.replace("checkpoint_placeholder", checkpoint_for_example)
from_config.__doc__ = from_config_docstring
from_config = replace_list_option_in_docstrings(model_mapping, use_model_types=False)(from_config)
from_config = replace_list_option_in_docstrings(model_mapping._model_mapping, use_model_types=False)(from_config)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal attribute _model_mapping contains the mapping model type to model class name. We use this to avoid importing all models when generating the docstring (which would defeat the purpose of this PR).

Comment on lines +459 to +462
# Some of the mappings have entries model_type -> object of another model type. In that case we try to grab the
# object at the top level.
transformers_module = importlib.import_module("transformers")
return getattribute_from_module(transformers_module, attr)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is mainly there to support use-cases like ("ibert", ("RobertaTokenizer", "RobertaTokenizerFast"))

# we are loading we see if we can infer it from the type of the configuration file
from .models.auto.configuration_auto import CONFIG_MAPPING # tests_ignore
from .models.auto.tokenization_auto import TOKENIZER_MAPPING # tests_ignore
from .models.auto.tokenization_auto import TOKENIZER_MAPPING_NAMES # tests_ignore
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes here are to use the name mappings (since we only want the names of the class), which go model_type to tokenizer classes.

)

# For tokenizers which are not directly mapped from a config
NO_CONFIG_TOKENIZER = [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes allow us to remove that list entirely, since we use a map model_type to tokenziers. This way, all tokenizers can be in it, even if they don't have a config.

return getattribute_from_module(transformers_module, attr)


class LazyAutoMapping(OrderedDict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) maybe a short docstring like we have for _LazyModule?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and maybe also make it private _LazyAutoMapping for consistency?

CONFIG_MAPPING = LazyConfigMapping(CONFIG_MAPPING_NAMES)


class LazyLoadAllMappings(OrderedDict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also maybe private class + docstring?

ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = LazyLoadAllMappings(CONFIG_ARCHIVE_MAP_MAPPING_NAMES)


def _get_class_name(model_class):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has the input type changed here from class to str? maybe add type hinting?

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the clean-up!

Left some nits

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, very welcome change! Thanks for working on that behemoth of a PR, @sgugger.

Comment on lines +293 to +297
warnings.warn(
"ALL_PRETRAINED_CONFIG_ARCHIVE_MAP is deprecated and will be removed in v5 of Transformers. "
"It does not contain all available model checkpoints, far from it. Checkout hf.co/models for that.",
FutureWarning,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@sgugger sgugger merged commit 9870093 into master Aug 6, 2021
@sgugger sgugger deleted the disentangle_auto branch August 6, 2021 11:12
@patrickvonplaten patrickvonplaten changed the title [WIP] Disentangle auto modules from other modeling files Disentangle auto modules from other modeling files Aug 6, 2021
if class_name in tokenizers:
break

module = importlib.import_module(f".{module_name}", "transformers.models")
Copy link
Contributor

@stas00 stas00 Aug 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks for some models with - in its name. e.g. xlm-roberta,

For example:


Traceback (most recent call last):
  File "/mnt/nvme1/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_mlm.py", line 550, in <module>
    main()
  File "/mnt/nvme1/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_mlm.py", line 337, in main
    tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/auto/tokenization_auto.py", line 432, in from_pretrained
    tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/auto/tokenization_auto.py", line 226, in tokenizer_class_from_name
    module = importlib.import_module(f".{module_name}", "transformers.models")
  File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'transformers.models.xlm-roberta'

as you can see it tries to import "transformers.models.xlm-roberta", to reproduce:

PYTHONPATH=src python examples/pytorch/language-modeling/run_mlm.py --train_file tests/fixtures/sample_text.txt --model_name_or_path hf-internal-testing/tiny-xlm-roberta --do_train --max_train_samples 4 --per_device_train_batch_size 2 --num_train_epochs 1 --fp16 --report_to none --overwrite_output_dir --output_dir output_dir --save_steps 1
# module_name, tokenizers debug print:
xlm-roberta ('XLMRobertaTokenizer', 'XLMRobertaTokenizerFast')

Should it do:

 module = importlib.import_module(f".{module_name.replace('-', '_')}", "transformers.models")

Oddly enough I don't get this problem if I run xlm-roberta-base, so this is an edge case.

Probably should include this tiny model as a test as it triggers this bug.

I detected it since 2 deepspeed tests got broken.

@LysandreJik, @sgugger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed fix: #13251

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants