Disentangle auto modules from other modeling files #13023

sgugger · 2021-08-06T07:18:30Z

What does this PR do?

This PR cleans up the auto modules to have them rely on string mappings and dynamically import the model when they are needed, instead of having a hard dependency on every modeling file.

There is no breaking changes are all the MAPPING classes are still present and will behave like regular dictionaries, just loading the objects as needed. On the internal tooling side, this allows us to remove the script that was extracting the names of the auto-mapping (since we have them now) and the file that stored them.

sgugger · 2021-08-06T08:57:18Z

src/transformers/__init__.py

    _import_structure["models.marian"].append("MarianTokenizer")
    _import_structure["models.mbart"].append("MBartTokenizer")
-    _import_structure["models.mbart"].append("MBart50Tokenizer")
+    _import_structure["models.mbart50"].append("MBart50Tokenizer")


This is cleaner to have the mBART-50 tokenizers in their own folder.

sgugger · 2021-08-06T08:57:37Z

src/transformers/models/__init__.py

    cpm,
    ctrl,
    deberta,
+    deberta_v2,


Lots of modules were missing here.

sgugger · 2021-08-06T08:58:47Z

src/transformers/models/auto/auto_factory.py

    from_config_docstring = from_config_docstring.replace("checkpoint_placeholder", checkpoint_for_example)
    from_config.__doc__ = from_config_docstring
-    from_config = replace_list_option_in_docstrings(model_mapping, use_model_types=False)(from_config)
+    from_config = replace_list_option_in_docstrings(model_mapping._model_mapping, use_model_types=False)(from_config)


The internal attribute _model_mapping contains the mapping model type to model class name. We use this to avoid importing all models when generating the docstring (which would defeat the purpose of this PR).

sgugger · 2021-08-06T08:59:35Z

src/transformers/models/auto/auto_factory.py

+    # Some of the mappings have entries model_type -> object of another model type. In that case we try to grab the
+    # object at the top level.
+    transformers_module = importlib.import_module("transformers")
+    return getattribute_from_module(transformers_module, attr)


This part is mainly there to support use-cases like ("ibert", ("RobertaTokenizer", "RobertaTokenizerFast"))

sgugger · 2021-08-06T09:02:58Z

src/transformers/tokenization_utils_base.py

                # we are loading we see if we can infer it from the type of the configuration file
-                from .models.auto.configuration_auto import CONFIG_MAPPING  # tests_ignore
-                from .models.auto.tokenization_auto import TOKENIZER_MAPPING  # tests_ignore
+                from .models.auto.tokenization_auto import TOKENIZER_MAPPING_NAMES  # tests_ignore


Changes here are to use the name mappings (since we only want the names of the class), which go model_type to tokenizer classes.

sgugger · 2021-08-06T09:03:56Z

src/transformers/models/auto/tokenization_auto.py

 )

-# For tokenizers which are not directly mapped from a config
-NO_CONFIG_TOKENIZER = [


The changes allow us to remove that list entirely, since we use a map model_type to tokenziers. This way, all tokenizers can be in it, even if they don't have a config.

patrickvonplaten · 2021-08-06T09:49:01Z

src/transformers/models/auto/auto_factory.py

+    return getattribute_from_module(transformers_module, attr)
+
+
+class LazyAutoMapping(OrderedDict):


(nit) maybe a short docstring like we have for _LazyModule?

and maybe also make it private _LazyAutoMapping for consistency?

patrickvonplaten · 2021-08-06T09:50:43Z

src/transformers/models/auto/configuration_auto.py

+CONFIG_MAPPING = LazyConfigMapping(CONFIG_MAPPING_NAMES)
+
+
+class LazyLoadAllMappings(OrderedDict):


also maybe private class + docstring?

patrickvonplaten · 2021-08-06T09:52:47Z

src/transformers/models/auto/configuration_auto.py

+ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = LazyLoadAllMappings(CONFIG_ARCHIVE_MAP_MAPPING_NAMES)
+

 def _get_class_name(model_class):


Has the input type changed here from class to str? maybe add type hinting?

patrickvonplaten

Thanks a lot for the clean-up!

Left some nits

LysandreJik

Great, very welcome change! Thanks for working on that behemoth of a PR, @sgugger.

LysandreJik · 2021-08-06T09:47:54Z

src/transformers/models/auto/configuration_auto.py

+        warnings.warn(
+            "ALL_PRETRAINED_CONFIG_ARCHIVE_MAP is deprecated and will be removed in v5 of Transformers. "
+            "It does not contain all available model checkpoints, far from it. Checkout hf.co/models for that.",
+            FutureWarning,
+        )


stas00 · 2021-08-25T03:50:37Z

src/transformers/models/auto/tokenization_auto.py

+        if class_name in tokenizers:
+            break
+
+    module = importlib.import_module(f".{module_name}", "transformers.models")


This breaks for some models with - in its name. e.g. xlm-roberta,

For example:

Traceback (most recent call last): File "/mnt/nvme1/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_mlm.py", line 550, in <module> main() File "/mnt/nvme1/code/huggingface/transformers-master/examples/pytorch/language-modeling/run_mlm.py", line 337, in main tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs) File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/auto/tokenization_auto.py", line 432, in from_pretrained tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate) File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/auto/tokenization_auto.py", line 226, in tokenizer_class_from_name module = importlib.import_module(f".{module_name}", "transformers.models") File "/home/stas/anaconda3/envs/py38-pt19/lib/python3.8/importlib/__init__.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1014, in _gcd_import File "<frozen importlib._bootstrap>", line 991, in _find_and_load File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked ModuleNotFoundError: No module named 'transformers.models.xlm-roberta'

as you can see it tries to import "transformers.models.xlm-roberta", to reproduce:

PYTHONPATH=src python examples/pytorch/language-modeling/run_mlm.py --train_file tests/fixtures/sample_text.txt --model_name_or_path hf-internal-testing/tiny-xlm-roberta --do_train --max_train_samples 4 --per_device_train_batch_size 2 --num_train_epochs 1 --fp16 --report_to none --overwrite_output_dir --output_dir output_dir --save_steps 1

# module_name, tokenizers debug print: xlm-roberta ('XLMRobertaTokenizer', 'XLMRobertaTokenizerFast')

Should it do:

module = importlib.import_module(f".{module_name.replace('-', '_')}", "transformers.models")

Oddly enough I don't get this problem if I run xlm-roberta-base, so this is an edge case.

Probably should include this tiny model as a test as it triggers this bug.

I detected it since 2 deepspeed tests got broken.

@LysandreJik, @sgugger

You can see the failure on CI:

https://github.com/huggingface/transformers/runs/3406045740?check_suite_focus=true

Proposed fix: #13251

sgugger added 14 commits August 5, 2021 10:24

Initial work

6703745

All auto models

ec43ac4

All tf auto models

7c98c1d

All flax auto models

7730772

Tokenizers

19ec65b

Add feature extractors

da3ed1b

Fix typos

8084d5d

Fix other typo

4b5139d

Use the right config

a562088

Remove old mapping names and update logic in AutoTokenizer

c3c2919

Update check_table

c7d387b

Fix copies and check_repo script

d678ce9

Fix last test

f525d61

Add back name

d459b56

sgugger commented Aug 6, 2021

View reviewed changes

sgugger added 4 commits August 6, 2021 05:08

clean up

8a765b6

Update template

8bf8119

Update template

d026e5c

Forgot a )

cb92f16

patrickvonplaten reviewed Aug 6, 2021

View reviewed changes

Use alternative to fixup

f99ad9e

patrickvonplaten reviewed Aug 6, 2021

View reviewed changes

patrickvonplaten approved these changes Aug 6, 2021

View reviewed changes

LysandreJik approved these changes Aug 6, 2021

View reviewed changes

sgugger added 4 commits August 6, 2021 06:07

Fix TF model template

9ecb54c

Address review comments

76b2665

Address review comments

133ab69

Style

7162b1e

sgugger merged commit 9870093 into master Aug 6, 2021

sgugger deleted the disentangle_auto branch August 6, 2021 11:12

patrickvonplaten changed the title ~~[WIP] Disentangle auto modules from other modeling files~~ Disentangle auto modules from other modeling files Aug 6, 2021

This was referenced Aug 6, 2021

GPT-J-6B #13022

Merged

Deberta tf #12972

Merged

Add splinter #12955

Merged

transformers-cli depends on torchaudio optional deps #13034

Closed

stas00 reviewed Aug 25, 2021

View reviewed changes

stas00 mentioned this pull request Aug 25, 2021

fix tokenizer_class_from_name for models with - in the name #13251

Merged

91CH mentioned this pull request Aug 25, 2021

This breaks for some models with - in its name. e.g. xlm-roberta, Fearless91/probable-eureka#1

Closed

		return getattribute_from_module(transformers_module, attr)


		class LazyAutoMapping(OrderedDict):

		CONFIG_MAPPING = LazyConfigMapping(CONFIG_MAPPING_NAMES)


		class LazyLoadAllMappings(OrderedDict):

		ALL_PRETRAINED_CONFIG_ARCHIVE_MAP = LazyLoadAllMappings(CONFIG_ARCHIVE_MAP_MAPPING_NAMES)


		def _get_class_name(model_class):

Disentangle auto modules from other modeling files #13023

Disentangle auto modules from other modeling files #13023

Uh oh!

Conversation

sgugger commented Aug 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 Aug 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sgugger commented Aug 6, 2021 •

edited

Loading

stas00 Aug 25, 2021 •

edited

Loading