New run_clm script #8105

sgugger · 2020-10-27T21:14:50Z

What does this PR do?

This PR adds an example of a causal language modeling fine-tuning (or training from scratch) using the 🤗 Datasets library. It supports loading a dataset via its name (from the hub) or local files. A test of training on a small text is added.

thomwolf

Very nice!

Just a few comments and suggestion.

thomwolf · 2020-10-28T09:28:22Z

examples/language-modeling/run_clm.py

+from transformers.trainer_utils import is_main_process
+
+
+logger = logging.getLogger(__name__)


Should we use transformers's library logging ? (cc @LysandreJik)

No, for the script, we should use the regular one. @LysandreJik had a very long explanation of why that I don't remember.

Here it is.

The gist of it is that imo the transformer logging utility should only be used to control the logging of the transformers module, not of the users' scripts directly, as it is not made for that and would lead to very weird behavior.

In my opinion the control of logging in a user script should contain both:

import logging from transformers import logging as hf_logging hf_logging.set_verbosity_xxx() logger = logging.getLogger(__name__) # then do stuff with the logger without worrying about the HF logging which has already been managed before logger.warn("xxx")

examples/language-modeling/run_clm.py

thomwolf · 2020-10-28T09:35:48Z

examples/language-modeling/run_clm.py

+    def tokenize_function(examples):
+        return tokenizer(examples[text_column_name])
+
+    tokenized_datasets = datasets.map(


In the two calls to map (here and below) it could be nice to add a reference to multi-processing with num_proc
(and maybe a link to the doc: https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map)

We could do the same thing to the run_glue script too, in passing.

Co-authored-by: Thomas Wolf <[email protected]>

LysandreJik

Great, LGTM!

LysandreJik · 2020-10-28T13:44:34Z

examples/language-modeling/run_clm.py

+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.


Do we need the copyright to Google AI and NVIDIA? Are there some snippets taken from their codebases?

No, it's a bad copy paste.

LysandreJik · 2020-10-28T13:47:03Z

examples/language-modeling/run_clm.py

+"""
+Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL) on a text file or a dataset.
+"""


From experience, users will understand that only GPT, GPT-2 and CTRL are supported by that script. I would put (GPT, GPT-2, CTRL, ...) instead, and provide a link:

Suggested change

"""

Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL) on a text file or a dataset.

"""

"""

Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...) on a text file or a dataset.

Find the full list of model architectures that can be fine-tuned by this script on the documentation:

https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModelWithLMHead

"""

But that might be a bit too much. Maybe adding a README would be simpler.

Not AutomodelWithLMHead, just CausalLM, but I can add that.

Maybe the link is overkill, I just had an issue with (GPT, GPT-2, CTRL) which seems to imply that only those three models are supported.

Or this link? https://huggingface.co/models?filter=lm-head

This works too but this shows checkpoints, whereas this script can also train from scratch so showing architectures would probably be better

No this links shows all kinds of LM. The script will only work with a model that can be loaded with AutoModelForCausalLM (since it uses that class).

Then https://huggingface.co/models?filter=causal-lm ?

(the other one is the deprecated one, will remove soon)

LysandreJik · 2020-10-28T13:51:04Z

examples/language-modeling/run_clm.py

+from transformers.trainer_utils import is_main_process
+
+
+logger = logging.getLogger(__name__)


Here it is.

The gist of it is that imo the transformer logging utility should only be used to control the logging of the transformers module, not of the users' scripts directly, as it is not made for that and would lead to very weird behavior.

In my opinion the control of logging in a user script should contain both:

import logging from transformers import logging as hf_logging hf_logging.set_verbosity_xxx() logger = logging.getLogger(__name__) # then do stuff with the logger without worrying about the HF logging which has already been managed before logger.warn("xxx")

LysandreJik · 2020-10-28T13:52:25Z

examples/language-modeling/run_clm.py

+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+    # Set the verbosity to info of the Transformers logger (on main process only):
+    if is_main_process(training_args.local_rank):
+        transformers.utils.logging.set_verbosity_info()
+    logger.info("Training/evaluation parameters %s", training_args)


That's exactly what I'm talking about :)

LysandreJik · 2020-10-28T13:53:42Z

examples/language-modeling/run_clm.py

+    if data_args.block_size <= 0:
+        block_size = tokenizer.max_len
+    else:
+        block_size = min(data_args.block_size, tokenizer.max_len)


Should we print a warning here to tell the user their block_size isn't going to be used if it's larger than the tokenizer's max length?

I can add that.

LysandreJik

LGTM

sgugger added 3 commits October 27, 2020 17:08

New run_clm script

1e6ab48

Formatting

c8f0977

More comments

c2f7910

sgugger requested review from LysandreJik and thomwolf October 27, 2020 21:14

Remove unused imports

29a8f64

thomwolf approved these changes Oct 28, 2020

View reviewed changes

Apply suggestions from code review

c265b35

Co-authored-by: Thomas Wolf <[email protected]>

LysandreJik approved these changes Oct 28, 2020

View reviewed changes

sgugger added 2 commits October 28, 2020 10:27

Address review comments

47dfb03

Change link to the hub

7821071

LysandreJik approved these changes Oct 28, 2020

View reviewed changes

sgugger merged commit 47dfa65 into master Oct 28, 2020

sgugger deleted the run_clm_script branch October 28, 2020 14:39

		from transformers.trainer_utils import is_main_process


		logger = logging.getLogger(__name__)

		# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
		# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.

-"""
-Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL) on a text file or a dataset.
-"""
+"""
+Fine-tuning the library models for causal language modeling (GPT, GPT-2, CTRL, ...) on a text file or a dataset.
+Find the full list of model architectures that can be fine-tuned by this script on the documentation:
+https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModelWithLMHead
+"""

New run_clm script #8105

New run_clm script #8105

Conversation

sgugger commented Oct 27, 2020

What does this PR do?

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants