[Config, Serialization] more readable config serialization #3797

patrickvonplaten · 2020-04-14T20:50:21Z

Given the discussion in PR: #3433, we want to make the serialized model conf more readable.

Problem:

E.g. bert-base-cased has the following config on S3:

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 28996
}

But when saved all default params are saved as well (which is unnecessary):

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 28996
}
(which is readable imo) and once it's saved it now looks like this:

{
  "_num_labels": 2,
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": null,
  "do_sample": false,
  "early_stopping": false,
  "eos_token_id": null,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "is_encoder_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-12,
  "length_penalty": 1.0,
  "max_length": 20,
  "max_position_embeddings": 512,
  "min_length": 0,
  "model_type": "bert",
  "no_repeat_ngram_size": 0,
  "num_attention_heads": 12,
  "num_beams": 1,
  "num_hidden_layers": 12,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pad_token_id": 0,
  "pruned_heads": {},
  "repetition_penalty": 1.0,
  "temperature": 1.0,
  "top_k": 50,
  "top_p": 1.0,
  "torchscript": false,
  "type_vocab_size": 2,
  "use_bfloat16": false,
  "vocab_size": 28996
}

Solution:

We should only save the difference of the actual config to either v1) to the model class' config or v2) to the PretrainedConfig() (which contains most of the unnecessary default params).

This PR implements either v1) or v2) - up for discussion!

v1) for bert-base-cased would look like this:

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "vocab_size": 28996
}

v2) for bert-base-cased would look like this:

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 28996
}

patrickvonplaten · 2020-04-14T20:52:23Z

I would prefer v2) because the parameters saved in each of the model configuration files are important for the models behavior and would be nice to see in the config (also to compare to other models' configs)

julien-c · 2020-04-15T12:23:56Z

Need to look into it more but on principle this is very nice. (If Configs were dataclass-backed it would be even cleaner to implement – might be too big of a change though)

I agree that v2 is probably better, but will think about it some more.

For config hosted on our S3, what should we do? Update the official ones, but not the user-uploaded ones? Or just do all of them? :)

patrickvonplaten · 2020-04-15T13:13:31Z

I would update all of them by downloading, save_pretrained() and upload. I already a very similar script I would only need to adapt a tiny bit

thomwolf

Ok that's really cool, thanks a lot @patrickvonplaten!
I'm all in for V2!

thomwolf · 2020-04-16T09:52:49Z

src/transformers/configuration_utils.py

        self.label2id = dict((key, int(value)) for key, value in self.label2id.items())

-    def save_pretrained(self, save_directory):
+    def save_pretrained(self, save_directory, use_diff=True):


Need to document the additional argument in the docstring.

@LysandreJik we have a script testing that the code of the docstring examples run fine, right?
Could we maybe one day make a similar script that tests that all the arguments in methods signatures with docstrings are mentioned in the associated docstring? I feel like I keep asking people to document their new arguments in the docstring, would be nice to just have a test that fails if it's not done (cc @julien-c @sshleifer)

(I can't think of a good reason you wouldn't want to document an argument in the docstring)

Indeed, usually the IDE does that, but we could definitely look into enforcing it. I can't think of a reason we would want for an argument to not be documented either.

I think https://pypi.org/project/flake8-docstrings/ might be what we're looking for, haven't used it however.

I feel like this should be the only option (i.e., remove the argument completely). i.e., is there really a case where we want the full JSON serialization of the object?

Or at least, remove it from save_pretrained, keep it only on to_json_string() (and then it can default to True there, maybe it's more correct)

(the user of save_pretrained does not really need to know which exact serialization method we use underneath, IMO)

I don't know either but if we can keep a backward compatible option with this flag, why not maybe?

It's kinda hard to think about all the use cases people may have (e.g. loading in an older pytorch-transformers version the configuration which might break now for them).

thomwolf · 2020-04-16T09:57:17Z

src/transformers/configuration_utils.py

        return output

-    def to_json_string(self):
+    def to_json_string(self, use_diff=False):


Same here, you should document the new argument

LysandreJik

Elegant way of doing that! I'm all for V2 as well.
I agree with Thom that arguments should be documented, but other than that it looks great to me!

sshleifer · 2020-04-16T15:16:28Z

Does this impact the process for changing the default config?

julien-c

Other than my comment, looks great!

julien-c · 2020-04-16T19:20:56Z

src/transformers/configuration_utils.py

        self.label2id = dict((key, int(value)) for key, value in self.label2id.items())

-    def save_pretrained(self, save_directory):
+    def save_pretrained(self, save_directory, use_diff=True):


I feel like this should be the only option (i.e., remove the argument completely). i.e., is there really a case where we want the full JSON serialization of the object?

Or at least, remove it from save_pretrained, keep it only on to_json_string() (and then it can default to True there, maybe it's more correct)

patrickvonplaten · 2020-04-16T19:49:45Z

Sounds good!

patrickvonplaten · 2020-04-17T12:58:50Z

Ok, v2 is now implemented. I agree with @julien-c that the save_pretrained() method should be kept as clean as possible and I think we can keep backward compatibility (for very special edge cases) by allowing a boolean argument to the to_json_file() method.

julien-c · 2020-04-18T00:07:33Z

Awesome, merging this

better config serialization

4391b42

patrickvonplaten requested review from LysandreJik, julien-c and thomwolf April 14, 2020 20:52

thomwolf approved these changes Apr 16, 2020

View reviewed changes

LysandreJik approved these changes Apr 16, 2020

View reviewed changes

julien-c requested changes Apr 16, 2020

View reviewed changes

finish configuration utils

41d7aa8

julien-c merged commit e9d0bc0 into huggingface:master Apr 18, 2020

julien-c mentioned this pull request Apr 25, 2020

tiny typo guillaume-be/rust-bert#21

Merged

patrickvonplaten mentioned this pull request Oct 21, 2020

[PretrainedConfig] Fix save pretrained config for edge case #7943

Merged

5 tasks

[Config, Serialization] more readable config serialization #3797

[Config, Serialization] more readable config serialization #3797

Uh oh!

Conversation

patrickvonplaten commented Apr 14, 2020

Problem:

Solution:

Uh oh!

patrickvonplaten commented Apr 14, 2020

Uh oh!

julien-c commented Apr 15, 2020

Uh oh!

patrickvonplaten commented Apr 15, 2020

Uh oh!

thomwolf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LysandreJik left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sshleifer commented Apr 16, 2020

Uh oh!

julien-c left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickvonplaten commented Apr 16, 2020

Uh oh!

patrickvonplaten commented Apr 17, 2020

Uh oh!

julien-c commented Apr 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

LysandreJik left a comment •

edited

Loading