Skip to content

Conversation

@patrickvonplaten
Copy link
Contributor

Given the discussion in PR: #3433, we want to make the serialized model conf more readable.

Problem:

E.g. bert-base-cased has the following config on S3:

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 28996
}

But when saved all default params are saved as well (which is unnecessary):

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 28996
}
(which is readable imo) and once it's saved it now looks like this:

{
  "_num_labels": 2,
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": null,
  "do_sample": false,
  "early_stopping": false,
  "eos_token_id": null,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "is_encoder_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-12,
  "length_penalty": 1.0,
  "max_length": 20,
  "max_position_embeddings": 512,
  "min_length": 0,
  "model_type": "bert",
  "no_repeat_ngram_size": 0,
  "num_attention_heads": 12,
  "num_beams": 1,
  "num_hidden_layers": 12,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pad_token_id": 0,
  "pruned_heads": {},
  "repetition_penalty": 1.0,
  "temperature": 1.0,
  "top_k": 50,
  "top_p": 1.0,
  "torchscript": false,
  "type_vocab_size": 2,
  "use_bfloat16": false,
  "vocab_size": 28996
} 

Solution:

We should only save the difference of the actual config to either v1) to the model class' config or v2) to the PretrainedConfig() (which contains most of the unnecessary default params).

This PR implements either v1) or v2) - up for discussion!

v1) for bert-base-cased would look like this:

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "vocab_size": 28996
}

v2) for bert-base-cased would look like this:

{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 28996
}

@patrickvonplaten
Copy link
Contributor Author

I would prefer v2) because the parameters saved in each of the model configuration files are important for the models behavior and would be nice to see in the config (also to compare to other models' configs)

@julien-c
Copy link
Member

Need to look into it more but on principle this is very nice. (If Configs were dataclass-backed it would be even cleaner to implement – might be too big of a change though)

I agree that v2 is probably better, but will think about it some more.

For config hosted on our S3, what should we do? Update the official ones, but not the user-uploaded ones? Or just do all of them? :)

@patrickvonplaten
Copy link
Contributor Author

I would update all of them by downloading, save_pretrained() and upload. I already a very similar script I would only need to adapt a tiny bit

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's really cool, thanks a lot @patrickvonplaten!
I'm all in for V2!

self.label2id = dict((key, int(value)) for key, value in self.label2id.items())

def save_pretrained(self, save_directory):
def save_pretrained(self, save_directory, use_diff=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to document the additional argument in the docstring.

@LysandreJik we have a script testing that the code of the docstring examples run fine, right?
Could we maybe one day make a similar script that tests that all the arguments in methods signatures with docstrings are mentioned in the associated docstring? I feel like I keep asking people to document their new arguments in the docstring, would be nice to just have a test that fails if it's not done (cc @julien-c @sshleifer)

(I can't think of a good reason you wouldn't want to document an argument in the docstring)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, usually the IDE does that, but we could definitely look into enforcing it. I can't think of a reason we would want for an argument to not be documented either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think https://pypi.org/project/flake8-docstrings/ might be what we're looking for, haven't used it however.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should be the only option (i.e., remove the argument completely). i.e., is there really a case where we want the full JSON serialization of the object?

Or at least, remove it from save_pretrained, keep it only on to_json_string() (and then it can default to True there, maybe it's more correct)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the user of save_pretrained does not really need to know which exact serialization method we use underneath, IMO)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know either but if we can keep a backward compatible option with this flag, why not maybe?

It's kinda hard to think about all the use cases people may have (e.g. loading in an older pytorch-transformers version the configuration which might break now for them).

return output

def to_json_string(self):
def to_json_string(self, use_diff=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, you should document the new argument

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elegant way of doing that! I'm all for V2 as well.
I agree with Thom that arguments should be documented, but other than that it looks great to me!

@sshleifer
Copy link
Contributor

Does this impact the process for changing the default config?

Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than my comment, looks great!

self.label2id = dict((key, int(value)) for key, value in self.label2id.items())

def save_pretrained(self, save_directory):
def save_pretrained(self, save_directory, use_diff=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should be the only option (i.e., remove the argument completely). i.e., is there really a case where we want the full JSON serialization of the object?

Or at least, remove it from save_pretrained, keep it only on to_json_string() (and then it can default to True there, maybe it's more correct)

@patrickvonplaten
Copy link
Contributor Author

Sounds good!

@patrickvonplaten
Copy link
Contributor Author

Ok, v2 is now implemented. I agree with @julien-c that the save_pretrained() method should be kept as clean as possible and I think we can keep backward compatibility (for very special edge cases) by allowing a boolean argument to the to_json_file() method.

@julien-c julien-c merged commit e9d0bc0 into huggingface:master Apr 18, 2020
@julien-c
Copy link
Member

Awesome, merging this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants