-
Notifications
You must be signed in to change notification settings - Fork 31.9k
[PretrainedConfig] Fix save pretrained config for edge case #7943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PretrainedConfig] Fix save pretrained config for edge case #7943
Conversation
|
Any reason not to look at just the config class? At a first glance, I'd say we want to compare the defaults to the class we instantiated, not to the superclass |
Back then this was my initial idea as well - but then the configs could be more or less emtpy if all parameters are the same. This has a couple of disadvantages:
|
| langs, | ||
| src_vocab_size, | ||
| tgt_vocab_size, | ||
| langs=["en", "de"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly for consistency with other configs in the library, it should be possible to instantiate every config (which is not a "composition" config) without providing any parameters:
config = self.config_cls()I added these init params from: https://huggingface.co/facebook/wmt19-en-de (cc @stas00)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say, in such case let's use langs=["xx", "yy"] so it's clear it's non-sense. Using meaningful data here is misleading at best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's the actual configuration_fsmt.py file no? All other configuration files use the actual params of one of the main models as default. E.g. BertConfig() uses the params of bert-base-cased as defaults. I don't really see why this is misleading... @sgugger @LysandreJik @sshleifer what is your opinion here?
| def check_config_can_be_init_without_params(self): | ||
| if self.config_class.is_composition: | ||
| return | ||
| config = self.config_class() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sure that every config can be instantiated without providing any parameters.
|
UPDATE: I had to add a class attribute to the config to make this feature work (see description above) - @julien-c @sgugger @thomwolf @LysandreJik - could you check if this is fine for you guys. |
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, this LGTM then! Thanks!
|
LGTM |
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
| tgt_vocab_size, | ||
| langs=["en", "de"], | ||
| src_vocab_size=42024, | ||
| tgt_vocab_size=42024, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggested vocab size defaults make no sense w/o the corresponding vocab. Just as well set them to 1000 if defaults are now required. I remember some common tests fail if vocab size is less than 1000, so that is probably a good default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok great! Thanks for the tip :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, actually I don't really see why 42024 makes no sense - could you explain a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a set of defaults for the sake of defaults, or are we configuring a very specific model as a default that is actually a working model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're using some real model as a default, then yes, you want the actual numbers. If these are just random numbers, then why would you want to set it to 42024?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @patrickvonplaten has the right intentions following what is established across the repo. In every configuration file (e.g., BERT, ALBERT, CTRL), we have defaults for configurations so that initializing one without arguments yields a sensible architecture based on existing pre-trained weights. If it is not the case for a model, then it slipped the review and should be updated.
This means that doing: BertModel(BertConfig()) yields a model similar to what the original BERT model is. This makes it easier to work with as we don't have a myriad of (usually different) arguments to supply to each configuration when doing tests.
Also, this is the convention we have adopted until now and I see no strong argument against it that would lead to changing this convention. I also see no argument on why FSMT would need to differ in that regard.
Because the configuration process for fsmt was complex, using some defaults originally masked problems, so by not having defaults it was detecting problems of not getting the right config immediately. I'm concerned that adding defaults that look about right all kinds of unexpected problems may arise.
I would say that if some problems were masked by using some defaults, then some additional tests should be added to ensure that these tests are not an issue, either for this model or for others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the convention we have adopted until now and I see no strong argument against it that would lead to changing this convention
If you're saying that this convention practically works for this project, then it is.
some additional tests should be added
Yes, I have just done that here #7860
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LysandreJik The convention doesn't make as much sense for situations where you have many checkpoints trained from scratch for different datasets, like FSMT/Marian. They all have different vocab sizes and none is meaningfully the "base" model in a bert-base-cased way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, there are two completely different blenderbot checkpoints and I arbitrarily chose the smaller to be the config defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned twice earlier this is the way of the project, and so it is.
So I will just address @patrickvonplaten comment since i think I know why we are missing each other here:
I guess for me it's quite natural that if you do:
config = BertConfig()
you would expect to get the most commonly used / standard BERT config being bert-base-cased, which is useful information IMO.
I totally agree! It works for Bert and other similar models. It doesn't work for translation models, IMHO.
What would a user do with the default German to English translation model with 40K vocab of non-existing vocab - the model the user will end up with will not be functional - yes, they have to configure it if they don't use a pretrained model. There is no magical way around it.
The key here is that certain models do not have sensible defaults and when this is the case giving a default that looks very much like correct default just for the consistency sake is questionable engineering-wise.
It'd work for automatic tests as in ThomWolf's recent PR as long as you don't try to do any qualitative tests, but it won't do anything useful for end users.
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What does this PR do?
There is an edge case for which the "diff" save method for
PretrainedConfigfails. We decided a while ago in this PR: #3797 that we wanted to have more readable configs and thus tweaked thesave_pretrained()method so that only parameters that are different to the default PretrainedConfig class are serialized.There was an edge case we did not consider:
If a parameter, like
add_cross_attentiondefaults toTrueinProphetNetConfig, but is by defaultFalseinPretrainedConfiga problem can arise when a user wants to saveadd_cross_attention=Falsein hisProphetNetConfig. Becauseadd_cross_attention=Falsecorresponds to thePretrainedConfigdefault case, this parameter will not be serialized and thus when reloading the config, the parameter defaults toProphetNetConfigwhich isTrueand which is then an error.This PR fixes this behavior by simply making sure that a parameter is only not saved if it is equal to both
PretrainedConfigandProphetNetConfig.This feature requires configs to be instantiated without providing any parameters. This is currently not possible for the
EncoderDecoderModelConfigandRagConfigbecause those configs are composed of multiple sub-configs which have to be provided. => A new class attributeis_compositionis added to correctly handle these classes.Two tests are added.
Also cc @stas00 for FSTM config.
Before submitting
Pull Request section?
to the it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?