Skip to content

Conversation

@patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Oct 21, 2020

What does this PR do?

There is an edge case for which the "diff" save method for PretrainedConfig fails. We decided a while ago in this PR: #3797 that we wanted to have more readable configs and thus tweaked the save_pretrained() method so that only parameters that are different to the default PretrainedConfig class are serialized.

There was an edge case we did not consider:

If a parameter, like add_cross_attention defaults to True in ProphetNetConfig, but is by default False in PretrainedConfig a problem can arise when a user wants to save add_cross_attention=False in his ProphetNetConfig. Because add_cross_attention=False corresponds to the PretrainedConfig default case, this parameter will not be serialized and thus when reloading the config, the parameter defaults to ProphetNetConfig which is True and which is then an error.

This PR fixes this behavior by simply making sure that a parameter is only not saved if it is equal to both PretrainedConfig and ProphetNetConfig.

This feature requires configs to be instantiated without providing any parameters. This is currently not possible for the EncoderDecoderModelConfig and RagConfig because those configs are composed of multiple sub-configs which have to be provided. => A new class attribute is_composition is added to correctly handle these classes.

Two tests are added.

Also cc @stas00 for FSTM config.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to the it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@sgugger
Copy link
Collaborator

sgugger commented Oct 21, 2020

Any reason not to look at just the config class? At a first glance, I'd say we want to compare the defaults to the class we instantiated, not to the superclass PretrainedConfig.

@patrickvonplaten
Copy link
Contributor Author

Any reason not to look at just the config class? At a first glance, I'd say we want to compare the defaults to the class we instantiated, not to the superclass PretrainedConfig.

Back then this was my initial idea as well - but then the configs could be more or less emtpy if all parameters are the same. This has a couple of disadvantages:

  • When looking at the config online people cannot see any parameters and would have to look into the code which might be annoying
  • This would make the configs much more prone to break if the init values of respective classes are changed.

langs,
src_vocab_size,
tgt_vocab_size,
langs=["en", "de"],
Copy link
Contributor Author

@patrickvonplaten patrickvonplaten Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly for consistency with other configs in the library, it should be possible to instantiate every config (which is not a "composition" config) without providing any parameters:

config = self.config_cls()

I added these init params from: https://huggingface.co/facebook/wmt19-en-de (cc @stas00)

Copy link
Contributor

@stas00 stas00 Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say, in such case let's use langs=["xx", "yy"] so it's clear it's non-sense. Using meaningful data here is misleading at best.

Copy link
Contributor Author

@patrickvonplaten patrickvonplaten Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's the actual configuration_fsmt.py file no? All other configuration files use the actual params of one of the main models as default. E.g. BertConfig() uses the params of bert-base-cased as defaults. I don't really see why this is misleading... @sgugger @LysandreJik @sshleifer what is your opinion here?

def check_config_can_be_init_without_params(self):
if self.config_class.is_composition:
return
config = self.config_class()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sure that every config can be instantiated without providing any parameters.

@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Oct 21, 2020

UPDATE: I had to add a class attribute to the config to make this feature work (see description above) - @julien-c @sgugger @thomwolf @LysandreJik - could you check if this is fine for you guys.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, this LGTM then! Thanks!

@julien-c
Copy link
Member

LGTM

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
tgt_vocab_size,
langs=["en", "de"],
src_vocab_size=42024,
tgt_vocab_size=42024,
Copy link
Contributor

@stas00 stas00 Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggested vocab size defaults make no sense w/o the corresponding vocab. Just as well set them to 1000 if defaults are now required. I remember some common tests fail if vocab size is less than 1000, so that is probably a good default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok great! Thanks for the tip :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, actually I don't really see why 42024 makes no sense - could you explain a bit?

Copy link
Contributor

@stas00 stas00 Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a set of defaults for the sake of defaults, or are we configuring a very specific model as a default that is actually a working model?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're using some real model as a default, then yes, you want the actual numbers. If these are just random numbers, then why would you want to set it to 42024?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @patrickvonplaten has the right intentions following what is established across the repo. In every configuration file (e.g., BERT, ALBERT, CTRL), we have defaults for configurations so that initializing one without arguments yields a sensible architecture based on existing pre-trained weights. If it is not the case for a model, then it slipped the review and should be updated.

This means that doing: BertModel(BertConfig()) yields a model similar to what the original BERT model is. This makes it easier to work with as we don't have a myriad of (usually different) arguments to supply to each configuration when doing tests.

Also, this is the convention we have adopted until now and I see no strong argument against it that would lead to changing this convention. I also see no argument on why FSMT would need to differ in that regard.

@stas00:

Because the configuration process for fsmt was complex, using some defaults originally masked problems, so by not having defaults it was detecting problems of not getting the right config immediately. I'm concerned that adding defaults that look about right all kinds of unexpected problems may arise.

I would say that if some problems were masked by using some defaults, then some additional tests should be added to ensure that these tests are not an issue, either for this model or for others.

Copy link
Contributor

@stas00 stas00 Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the convention we have adopted until now and I see no strong argument against it that would lead to changing this convention

If you're saying that this convention practically works for this project, then it is.

some additional tests should be added

Yes, I have just done that here #7860

Copy link
Contributor

@sshleifer sshleifer Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LysandreJik The convention doesn't make as much sense for situations where you have many checkpoints trained from scratch for different datasets, like FSMT/Marian. They all have different vocab sizes and none is meaningfully the "base" model in a bert-base-cased way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, there are two completely different blenderbot checkpoints and I arbitrarily chose the smaller to be the config defaults.

Copy link
Contributor

@stas00 stas00 Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned twice earlier this is the way of the project, and so it is.

So I will just address @patrickvonplaten comment since i think I know why we are missing each other here:

I guess for me it's quite natural that if you do:

config = BertConfig()

you would expect to get the most commonly used / standard BERT config being bert-base-cased, which is useful information IMO.

I totally agree! It works for Bert and other similar models. It doesn't work for translation models, IMHO.

What would a user do with the default German to English translation model with 40K vocab of non-existing vocab - the model the user will end up with will not be functional - yes, they have to configure it if they don't use a pretrained model. There is no magical way around it.

The key here is that certain models do not have sensible defaults and when this is the case giving a default that looks very much like correct default just for the consistency sake is questionable engineering-wise.

It'd work for automatic tests as in ThomWolf's recent PR as long as you don't try to do any qualitative tests, but it won't do anything useful for end users.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@patrickvonplaten patrickvonplaten merged commit f34372a into huggingface:master Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants