Skip to content

Conversation

@BigBird01
Copy link
Contributor

Add DeBERTa model to hf transformers. DeBERTa applies two techniques to improve RoBERTa, one is disentangled attention, the other is enhanced mask decoder. With 80GB training data, DeBERTa outperform RoBERTa on a majority of NLU tasks, e.g. SQUAD, MNLI and RACE. Paper link: https://arxiv.org/abs/2006.03654

@julien-c julien-c added the model card Related to pretrained model cards label Jul 21, 2020
@codecov
Copy link

codecov bot commented Jul 21, 2020

Codecov Report

Merging #5929 into master will decrease coverage by 0.30%.
The diff coverage is 73.13%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5929      +/-   ##
==========================================
- Coverage   79.35%   79.05%   -0.31%     
==========================================
  Files         181      184       +3     
  Lines       35800    36660     +860     
==========================================
+ Hits        28410    28982     +572     
- Misses       7390     7678     +288     
Impacted Files Coverage Δ
src/transformers/activations.py 76.92% <50.00%> (-6.42%) ⬇️
src/transformers/tokenization_deberta.py 69.76% <69.76%> (ø)
src/transformers/modeling_deberta.py 73.26% <73.26%> (ø)
src/transformers/__init__.py 99.39% <100.00%> (+0.01%) ⬆️
src/transformers/configuration_auto.py 96.34% <100.00%> (+0.04%) ⬆️
src/transformers/configuration_deberta.py 100.00% <100.00%> (ø)
src/transformers/modeling_auto.py 87.11% <100.00%> (+0.06%) ⬆️
src/transformers/tokenization_auto.py 92.64% <100.00%> (+0.10%) ⬆️
src/transformers/modeling_tf_lxmert.py 22.14% <0.00%> (-72.41%) ⬇️
src/transformers/tokenization_pegasus.py 46.03% <0.00%> (-49.21%) ⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1fc4de6...0a08565. Read the comment docs.

@julien-c julien-c added model card Related to pretrained model cards New model and removed model card Related to pretrained model cards labels Jul 21, 2020
@BigBird01 BigBird01 mentioned this pull request Jul 21, 2020
3 tasks
@BigBird01
Copy link
Contributor Author

Related issue #4858

Copy link
Contributor

@sshleifer sshleifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!
I left some comments. Most importantly we can't add a deeberta dependency.
Let me know if there's anything I can do to help :)

@mrm8488
Copy link
Contributor

mrm8488 commented Jul 22, 2020

Someone is waiting for fine-tuning a new model :)

@LysandreJik
Copy link
Member

Very cool, looking forward to that model!!

@BigBird01
Copy link
Contributor Author

Hello, May I know when will the PR be merged?

@sshleifer
Copy link
Contributor

@BigBird01 it will probably take between 1 and 3 weeks to merge. 2,500 lines is a lot to review :)
I made some comments, and can take another pass on this when they're addressed.

@BigBird01
Copy link
Contributor Author

@BigBird01 it will probably take between 1 and 3 weeks to merge. 2,500 lines is a lot to review :)
I made some comments, and can take another pass on this when they're addressed.

Thanks!

@LysandreJik
Copy link
Member

Thanks for addressing the comments! Will take a look in a few days.

@LysandreJik LysandreJik self-requested a review September 3, 2020 09:09
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, it's nearly done! Thanks a lot for your work on it.

What's left to do is:

  • Ensure that the documentation is in the correct format
  • Enable the remaining tests

If you don't have time to work on it right now, let me know and I'll finish the implementation and merge it. Thanks!

README.md Outdated
Comment on lines 175 to 177
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
22. **[DeBERTa](https://huggingface.co/transformers/model_doc/deberta.html)** (from Microsoft Research) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
25. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
26. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.
25. **[DeBERTa](https://huggingface.co/transformers/model_doc/deberta.html)** (from Microsoft Research) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen.
26. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users).
27. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't understand if this is absolutely necessary or not

Comment on lines 37 to 62
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if this could be in the same style as the library's docstrings. See BERT's docstrings for example:

Args:
vocab_size (:obj:`int`, optional, defaults to 30522):
Vocabulary size of the BERT model. Defines the different tokens that
can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`.
hidden_size (:obj:`int`, optional, defaults to 768):
Dimensionality of the encoder layers and the pooler layer.
num_hidden_layers (:obj:`int`, optional, defaults to 12):
Number of hidden layers in the Transformer encoder.
num_attention_heads (:obj:`int`, optional, defaults to 12):
Number of attention heads for each attention layer in the Transformer encoder.
intermediate_size (:obj:`int`, optional, defaults to 3072):
Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"):
The non-linear activation function (function or string) in the encoder and pooler.
If string, "gelu", "relu", "swish" and "gelu_new" are supported.
hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1):
The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler.
attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1):
The dropout ratio for the attention probabilities.
max_position_embeddings (:obj:`int`, optional, defaults to 512):
The maximum sequence length that this model might ever be used with.
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
type_vocab_size (:obj:`int`, optional, defaults to 2):
The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`.
initializer_range (:obj:`float`, optional, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
layer_norm_eps (:obj:`float`, optional, defaults to 1e-12):
The epsilon used by the layer normalization layers.
gradient_checkpointing (:obj:`bool`, optional, defaults to :obj:`False`):
If True, use gradient checkpointing to save memory at the expense of slower backward pass.

Comment on lines 35 to 38
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only support torch>=1.0.1 so there's no need for this assertion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should remove this check

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
There is an issue for tracing customer python torch Function, using this decorator to work around it.
There is an issue for tracing custom python torch Function, using this decorator to work around it.

Comment on lines 1111 to 1146
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here regarding docs

Comment on lines 1207 to 1233
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to handle the hidden states and attentions there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeBERTa?

Suggested change
A BERT sequence has the following format:
A DeBERTa sequence has the following format:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please copy and adapt from the template/BERT tokenizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A BERT sequence pair mask has the following format:
A DeBERTa sequence pair mask has the following format:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please copy and adapt from the template/BERT tokenizer.

Comment on lines 228 to 242
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to enable these for the merge

@BigBird01
Copy link
Contributor Author

BigBird01 commented Sep 8, 2020

Great, it's nearly done! Thanks a lot for your work on it.

What's left to do is:

  • Ensure that the documentation is in the correct format
  • Enable the remaining tests

If you don't have time to work on it right now, let me know and I'll finish the implementation and merge it. Thanks!

@LysandreJik Thanks for the comments. It will be great if you can work on the rest:) Feel free to let me know if you have any questions on the implementation.

@LysandreJik
Copy link
Member

Okay @BigBird01, I have the PyTorch version ready, passing all tests and the docs cleaned up as well. Should I push directly on your branch, or do you want me to open a PR on your fork so that you can check my changes before applying them?

The value used to pad input_ids.
position_biased_input (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether add absolute position embedding to content embedding.
pos_att_type (:obj:`str`, `optional`, defaults to "None"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to directly use a list of strings here instead of p2c|c2p| .... it is a bit strange that we do string operations in the modeling file such as ''.split("|"). What do you think @LysandreJik

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the beginning we explored more options here while we finally comes to p2c|c2P. But it's OK to change it to list.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your PR @BigBird01 ! I'm very exciting to have a new attention mechanism in the library and your results in the paper look awesome!

A couple of things I would like to change before merging:
Would be happier if the layer names could be renamed from Bert... -> DeBert... for better consistency. Also I think we should delete the MaskedLayerNorm as explained below.
Lastly, can we change the config.pos_att_type from a string that is converted to a list directly to a list?

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this model.
It looks like you have used an old version of the template for the docstrings though, so there are quite a few things to adapt/update. We can help with that if necessary.

More generally for the naming, I don't really like the capitals in the models name. For instance BERT is BertModel/BertConfig/BertTokenizer. RoBERTa is RobertaModel/RobertaConfig/RobertaTokenizer. I think we should use the same for DeBERTa and have DebertaModel/DebertaConfig/DebertaTokenizer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please copy and adapt from the template/BERT tokenizer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please copy and adapt from the template/BERT tokenizer.

@LysandreJik
Copy link
Member

Thanks @patrickvonplaten, @sgugger for the reviews. Will implement the changes tonight.

LysandreJik and others added 3 commits September 29, 2020 17:32
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@LysandreJik LysandreJik merged commit 7a0cf0e into huggingface:master Sep 30, 2020
@LysandreJik
Copy link
Member

Thanks for your work on this @BigBird01 :)

@BigBird01
Copy link
Contributor Author

Thanks for your work on this @BigBird01 :)

Thank you all to merge the code into master @LysandreJik @patrickvonplaten
One question, why after the merge we can't find the document of deberta model at https://huggingface.co/transformers/
Could you help to check that?

@patrickvonplaten
Copy link
Contributor

The documentation is online, you just have to click on master on the top left right under the hugging face logo as is done here: https://huggingface.co/transformers/master/. The next release will then show deberta docs as a default :-)

@LysandreJik
Copy link
Member

@BigBird01, two slow tests are failing with the DeBERTa models. Could you show how you implemented the integration tests so that I may investigate?

@BigBird01
Copy link
Contributor Author

@BigBird01, two slow tests are failing with the DeBERTa models. Could you show how you implemented the integration tests so that I may investigate?

@LysandreJik In the integration tests, I just feed the model with a fake input data and verify the output of the model. It's similar to RoBERTa tests. I may take a took at it today.

@LysandreJik
Copy link
Member

Thanks! The DeBERTa may not be working correctly right now, knowing the source of the issue would be great.

@BigBird01
Copy link
Contributor Author

Thanks! The DeBERTa may not be working correctly right now, knowing the source of the issue would be great.

The issue is due to the model failed to be loaded due to parameter name mismatch. It needs to update the model by changing the encoder name from 'bert' to 'deberta'.

@BigBird01
Copy link
Contributor Author

Thanks! The DeBERTa may not be working correctly right now, knowing the source of the issue would be great.

@LysandreJik I just made a fix to the test failure. #7645
Could you take a look?

@BigBird01 BigBird01 deleted the penhe/deberta branch February 4, 2021 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model card Related to pretrained model cards New model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants