-
Notifications
You must be signed in to change notification settings - Fork 31.8k
Add DeBERTa model #5929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DeBERTa model #5929
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5929 +/- ##
==========================================
- Coverage 79.35% 79.05% -0.31%
==========================================
Files 181 184 +3
Lines 35800 36660 +860
==========================================
+ Hits 28410 28982 +572
- Misses 7390 7678 +288
Continue to review full report at Codecov.
|
|
Related issue #4858 |
sshleifer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
I left some comments. Most importantly we can't add a deeberta dependency.
Let me know if there's anything I can do to help :)
|
Someone is waiting for fine-tuning a new model :) |
|
Very cool, looking forward to that model!! |
|
Hello, May I know when will the PR be merged? |
|
@BigBird01 it will probably take between 1 and 3 weeks to merge. 2,500 lines is a lot to review :) |
Thanks! |
|
Thanks for addressing the comments! Will take a look in a few days. |
LysandreJik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, it's nearly done! Thanks a lot for your work on it.
What's left to do is:
- Ensure that the documentation is in the correct format
- Enable the remaining tests
If you don't have time to work on it right now, let me know and I'll finish the implementation and merge it. Thanks!
README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 22. **[DeBERTa](https://huggingface.co/transformers/model_doc/deberta.html)** (from Microsoft Research) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. | |
| 25. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users). | |
| 26. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR. | |
| 25. **[DeBERTa](https://huggingface.co/transformers/model_doc/deberta.html)** (from Microsoft Research) released with the paper [DeBERTa: Decoding-enhanced BERT with Disentangled Attention](https://arxiv.org/abs/2006.03654) by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. | |
| 26. **[Other community models](https://huggingface.co/models)**, contributed by the [community](https://huggingface.co/users). | |
| 27. Want to contribute a new model? We have added a **detailed guide and templates** to guide you in the process of adding a new model. You can find them in the [`templates`](./templates) folder of the repository. Be sure to check the [contributing guidelines](./CONTRIBUTING.md) and contact the maintainers or open an issue to collect feedbacks before starting your PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still don't understand if this is absolutely necessary or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great if this could be in the same style as the library's docstrings. See BERT's docstrings for example:
transformers/src/transformers/configuration_bert.py
Lines 63 to 92 in 48ff6d5
| Args: | |
| vocab_size (:obj:`int`, optional, defaults to 30522): | |
| Vocabulary size of the BERT model. Defines the different tokens that | |
| can be represented by the `inputs_ids` passed to the forward method of :class:`~transformers.BertModel`. | |
| hidden_size (:obj:`int`, optional, defaults to 768): | |
| Dimensionality of the encoder layers and the pooler layer. | |
| num_hidden_layers (:obj:`int`, optional, defaults to 12): | |
| Number of hidden layers in the Transformer encoder. | |
| num_attention_heads (:obj:`int`, optional, defaults to 12): | |
| Number of attention heads for each attention layer in the Transformer encoder. | |
| intermediate_size (:obj:`int`, optional, defaults to 3072): | |
| Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder. | |
| hidden_act (:obj:`str` or :obj:`function`, optional, defaults to "gelu"): | |
| The non-linear activation function (function or string) in the encoder and pooler. | |
| If string, "gelu", "relu", "swish" and "gelu_new" are supported. | |
| hidden_dropout_prob (:obj:`float`, optional, defaults to 0.1): | |
| The dropout probabilitiy for all fully connected layers in the embeddings, encoder, and pooler. | |
| attention_probs_dropout_prob (:obj:`float`, optional, defaults to 0.1): | |
| The dropout ratio for the attention probabilities. | |
| max_position_embeddings (:obj:`int`, optional, defaults to 512): | |
| The maximum sequence length that this model might ever be used with. | |
| Typically set this to something large just in case (e.g., 512 or 1024 or 2048). | |
| type_vocab_size (:obj:`int`, optional, defaults to 2): | |
| The vocabulary size of the `token_type_ids` passed into :class:`~transformers.BertModel`. | |
| initializer_range (:obj:`float`, optional, defaults to 0.02): | |
| The standard deviation of the truncated_normal_initializer for initializing all weight matrices. | |
| layer_norm_eps (:obj:`float`, optional, defaults to 1e-12): | |
| The epsilon used by the layer normalization layers. | |
| gradient_checkpointing (:obj:`bool`, optional, defaults to :obj:`False`): | |
| If True, use gradient checkpointing to save memory at the expense of slower backward pass. |
src/transformers/modeling_deberta.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only support torch>=1.0.1 so there's no need for this assertion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should remove this check
src/transformers/modeling_deberta.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| There is an issue for tracing customer python torch Function, using this decorator to work around it. | |
| There is an issue for tracing custom python torch Function, using this decorator to work around it. |
src/transformers/modeling_deberta.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here regarding docs
src/transformers/modeling_deberta.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to handle the hidden states and attentions there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeBERTa?
| A BERT sequence has the following format: | |
| A DeBERTa sequence has the following format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please copy and adapt from the template/BERT tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| A BERT sequence pair mask has the following format: | |
| A DeBERTa sequence pair mask has the following format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please copy and adapt from the template/BERT tokenizer.
tests/test_modeling_deberta.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to enable these for the merge
@LysandreJik Thanks for the comments. It will be great if you can work on the rest:) Feel free to let me know if you have any questions on the implementation. |
|
Okay @BigBird01, I have the PyTorch version ready, passing all tests and the docs cleaned up as well. Should I push directly on your branch, or do you want me to open a PR on your fork so that you can check my changes before applying them? |
| The value used to pad input_ids. | ||
| position_biased_input (:obj:`bool`, `optional`, defaults to :obj:`True`): | ||
| Whether add absolute position embedding to content embedding. | ||
| pos_att_type (:obj:`str`, `optional`, defaults to "None"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer to directly use a list of strings here instead of p2c|c2p| .... it is a bit strange that we do string operations in the modeling file such as ''.split("|"). What do you think @LysandreJik
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the beginning we explored more options here while we finally comes to p2c|c2P. But it's OK to change it to list.
patrickvonplaten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your PR @BigBird01 ! I'm very exciting to have a new attention mechanism in the library and your results in the paper look awesome!
A couple of things I would like to change before merging:
Would be happier if the layer names could be renamed from Bert... -> DeBert... for better consistency. Also I think we should delete the MaskedLayerNorm as explained below.
Lastly, can we change the config.pos_att_type from a string that is converted to a list directly to a list?
sgugger
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this model.
It looks like you have used an old version of the template for the docstrings though, so there are quite a few things to adapt/update. We can help with that if necessary.
More generally for the naming, I don't really like the capitals in the models name. For instance BERT is BertModel/BertConfig/BertTokenizer. RoBERTa is RobertaModel/RobertaConfig/RobertaTokenizer. I think we should use the same for DeBERTa and have DebertaModel/DebertaConfig/DebertaTokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please copy and adapt from the template/BERT tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also please copy and adapt from the template/BERT tokenizer.
|
Thanks @patrickvonplaten, @sgugger for the reviews. Will implement the changes tonight. |
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
|
Thanks for your work on this @BigBird01 :) |
Thank you all to merge the code into master @LysandreJik @patrickvonplaten |
|
The documentation is online, you just have to click on |
|
@BigBird01, two slow tests are failing with the DeBERTa models. Could you show how you implemented the integration tests so that I may investigate? |
@LysandreJik In the integration tests, I just feed the model with a fake input data and verify the output of the model. It's similar to RoBERTa tests. I may take a took at it today. |
|
Thanks! The DeBERTa may not be working correctly right now, knowing the source of the issue would be great. |
The issue is due to the model failed to be loaded due to parameter name mismatch. It needs to update the model by changing the encoder name from 'bert' to 'deberta'. |
@LysandreJik I just made a fix to the test failure. #7645 |
Add DeBERTa model to hf transformers. DeBERTa applies two techniques to improve RoBERTa, one is disentangled attention, the other is enhanced mask decoder. With 80GB training data, DeBERTa outperform RoBERTa on a majority of NLU tasks, e.g. SQUAD, MNLI and RACE. Paper link: https://arxiv.org/abs/2006.03654