Fix DeBERTa token_type_ids #17082

deutschmn · 2022-05-04T11:51:41Z

What does this PR do?

This PR fixes #15735. It changes the behavior of DebertaTokenizer and DebertaTokenizerFast when passing pair inputs. Before, the token type IDs were all 0. This PR changes this so that the token_type_ids for the tokens of the second sentence are 1.

It also adds a test case to test this behavior (DebertaTokenizationTest.test_token_type_ids). Failed before, passes now.

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. DebertaTokenizer always assigns token type ID 0 #15735
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@LysandreJik?

HuggingFaceDocBuilderDev · 2022-05-04T12:08:34Z

The documentation is not available anymore as the PR was closed or merged.

LysandreJik

Looks good to me, important bugfix. WDYT @SaulLu?

SaulLu

Thank you very much for the proposed fix @deutschmn 🤗 !

To have more context, I have traced the history of the changes concerning DeBERTa's token_type_ids:

In the first PR #5929 where Deverta was added the ids were 0 for the first sentence and 1 for the second;
When the tokenizer fast was added in PR #11387 the choice was at that time to assign the 2 sentences to id 0 (see lines here and here);
In a later bug fix in PR #10703, it seems that the ids were aligned to the fast version, i.e. only 0s (diff here - thanks @daniel-ziegler).

Looking at the history I also think it's a bug (nothing mentions that the ids should have been all assigned to 0) and the proposed fix seems the right one!

deutschmn · 2022-05-04T15:21:03Z

Great, thanks for looking into this! I also checked Microsoft's implementation and it looks like they use 1 for sentence B as well 😊

deutschmn added 2 commits May 4, 2022 11:22

Add failing test case

ae8b132

Fix type_id creation for slow and fast tokenizer

a36481f

LysandreJik approved these changes May 4, 2022

View reviewed changes

SaulLu approved these changes May 4, 2022

View reviewed changes

SaulLu merged commit 870e6f2 into huggingface:main May 4, 2022

elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022

Fix DeBERTa token_type_ids (huggingface#17082)

9b3265c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DeBERTa token_type_ids #17082

Fix DeBERTa token_type_ids #17082

Uh oh!

deutschmn commented May 4, 2022

Uh oh!

HuggingFaceDocBuilderDev commented May 4, 2022 •

edited

Loading

Uh oh!

LysandreJik left a comment

Uh oh!

SaulLu left a comment •

edited

Loading

Uh oh!

deutschmn commented May 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix DeBERTa token_type_ids #17082

Fix DeBERTa token_type_ids #17082

Uh oh!

Conversation

deutschmn commented May 4, 2022

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented May 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LysandreJik left a comment

Choose a reason for hiding this comment

Uh oh!

SaulLu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deutschmn commented May 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HuggingFaceDocBuilderDev commented May 4, 2022 •

edited

Loading

SaulLu left a comment •

edited

Loading