Skip to content

Conversation

@deutschmn
Copy link
Contributor

What does this PR do?

This PR fixes #15735. It changes the behavior of DebertaTokenizer and DebertaTokenizerFast when passing pair inputs. Before, the token type IDs were all 0. This PR changes this so that the token_type_ids for the tokens of the second sentence are 1.

It also adds a test case to test this behavior (DebertaTokenizationTest.test_token_type_ids). Failed before, passes now.

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@LysandreJik?

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 4, 2022

The documentation is not available anymore as the PR was closed or merged.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, important bugfix. WDYT @SaulLu?

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the proposed fix @deutschmn 🤗 !

To have more context, I have traced the history of the changes concerning DeBERTa's token_type_ids:

  1. In the first PR #5929 where Deverta was added the ids were 0 for the first sentence and 1 for the second;
  2. When the tokenizer fast was added in PR #11387 the choice was at that time to assign the 2 sentences to id 0 (see lines here and here);
  3. In a later bug fix in PR #10703, it seems that the ids were aligned to the fast version, i.e. only 0s (diff here - thanks @daniel-ziegler).

Looking at the history I also think it's a bug (nothing mentions that the ids should have been all assigned to 0) and the proposed fix seems the right one!

@deutschmn
Copy link
Contributor Author

Great, thanks for looking into this! I also checked Microsoft's implementation and it looks like they use 1 for sentence B as well 😊

@SaulLu SaulLu merged commit 870e6f2 into huggingface:main May 4, 2022
elusenji pushed a commit to elusenji/transformers that referenced this pull request Jun 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DebertaTokenizer always assigns token type ID 0

4 participants