-
Notifications
You must be signed in to change notification settings - Fork 33.5k
fix LayoutLMv3TokenizerFast subword label after 'Ġ' token #21695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -508,7 +508,6 @@ def encode_plus( | |||||
| **kwargs, | ||||||
| ) | ||||||
|
|
||||||
| # Copied from transformers.models.layoutlmv2.tokenization_layoutlmv2_fast.LayoutLMv2TokenizerFast._batch_encode_plus with LayoutLMv2->LayoutLMv3 | ||||||
| def _batch_encode_plus( | ||||||
| self, | ||||||
| batch_text_or_text_pairs: Union[ | ||||||
|
|
@@ -640,18 +639,23 @@ def _batch_encode_plus( | |||||
| else: | ||||||
| original_index = batch_index | ||||||
| labels_example = [] | ||||||
| previous_token_empty = False | ||||||
| for id, offset, word_id in zip( | ||||||
| sanitized_tokens["input_ids"][batch_index], | ||||||
| sanitized_tokens["offset_mapping"][batch_index], | ||||||
| sanitized_encodings[batch_index].word_ids, | ||||||
| ): | ||||||
| if word_id is not None: | ||||||
| if self.only_label_first_subword: | ||||||
| if offset[0] == 0: | ||||||
| if offset[0] == 0 and not previous_token_empty: | ||||||
| # Use the real label id for the first token of the word, and padding ids for the remaining tokens | ||||||
| labels_example.append(word_labels[original_index][word_id]) | ||||||
| else: | ||||||
| labels_example.append(self.pad_token_label) | ||||||
| if offset == (0, 0): | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I'm not sure offset == (0,0) is the right way to check this, maybe this is a safer option
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can confirm this works when testing on a new model (UDOP)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I believe it is equivalent, although slower than simply comparing a tuple to (0, 0) it is probably safer and resilient to future changes.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just checked for LayoutLMv3TokenizerFast,
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok interesting. When working on When testing out the following with offset == (0,0) from this branch: it gives me: (I used "weirdly" just to make sure we get multiple tokens). Interestingly it splits the word "a" into 2 tokens: an empty token and "a". I also printed (id, offset, word_id): and in this case the empty token has offset (0, 1), which explains why offset == (0,0) didn't work.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is probably not related to the current issue but your tokenizer produces weird (pun intended) offsets mapping. We should be able to derive word length from the |
||||||
| previous_token_empty = True | ||||||
| else: | ||||||
| previous_token_empty = False | ||||||
| else: | ||||||
| labels_example.append(word_labels[original_index][word_id]) | ||||||
| else: | ||||||
|
|
||||||
Uh oh!
There was an error while loading. Please reload this page.