fix LayoutLMv3TokenizerFast subword label after 'Ġ' token#21695
Conversation
LayoutLMv3TokenizerFast produces empty 'Ġ' token with `offset_mapping = (0, 0)`. Next token is wrongly assumed to also be beginning of word and isn't correctly assigned `pad_token_label`. Modify test with text that produce 'Ġ' token. Remove copy check from LayoutLMv2TokenizerFast for `_batch_encode_plus`. solves issue: huggingface#19978
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
|
Also cc @amyeroberts |
There was a problem hiding this comment.
Hey @thibaultdouzon thanks all for working on this! I am having a look to check if there might be a problem with the rust BPE (may just be an edge case!). In the mean time, I think that this is great.
Edit: here are the results of my investigation:
- The BPE is behaving as expected:
Let the input words be['phone', '0000000000000000', 'phone]. The pretokenization will result in[('Ġpencil', (0, 6)), ('Ġ0000000000000000', (6, 23)), ('Ġphone', (23, 29))]because the tokenizer hasadd_prefix_spaceset toTrue. Then, each of these sequences are given to the actual BPE model which produces (tokenizer.model.tokenize('Ġ0000000000000000'))Ġpencil, 'Ġ', '0000000000000000', Ġphone. Why? BecauseĠpencilis part of the vocabulary (vocab.json) from the merge ofĠpenandcil(in the merge.txt). However,Ġ0000000000000000is not part of the vocabulary, but0000000000000000` is (idx is 49393). - The post-processor is not supposed to remove the added space. It is only when decoding that we remove what was added by BPE. Note that the space is only add for beginning tokens. However there is a problem with rust here, when decoding the extra space is still here, and the output for
'0000000000000000'will be' 0000000000000000'
|
Hi @ArthurZucker, thanks for your investigations. This PR fixes the problem for LayoutLMv3 but I expect the problem to exist on other models using Fast BPE tokenization, I will take a look when I can to list all impacted models that need a fix. |
| labels_example.append(word_labels[original_index][word_id]) | ||
| else: | ||
| labels_example.append(self.pad_token_label) | ||
| if offset == (0, 0): |
There was a problem hiding this comment.
| if offset == (0, 0): | |
| if self.decode(id) == "": |
I'm not sure offset == (0,0) is the right way to check this, maybe this is a safer option
There was a problem hiding this comment.
I can confirm this works when testing on a new model (UDOP)
There was a problem hiding this comment.
Yes I believe it is equivalent, although slower than simply comparing a tuple to (0, 0) it is probably safer and resilient to future changes.
It works because special tokens are checked line 648 with word_id is not None. Thus an offset of (0, 0) cannot be a special token at this position in code.
There was a problem hiding this comment.
I just checked for LayoutLMv3TokenizerFast, tok.decode(1437) == " ", where 1437 is the id of the "Ġ" token.
There was a problem hiding this comment.
Ok interesting. When working on UdopTokenizerFast (UDOP is a new model for which I'll open a PR soon), I had to use self.decode(id) == "", cause it didn't work with offset == (0, 0). UDOP has the same vocabulary as T5, which is based on SentencePiece.
When testing out the following with offset == (0,0) from this branch:
from transformers import UdopTokenizerFast
words = ['a', 'weirdly', 'test', 'hello']
boxes = [[1,2,3,4] for _ in range(len(words))]
labels = [1, 2, 3, 4]
tokenizer = UdopTokenizerFast.from_pretrained("nielsr/udop-large")
encoding = tokenizer(words, boxes=boxes, word_labels=labels)
for id, label in zip(encoding.input_ids, encoding.labels):
print(tokenizer.decode([id]), label)
it gives me:
1
a -100
weird 2
ly -100
test 3
hello 4
</s> -100
(I used "weirdly" just to make sure we get multiple tokens). Interestingly it splits the word "a" into 2 tokens: an empty token and "a". I also printed (id, offset, word_id):
(0, 1) 0
a (0, 1) 0
weird (0, 5) 1
ly (5, 7) 1
test (0, 4) 2
hello (0, 5) 3
and in this case the empty token has offset (0, 1), which explains why offset == (0,0) didn't work.
There was a problem hiding this comment.
This is probably not related to the current issue but your tokenizer produces weird (pun intended) offsets mapping.
(0, 1)
a (0, 1)
weird (0, 5)
ly (5, 7)
We should be able to derive word length from the offset_mapping, ie "weirdly" is of length (5-0) + (7-5) = 7. But this does not hold anymore with this empty token not being assigned (0, 0) offset.
|
Thanks a lot for this fix, would you be able to take into account my comment such that we can merge it? 🙏 Thanks! Btw the same fix could then be applied to LayoutLMv2 and LayoutXLM |
|
LayoutLMv2 uses WordPiece and not BPE. From what I saw its vocabulary does not contain empty token and thus cannot produce (0, 0) offset_mapping when encoding. |
…e#21695) LayoutLMv3TokenizerFast produces empty 'Ġ' token with `offset_mapping = (0, 0)`. Next token is wrongly assumed to also be beginning of word and isn't correctly assigned `pad_token_label`. Modify test with text that produce 'Ġ' token. Remove copy check from LayoutLMv2TokenizerFast for `_batch_encode_plus`. solves issue: huggingface#19978
…e#21695) LayoutLMv3TokenizerFast produces empty 'Ġ' token with `offset_mapping = (0, 0)`. Next token is wrongly assumed to also be beginning of word and isn't correctly assigned `pad_token_label`. Modify test with text that produce 'Ġ' token. Remove copy check from LayoutLMv2TokenizerFast for `_batch_encode_plus`. solves issue: huggingface#19978
LayoutLMv3TokenizerFast produces empty 'Ġ' token with
offset_mapping = (0, 0).Next token is wrongly assumed to also be beginning of word and isn't correctly assigned
pad_token_label.This may lead to misalignment of words and token representations.
Other BPE tokenizers might be affected
Add check for previous token if it had an empty
offset_mapping(not including special tokens)Remove copy check from LayoutLMv2TokenizerFast for
_batch_encode_plusbecause it is not affected (uses WordPiece instead of BPE)Modify test with text that produce 'Ġ' token.
Fixes issue: #19978
@NielsRogge
@ArthurZucker