-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
omit the leading space on the first token #89
Conversation
It's not only position zero that would need the removal. The sentencepiece logic is here: |
Still also a bit confused why sentencepiece even needs to do this or how that works. In the BPE world of GPT that I'm used to there is no need to have special postprocessings like this and strip whitespaces in special cases. |
It appears that BPE has tokens that do not contain spaces at the beginning, if they are frequent. So when decoding the first word, the transformer itself will choose that token without space, because it was trained to do so. But models that use WordPiece do not "see" the spaces, because there is only one token for each subword. So they will not be able to learn that the first word is usually preceded by a space. |
By default the SentencePiece implementation adds whitespace during preprocessing to the beginning of text -- besides removing leading, trailing, and duplicate internal whitespace. See --add_dummy_prefix and --remove_extra_whitespaces here: The decoding code removes whitespace (if present) from a piece that follows BOS. I'm not certain about this choice either. The SentencePiece implementation uses whitespace to differentiate between subwords that are continuation of a word vs not (at least for some languages). It seems someone found it advantageous to use the same id for a subword at the beginning of the text and the same subword elsewhere in the text. |
Sigh sentencepiece 🤦 . Let's not worry about this whitespace for now, it just confuses everything. Maybe we'll come back around at a later time. |
It is not a bug, it is a property of tokenizers based on Unigram. If the word "Once" is tokenized as " Once" and there are no other versions of it, then we need to remove the space when outputting the first word. This applies to all word prefixes or small words. It is just automatically done by the tokenizer library in python, so we do not see it. Look the example at the very end here: |
This shows how the tokenizer works: $ python3
>>> import sentencepiece as spm
>>> sp = spm.SentencePieceProcessor(model_file='tokenizer.model')
>>>
>>> sp.encode("Hello world!")
[15043, 3186, 29991]
>>> sp.id_to_piece([15043])
['▁Hello']
>>> sp.id_to_piece([3186])
['▁world']
>>> sp.id_to_piece([29991])
['!']
>>> sp.decode([15043, 3186, 29991])
'Hello world!'
>>>
>>> sp.encode("Once upon a time")
[9038, 2501, 263, 931]
>>> sp.id_to_piece([9038])
['▁Once']
>>> sp.id_to_piece([2501])
['▁upon']
>>> sp.id_to_piece([263])
['▁a']
>>> sp.id_to_piece([931])
['▁time']
>>> sp.decode([9038, 2501, 263, 931])
'Once upon a time' |
So we must remove the space It can be done either:
|
Thanks for the example. Any idea why the preprocessing even adds these spaces? |
I was able to extract the training and normalizer flags/parameters from transformer.model (by decoding protobuf message).
snippet to decode/print above params from model file: import sentencepiece.sentencepiece_model_pb2
mp = sentencepiece.sentencepiece_model_pb2.ModelProto()
mp.ParseFromString(open("tokenizer.model", "rb").read())
print(mp.trainer_spec)
print(mp.normalizer_spec) |
!!! @atamurad super helpful so. yeah in particular: |
I found this string on the internet: The vocabulary size is set to 32,000. A add_dummy_prefix option is set to True because words are not separated by whitespaces in Japanese. I don't really understand this sentence, how this option fixes Japanese, or why it exists. |
I pushed a fix to this, and fixed a bug with current PR, which only would have done it as pos=0 instead of right after BOS. |
I would add your explanation with another one example: >>> [sp.id_to_piece(t) for t in sp.encode("Once upon a time")]
['▁Once', '▁upon', '▁a', '▁time']
>>> sp.encode("Once upon a time ")
[9038, 2501, 263, 931, 29871]
>>> [sp.id_to_piece(t) for t in sp.encode("Once upon a time ")]
['▁Once', '▁upon', '▁a', '▁time', '▁']
>>> sp.decode(sp.encode("Once upon a time "))
'Once upon a time '
>>> sp.decode(sp.encode(" Once upon a time"))
'Once upon a time' |
No description provided.