We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debugged to the first token emerged:
And confirmed by using python model directly:
Clearly the tokenizer model count ▁Once instead of Once as a token. Note that sentencepiece use ▁ as space.
▁Once
Once
sentencepiece
The text was updated successfully, but these errors were encountered:
Yes, the vocab itself has this space. But somehow when you ask sentencepiece to decode a token sequence, it doesn't print this leading space.
Sorry, something went wrong.
I am not familiar with the tokenizer common practices, but seems there it has a patch for such scenario in decoding: https://github.com/google/sentencepiece/blob/master/src/sentencepiece_processor.cc#L788 . Might have a good reason to do so.
This was fixed while back
No branches or pull requests
Debugged to the first token emerged:
And confirmed by using python model directly:
Clearly the tokenizer model count
▁Once
instead ofOnce
as a token. Note thatsentencepiece
use ▁ as space.The text was updated successfully, but these errors were encountered: