You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Nov 22, 2022. It is now read-only.
Start, End index calculations fix for unicode characters. (#1171)
Summary:
Pull Request resolved: #1171
The existing GPT2BPETokenizer incorrectly calculates the start and end indices for unicode characters.
This is because for multi-byte characters, we need to additionally use the byte decoder on the decoded bytes to get back the original token that was encoded.
Reviewed By: chenyangyu1988
Differential Revision: D18697646
fbshipit-source-id: 8f4d32a1caa40d8d06e7be31dfd4a6846692531a
0 commit comments