-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentence piece legacy false compatibility #6154
Conversation
… attribute Signed-off-by: arendu <[email protected]>
Signed-off-by: arendu <[email protected]>
Signed-off-by: arendu <[email protected]>
for more information, see https://pre-commit.ci
self.tokenizer.pad_token = self.tokenizer.ids_to_tokens([self.tokenizer.pad_id])[0] | ||
self.tokenizer.bos_token = self.tokenizer.ids_to_tokens([self.tokenizer.bos_id])[0] | ||
self.tokenizer.eos_token = self.tokenizer.ids_to_tokens([self.tokenizer.eos_id])[0] | ||
self.tokenizer.legacy = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't worked with SentencePiece tokenizer so it's best to check with @MaximumEntropy
One thing that catches my eye is that if it is a sentence piece tokenizer, self.tokenizer.legacy will always become True due to this line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
* patch to allow using tokenizers without additional_special_tokens_ids attribute Signed-off-by: arendu <[email protected]> * make legacy=False tokenizer compatible with prompt learning Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: arendu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* patch to allow using tokenizers without additional_special_tokens_ids attribute Signed-off-by: arendu <[email protected]> * make legacy=False tokenizer compatible with prompt learning Signed-off-by: arendu <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: arendu <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]>
What does this PR do ?
Makes prompt learning compatible when sp tokenizer has legacy = False.
Collection: [NLP]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information