-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding special_tokens from tokenizer config for transformer-lm model #7613
adding special_tokens from tokenizer config for transformer-lm model #7613
Conversation
Signed-off-by: Alexander Jipa <[email protected]>
aae5ed9
to
a2c3be2
Compare
for more information, see https://pre-commit.ci
@ericharper for visibility when you get the chance fix for using special tokens |
@ericharper , @wdykas is there anything I can do from my end to unblock this PR? |
Please let me know if there's anything I can do to get this PR approved, @wdykas @ericharper. Thanks! |
this PR should be good to go but we need to ensure we do not do this special token passing for sentencepiece (its fine for other tokenizer types). We know it causes different tokenization as compared to a standard pre-trained sentencepiece model and can cause hidden issues The recommended way to add special tokens to sentencepiece is to directly add it to the tokenizer by editing it using something like: https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/add_special_tokens_to_sentencepiece.py |
jenkins |
@aklife97 are you recommending that we add a check in this PR? |
I think it should be good to merge, we need to think of a more general solution for this... |
…VIDIA#7613) * adding special_tokens from tokenizer config for transformer-lm model Signed-off-by: Alexander Jipa <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alexander Jipa <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Piotr Żelasko <[email protected]>
…VIDIA#7613) * adding special_tokens from tokenizer config for transformer-lm model Signed-off-by: Alexander Jipa <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alexander Jipa <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
What does this PR do ?
special_tokens
are currently ignored when constructing a tokenizer fortransformer_lm_model
using the following example:NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml
Line 51 in 5cb76a5
Collection: NLP/language-modeling
Changelog
special_tokens
from config when creating the tokenizer, making it backward-compatible with previous behavior (whenspecial_tokens
are not provided we proceed withNone
).Usage
NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml
Line 51 in 5cb76a5
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information