Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding special_tokens from tokenizer config for transformer-lm model #7613

Merged

Conversation

clumsy
Copy link
Contributor

@clumsy clumsy commented Oct 3, 2023

What does this PR do ?

special_tokens are currently ignored when constructing a tokenizer for transformer_lm_model using the following example:

special_tokens: # only necessary for adding transformer/bert-specific special tokens to tokenizer if the tokenizer does not already have these inherently.

Collection: NLP/language-modeling

Changelog

  • transformer_lm_model: use special_tokens from config when creating the tokenizer, making it backward-compatible with previous behavior (when special_tokens are not provided we proceed with None).

Usage

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
  • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the NLP label Oct 3, 2023
@clumsy clumsy force-pushed the fix/transformer_lm_model_tokenizer_special_tokens branch from aae5ed9 to a2c3be2 Compare October 3, 2023 16:27
@wdykas
Copy link
Contributor

wdykas commented Oct 6, 2023

@ericharper for visibility when you get the chance fix for using special tokens

@clumsy
Copy link
Contributor Author

clumsy commented Oct 24, 2023

@ericharper , @wdykas is there anything I can do from my end to unblock this PR?

@clumsy
Copy link
Contributor Author

clumsy commented Nov 7, 2023

Please let me know if there's anything I can do to get this PR approved, @wdykas @ericharper. Thanks!

@aklife97
Copy link
Collaborator

aklife97 commented Nov 7, 2023

this PR should be good to go but we need to ensure we do not do this special token passing for sentencepiece (its fine for other tokenizer types). We know it causes different tokenization as compared to a standard pre-trained sentencepiece model and can cause hidden issues

The recommended way to add special tokens to sentencepiece is to directly add it to the tokenizer by editing it using something like: https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/add_special_tokens_to_sentencepiece.py

@ericharper
Copy link
Collaborator

jenkins

@ericharper
Copy link
Collaborator

@aklife97 are you recommending that we add a check in this PR?

@aklife97
Copy link
Collaborator

aklife97 commented Nov 8, 2023

I think it should be good to merge, we need to think of a more general solution for this...

@aklife97 aklife97 merged commit d49b73c into NVIDIA:main Nov 8, 2023
11 checks passed
@clumsy clumsy deleted the fix/transformer_lm_model_tokenizer_special_tokens branch November 8, 2023 18:59
pzelasko pushed a commit to pzelasko/NeMo that referenced this pull request Jan 3, 2024
…VIDIA#7613)

* adding special_tokens from tokenizer config for transformer-lm model

Signed-off-by: Alexander Jipa <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Piotr Żelasko <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…VIDIA#7613)

* adding special_tokens from tokenizer config for transformer-lm model

Signed-off-by: Alexander Jipa <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants