adding special_tokens from tokenizer config for transformer-lm model #7613

clumsy · 2023-10-03T16:24:08Z

What does this PR do ?

special_tokens are currently ignored when constructing a tokenizer for transformer_lm_model using the following example:

NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml

Line 51 in 5cb76a5

    
           special_tokens: # only necessary for adding transformer/bert-specific special tokens to tokenizer if the tokenizer does not already have these inherently.

Collection: NLP/language-modeling

Changelog

transformer_lm_model: use special_tokens from config when creating the tokenizer, making it backward-compatible with previous behavior (when special_tokens are not provided we proceed with None).

Usage

NeMo/examples/nlp/language_modeling/conf/bert_pretraining_from_text_config.yaml

Line 51 in 5cb76a5

    
           special_tokens: # only necessary for adding transformer/bert-specific special tokens to tokenizer if the tokenizer does not already have these inherently.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Alexander Jipa <[email protected]>

for more information, see https://pre-commit.ci

wdykas · 2023-10-06T20:46:07Z

@ericharper for visibility when you get the chance fix for using special tokens

…okens

clumsy · 2023-10-24T20:27:37Z

@ericharper , @wdykas is there anything I can do from my end to unblock this PR?

…okens

clumsy · 2023-11-07T14:02:54Z

Please let me know if there's anything I can do to get this PR approved, @wdykas @ericharper. Thanks!

aklife97 · 2023-11-07T19:56:17Z

this PR should be good to go but we need to ensure we do not do this special token passing for sentencepiece (its fine for other tokenizer types). We know it causes different tokenization as compared to a standard pre-trained sentencepiece model and can cause hidden issues

The recommended way to add special tokens to sentencepiece is to directly add it to the tokenizer by editing it using something like: https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/add_special_tokens_to_sentencepiece.py

ericharper · 2023-11-07T20:13:13Z

jenkins

ericharper · 2023-11-08T00:10:08Z

@aklife97 are you recommending that we add a check in this PR?

aklife97 · 2023-11-08T00:34:23Z

I think it should be good to merge, we need to think of a more general solution for this...

…VIDIA#7613) * adding special_tokens from tokenizer config for transformer-lm model Signed-off-by: Alexander Jipa <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alexander Jipa <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Piotr Żelasko <[email protected]>

…VIDIA#7613) * adding special_tokens from tokenizer config for transformer-lm model Signed-off-by: Alexander Jipa <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alexander Jipa <[email protected]> Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

github-actions bot added the NLP label Oct 3, 2023

adding special_tokens from tokenizer config for transformer-lm model

a2c3be2

Signed-off-by: Alexander Jipa <[email protected]>

clumsy force-pushed the fix/transformer_lm_model_tokenizer_special_tokens branch from aae5ed9 to a2c3be2 Compare October 3, 2023 16:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

fc2088c

for more information, see https://pre-commit.ci

shanmugamr1992 approved these changes Oct 3, 2023

View reviewed changes

Alexander Jipa added 3 commits October 9, 2023 10:55

Merge branch 'main' into fix/transformer_lm_model_tokenizer_special_t…

c02ac9d

…okens

Merge branch 'main' into fix/transformer_lm_model_tokenizer_special_t…

3762bd4

…okens

Merge branch 'main' into fix/transformer_lm_model_tokenizer_special_t…

e3e30dd

…okens

Merge branch 'main' into fix/transformer_lm_model_tokenizer_special_t…

6a85461

…okens

aklife97 merged commit d49b73c into NVIDIA:main Nov 8, 2023
11 checks passed

clumsy deleted the fix/transformer_lm_model_tokenizer_special_tokens branch November 8, 2023 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding special_tokens from tokenizer config for transformer-lm model #7613

adding special_tokens from tokenizer config for transformer-lm model #7613

clumsy commented Oct 3, 2023

wdykas commented Oct 6, 2023

clumsy commented Oct 24, 2023

clumsy commented Nov 7, 2023

aklife97 commented Nov 7, 2023 •

edited

Loading

ericharper commented Nov 7, 2023

ericharper commented Nov 8, 2023

aklife97 commented Nov 8, 2023

adding special_tokens from tokenizer config for transformer-lm model #7613

adding special_tokens from tokenizer config for transformer-lm model #7613

Conversation

clumsy commented Oct 3, 2023

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

wdykas commented Oct 6, 2023

clumsy commented Oct 24, 2023

clumsy commented Nov 7, 2023

aklife97 commented Nov 7, 2023 • edited Loading

ericharper commented Nov 7, 2023

ericharper commented Nov 8, 2023

aklife97 commented Nov 8, 2023

aklife97 commented Nov 7, 2023 •

edited

Loading