generalized chat sft prompt #7655

yidong72 · 2023-10-06T14:29:05Z

What does this PR do ?

In this PR, it genialized the chat SFT dataset that it can use customized turn start/end tokens by using chat_prompt_tokens config. e.g.

    chat_prompt_tokens:  # special tokens for the chat prompts, a dictionary of {token_type: token}. note that some tokenizer may combine the characters at the junction between {end_of_turn}{turn_start}. e.g. '<im end><im start>', the '><' sometimes is merged to be a single token. This is not supported, try to avoid
      system_turn_start: '<extra_id_0>'
      turn_start: '<extra_id_1>'
      label_start: '<extra_id_2>'
      end_of_turn: "\x0A"  # \0x0A is '\n'
      end_of_name: "\x0A"  # \0x0A is '\n'

after this change, the LM is not required to have "extra_id" special tokens any more to use chat SFT dataset. In this PR, also expanded the unit test to cover more LM tokenizers.

Another feature added is to overwrite the prompt_template config with the chat prompt format.

Signed-off-by: Yi Dong <[email protected]>

Zhilin123

LGTM, minor code style issues

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py

examples/nlp/language_modeling/tuning/conf/megatron_gpt_sft.yaml

Signed-off-by: Yi Dong <[email protected]>

tests/collections/nlp/test_chat_sft_dataset.py

Signed-off-by: Yi Dong <[email protected]>

examples/nlp/language_modeling/megatron_gpt_eval.py

Zhilin123

LGTM

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py

Signed-off-by: Yi Dong <[email protected]>

aklife97 · 2023-10-09T23:05:19Z

nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_dataset.py

+                "end_of_name": "\n",
+            }
+        else:
+            self.special_tokens = special_tokens



can we do a check to see if the tokens in special_tokens are tokenizer's special tokens or not? If not (the case with llama), can we just throw a warning that we'll use text as turn tokens which might cause incorrect merging

I have an assert in the code

assert torch.equal(torch.tensor(target[:header_len]), torch.tensor(header_tokens))

which will throw an exception if the token merge happens.

that is different, the token merge can still happen during multi-turn

what I mean is that if the turn tokens are not special tokens, we just say that there might be an error possible

The header_len stops at the "end_of_turn". The next token is "turn_start". If the merge happens this assert will catch it. The multiple turn has the same thing. each turn ends with "end_of_turn" and the next token is "turn_start". So this one is enough to catch it.

Also I don't see the point of just giving a warning which doesn't help the user at all.

Signed-off-by: Yi Dong <[email protected]>

scripts/nlp_language_modeling/sft/preprocessing.py

Signed-off-by: Yi Dong <[email protected]>

examples/nlp/language_modeling/tuning/megatron_gpt_sft.py

scripts/nlp_language_modeling/sft/preprocessing.py

+        # for key in turn['human_labels']:
+        #     value_set = label_values.get(key, set())
+        #     value_set.add(turn['human_labels'][key]['value'])
+        #     label_values[key] = value_set


nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py

scripts/nlp_language_modeling/sft/data_clean.py

Signed-off-by: Yi Dong <[email protected]>

aklife97

LGTM, thanks!

* fix dataset issues Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * all passed Signed-off-by: Yi Dong <[email protected]> * refactor tests Signed-off-by: Yi Dong <[email protected]> * all pass Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * use end name signal for labels Signed-off-by: Yi Dong <[email protected]> * all fixed Signed-off-by: Yi Dong <[email protected]> * update doc Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * make sure nccl not timing out Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * generate example template Signed-off-by: Yi Dong <[email protected]> * generic end of name token Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * add the chat prompt format into the config Signed-off-by: Yi Dong <[email protected]> * make sure sft working Signed-off-by: Yi Dong <[email protected]> * address reviewer comment Signed-off-by: Yi Dong <[email protected]> * fix non Signed-off-by: Yi Dong <[email protected]> * try openAI prompt Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * remove human labels from the data Signed-off-by: Yi Dong <[email protected]> * use hf dataset to clean Signed-off-by: Yi Dong <[email protected]> * reviewer comments Signed-off-by: Yi Dong <[email protected]> --------- Signed-off-by: Yi Dong <[email protected]> Signed-off-by: Sasha Meister <[email protected]>

* fix dataset issues Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * all passed Signed-off-by: Yi Dong <[email protected]> * refactor tests Signed-off-by: Yi Dong <[email protected]> * all pass Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * use end name signal for labels Signed-off-by: Yi Dong <[email protected]> * all fixed Signed-off-by: Yi Dong <[email protected]> * update doc Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * make sure nccl not timing out Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * generate example template Signed-off-by: Yi Dong <[email protected]> * generic end of name token Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * add the chat prompt format into the config Signed-off-by: Yi Dong <[email protected]> * make sure sft working Signed-off-by: Yi Dong <[email protected]> * address reviewer comment Signed-off-by: Yi Dong <[email protected]> * fix non Signed-off-by: Yi Dong <[email protected]> * try openAI prompt Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * remove human labels from the data Signed-off-by: Yi Dong <[email protected]> * use hf dataset to clean Signed-off-by: Yi Dong <[email protected]> * reviewer comments Signed-off-by: Yi Dong <[email protected]> --------- Signed-off-by: Yi Dong <[email protected]> Signed-off-by: Elena Rastorgueva <[email protected]>

* fix dataset issues Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * all passed Signed-off-by: Yi Dong <[email protected]> * refactor tests Signed-off-by: Yi Dong <[email protected]> * all pass Signed-off-by: Yi Dong <[email protected]> * working version Signed-off-by: Yi Dong <[email protected]> * use end name signal for labels Signed-off-by: Yi Dong <[email protected]> * all fixed Signed-off-by: Yi Dong <[email protected]> * update doc Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * make sure nccl not timing out Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * generate example template Signed-off-by: Yi Dong <[email protected]> * generic end of name token Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * add the chat prompt format into the config Signed-off-by: Yi Dong <[email protected]> * make sure sft working Signed-off-by: Yi Dong <[email protected]> * address reviewer comment Signed-off-by: Yi Dong <[email protected]> * fix non Signed-off-by: Yi Dong <[email protected]> * try openAI prompt Signed-off-by: Yi Dong <[email protected]> * remove unused imports Signed-off-by: Yi Dong <[email protected]> * remove human labels from the data Signed-off-by: Yi Dong <[email protected]> * use hf dataset to clean Signed-off-by: Yi Dong <[email protected]> * reviewer comments Signed-off-by: Yi Dong <[email protected]> --------- Signed-off-by: Yi Dong <[email protected]>

yidong72 added 14 commits October 5, 2023 20:19

fix dataset issues

c66de49

Signed-off-by: Yi Dong <[email protected]>

Merge branch 'main' into sft_mcore

3e42b74

working version

1f3d2d3

Signed-off-by: Yi Dong <[email protected]>

all passed

7fdc339

Signed-off-by: Yi Dong <[email protected]>

refactor tests

87a01bb

Signed-off-by: Yi Dong <[email protected]>

all pass

9c0ee5c

Signed-off-by: Yi Dong <[email protected]>

working version

ccaa6a0

Signed-off-by: Yi Dong <[email protected]>

use end name signal for labels

14467c4

Signed-off-by: Yi Dong <[email protected]>

all fixed

4a674e4

Signed-off-by: Yi Dong <[email protected]>

update doc

11bc6cd

Signed-off-by: Yi Dong <[email protected]>

style fix

2e0285a

Signed-off-by: Yi Dong <[email protected]>

remove unused imports

4ba2395

Signed-off-by: Yi Dong <[email protected]>

make sure nccl not timing out

d1b8328

Signed-off-by: Yi Dong <[email protected]>

style fix

5bf546e

Signed-off-by: Yi Dong <[email protected]>

yidong72 requested a review from MaximumEntropy October 6, 2023 14:29

github-actions bot added the NLP label Oct 6, 2023

yidong72 requested review from aklife97 and Zhilin123 October 6, 2023 14:30

yidong72 and others added 5 commits October 6, 2023 10:30

Merge branch 'main' into sft_mcore

f945ec6

generate example template

cd7c77a

Signed-off-by: Yi Dong <[email protected]>

generic end of name token

d734830

Signed-off-by: Yi Dong <[email protected]>

style fix

33b7910

Signed-off-by: Yi Dong <[email protected]>

Merge branch 'sft_mcore' of github.com:NVIDIA/NeMo into sft_mcore

e293336

Zhilin123 previously approved these changes Oct 6, 2023

View reviewed changes

add the chat prompt format into the config

c99b55f

Signed-off-by: Yi Dong <[email protected]>

yidong72 dismissed Zhilin123’s stale review via c99b55f October 6, 2023 19:59

github-advanced-security bot found potential problems Oct 6, 2023

View reviewed changes

tests/collections/nlp/test_chat_sft_dataset.py Fixed Show fixed Hide fixed

yidong72 and others added 3 commits October 6, 2023 21:10

make sure sft working

b64f0bd

Signed-off-by: Yi Dong <[email protected]>

address reviewer comment

86bb7b0

Signed-off-by: Yi Dong <[email protected]>

Merge branch 'main' into sft_mcore

019afa4

yidong72 commented Oct 6, 2023

View reviewed changes

examples/nlp/language_modeling/megatron_gpt_eval.py Show resolved Hide resolved

Zhilin123 previously approved these changes Oct 6, 2023

View reviewed changes

github-advanced-security bot found potential problems Oct 6, 2023

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py Fixed Show fixed Hide fixed

fix non

3ddd9cd

Signed-off-by: Yi Dong <[email protected]>

yidong72 dismissed Zhilin123’s stale review via 3ddd9cd October 7, 2023 01:13

yidong72 and others added 3 commits October 7, 2023 12:18

try openAI prompt

a1789e4

Signed-off-by: Yi Dong <[email protected]>

Merge branch 'main' into sft_mcore

4db2188

remove unused imports

d36d3a9

Signed-off-by: Yi Dong <[email protected]>

aklife97 reviewed Oct 9, 2023

View reviewed changes

remove human labels from the data

162be79

Signed-off-by: Yi Dong <[email protected]>

aklife97 reviewed Oct 9, 2023

View reviewed changes

scripts/nlp_language_modeling/sft/preprocessing.py Show resolved Hide resolved

use hf dataset to clean

700d9f2

Signed-off-by: Yi Dong <[email protected]>

aklife97 reviewed Oct 9, 2023

View reviewed changes

examples/nlp/language_modeling/tuning/megatron_gpt_sft.py Show resolved Hide resolved

github-advanced-security bot found potential problems Oct 10, 2023

View reviewed changes