Filter p-tuning by example length #6182

arendu · 2023-03-13T19:29:58Z

What does this PR do ?

Previously only training and validation examples longer than the max encoder length were filtered out. This PR makes it possible to filter out by a specified length. This will save training time, as larger batch sizes will be possible after removing a small % of training and validation examples.

Collection: [NLP]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

… attribute Signed-off-by: arendu <[email protected]>

Signed-off-by: arendu <[email protected]>

Zhilin123

Left some questions

nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py

Zhilin123

LGTM - there remains some more things to handle but @arendu / @Zhilin123 will work on them in a separate PR.

okuchaiev

Can you please modify build_virtual_prompt_dataset to print a message that sequences longer than set max_seq_length will be dropped. Ideally, we'd also like to know how many were dropped. I guess for that you'll need to modify GPTPromptLearningDataset ?

okuchaiev

Saw that requested warning is already there. LGTM, thanks!

* patch to allow using tokenizers without additional_special_tokens_ids attribute Signed-off-by: arendu <[email protected]> * filter long and/or short training and validation examples Signed-off-by: arendu <[email protected]> --------- Signed-off-by: arendu <[email protected]>

* patch to allow using tokenizers without additional_special_tokens_ids attribute Signed-off-by: arendu <[email protected]> * filter long and/or short training and validation examples Signed-off-by: arendu <[email protected]> --------- Signed-off-by: arendu <[email protected]> Signed-off-by: hsiehjackson <[email protected]>

arendu added 30 commits December 15, 2022 10:51

patch to allow using tokenizers without additional_special_tokens_ids…

2b95406

… attribute Signed-off-by: arendu <[email protected]>

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

c131a90

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

9e15c3a

merge main

d0e3669

Signed-off-by: arendu <[email protected]>

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

0a19a5a

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

ec3d57b

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

64e36ba

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

5bfde7e

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

b04b145

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

b1906ab

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

9795062

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

0f83085

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

ee4dd1a

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

53ba0b2

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

a6aee2a

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

33442d4

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

8e6c5c9

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

efd263c

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

ecfda4f

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

15aee0c

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

2b7f3de

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

f62cde9

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

31915c9

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

fa22a1f

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

e62cd47

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

321a907

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

aeaf13f

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

c9a61f1

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

942b58c

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

0007d32

arendu added 16 commits February 27, 2023 16:29

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

685c158

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

d0846c7

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

c996520

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

2c9f503

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

cf70167

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

2302598

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

5c200e2

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

638f66e

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

2ec8fe4

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

b2fdc50

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

d21e63c

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

a921d0c

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

93c16bc

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

63d7a6a

Merge branch 'main' of https://github.com/NVIDIA/NeMo into main

0ff7ca7

filter long and/or short training and validation examples

e04580b

Signed-off-by: arendu <[email protected]>

arendu requested review from Zhilin123 and Davood-M March 13, 2023 19:30

github-actions bot added the NLP label Mar 13, 2023

arendu requested a review from aklife97 March 13, 2023 19:30

Merge branch 'main' into adithyare/max_len_configurable

d4a780f

Zhilin123 reviewed Mar 13, 2023

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py Show resolved Hide resolved

nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py Show resolved Hide resolved

arendu requested a review from Zhilin123 March 13, 2023 19:49

Zhilin123 approved these changes Mar 13, 2023

View reviewed changes

okuchaiev requested changes Mar 13, 2023

View reviewed changes

okuchaiev approved these changes Mar 13, 2023

View reviewed changes

arendu merged commit edb0812 into main Mar 13, 2023

arendu deleted the adithyare/max_len_configurable branch March 13, 2023 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter p-tuning by example length #6182

Filter p-tuning by example length #6182

arendu commented Mar 13, 2023

Zhilin123 left a comment

Zhilin123 left a comment

okuchaiev left a comment

okuchaiev left a comment

Filter p-tuning by example length #6182

Filter p-tuning by example length #6182

Conversation

arendu commented Mar 13, 2023

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

Zhilin123 left a comment

Choose a reason for hiding this comment

Zhilin123 left a comment

Choose a reason for hiding this comment

okuchaiev left a comment

Choose a reason for hiding this comment

okuchaiev left a comment

Choose a reason for hiding this comment