-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter p-tuning by example length #6182
Conversation
… attribute Signed-off-by: arendu <[email protected]>
Signed-off-by: arendu <[email protected]>
Signed-off-by: arendu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions
nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py
Show resolved
Hide resolved
nemo/collections/nlp/models/language_modeling/megatron_gpt_prompt_learning_model.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - there remains some more things to handle but @arendu / @Zhilin123 will work on them in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please modify build_virtual_prompt_dataset
to print a message that sequences longer than set max_seq_length will be dropped. Ideally, we'd also like to know how many were dropped. I guess for that you'll need to modify GPTPromptLearningDataset ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw that requested warning is already there. LGTM, thanks!
* patch to allow using tokenizers without additional_special_tokens_ids attribute Signed-off-by: arendu <[email protected]> * filter long and/or short training and validation examples Signed-off-by: arendu <[email protected]> --------- Signed-off-by: arendu <[email protected]>
* patch to allow using tokenizers without additional_special_tokens_ids attribute Signed-off-by: arendu <[email protected]> * filter long and/or short training and validation examples Signed-off-by: arendu <[email protected]> --------- Signed-off-by: arendu <[email protected]> Signed-off-by: hsiehjackson <[email protected]>
What does this PR do ?
Previously only training and validation examples longer than the max encoder length were filtered out. This PR makes it possible to filter out by a specified length. This will save training time, as larger batch sizes will be possible after removing a small % of training and validation examples.
Collection: [NLP]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information