-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Packed Sequence #7945
Packed Sequence #7945
Conversation
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Chen Cui <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Chen Cui <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Modify GPTSFTPackedDataset to respond to pad_to_max_length setting Signed-off-by: Chen Cui <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry ignore my previous approval.
Could you please add unit tests?
See tests/collections/nlp/test_chat_sft_dataset.py
as an example
jenkins |
Signed-off-by: Chen Cui <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
for more information, see https://pre-commit.ci
jenkins |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for adding the unit test.
* support packed dataset Signed-off-by: Chen Cui <[email protected]> * support packed dataset Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make packed dataset work Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix nan bug Signed-off-by: Chen Cui <[email protected]> * support answer only loss Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <[email protected]> * fix path for answer_only_loss = false Signed-off-by: Chen Cui <[email protected]> * @vysarge Modify GPTSFTPackedDataset to respond to pad_to_max_length setting Signed-off-by: Chen Cui <[email protected]> * add unit test Signed-off-by: Chen Cui <[email protected]> * fix loss mask bug for answer_only_loss=False Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Chen Cui <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]>
* support packed dataset Signed-off-by: Chen Cui <[email protected]> * support packed dataset Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make packed dataset work Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix nan bug Signed-off-by: Chen Cui <[email protected]> * support answer only loss Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <[email protected]> * fix path for answer_only_loss = false Signed-off-by: Chen Cui <[email protected]> * @vysarge Modify GPTSFTPackedDataset to respond to pad_to_max_length setting Signed-off-by: Chen Cui <[email protected]> * add unit test Signed-off-by: Chen Cui <[email protected]> * fix loss mask bug for answer_only_loss=False Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Chen Cui <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Sasha Meister <[email protected]>
* support packed dataset Signed-off-by: Chen Cui <[email protected]> * support packed dataset Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make packed dataset work Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix nan bug Signed-off-by: Chen Cui <[email protected]> * support answer only loss Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <[email protected]> * fix path for answer_only_loss = false Signed-off-by: Chen Cui <[email protected]> * @vysarge Modify GPTSFTPackedDataset to respond to pad_to_max_length setting Signed-off-by: Chen Cui <[email protected]> * add unit test Signed-off-by: Chen Cui <[email protected]> * fix loss mask bug for answer_only_loss=False Signed-off-by: Chen Cui <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Chen Cui <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <[email protected]>
What does this PR do ?
Support training with packed sequences for SFT and PEFT.
In this scenario, sequences in a batch are "concatenated" in the sequence length dimension, instead of "stacked" in a batch dimension.
This is shown to significantly improve the training throughput.
Packed sequences will need to be prepared in a particular way (script in #8682 )
Collection: NLP
Changelog
cu_seqlens
andqkv_format
to the forward() function of mcore modelUsage
Note: only train_ds is packed. Validation and test datasets are unchanged.
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information