Add Customization Dataset Preparation Tool #6029

Zhilin123 · 2023-02-15T01:20:07Z

Allows users to read data into prompt-and-completion format .jsonl as expected by the Customization service/NeMo LLM P tuning service

Signed-off-by: Zhilin Wang [email protected]

What does this PR do ?

Allows users to read data into prompt-and-completion format .jsonl as expected by the Customization service/NeMo LLM P tuning service

Collection: NLP

Changelog

Add specific line by line info of high level changes in this PR.

Usage

See tutorial.ipynb

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Allows users to read data into prompt-and-completion format .jsonl as expected by the Customization service/NeMo LLM P tuning service Signed-off-by: Zhilin Wang [email protected]

for more information, see https://pre-commit.ci

okuchaiev

Please add licence headers as well as docstrings to functions

tools/customization_dataset_preparation/customization_dataset_preparation.py

okuchaiev · 2023-02-16T01:10:34Z

tools/customization_dataset_preparation/NeMo_LLM_Dataset_Preparation_Tutorial.ipynb

@@ -0,0 +1,151 @@
+{


This should be added as a subsection here docs/source/nlp/nemo_megatron/prompt_learning.rst ? Also there is no dataset_validation.py file

Currently set to the run instructions in the header of customization_dataset_preparation.py --> this would be much easier for people to see and read when they use the py file separately (w/o needing to download nemo or going to nemo docs).

Signed-off-by: Zhilin Wang [email protected]

okuchaiev

lgtm

* Add Customization Dataset Preparation Tool Allows users to read data into prompt-and-completion format .jsonl as expected by the Customization service/NeMo LLM P tuning service Signed-off-by: Zhilin Wang [email protected] * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add license and usage examples, remove tutorial Signed-off-by: Zhilin Wang [email protected] * Fix typo Signed-off-by: Zhilin Wang [email protected] * Fix some more typos --------- Signed-off-by: Zhilin Wang [email protected] Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Oleksii Kuchaiev <[email protected]>

* Add Customization Dataset Preparation Tool Allows users to read data into prompt-and-completion format .jsonl as expected by the Customization service/NeMo LLM P tuning service Signed-off-by: Zhilin Wang [email protected] * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add license and usage examples, remove tutorial Signed-off-by: Zhilin Wang [email protected] * Fix typo Signed-off-by: Zhilin Wang [email protected] * Fix some more typos --------- Signed-off-by: Zhilin Wang [email protected] Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Oleksii Kuchaiev <[email protected]> Signed-off-by: hsiehjackson <[email protected]>

Add Customization Dataset Preparation Tool

06e7e07

Allows users to read data into prompt-and-completion format .jsonl as expected by the Customization service/NeMo LLM P tuning service Signed-off-by: Zhilin Wang [email protected]

Zhilin123 self-assigned this Feb 15, 2023

Zhilin123 requested a review from okuchaiev February 15, 2023 01:20

pre-commit-ci bot and others added 2 commits February 15, 2023 01:21

[pre-commit.ci] auto fixes from pre-commit.com hooks

3d4acf3

for more information, see https://pre-commit.ci

Merge branch 'main' into dataset_preparation_tool

d01611f

okuchaiev requested changes Feb 16, 2023

View reviewed changes

okuchaiev reviewed Feb 16, 2023

View reviewed changes

Zhilin123 added 3 commits February 15, 2023 17:19

Add license and usage examples, remove tutorial

660ccd9

Signed-off-by: Zhilin Wang [email protected]

Fix typo

95af2f6

Signed-off-by: Zhilin Wang [email protected]

Fix some more typos

68fc3c1

Zhilin123 requested a review from okuchaiev February 17, 2023 18:44

Merge branch 'main' into dataset_preparation_tool

b036e5e

okuchaiev approved these changes Feb 17, 2023

View reviewed changes

Zhilin123 merged commit 050971b into main Feb 17, 2023

nithinraok deleted the dataset_preparation_tool branch February 17, 2023 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Customization Dataset Preparation Tool #6029

Add Customization Dataset Preparation Tool #6029

Zhilin123 commented Feb 15, 2023

okuchaiev left a comment

okuchaiev Feb 16, 2023 •

edited

Loading

Zhilin123 Feb 17, 2023

okuchaiev left a comment

Add Customization Dataset Preparation Tool #6029

Add Customization Dataset Preparation Tool #6029

Conversation

Zhilin123 commented Feb 15, 2023

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

okuchaiev left a comment

Choose a reason for hiding this comment

okuchaiev Feb 16, 2023 • edited Loading

Choose a reason for hiding this comment

Zhilin123 Feb 17, 2023

Choose a reason for hiding this comment

okuchaiev left a comment

Choose a reason for hiding this comment

okuchaiev Feb 16, 2023 •

edited

Loading