-
Notifications
You must be signed in to change notification settings - Fork 228
use HuggingFace Datasets as source to build Megatron data files #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
f68999f
indexed_dataset: use numpy to compute byte offsets faster
adammoody d5d20bb
preprocess with huggingface datasets and mpi
adammoody 32fc48f
preprocess_dataset_mpi: add --shuffle and --seed options
adammoody a456e48
indexed_dataset: fix to handle file with 0 items
adammoody 17ac2f9
preprocess_dataset_mpi: add --split and --count options
adammoody 7836a32
update script comments to reflect shuffle behavior
adammoody 7b27853
add torch.distributed version
adammoody 92d78c4
Update tools/preprocess_dataset_mpi.py
adammoody 38b2d8a
Update tools/preprocess_dataset_mpi.py
adammoody 6264d7a
Update tools/preprocess_dataset_mpi.py
adammoody 782151f
Update tools/preprocess_dataset_mpi.py
adammoody 88f5d0b
Update tools/preprocess_dataset_mpi.py
adammoody 31fab0e
Update tools/preprocess_dataset_mpi.py
adammoody 520d06c
Update tools/preprocess_dataset_mpi.py
adammoody 6e0e4fd
add estimated progress logging
adammoody 03bf199
avoid downloading dataset unless user really wants to
adammoody 0b2d4cd
Update tools/preprocess_dataset_mpi.py
adammoody cbf965e
Update tools/preprocess_dataset_mpi.py
adammoody 7c8c1c9
refactor main into more functions
adammoody 600c091
reformat progress messages
adammoody 6bb27f7
move mpi4py import test to get_args
adammoody 7ee7bf5
drop Open MPI variables from init_process_group
adammoody f0f45b9
add --local_rank to support torch.distributed.launch
adammoody 3be3423
update from DeepSpeedExamples
adammoody 8ae0cf8
raise exceptions on errors
adammoody a8e9b2e
drop --download option
adammoody fa9e323
format byte rate as MB/s
adammoody 3db9cdb
Update tools/preprocess_dataset_mpi.py
adammoody 764e760
move datasets import back to top
adammoody 80ee230
import config from datasets
adammoody File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand the changes, this is equivlant to the previous python list + loop based approach. Would this make things faster because you don't need to create a list and then convert to numpy? It looks right. Did you test to see if the output of the two methods are the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ontocord , thanks for your review, and thanks for the pointers to this other project. I'll take a look when I get a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be equivalent. I checked by verifying that the resulting files are identical with
cmp. Though, I may put that on my list to double check again, just to be sure.I found that creating this list of pointers was taking a long time when writing out the index file. This list for the full
openwebtextdataset turned out to be ~300 million items. The cost seemed to be concentrated in the loop that computes the byte offsets. Converting that loop into numpy calls cut the compute time significantly.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aweseome optimization! Thank you.