-
Notifications
You must be signed in to change notification settings - Fork 228
use HuggingFace Datasets as source to build Megatron data files #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you, @adammoody @thomasw21 and @ontocord - since you have been working on this area of Megatron recently would you be inspired to work with Adam to find the best outcome? |
|
Specifically to MPI, won't the same be accomplished if the Just trying to avoid a dependency on MPI as it often camouflages invalid ditributed setup, and then the code fails elsewhere where mpi is not installed. So I actively avoid installing it so that I can detect errors early. For example if you try to run deepspeed in a notebook w/o proper dist-like ENV emulation, the program fails, but with MPI it often still works. I think there are a few other circumstances where it makes things work, obfuscating problems. |
this is just an idea - use a synthetic input made from the same single line replicated multiple times, and then compare the resulting index output with the original script and yours - they should be the same. But perhaps even normal data will work, since from what I remember Megatron preprocessing script doesn't shuffle data. |
|
So I might get something completely wrong here but:
Sorry I might be mixing a lot of concepts together, I'm not super clear on how deepspeed works right now. |
|
I am such a github noob that I didn't even see this thread until now :) Yeah, I'm happy to discuss. I actually don't know MPI. HF Datasets does not support easily multi-node cluster processing. I'm trying to get it to work with Dask in my project for data-tooling. I'm not sure if this answers your question. As for multiprocessing, I know that torch has its own fork functions, and doesn't use python's native multiprocessing if that is any help. Ideally we would want to be able to have datasets or any enhancement to it (my version is called datastore) to use different types of communicaiton channel to distribute loads across nodes; torch.distributed, MPI, whatever. You need to be able to serialize the parameters of the map, and tell the node which shard you are working on, and then you need to be able to get back the results, or the .bin/.idx file in the indexed_dataset case somehow. Ideally the channel to get back data would be fast. |
|
I think |
|
I was able to use The script is currently hardcoded to set the rank and world_size from Open MPI environment variables. Those are set for each process separately by the job launcher I happen to be using. We'll need to make that more general. If you want to test, you'll likely need to change these lines. It expects And I've left MPI in there. It defaults to use |
huu4ontocord
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool and thank you for writing this!
As an aside, I found that pytorch-biggraph also uses both torch.distributed/glood/MPI to communicate, if you are interested:
| dtype_size = dtype().itemsize | ||
| address = 0 | ||
| pointers = [] | ||
| def _get_pointers(sizes, npdtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand the changes, this is equivlant to the previous python list + loop based approach. Would this make things faster because you don't need to create a list and then convert to numpy? It looks right. Did you test to see if the output of the two methods are the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an aside, I found that pytorch-biggraph also uses both torch.distributed/glood/MPI to communicate, if you are interested:
@ontocord , thanks for your review, and thanks for the pointers to this other project. I'll take a look when I get a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be equivalent. I checked by verifying that the resulting files are identical with cmp. Though, I may put that on my list to double check again, just to be sure.
I found that creating this list of pointers was taking a long time when writing out the index file. This list for the full openwebtext dataset turned out to be ~300 million items. The cost seemed to be concentrated in the loop that computes the byte offsets. Converting that loop into numpy calls cut the compute time significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aweseome optimization! Thank you.
tools/preprocess_dataset_mpi.py
Outdated
| # allocate a tensor of appropriate size | ||
| # initialize tensor with list values on root | ||
| rank, size = get_rank_size(args) | ||
| if rank == root: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuming it will always be int64? If we have large lists but the values are in range int8, int32, etc. would it be more efficient to send based on a smaller size? perhpas as a parameter to bcast? I guess in the case of mpi, it's fixed at int64, and dtype won't do anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a good point. It ultimately depends on the values within the list. In this code, I'm only calling this broadcast function to send the sample index values of the dataset that is being processed, so those values will be in the range [0, num_samples). We could likely size things down given the value of num_samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the communication doesn't happen often and is small, prob. not worth optimizing. If its offten and large, we could consider this, I thin. Thanks for the explanation!
tools/preprocess_dataset_mpi.py
Outdated
| j = idx[i] | ||
| for key in columns: | ||
| # tokenize text for the given sample index | ||
| text = dset[j][key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to increase the speed on each node, you could use dset.map(... batched=True, batch_size=..., num_proc=...) And have each process write to a .bin/.idx file for merging later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ontocord If it really faster if num_proc = 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multi-process within a node, I have been launching with one process per CPU core. For example, with a launch command like srun -n320 -N8, the job runs on 8 nodes with 40 procs per node. Each of those procs then writes its own .bin/.idx file for merging.
As you say, with some more work, one could launch say one process per node and then use python multiprocessing to run multiple procs on the node. In fact, that strategy could perhaps help with improved load balancing vs. the static work allocation that I'm using.
However, I settled on the simpler model in this first pass. For now, it's up to the user to launch multiple procs per node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Makes sense to me. FYI, we ran some benchmark (the code of which I can share with you), for various ways to do multiprocessing to make it faster on the machine we were working on. It turns out a queue method was fastest, then multi-reader, multi-writer was next (which is the way you are doing it here) But your mileage may vary depending on your hardware/os. The queuing code was pushed through this PR if you are interested. #18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasw21 I know we've discussed before and my opinion is that you can still run num of processes at more than the number of cores in some cases if you have to wait for disk for some procesess. But if that is not an issue, num_proc=1 doesn't make sense :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right I forgot about that one. Personally I'd suggest not to mix up multiple ways of having multiple processes, I think they are usually a nightmare and the gain is performance is just not worth it for a script we're going to run just a few times. But if you feel it's really worth a shot then maybe please go ahead!
tools/preprocess_dataset_mpi.py
Outdated
| vocab_size=tokenizer.vocab_size) | ||
|
|
||
| # merge all ranks into one file | ||
| for rank in range(proc_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how fast shared disk access is, but if it is a bottleneck, you could do the merging as a /bin/.idx file is available from a node/process of the node. This will require some communication plumbing, unless you just want to scan the disk for files for which the size hasn't changed in awhile or both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an ideal world, there is not much time for overlap here. All procs get about the same amount of work, and they ideally all finish writing their .bin/.idx files at about the same time.
I think where a bottleneck can show up is that only a single process (rank 0) works to merge the final file. For users who have parallel file systems, like Lustre/GPFS, that final merge step could be done in parallel by having each process write its portion of the data directly into the merge file. That could be worthwhile, though it will require some significant I/O development work in the indexed_dataset class, e.g., to compute the offsets where each process needs to write. That implementation would also require a POSIX-compliant file system, so NFS users would often have to fallback to something where just a single process writes the file.
I think even in my current implementation, NFS users may hit some snags since the .bin/.idx file written by a process on one node may not be immediately visible to rank 0. If that problem surfaces, we could probably work around it by adding a sleep to wait out the NFS cache timeout. For now, I've just got a note in there that the backing file system needs to support shared file access when running on multiple nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. If it's not an issue because load is balanced, then no big deal. I don't really like the way the builder is written because it is not restartable from an existing .bin/.idx file, so I gave up on an idea of doing fancy merging. I find merging is pretty fast actually. It's just "cat" at the end of the day.
For the issue of files in flight, I wrote some code to see if a file has finished loading before I work on it, for what it's worth. Prob has to be modified for files that are not yet visible, because this assumes all files are there and are just being written too. Might also want to pass in a minimum wait incr. You mentioned your system has a 4 min (!) increement between updates.
You have to call it in an iteration (list, or next):
# This is used for seeing if files from a remote drive have finished loading. Files could be in flight while we try to retreive them.
def wait_until_files_loaded(flist, max_tries=120, fs=None): # wait 2 hrs max
if fs is None:
fs = os
if isinstance(flist, str):
flist = [[flist, 0]]
else:
flist = [[f, 0]for f in flist]
for j in range(len(flist)*max_tries):
num_done = 0
for i, val in enumerate(flist):
if val is None:
num_done += 1
continue
(f, incr) = val
if incr > max_tries:
raise RuntimeError("Timed out while trying to wait for file " + str(f))
size1 = fs.stat(f).st_size
time.sleep(min(600, 1 + incr))
incr += 1
if fs.stat(f).st_size == size1:
flist[i] = None
num_done += 1
yield f
else:
flist[i]=[f,incr]
if num_done == len(flist):
return
return
You could do something like
for bin_file in wait_until_files_loaded(all_bin_files):
idx_file = next(wait_until_files_loaded(bin_file.replace(".bin", ".idx")))
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah. That could be handy. Thanks. We could gather the actual file size to rank 0 after each process finishes, and then we could have rank 0 wait with code like this until both the file exists and reaches the expected size. Let's keep this in mind for future work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! So I'd like to start off by saying I don't know much about MPI and such. I have a few questions:
- Is is possible to no send the entire list of indices to everyone? To my understanding, you're sending it all to every rank, and then they use their start and end variable to compute which subset they need to take care of.
torch.distributedhas a scatter/gather mechanism that seems to be helpful, is there something similar inmpi4py? - I think the use of index of not needed as you can do all the things you're doing using indices using
dataset. Then the only thing you need to communicate is the start and end index no? - In case of failure we probably want the exception message raised or logged somewhere?
- You're not able to monitor live the current progress right? I think it's worth trying to keep, especially since this can take quite some time.
tools/preprocess_dataset_mpi.py
Outdated
| j = idx[i] | ||
| for key in columns: | ||
| # tokenize text for the given sample index | ||
| text = dset[j][key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ontocord If it really faster if num_proc = 1?
| print("Bytes=", dset_stats[2], "bytes/sec=", byterate) | ||
|
|
||
| # allreduce to check whether all ranks wrote their part successfully | ||
| success = all_true(args, success) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's very useful no? unless it's used for clean up only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We basically skip the merge if any process fails. We'll also want to print an error message in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this can probably go somewhere like def process_shard(rank) l.428
| dist.all_reduce(tensor, op=dist.ReduceOp.SUM) | ||
| return (tensor.tolist()[0] == size) | ||
|
|
||
| def select_sample_list(args, dset_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed, you can do shuffling/truncation using datasets instead. This would remove the necessity to communicate the idx between ranks I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that's true. The main reason I do this work on rank 0 and bcast is to handle randomness. We need all ranks to generate a consistent random sequence. I think we could have each process build the sequence locally if we ensure they all use a consistent random seed.
For really large datasets, there might also be some savings in memory by having one rank build and shuffle the sequence if we replace the bcast with a scatter.
I'll review the datasets API in more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasw21 , the Datasets API looks pretty rich. However, it'll take some time to comprehend how to use it all. Mind if we push that change to a future PR?
More immediately, do you know if there is a good way to query whether a dataset is cached without having to load it? It seems like there must be a way.
I'm thinking it may be a good idea to suggest that people download/extract the dataset as a separate step, especially for really large datasets. That can be done with a single process rather than burning a lot of cluster CPU time waiting for one process to download/extract. We could detect that condition and print a message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you have a single rank handling shuffling/truncation you might not even need to share indices, as all you need is start and end, and since those two values can be computed on each rank, you don;t need to communicate anything. I admit I don't think this is really a bottleneck, but I feel this would simplify the the code a bit by removing the indexing mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind, please create an issue if you don't want to tackle it now, so we can track it and somebody else might take a look.
Concerning the way things are cached:
- setting
HF_DATASETS_OFFLINE = 1as env variable allows for offline mode, ie if it tries to download, it'll raise an error - I don't think there's a simple way to either enforce getting from cache, or check if it exists in cache. Maybe @stas00 knows?
I agree that having a seperate process might be the best idea especially since you lock a lot of ressources in order to just wait it out. (Might also be for the merging part, as only rank=0 is merging everything, and that can take quite some time depending on the size of your dataset, TBD i guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your assert is great, Adam.
Why do you think it may break in the future? HF_DATASETS_OFFLINE is now part of the datasets public API.
Unless you refer to something else when you say " it's not a documented interface"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HF_DATASETS_OFFLINE as env variable is documented, not the one from config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for clarifying, @thomasw21! Perhaps that one should be documented too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe they are not meant to be tweaked? The advantage of env variables usually is once they are set in the script, they are set for the whole duration of the script. I'm guessing playing around the config in the script can have undesired consequences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a request to both document and define precedence: huggingface/datasets#2776
Co-authored-by: Thomas Wang <[email protected]>
Co-authored-by: Thomas Wang <[email protected]>
Co-authored-by: Thomas Wang <[email protected]>
Co-authored-by: Thomas Wang <[email protected]>
Co-authored-by: Thomas Wang <[email protected]>
Co-authored-by: Thomas Wang <[email protected]>
Co-authored-by: Thomas Wang <[email protected]>
thomasw21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more minor comments, but no real blocking issue as this is a seperate script that have no impact on the rest of the code mostly. I'll approve as the only critical part for me were the changes in index_dataset.py that impact other scripts, and that looks much better.
Thanks! Awesome work!
tools/preprocess_dataset_mpi.py
Outdated
| print(f"ERROR: At least one process failed to write its file, skipping merge and cleaning up") | ||
|
|
||
| # delete per-rank files, do this even on error | ||
| print("Deleting rank files ...", flush=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can actually parallel delete, each rank can remove their own file. And all this code can move to remove_shards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Will change to do a parallel delete.
tools/preprocess_dataset_mpi.py
Outdated
| try: | ||
| os.remove(binfile) | ||
| except: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why it can fail, can we add a log here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main thing I was guarding against here is if some process fails before it creates its file, in which case the file for that rank won't exist. This turned up in testing when I was forcing some process failures. I could likely add an os.exists check before the remove to help mitigate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see! But if you do parallel removal, if it crash it won't remove the files (since it would have already crashed ...). We could wrap everything in try catch ... Also is one process crash, you probably end in deadlock somewhere around success = all_true(args, success) (right before the merge) no? Or does MPI realise that one process has died so it broadcast to only live processes?
I think it's complicated to handle perfectly all failures, maybe we can make the assumption that everything runs well. And if not there might be manual steps to run (I usually just ran rm {prefix}_* from the folder without much care ...). Or just realise that the next run will override your files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I think we'll be able to auto-cleanup the rank files in most cases, like no failures and perhaps some common failures like out of disk space. There are some other cases where we might leave some files behind, and the user can fallback to use rm then.
It depends on the MPI runtime and the job launcher that is used, but MPI normally tears down a job when it detects that some process in the job has failed. I have seem some hangs though in my testing, especially when python exceptions are raised. I haven't figured out why the MPI is failing to tear down those jobs yet.
tools/preprocess_dataset_mpi.py
Outdated
| try: | ||
| os.remove(idxfile) | ||
| except: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| print("Bytes=", dset_stats[2], "bytes/sec=", byterate) | ||
|
|
||
| # allreduce to check whether all ranks wrote their part successfully | ||
| success = all_true(args, success) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this can probably go somewhere like def process_shard(rank) l.428
tools/preprocess_dataset_mpi.py
Outdated
| byterate = numbytes / secs if secs > 0.0 else 0.0 | ||
| print("Seconds to merge:", merge_end - merge_start) | ||
| print(f"Merged {proc_size} files into {args.output_prefix}") | ||
| print(f"Bytes=", numbytes, "bytes/sec=", byterate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
l.510 to here can go to merge_shards method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created three functions for the create, merge, and delete operations of the per-rank files. That cleans up the main body a lot.
Co-authored-by: Thomas Wang <[email protected]>
Co-authored-by: Thomas Wang <[email protected]>
|
I suggest we take that parallel merge into another PR. The reason why is this PR is getting huge. Good thing is we probably can leverage this in all other scripts, so if it really is much faster we probably want to change the other scripts too. FYI we could run things at 40-45mb/s on 60 workers (though it was single node) |
Thanks for that data point @thomasw21 . I noticed that kind of rate from your PR comments last week. That's an impressive result. It also suggests that this multi-node implementation must still be leaving performance on the table somewhere. |
|
The hot spot in my tests seems to be the sample encode cost. I ran a few tests, each with 8 nodes processing a total of 1m samples: Reading the samples from disk without encoding, I see a processing rate of ~10GB/s. That's reasonable given the file system. Encoding those text samples without splitting sentences drops performance to ~100MB/s -- 100x slower. Splitting sentences drops to ~15MB/s -- another factor of 7x. |
|
Alright, I'm settled on this one, assuming you guys are all ok with it. Thanks for the help. I suppose the name Also, maybe replace I'd be interested to know whether it works for anyone else, either with Then if In particular, I have been using a command like: I have less experience with |
|
And a simple test program to check that And to run: |
|
Okay I've tried a couple of things:
I haven't run benchmark, just checked that I could get it working fairly easily. Btw, can we have logs that are easier to read? convert I was only able to run on |
|
This is very interesting. Splitting uses nltk so unless we want to use something else, we are stuck with that. Maybe something from Spacy might be faster, or we could do split on ". ", but that's not the same semantics. As for the encoding, which encoder did you use? HFBert, or HFGPT or the GPT in the repo? We could optimize some of the encoder in the repo to use the LRU/word length based cache that we did for GPT to squeeze more performance. But unless the performance is a bottlneck, we were good enough with the performance we got on the optimized gpt encoder we used. |
|
Thank you for testing, @thomasw21 . It's good to know that it's working for you and working on another system. @ontocord , I'm basically using the same For tokenizers, I've tried |
|
I think this script will need some work to handle very large datasets. There are a few aspects that come to mind:
We can look at all of that in future work. For some context, for a large dataset like oscar, how many different examples are there in the training set? |
|
I've also got a start on an extension to allow this to process files in jsonl format. The ranks work to scan the file to identify the byte offset of newline ( |
Sure. Just pushed a commit to make that change. |
This would be fixed with migrating to dataset as there would be no need to have the index anymore.
So I don't know if
It depends on the split you take, those information are available here: https://huggingface.co/datasets/oscar |
thomasw21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw I've noticed that HF dataset can sometimes require a subset config for some (for example: wiki40b). Though it should be just an additional argument.
tools/preprocess_dataset_mpi.py
Outdated
| # import datasets after potentially setting environment variables | ||
| from datasets import load_dataset, logging | ||
| from datasets.utils.file_utils import OfflineModeIsEnabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this at the top with other imports since you have updated to use config.
tools/preprocess_dataset_mpi.py
Outdated
| # Alternatively, one could set datasets.config.HF_DATASETS_OFFLINE. | ||
| # This seems to work even after the import statement, | ||
| # but it feels like a bigger hack. | ||
| import datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Moved those up.
Co-authored-by: Thomas Wang <[email protected]>
|
Feel free to merge whenever you can! This will allow us to look at the parallel merge part. |
|
Thank you @adammoody, @thomasw21 and @ontocord for working on this. Awesome addition! @adammoody, you may want now to ask the upstreams
To integrate this new tool into their arsenal, so it's not limited to this repo. or we can add it to the list here: #10 and the upstreams may or may not pick it from there. |
|
Thanks, @stas00 . Do you find it's easier to upstream with one vs the other? I've got a few more PRs planned to further this, so I may let it settle a bit before making the request. |
|
I haven't had a chance to participate closely in your PR, as there are many other burning issues to attend to at the moment, so I am trusting Thomas to have experimented with it and found it to be an excellent contribution. I will definitely try it on the first occasion I will next need to do pre-processing and will follow up if I run into any issues. |
|
FY, we now have a test suite. #64 If you'd like you can add a simple test to https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tests/test_preprocessing.py by copying the test and adjusting it to use @thomasw21, you may want to do the same for hopefully it should be an easy task for both. If you have any questions please don't hesitate to ask. We will refactor as we go. |
Co-authored-by: Ammar Ahmad Awan <[email protected]> Co-authored-by: Conglong Li <[email protected]> Co-authored-by: Jeff Rasley <[email protected]> Co-authored-by: Minjia Zhang <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: yaozhewei <[email protected]> Co-authored-by: Adam Moody <[email protected]>
This is work in progress, but I'll post it early to get feedback.
This PR includes a few things at once.
Updates
megatron/data/indexed_dataset.pyto usenumpyto compute sample offsets to improve speed while writing an index file.Adds a new
tools/preprocess_dataset_mpi.pyscript:mpi4pyto support multiple nodes--splitoption to name the HF dataset split name, defaults totrain--columnsoption to specify dataset feature (column) names to process from each row--shuffleoption to randomly shuffle data samples--seedfor random number generator on shuffle operations--countoption to limit the number of selected samples, e.g.--count 10000to use 10k samples--mpi4pyoption to instruct script to usempi4pyinstead oftorch.distributed--torch-backendoption to select betweengloo/mpi--local_rankto supporttorch.distributed.launchwhen usingtorch.distributed--log-intervalto specify seconds between progress messages or 0 to disableAssuming srun has been configured to launch MPI jobs, one can run this script with something like:
The script can use MPI and
mpi4py. It requires that a shared file system exists, like Lustre or GPFS, such that one process can read a file written by another process. In particular, there may be problems on NFS due to client-side caching.TODO:
The size discrepancy was in the index file, which is resolved with the doc_idx fix here
Megatron-DeepSpeed/megatron/data/indexed_dataset.py
Line 574 in 752e958
.binand.idxfiles are identical usingcmpafter disabling the data shuffle.--shuffleand--seedoptions in 32fc48f)torch.distributed.init_process_groupHF_DATASETS_OFFLINEitemIn my testing with 320 procs on 8 nodes, start up takes 2 minutes, it takes about 15 minutes for all processes to write their .bin/.idx file, and the merge takes 3 more minutes. It processes the full
openwebtextdataset in 20 minutes.The file system hosting the source dataset, the intermediate .bin/.idx files, and the final merged file is GPFS, which provides 120GB/s write bandwidth.