Skip to content

Conversation

@adammoody
Copy link

@adammoody adammoody commented Aug 19, 2021

I can't run on JZ, but for a concrete example, I think a script like the one in this PR could be used with the new preprocess_data_dist.py script. This requires the JSON support added in PR bigscience-workshop/Megatron-DeepSpeed#60

To process a JSON file, the script first generates an "index" that records the starting byte offset and length of each line in the source JSON file. That index file is stored beside the source JSON file to be reused in future runs. The index enables quick random access to the (variable-length) lines in the JSON file.

In the example SLURM script in this PR, for the source file:

$six_ALL_CCFRSCRATCH/datasets/oscar-small/oscar-en-shuffled-p1.jsonl

the preprocess_data_dist.py script will create the following files as a result of indexing the source JSON file:

$six_ALL_CCFRSCRATCH/datasets/oscar-small/oscar-en-shuffled-p1.jsonl.idx (persists after first run)
$six_ALL_CCFRSCRATCH/datasets/oscar-small/oscar-en-shuffled-p1.jsonl.idxtmp (created and deleted during run)

@adammoody
Copy link
Author

If you verify this works well and correctly at smaller nodes counts, you might try scaling it higher. It has scaled well for me up to 64 nodes so far.

@adammoody
Copy link
Author

Oh, and for a quick test, add something like preprocess_data_dist.py --count 1000 to limit the number of samples processed. It's good to test things with one or two nodes and a small sample count before trying large node counts and the full dataset.

@adammoody adammoody changed the title add oscar slurm script for preprocess_data_dist WIP: add oscar slurm script for preprocess_data_dist Aug 20, 2021
@stas00
Copy link
Contributor

stas00 commented Oct 15, 2021

Hi Adam,

I have just noticed your WIP PR here - is this still relevant and then let's merge it, or if not move/close it?

@adammoody
Copy link
Author

Thanks, @stas00 . No need to merge this one.

@adammoody adammoody closed this Oct 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants