Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
f68999f
indexed_dataset: use numpy to compute byte offsets faster
adammoody Aug 4, 2021
d5d20bb
preprocess with huggingface datasets and mpi
adammoody Aug 4, 2021
32fc48f
preprocess_dataset_mpi: add --shuffle and --seed options
adammoody Aug 5, 2021
a456e48
indexed_dataset: fix to handle file with 0 items
adammoody Aug 5, 2021
17ac2f9
preprocess_dataset_mpi: add --split and --count options
adammoody Aug 5, 2021
7836a32
update script comments to reflect shuffle behavior
adammoody Aug 5, 2021
7b27853
add torch.distributed version
adammoody Aug 5, 2021
92d78c4
Update tools/preprocess_dataset_mpi.py
adammoody Aug 6, 2021
38b2d8a
Update tools/preprocess_dataset_mpi.py
adammoody Aug 6, 2021
6264d7a
Update tools/preprocess_dataset_mpi.py
adammoody Aug 6, 2021
782151f
Update tools/preprocess_dataset_mpi.py
adammoody Aug 6, 2021
88f5d0b
Update tools/preprocess_dataset_mpi.py
adammoody Aug 6, 2021
31fab0e
Update tools/preprocess_dataset_mpi.py
adammoody Aug 6, 2021
520d06c
Update tools/preprocess_dataset_mpi.py
adammoody Aug 6, 2021
6e0e4fd
add estimated progress logging
adammoody Aug 6, 2021
03bf199
avoid downloading dataset unless user really wants to
adammoody Aug 6, 2021
0b2d4cd
Update tools/preprocess_dataset_mpi.py
adammoody Aug 8, 2021
cbf965e
Update tools/preprocess_dataset_mpi.py
adammoody Aug 8, 2021
7c8c1c9
refactor main into more functions
adammoody Aug 8, 2021
600c091
reformat progress messages
adammoody Aug 8, 2021
6bb27f7
move mpi4py import test to get_args
adammoody Aug 8, 2021
7ee7bf5
drop Open MPI variables from init_process_group
adammoody Aug 8, 2021
f0f45b9
add --local_rank to support torch.distributed.launch
adammoody Aug 8, 2021
3be3423
update from DeepSpeedExamples
adammoody Aug 9, 2021
8ae0cf8
raise exceptions on errors
adammoody Aug 9, 2021
a8e9b2e
drop --download option
adammoody Aug 9, 2021
fa9e323
format byte rate as MB/s
adammoody Aug 10, 2021
3db9cdb
Update tools/preprocess_dataset_mpi.py
adammoody Aug 10, 2021
764e760
move datasets import back to top
adammoody Aug 10, 2021
80ee230
import config from datasets
adammoody Aug 10, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 23 additions & 13 deletions megatron/data/indexed_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,28 +355,38 @@ def __enter__(self):
return self

@staticmethod
def _get_pointers(sizes):
dtype_size = dtype().itemsize
address = 0
pointers = []
def _get_pointers(sizes, npdtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand the changes, this is equivlant to the previous python list + loop based approach. Would this make things faster because you don't need to create a list and then convert to numpy? It looks right. Did you test to see if the output of the two methods are the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, I found that pytorch-biggraph also uses both torch.distributed/glood/MPI to communicate, if you are interested:

@ontocord , thanks for your review, and thanks for the pointers to this other project. I'll take a look when I get a chance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be equivalent. I checked by verifying that the resulting files are identical with cmp. Though, I may put that on my list to double check again, just to be sure.

I found that creating this list of pointers was taking a long time when writing out the index file. This list for the full openwebtext dataset turned out to be ~300 million items. The cost seemed to be concentrated in the loop that computes the byte offsets. Converting that loop into numpy calls cut the compute time significantly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aweseome optimization! Thank you.

"""Return a numpy array of byte offsets given a list of sizes.

for size in sizes:
pointers.append(address)
address += size * dtype_size
Multiplies values in the sizes array by dtype size (bytes),
and then computes a zero-based prefix scan.
"""

# create numpy array of desired numpy datatype
pointers = np.array(sizes, dtype=npdtype)

if len(sizes) > 0:
# scale each element by its dtype size
dtype_size = dtype().itemsize
pointers *= dtype_size

# in-place prefix scan to compute byte offsets
np.cumsum(pointers, axis=0, out=pointers)

# zero-base the prefix scan (exclusive scan)
pointers -= pointers[0]

return pointers

def write(self, sizes, doc_idx):
pointers = self._get_pointers(sizes)

self._file.write(struct.pack('<Q', len(sizes)))
self._file.write(struct.pack('<Q', len(doc_idx)))

sizes = np.array(sizes, dtype=np.int32)
self._file.write(sizes.tobytes(order='C'))
del sizes
sizes32 = np.array(sizes, dtype=np.int32)
self._file.write(sizes32.tobytes(order='C'))
del sizes32

pointers = np.array(pointers, dtype=np.int64)
pointers = self._get_pointers(sizes, np.int64)
self._file.write(pointers.tobytes(order='C'))
del pointers

Expand Down
Loading