Add fault tolerance for the StreamingDataset 1/n #19049

tchaton · 2023-11-22T11:06:27Z

What does this PR do?

This PR adds support for fault tolerance for the StreamingDataset. This enables to use it seamlessly with Fabric.save as follows:

dataset = StreamingDataset(...)

state = {...., "model": model, "dataset": dataset}

...

fabric.save(state)

Fixes #<issue_number>

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--19049.org.readthedocs.build/en/19049/

cc @Borda

github-actions · 2023-11-22T11:06:54Z

⚡ Required checks status: All passing 🟢

Groups summary

🟢 lightning_data: CPU workflow

Check ID	Status
data-cpu (macOS-11, lightning, 3.10, 2.1)	success	✅
data-cpu (ubuntu-20.04, lightning, 3.10, 2.1)	success	✅
data-cpu (windows-2022, lightning, 3.10, 2.1)	success	✅

These checks are required after the changes to src/lightning/data/streaming/__init__.py, src/lightning/data/streaming/cache.py, src/lightning/data/streaming/constants.py, src/lightning/data/streaming/dataset.py, src/lightning/data/streaming/item_loader.py, src/lightning/data/streaming/shuffle.py, tests/tests_data/streaming/test_data_processor.py, tests/tests_data/streaming/test_dataset.py.

🟢 mypy

Check ID	Status
mypy	success	✅

These checks are required after the changes to src/lightning/data/streaming/__init__.py, src/lightning/data/streaming/cache.py, src/lightning/data/streaming/constants.py, src/lightning/data/streaming/dataset.py, src/lightning/data/streaming/item_loader.py, src/lightning/data/streaming/shuffle.py.

🟢 install

Check ID	Status
install-pkg (ubuntu-22.04, app, 3.8)	success	✅
install-pkg (ubuntu-22.04, app, 3.11)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.8)	success	✅
install-pkg (ubuntu-22.04, fabric, 3.11)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.8)	success	✅
install-pkg (ubuntu-22.04, pytorch, 3.11)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.8)	success	✅
install-pkg (ubuntu-22.04, lightning, 3.11)	success	✅
install-pkg (ubuntu-22.04, notset, 3.8)	success	✅
install-pkg (ubuntu-22.04, notset, 3.11)	success	✅
install-pkg (macOS-12, app, 3.8)	success	✅
install-pkg (macOS-12, app, 3.11)	success	✅
install-pkg (macOS-12, fabric, 3.8)	success	✅
install-pkg (macOS-12, fabric, 3.11)	success	✅
install-pkg (macOS-12, pytorch, 3.8)	success	✅
install-pkg (macOS-12, pytorch, 3.11)	success	✅
install-pkg (macOS-12, lightning, 3.8)	success	✅
install-pkg (macOS-12, lightning, 3.11)	success	✅
install-pkg (macOS-12, notset, 3.8)	success	✅
install-pkg (macOS-12, notset, 3.11)	success	✅
install-pkg (windows-2022, app, 3.8)	success	✅
install-pkg (windows-2022, app, 3.11)	success	✅
install-pkg (windows-2022, fabric, 3.8)	success	✅
install-pkg (windows-2022, fabric, 3.11)	success	✅
install-pkg (windows-2022, pytorch, 3.8)	success	✅
install-pkg (windows-2022, pytorch, 3.11)	success	✅
install-pkg (windows-2022, lightning, 3.8)	success	✅
install-pkg (windows-2022, lightning, 3.11)	success	✅
install-pkg (windows-2022, notset, 3.8)	success	✅
install-pkg (windows-2022, notset, 3.11)	success	✅

These checks are required after the changes to src/lightning/data/streaming/__init__.py, src/lightning/data/streaming/cache.py, src/lightning/data/streaming/constants.py, src/lightning/data/streaming/dataset.py, src/lightning/data/streaming/item_loader.py, src/lightning/data/streaming/shuffle.py.

Thank you for your contribution! 💜

Note
This comment is automatically generated and updates for 60 minutes every 180 seconds. If you have any other questions, contact carmocca for help.

codecov · 2023-11-22T12:04:30Z

Codecov Report

Merging #19049 (38c3c63) into master (bc16580) will decrease coverage by 34%.
The diff coverage is 0%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #19049      +/-   ##
==========================================
- Coverage      83%      49%     -34%     
==========================================
  Files         443      435       -8     
  Lines       36185    36120      -65     
==========================================
- Hits        30114    17614   -12500     
- Misses       6071    18506   +12435

lantiga

Looks great! A couple of comments

src/lightning/data/streaming/dataset.py

tests/tests_data/streaming/test_dataset.py

src/lightning/data/streaming/cache.py

awaelchli · 2023-11-22T17:18:57Z

src/lightning/data/streaming/cache.py

@@ -102,6 +106,20 @@ def filled(self) -> bool:
        self._is_done = os.path.exists(os.path.join(self._cache_dir, _INDEX_FILENAME))
        return self._is_done

+    @property
+    def checkpoint_dir(self) -> str:


It shouldn't be necessary to duplicate the code in both of these.

src/lightning/data/streaming/dataset.py

awaelchli · 2023-11-22T17:29:39Z

src/lightning/data/streaming/dataset.py

+            self.current_indexes = current_indexes[state["index"] :]
+
+            # Bump the chunk_index
+            self.chunk_index += 1


Why +1 the index? We're reloading it in the line above. If the chunk wasn't complete, we would now miss the remainder?

state = self._state_dict[str(self.cache.rank)] # re-generate indexes interval = self.worker_intervals[self.chunk_index] current_indexes = np.arange(interval[0], interval[1]) current_indexes = self.shuffler(current_indexes, self.current_epoch, self.chunk_index) self.current_indexes = current_indexes[state["index"] :] # Bump the chunk_index self.chunk_index += 1

No, it won't. The chunk_index is bumped only once the current_indexes are re-created.

I will clean this up in another PR.

awaelchli · 2023-11-22T17:34:13Z

src/lightning/data/streaming/dataset.py

+
+            # 3. Move the file to avoid corrupted read from the main thread.
+            now = datetime.now().strftime(_TIME_FORMAT)
+            checkpoint_path = os.path.join(self.cache.checkpoint_rank_dir, f"checkpoint-{now}.json")


why is the file time stamped? This will create so many files. IMO we should overwrite the file, because only the current state matters. And since every worker only saves to their dedicated folder, it is safe.

awaelchli · 2023-11-22T17:38:07Z

src/lightning/data/streaming/dataset.py

        return data

+    def checkpoint(self, chunk_index: int) -> None:


This shouldn't be public. The user wouldn't be able to use this effectively, because they can only call it from the main process. And from the main process it never makes sense.

I suggest to 1) make it private 2) raise an error if called in main process

awaelchli · 2023-11-22T17:42:27Z

src/lightning/data/streaming/dataset.py

+                    state_dict.update(**state)
+                    node_ranks.append(node_rank)
+        else:
+            raise NotImplementedError("The `state_dict` should be called on the main thread.")


But they aren't threads, they are proper processes. And we should raise immediately at the beginning, this would eliminate the entire if-else block, making the code much more readable

awaelchli · 2023-11-22T17:42:56Z

src/lightning/data/streaming/dataset.py

+                # TODO: Move this to fabric.
+                num_devices = torch.cuda.device_count() or 1
+                node_ranks = []
+                for index in range(self.distributed_env.world_size):
+                    node_rank = index // num_devices
+                    if node_rank in node_ranks:
+                        continue
+                    state = {}
+                    obj = [_state_dict]
+                    torch.distributed.broadcast_object_list(obj, index, group=_group.WORLD)
+                    state = obj[0]
+                    state_dict.update(**state)
+                    node_ranks.append(node_rank)


we should put this in a function. Way easier to unit test!

awaelchli · 2023-11-22T17:44:45Z

src/lightning/data/streaming/dataset.py

+            if not os.path.exists(self.cache.checkpoint_dir):
+                return state_dict
+
+            # 2. Iterate through the workers and read the latest checkpoint


This step wouldn't be necessary, see comment above

awaelchli · 2023-11-22T17:47:09Z

src/lightning/data/streaming/shuffle.py

-        assert self.random_state
-        return self.random_state.permutation(array).tolist()
+    def __call__(self, array: np.ndarray, current_epoch: int, chunk_index: int) -> List[int]:
+        return np.random.RandomState(seed=self.seed + current_epoch + chunk_index).permutation(array).tolist()


This is problematic, because the seed will not be unique. For example 5 + 4 = 4 + 5 = 9. We should at least multiple the current epoch by number of chunks.

Co-authored-by: thomas <[email protected]> (cherry picked from commit 1073276)

thomas added 3 commits November 21, 2023 10:16

update

ee3c8f0

update

a89c529

update

e524356

github-actions bot added the data (external) litdata package label Nov 22, 2023

update

38c3c63

tchaton requested review from awaelchli, carmocca, justusschock and Borda as code owners November 22, 2023 11:40

github-actions bot added the pl Generic label for PyTorch Lightning package label Nov 22, 2023

thomas added 4 commits November 22, 2023 13:27

update

e8042da

update

733ad85

update

15400c8

update

c0404be

tchaton changed the title ~~Add fault tolerance StreamingDataset 1/n~~ Add fault tolerance for the StreamingDataset 1/n Nov 22, 2023

thomas added 2 commits November 22, 2023 13:51

update

073200a

update

a12ca08

github-actions bot removed the pl Generic label for PyTorch Lightning package label Nov 22, 2023

thomas added 2 commits November 22, 2023 13:59

update

c7db240

update

cf13f37

lantiga approved these changes Nov 22, 2023

View reviewed changes

src/lightning/data/streaming/dataset.py Outdated Show resolved Hide resolved

tests/tests_data/streaming/test_dataset.py Show resolved Hide resolved

update

c71b559

Borda approved these changes Nov 22, 2023

View reviewed changes

tchaton merged commit 1073276 into master Nov 22, 2023
53 checks passed

tchaton deleted the resumable_streaming_dataset branch November 22, 2023 17:22

mergify bot added the ready PRs ready to be merged label Nov 22, 2023

awaelchli reviewed Nov 22, 2023

View reviewed changes

tchaton mentioned this pull request Nov 22, 2023

Add fault tolerance Streaming Dataset 2/n #19052

Merged

7 tasks

Borda pushed a commit that referenced this pull request Dec 19, 2023

Add fault tolerance for the StreamingDataset 1/n (#19049)

ed7746f

Co-authored-by: thomas <[email protected]> (cherry picked from commit 1073276)

lantiga pushed a commit that referenced this pull request Dec 20, 2023

Add fault tolerance for the StreamingDataset 1/n (#19049)

d48910a

Co-authored-by: thomas <[email protected]> (cherry picked from commit 1073276)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fault tolerance for the StreamingDataset 1/n #19049

Add fault tolerance for the StreamingDataset 1/n #19049

tchaton commented Nov 22, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Nov 22, 2023 •

edited

Loading

codecov bot commented Nov 22, 2023

lantiga left a comment

awaelchli Nov 22, 2023

tchaton Nov 22, 2023

awaelchli Nov 22, 2023

tchaton Nov 22, 2023

awaelchli Nov 22, 2023

awaelchli Nov 22, 2023

awaelchli Nov 22, 2023

awaelchli Nov 22, 2023

awaelchli Nov 22, 2023

awaelchli Nov 22, 2023

Add fault tolerance for the StreamingDataset 1/n #19049

Add fault tolerance for the StreamingDataset 1/n #19049

Conversation

tchaton commented Nov 22, 2023 • edited by github-actions bot Loading

What does this PR do?

PR review

github-actions bot commented Nov 22, 2023 • edited Loading

⚡ Required checks status: All passing 🟢

Groups summary

codecov bot commented Nov 22, 2023

Codecov Report

lantiga left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tchaton commented Nov 22, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Nov 22, 2023 •

edited

Loading