Enable saving and loading stateful DataLoaders in Trainer #19361

awaelchli · 2024-01-29T00:50:17Z

What does this PR do?

Saves the state_dict of training dataloaders into the checkpoint, and enables that state to be loaded when resuming from a checkpoint. An example of a stateful dataloader is lightning.data.StreamingDataLoader.

The implementation collects the state of all iterables under the CombinedLoader that follow the stateful interface. The states are collected over the flattened view (CombinedLoader.flattened) and stored in a last. They get restored in the same manner via the flattened view. This means the number of iterables and the order in which they are given must be exactly the same as when the checkpoint was saved, otherwise the loading will fail.

Fixes #17105
Closes #17543

cc @Borda @justusschock @awaelchli @carmocca

for more information, see https://pre-commit.ci

codecov · 2024-01-30T03:54:56Z

Codecov Report

Merging #19361 (0f8621a) into master (5d178d0) will decrease coverage by 35%.
The diff coverage is 100%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #19361      +/-   ##
==========================================
- Coverage      84%      49%     -35%     
==========================================
  Files         448      440       -8     
  Lines       37887    37762     -125     
==========================================
- Hits        31649    18392   -13257     
- Misses       6238    19370   +13132

tchaton

Nice ;)

src/lightning/pytorch/loops/fit_loop.py

for more information, see https://pre-commit.ci

…/lightning into feature/stateful-dataloader

for more information, see https://pre-commit.ci

…/lightning into feature/stateful-dataloader

src/lightning/pytorch/utilities/combined_loader.py

awaelchli added 2 commits January 29, 2024 01:25

implement stateful dataloader serialization

f7545df

add test

46aca6b

github-actions bot added the pl Generic label for PyTorch Lightning package label Jan 29, 2024

awaelchli added feature Is an improvement or enhancement data handling Generic data-related topic trainer fun Staff contributions outside working hours - to differentiate from the "community" label and removed pl Generic label for PyTorch Lightning package labels Jan 29, 2024

awaelchli added this to the 2.2 milestone Jan 29, 2024

awaelchli added the fault tolerance label Jan 29, 2024

github-actions bot added the pl Generic label for PyTorch Lightning package label Jan 29, 2024

awaelchli added 3 commits January 29, 2024 03:35

extend test

ce5f545

update

d1cc990

test restore

5317545

awaelchli force-pushed the feature/stateful-dataloader branch from 1d75450 to 5317545 Compare January 29, 2024 02:45

pre-commit-ci bot and others added 9 commits January 29, 2024 02:46

[pre-commit.ci] auto fixes from pre-commit.com hooks

f596cb1

for more information, see https://pre-commit.ci

add test

a07e5f4

[pre-commit.ci] auto fixes from pre-commit.com hooks

d9efe93

for more information, see https://pre-commit.ci

update

234fbac

chlog

bfabe06

Merge branch 'master' into feature/stateful-dataloader

7f36e03

[pre-commit.ci] auto fixes from pre-commit.com hooks

e68eebf

for more information, see https://pre-commit.ci

fix

5670ad4

precommit

975c037

awaelchli changed the title ~~Save state_dict for stateful DataLoaders~~ Enable saving and loading stateful DataLoaders in Trainer Jan 30, 2024

awaelchli and others added 5 commits January 30, 2024 03:14

fix

75a4485

refactor test

a943f25

fix

038d1d2

[pre-commit.ci] auto fixes from pre-commit.com hooks

26475d8

for more information, see https://pre-commit.ci

reset

1c5b382

Borda approved these changes Jan 30, 2024

View reviewed changes

tchaton approved these changes Jan 30, 2024

View reviewed changes

mergify bot added the ready PRs ready to be merged label Jan 30, 2024

carmocca reviewed Jan 31, 2024

View reviewed changes

src/lightning/pytorch/loops/fit_loop.py Outdated Show resolved Hide resolved

carmocca reviewed Jan 31, 2024

View reviewed changes

src/lightning/pytorch/loops/fit_loop.py Outdated Show resolved Hide resolved

src/lightning/pytorch/loops/fit_loop.py Outdated Show resolved Hide resolved

awaelchli and others added 10 commits February 1, 2024 01:04

refactor for review

e104aee

Merge branch 'master' into feature/stateful-dataloader

00c170f

[pre-commit.ci] auto fixes from pre-commit.com hooks

2748d8c

for more information, see https://pre-commit.ci

error handling and test

a769813

Merge branch 'feature/stateful-dataloader' of github.com:Lightning-AI…

6c30b6f

…/lightning into feature/stateful-dataloader

type

de395d6

[pre-commit.ci] auto fixes from pre-commit.com hooks

a784a6e

for more information, see https://pre-commit.ci

add test

141eeed

Merge branch 'feature/stateful-dataloader' of github.com:Lightning-AI…

37d5fea

…/lightning into feature/stateful-dataloader

test

4374261

awaelchli requested a review from carmocca February 1, 2024 00:28

carmocca approved these changes Feb 1, 2024

View reviewed changes

src/lightning/pytorch/utilities/combined_loader.py Show resolved Hide resolved

mergify bot added has conflicts and removed ready PRs ready to be merged labels Feb 1, 2024

Merge branch 'master' into feature/stateful-dataloader

0f8621a

mergify bot added ready PRs ready to be merged and removed has conflicts ready PRs ready to be merged labels Feb 1, 2024

awaelchli merged commit 34a34a0 into master Feb 1, 2024
96 of 97 checks passed

awaelchli deleted the feature/stateful-dataloader branch February 1, 2024 02:11

This was referenced Feb 1, 2024

Data Processor: Add is_last argument to know when the last item for the current worker is being processed #19383

Merged

Add support for parallelizing processing parquet files across workers and nodes. #19400

Merged

awaelchli mentioned this pull request Feb 15, 2024

Avoid warning when resuming mid-epoch checkpoint and using stateful dataloader #19475

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable saving and loading stateful DataLoaders in Trainer #19361

Enable saving and loading stateful DataLoaders in Trainer #19361

awaelchli commented Jan 29, 2024 •

edited

Loading

codecov bot commented Jan 30, 2024 •

edited

Loading

tchaton left a comment

Enable saving and loading stateful DataLoaders in Trainer #19361

Enable saving and loading stateful DataLoaders in Trainer #19361

Conversation

awaelchli commented Jan 29, 2024 • edited Loading

What does this PR do?

codecov bot commented Jan 30, 2024 • edited Loading

Codecov Report

tchaton left a comment

Choose a reason for hiding this comment

awaelchli commented Jan 29, 2024 •

edited

Loading

codecov bot commented Jan 30, 2024 •

edited

Loading