Shar: tracking epochs in shard iterator with option for shard re-shuffling each epoch #894
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change is to implement Dan's idea about infinite iteration with re-shuffling of shards at the start of each epoch. I will write a more detailed example/tutorial some time later, but the basic idea is the following snippet:
yields an infinite
CutSet
that has two levels of shuffling:The rest of the dataloading workflow is identical as with WebDataset. The DataLoader will never stop yielding data so epochs need to be tracked differently. To make it possible, cuts iterated this way have an attached custom field called
shar_epoch
. In practice when the epoch is incremented, you'd still keep seeing cuts from the previous epoch for a number of steps until they are completely flushed out of the shuffling buffer. If that's undesirable, call.shuffle()
first and then.repeat()
, but it'll be a bit less random.