Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shar: tracking epochs in shard iterator with option for shard re-shuffling each epoch #894

Merged
merged 2 commits into from
Nov 19, 2022

Conversation

pzelasko
Copy link
Collaborator

This change is to implement Dan's idea about infinite iteration with re-shuffling of shards at the start of each epoch. I will write a more detailed example/tutorial some time later, but the basic idea is the following snippet:

cuts = (
    CutSet.from_shar(in_dir=shar_dir, shuffle_shards=True, stateful_shuffle=True)
    .repeat()
    .shuffle(buffer_size=10000)
)

yields an infinite CutSet that has two levels of shuffling:

  • shard-level: the shards are re-shuffled every time it exhausts the underlying finite CutSet
  • cut-level: the cuts are shuffled across shards using a buffer

The rest of the dataloading workflow is identical as with WebDataset. The DataLoader will never stop yielding data so epochs need to be tracked differently. To make it possible, cuts iterated this way have an attached custom field called shar_epoch. In practice when the epoch is incremented, you'd still keep seeing cuts from the previous epoch for a number of steps until they are completely flushed out of the shuffling buffer. If that's undesirable, call .shuffle() first and then .repeat(), but it'll be a bit less random.

@pzelasko pzelasko added this to the v1.11 milestone Nov 18, 2022
@pzelasko pzelasko merged commit 6e06ee8 into master Nov 19, 2022
@pzelasko pzelasko deleted the feature/shar-v15 branch November 19, 2022 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant