Lhotse Shar tutorial notebook #1006

pzelasko · 2023-03-22T01:48:48Z

Finally found some time to write it down a bit. It doesn't show every possible option, but should be enough to get started. I think in general this workflow may have been a bit simpler if 3 years ago I knew that Lhotse would go in this direction :) maybe some simplifications (and breaking changes) can be made in the future, but I don't plan them right now.

review-notebook-app · 2023-03-22T01:48:52Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

desh2608 · 2023-03-22T02:24:37Z

examples/04-lhotse-shar.ipynb

@@ -0,0 +1,882 @@
+{


from lhotse.shar import --> perhaps a typo?

Reply via ReviewNB

desh2608 · 2023-03-22T02:24:38Z

examples/04-lhotse-shar.ipynb

@@ -0,0 +1,882 @@
+{


Line #3. shards = cuts_train.to_shar(data_dir, fields={"recording": "wav"}, shard_size=1000)
May be helpful to explain what other values may be passed to the fields dict.
Also, what is the granularity of sharding? Is each recording treated as a unit?

Reply via ReviewNB

Yes, each cut is treated as a unit, so shard_size=1000 means 1000 cuts per shard. I will add this explanation to the tutorial and also provide link to the documentation that describes the fields dict in more detail (I noticed that Lhotse Shar docs were not linked in readthedocs and I'm also fixing it now).

desh2608 · 2023-03-22T02:24:38Z

examples/04-lhotse-shar.ipynb

@@ -0,0 +1,882 @@
+{


What do you mean by "keep reshuffling as the full epoch is reached"? Does this mean each worker's copy is shuffled (with a different RNG) at the start of every epoch of training?

Also, what would happen if we do not use the iterable dataset wrapper? Would all the workers generate the same batches?

Reply via ReviewNB

What do you mean by "keep reshuffling as the full epoch is reached"? Does this mean each worker's copy is shuffled (with a different RNG) at the start of every epoch of training?

It roughly means the following: given a dataset of three shards [A, B, C], a single node, two dataloader workers W1 and W2, and global random seed=0, the training dataloading might look like the following (assuming stateful_shuffle=True):

Epoch 0:
W1 uses RNG with seed (global=0 + worker-id=1 + 1000*rank=0) + epoch=0 = 1 and has order: [B, A, C]
W2 uses RNG with seed (global=0 + worker-id=2 + 1000*rank=0) + epoch=0 = 2 and has order: [C, B, A]

Epoch 1:
W1 uses RNG with seed (global=0 + worker-id=1 + 1000*rank=0) + epoch=1 = 2 and has order: [C, B, A]
W2 uses RNG with seed (global=0 + worker-id=2 + 1000*rank=0) + epoch=1 = 3 and has order: [A, B, C]

... and so on.

Note that since .repeat() makes CutSet infinite, the dataloader will never stop yielding data, so you won't easily know what is the current epoch -- it's best to count steps, although if you really need to know the epoch, Shar attaches a custom field cut.shar_epoch to each cut that you can read out to understand what's the epoch. You'll also generally observe that each shar_epoch contains world_size * num_workers actual epochs in this setup.

BTW After writing this I realized that I need to check what kind of IDs are given to workers by PyTorch so we can avoid seeing too much of the same order of data (randomized augmentations probably help with that though and it should matter less with large datasets).

Also, what would happen if we do not use the iterable dataset wrapper? Would all the workers generate the same batches?

Then you'd end up with data I/O happening in the training loop process (since with WebDataset and Shar the I/O happens upon iterating CutSet) and binary blobs being transferred to DataLoader worker process. But they wouldn't duplicate the data. It'd be great to emit a warning to the user if that happens but I don't really have an idea how to detect that.

desh2608 · 2023-03-22T02:26:09Z

Great tutorial! I was able to get a nice overview of Lhotse Shar.

Lhotse Shar tutorial notebook

93b5239

pzelasko added this to the v1.13 milestone Mar 22, 2023

Merge branch 'master' into feature/shar-tutorial

360c6ce

desh2608 reviewed Mar 22, 2023

View reviewed changes

pzelasko added 2 commits March 22, 2023 19:53

Merge branch 'master' into feature/shar-tutorial

54a90cd

Update the notebook and refresh Shar docs + make them visible

e5e7b54

pzelasko changed the title ~~WIP: Lhotse Shar tutorial notebook~~ Lhotse Shar tutorial notebook Mar 23, 2023

pzelasko added 2 commits March 23, 2023 08:34

Fix import

edbf275

Make Lhotse Shar tutorial visible in README and docs + fix typo

409df07

pzelasko merged commit d666c8d into master Mar 23, 2023

pzelasko deleted the feature/shar-tutorial branch March 23, 2023 13:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lhotse Shar tutorial notebook #1006

Lhotse Shar tutorial notebook #1006

pzelasko commented Mar 22, 2023 •

edited

Loading

review-notebook-app bot commented Mar 22, 2023

desh2608 Mar 22, 2023 •

edited

Loading

desh2608 Mar 22, 2023 •

edited

Loading

pzelasko Mar 23, 2023

desh2608 Mar 22, 2023 •

edited

Loading

pzelasko Mar 23, 2023 •

edited

Loading

desh2608 commented Mar 22, 2023

Lhotse Shar tutorial notebook #1006

Lhotse Shar tutorial notebook #1006

Conversation

pzelasko commented Mar 22, 2023 • edited Loading

review-notebook-app bot commented Mar 22, 2023

desh2608 Mar 22, 2023 • edited Loading

Choose a reason for hiding this comment

desh2608 Mar 22, 2023 • edited Loading

Choose a reason for hiding this comment

pzelasko Mar 23, 2023

Choose a reason for hiding this comment

desh2608 Mar 22, 2023 • edited Loading

Choose a reason for hiding this comment

pzelasko Mar 23, 2023 • edited Loading

Choose a reason for hiding this comment

desh2608 commented Mar 22, 2023

pzelasko commented Mar 22, 2023 •

edited

Loading

desh2608 Mar 22, 2023 •

edited

Loading

desh2608 Mar 22, 2023 •

edited

Loading

desh2608 Mar 22, 2023 •

edited

Loading

pzelasko Mar 23, 2023 •

edited

Loading