Infinite random-file random-line stateless sampler #1102

pzelasko · 2023-07-18T03:35:34Z

As proposed in #1096

It's pretty much what was described, with at most minor deviations. @danpovey let me know what you think, you can find the usage in unit tests. I couldn't think of a name that would describe what this sampler does, so I initially called it PoveySampler -- we should continue to build on the legacy of the Povey feature window :)

csukuangfj · 2023-07-19T03:31:56Z

lhotse/dataset/sampling/povey.py

+PathlikeAndScale = Tuple[Pathlike, float]
+
+
+class PoveySampler(torch.utils.data.Sampler, Dillable):


There is a "povey window".
Now we also have a "povey sampler" 😀

LOL, but let's call it StatelessSampler. With the Povey Window, I think people didn't get that it was basically a joke (since no-one really cares about things like windows these days), and they perhaps thought that it was a kind of self-aggrandizement.

I don't mind StatelessSampler, but do let me know if you change your mind 😂

danpovey · 2023-07-19T08:32:48Z

lhotse/dataset/sampling/povey.py

+
+
+    .. note:: This sampler works only with uncompressed jsonl manifests, as it creates extra index files with line byte offsets to quickly find and sample JSON lines.
+     This means this sampler will not work with Webdataset and Lhotse Shar data format.


Perhaps we should mention that if you restart training, you should set the random seed (in python's random module) to some function of the step you restarted at, to ensure that you don't see the exact same data that you saw at the beginning of training (or at some previous restart point). This will ensure that the random seeds used in the generator are different.

Good point, I'll also add some logging that prints out the random seeds and maybe other info at initialization so it'll be possible to keep track of these things between experiments.

pzelasko · 2023-07-19T16:25:01Z

Check it now, I changed it to use OS TRNG for sampling the base seed (if available) so that it doesn't depend on Python's global seed at all.

…nfinite loops

…ure/povey-sampler

pzelasko · 2023-07-21T23:11:34Z

I'll merge it, let me know later if it works as you expected.

danpovey · 2023-07-22T13:58:08Z

I'm not sure about the decisions about the TRNG stuff. My plan was to have the base seed depend on the python RNG state. The idea was that in the main thread, we set the RNG state to be a function of the start-iter if you are restarting, so if you are restarting training you get a different random seed. OTOH it would be consistent between runs, so that if you get a crash it can be debugged. Otherwise debugging crashes would be quite hard!

We should of course explain what is supposed to happen RE setting the rng state.

pzelasko · 2023-07-22T17:32:16Z

It's going to print the seed it chose in the logs, and if you want to reproduce the run, you can pass the argument base_seed=<value> to the sampler (by default it's None which means choose it at random). The same argument can be used to pass another seed of your own choice.

danpovey · 2023-07-24T01:55:31Z

Hm, OK. When we use this I plan to pass in something like the start iteration then.
If the secrets module gives a different number per worker, it seems to me it would be quite painful to pass
in separate numbers per worker in order to reproduce a crash.

pzelasko · 2023-07-24T02:33:37Z

You would be passing in only the base_seed, which is then modified in each node+worker combination similarly to the LM dataset code shared by you earlier. See the snippet linked here.

https://github.com/lhotse-speech/lhotse/pull/1102/files#diff-4b4592f9922837830d7c5a7fa206aa9c9e8b9c000f9125dc6a651d8d6e1ed446R180

Just make sure to use the IterableDatasetWrapper as described here so that the sampler is being placed into the worker subprocess (so it can correctly read out the worker ID).

https://github.com/lhotse-speech/lhotse/pull/1102/files#diff-4b4592f9922837830d7c5a7fa206aa9c9e8b9c000f9125dc6a651d8d6e1ed446R30

danpovey · 2023-07-24T03:23:44Z

OK. I think it's good if we encourage people to use it with IterableDatasetWrapper and also the base_seed (set as a function of at least the start-iter for restarts), because if they don't the use base_seed, in general they'll end up with different base_seed values in different workers, and reproducing failures would be hard. Debugging isn't something you normally plan in advance to have to do.

pzelasko · 2023-07-24T13:34:05Z

Alright, on another thought I'll simplify this code by removing TRNG and requiring base_seed to be passed in by the user.

pzelasko · 2023-07-24T13:42:47Z

See: #1109

pzelasko added 4 commits July 17, 2023 17:19

Infinite random-file random-line stateless sampler

04372a9

Unit tests and fixes

fd234dd

Fix and test sampler diagnostics

d43aacb

Add support for (path, scale) arguments

f4c9679

pzelasko mentioned this pull request Jul 18, 2023

Request for feature (?), a way to sample from large jsonl manifests. #1096

Closed

pzelasko linked an issue Jul 18, 2023 that may be closed by this pull request

Request for feature (?), a way to sample from large jsonl manifests. #1096

Closed

pzelasko added this to the v1.16 milestone Jul 18, 2023

csukuangfj reviewed Jul 19, 2023

View reviewed changes

danpovey reviewed Jul 19, 2023

View reviewed changes

pzelasko added 3 commits July 19, 2023 12:18

Renaming, more docs, using TRNG for base seeds, logging, more tests

9fd8acc

Remove excessive print

5410a03

Fix circular import

f59e053

pzelasko added 6 commits July 19, 2023 12:40

Check that at least one of max_duration / max_cuts was set to avoid i…

16a6166

…nfinite loops

Merge branch 'master' into feature/povey-sampler

e9c075d

Fix tests after adding checks

7fb3b1e

Merge remote-tracking branch 'origin/feature/povey-sampler' into feat…

48dc314

…ure/povey-sampler

Fix compatibility

0f4661e

Merge branch 'master' into feature/povey-sampler

294692a

pzelasko merged commit e094aa6 into master Jul 21, 2023

pzelasko deleted the feature/povey-sampler branch July 21, 2023 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite random-file random-line stateless sampler #1102

Infinite random-file random-line stateless sampler #1102

pzelasko commented Jul 18, 2023

csukuangfj Jul 19, 2023

danpovey Jul 19, 2023

pzelasko Jul 19, 2023

danpovey Jul 19, 2023 •

edited

Loading

pzelasko Jul 19, 2023

pzelasko commented Jul 19, 2023

pzelasko commented Jul 21, 2023

danpovey commented Jul 22, 2023 •

edited

Loading

pzelasko commented Jul 22, 2023

danpovey commented Jul 24, 2023

pzelasko commented Jul 24, 2023

danpovey commented Jul 24, 2023

pzelasko commented Jul 24, 2023

pzelasko commented Jul 24, 2023

		PathlikeAndScale = Tuple[Pathlike, float]


		class PoveySampler(torch.utils.data.Sampler, Dillable):



		.. note:: This sampler works only with uncompressed jsonl manifests, as it creates extra index files with line byte offsets to quickly find and sample JSON lines.
		This means this sampler will not work with Webdataset and Lhotse Shar data format.

Infinite random-file random-line stateless sampler #1102

Infinite random-file random-line stateless sampler #1102

Conversation

pzelasko commented Jul 18, 2023

csukuangfj Jul 19, 2023

Choose a reason for hiding this comment

danpovey Jul 19, 2023

Choose a reason for hiding this comment

pzelasko Jul 19, 2023

Choose a reason for hiding this comment

danpovey Jul 19, 2023 • edited Loading

Choose a reason for hiding this comment

pzelasko Jul 19, 2023

Choose a reason for hiding this comment

pzelasko commented Jul 19, 2023

pzelasko commented Jul 21, 2023

danpovey commented Jul 22, 2023 • edited Loading

pzelasko commented Jul 22, 2023

danpovey commented Jul 24, 2023

pzelasko commented Jul 24, 2023

danpovey commented Jul 24, 2023

pzelasko commented Jul 24, 2023

pzelasko commented Jul 24, 2023

danpovey Jul 19, 2023 •

edited

Loading

danpovey commented Jul 22, 2023 •

edited

Loading