Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite random-file random-line stateless sampler #1102

Merged
merged 13 commits into from
Jul 21, 2023

Conversation

pzelasko
Copy link
Collaborator

As proposed in #1096

It's pretty much what was described, with at most minor deviations. @danpovey let me know what you think, you can find the usage in unit tests. I couldn't think of a name that would describe what this sampler does, so I initially called it PoveySampler -- we should continue to build on the legacy of the Povey feature window :)

PathlikeAndScale = Tuple[Pathlike, float]


class PoveySampler(torch.utils.data.Sampler, Dillable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a "povey window".
Now we also have a "povey sampler" 😀

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL, but let's call it StatelessSampler. With the Povey Window, I think people didn't get that it was basically a joke (since no-one really cares about things like windows these days), and they perhaps thought that it was a kind of self-aggrandizement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind StatelessSampler, but do let me know if you change your mind 😂



.. note:: This sampler works only with uncompressed jsonl manifests, as it creates extra index files with line byte offsets to quickly find and sample JSON lines.
This means this sampler will not work with Webdataset and Lhotse Shar data format.
Copy link
Collaborator

@danpovey danpovey Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should mention that if you restart training, you should set the random seed (in python's random module) to some function of the step you restarted at, to ensure that you don't see the exact same data that you saw at the beginning of training (or at some previous restart point). This will ensure that the random seeds used in the generator are different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll also add some logging that prints out the random seeds and maybe other info at initialization so it'll be possible to keep track of these things between experiments.

@pzelasko
Copy link
Collaborator Author

Check it now, I changed it to use OS TRNG for sampling the base seed (if available) so that it doesn't depend on Python's global seed at all.

@pzelasko
Copy link
Collaborator Author

I'll merge it, let me know later if it works as you expected.

@pzelasko pzelasko merged commit e094aa6 into master Jul 21, 2023
@pzelasko pzelasko deleted the feature/povey-sampler branch July 21, 2023 23:11
@danpovey
Copy link
Collaborator

danpovey commented Jul 22, 2023

I'm not sure about the decisions about the TRNG stuff. My plan was to have the base seed depend on the python RNG state. The idea was that in the main thread, we set the RNG state to be a function of the start-iter if you are restarting, so if you are restarting training you get a different random seed. OTOH it would be consistent between runs, so that if you get a crash it can be debugged. Otherwise debugging crashes would be quite hard!

We should of course explain what is supposed to happen RE setting the rng state.

@pzelasko
Copy link
Collaborator Author

It's going to print the seed it chose in the logs, and if you want to reproduce the run, you can pass the argument base_seed=<value> to the sampler (by default it's None which means choose it at random). The same argument can be used to pass another seed of your own choice.

@danpovey
Copy link
Collaborator

Hm, OK. When we use this I plan to pass in something like the start iteration then.
If the secrets module gives a different number per worker, it seems to me it would be quite painful to pass
in separate numbers per worker in order to reproduce a crash.

@pzelasko
Copy link
Collaborator Author

You would be passing in only the base_seed, which is then modified in each node+worker combination similarly to the LM dataset code shared by you earlier. See the snippet linked here.

https://github.com/lhotse-speech/lhotse/pull/1102/files#diff-4b4592f9922837830d7c5a7fa206aa9c9e8b9c000f9125dc6a651d8d6e1ed446R180

Just make sure to use the IterableDatasetWrapper as described here so that the sampler is being placed into the worker subprocess (so it can correctly read out the worker ID).

https://github.com/lhotse-speech/lhotse/pull/1102/files#diff-4b4592f9922837830d7c5a7fa206aa9c9e8b9c000f9125dc6a651d8d6e1ed446R30

@danpovey
Copy link
Collaborator

OK. I think it's good if we encourage people to use it with IterableDatasetWrapper and also the base_seed (set as a function of at least the start-iter for restarts), because if they don't the use base_seed, in general they'll end up with different base_seed values in different workers, and reproducing failures would be hard. Debugging isn't something you normally plan in advance to have to do.

@pzelasko
Copy link
Collaborator Author

Alright, on another thought I'll simplify this code by removing TRNG and requiring base_seed to be passed in by the user.

@pzelasko
Copy link
Collaborator Author

See: #1109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request for feature (?), a way to sample from large jsonl manifests.
3 participants