-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite random-file random-line stateless sampler #1102
Conversation
lhotse/dataset/sampling/povey.py
Outdated
PathlikeAndScale = Tuple[Pathlike, float] | ||
|
||
|
||
class PoveySampler(torch.utils.data.Sampler, Dillable): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a "povey window".
Now we also have a "povey sampler" 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL, but let's call it StatelessSampler. With the Povey Window, I think people didn't get that it was basically a joke (since no-one really cares about things like windows these days), and they perhaps thought that it was a kind of self-aggrandizement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind StatelessSampler, but do let me know if you change your mind 😂
lhotse/dataset/sampling/povey.py
Outdated
|
||
|
||
.. note:: This sampler works only with uncompressed jsonl manifests, as it creates extra index files with line byte offsets to quickly find and sample JSON lines. | ||
This means this sampler will not work with Webdataset and Lhotse Shar data format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should mention that if you restart training, you should set the random seed (in python's random module) to some function of the step you restarted at, to ensure that you don't see the exact same data that you saw at the beginning of training (or at some previous restart point). This will ensure that the random seeds used in the generator are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll also add some logging that prints out the random seeds and maybe other info at initialization so it'll be possible to keep track of these things between experiments.
Check it now, I changed it to use OS TRNG for sampling the base seed (if available) so that it doesn't depend on Python's global seed at all. |
…ure/povey-sampler
I'll merge it, let me know later if it works as you expected. |
I'm not sure about the decisions about the TRNG stuff. My plan was to have the base seed depend on the python RNG state. The idea was that in the main thread, we set the RNG state to be a function of the start-iter if you are restarting, so if you are restarting training you get a different random seed. OTOH it would be consistent between runs, so that if you get a crash it can be debugged. Otherwise debugging crashes would be quite hard! We should of course explain what is supposed to happen RE setting the rng state. |
It's going to print the seed it chose in the logs, and if you want to reproduce the run, you can pass the argument |
Hm, OK. When we use this I plan to pass in something like the start iteration then. |
You would be passing in only the base_seed, which is then modified in each node+worker combination similarly to the LM dataset code shared by you earlier. See the snippet linked here. Just make sure to use the IterableDatasetWrapper as described here so that the sampler is being placed into the worker subprocess (so it can correctly read out the worker ID). |
OK. I think it's good if we encourage people to use it with IterableDatasetWrapper and also the base_seed (set as a function of at least the start-iter for restarts), because if they don't the use base_seed, in general they'll end up with different base_seed values in different workers, and reproducing failures would be hard. Debugging isn't something you normally plan in advance to have to do. |
Alright, on another thought I'll simplify this code by removing TRNG and requiring |
See: #1109 |
As proposed in #1096
It's pretty much what was described, with at most minor deviations. @danpovey let me know what you think, you can find the usage in unit tests. I couldn't think of a name that would describe what this sampler does, so I initially called it
PoveySampler
-- we should continue to build on the legacy of the Povey feature window :)