-
Notifications
You must be signed in to change notification settings - Fork 170
Open
Labels
Description
The buffer_size parameter is currently fairly inconsistent across datapipes:
| name | default buffer_size |
infinite buffer_size |
warn on infinite |
|---|---|---|---|
| Demultiplexer | 1e3 | -1 | yes |
| Forker | 1e3 | -1 | yes |
| Grouper | 1e4 | N/A | N/A |
| Shuffler | 1e4 | N/A | N/A |
| MaxTokenBucketizer | 1e3 | N/A | N/A |
| UnZipper | 1e3 | -1 | yes |
| IterKeyZipper | 1e4 | None | no |
Here are my suggestion on how to unify this:
- Use the same default
buffer_sizeeverywhere. It makes little difference whether we use1e3or1e4given that it is tightly coupled with the data we know nothing about. Given today's hardware / datasets, I would go with 1e4, but no strong opinion. - Give every datapipe with buffer the ability for an infinite buffer. Otherwise users will just be annoyed and use a workaround. For example,
torchvisionsimply usesINFINITE_BUFFER_SIZE = 1_000_000_000, which for all intents and purposes lives up to its name. Which sentinel we use, i.e.-1orNone, again makes little difference. I personally would useNoneto have a clear separation, but again no strong opinion other than being consistent. - Do not warn on infinite buffer sizes. Especially since infinite buffer is not the default behavior, the user is expected to know what they are doing when setting
buffer_size=None. I'm all for having a warning like this in the documentation, but I'm strongly against a runtime warning. For example,torchvisiondatasets need to use an infinite buffer everywhere. Thus, by using the infinite buffer sentinel, users would always get runtime warnings although neither them nor we did anything wrong.
NivekT and VitalyFedyunin