[BE] Unify `buffer_size` across datapipes

The `buffer_size` parameter is currently fairly inconsistent across datapipes:

| name               |   default `buffer_size` | infinite `buffer_size`   | warn on infinite   |
|--------------------|-------------------------|--------------------------|--------------------|
| Demultiplexer      |                    1e3 | -1                       | yes                |
| Forker             |                    1e3 | -1                       | yes                |
| Grouper            |                   1e4 | N/A                      | N/A                |
| Shuffler           |                   1e4 | N/A                      | N/A                |
| MaxTokenBucketizer |                    1e3 | N/A                      | N/A                |
| UnZipper           |                    1e3 | -1                       | yes                |
| IterKeyZipper      |                   1e4 | None                     | no                 |

Here are my suggestion on how to unify this:

- Use the same default `buffer_size` everywhere. It makes little difference whether we use `1e3` or `1e4` given that it is tightly coupled with the data we know nothing about. Given today's hardware / datasets, I would go with 1e4, but no strong opinion.
- Give every datapipe with buffer the ability for an infinite buffer. Otherwise users will just be annoyed and use a workaround. For example, `torchvision` simply uses [`INFINITE_BUFFER_SIZE = 1_000_000_000`](https://github.com/pytorch/vision/blob/1db8795733b91cd6dd62a0baa7ecbae6790542bc/torchvision/prototype/datasets/utils/_internal.py#L42-L43), which for all intents and purposes lives up to its name. Which sentinel we use, i.e. `-1` or `None`, again makes little difference. I personally would use `None` to have a clear separation, but again no strong opinion other than being consistent.
- Do not warn on infinite buffer sizes. Especially since infinite buffer is not the default behavior, the user is expected to know what they are doing when setting `buffer_size=None`. I'm all for having a warning like this in the documentation, but I'm strongly against a runtime warning. For example, `torchvision` datasets need to use an infinite buffer everywhere. Thus, by using the infinite buffer sentinel, users would always get runtime warnings although neither them nor we did anything wrong. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BE] Unify `buffer_size` across datapipes #335

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

name	default `buffer_size`	infinite `buffer_size`	warn on infinite
Demultiplexer	1e3	-1	yes
Forker	1e3	-1	yes
Grouper	1e4	N/A	N/A
Shuffler	1e4	N/A	N/A
MaxTokenBucketizer	1e3	N/A	N/A
UnZipper	1e3	-1	yes
IterKeyZipper	1e4	None	no

[BE] Unify buffer_size across datapipes #335

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BE] Unify `buffer_size` across datapipes #335