-
Notifications
You must be signed in to change notification settings - Fork 624
Expose common dataloader args #2097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6763cc0 to
990d654
Compare
wwwjn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I feel like I'm slightly lean towards using kwargs instead of adding these parameters one by one. This is because the StatefulDataLoader() has a lot of supported field and it's hard to say some of them are "common" in different use cases.
Can you explain more on "but that can easily complicate things"? We can just pass all the kwargs to StatefulDataLoader and let it to check correctness. wdyt @tianyu-l
| num_workers=num_workers, | ||
| persistent_workers=persistent_workers, | ||
| prefetch_factor=prefetch_factor, | ||
| pin_memory=pin_memory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about introducing a catch all kwargs to make it easier to specify args but that can easily complicate things (validation checks, duplication, existing defined named args in function definitions etc).
These are valid concerns. For now I'm leaning towards keeping things simple by passing **kwargs around.
Does it make sense if we only make these args explicit when sending to the actual init of StatefulDataLoader and not passing in all **kwargs from the input of ParallelAwareDataloader? The point is to not accidentally hit error inside StatefulDataLoader.
| self, | ||
| dataset: IterableDataset, | ||
| dp_rank: int, | ||
| dp_world_size: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you help change this: let's keep at most one positional arg (dataset) and others to be kwargs.
This diff introduces common dataloader args which are supported by statefuldataloader (and torch.utils.data dataloader). Users should be able to use them in their config files.
I was thinking about introducing a catch all kwargs to make it easier to specify args but that can easily complicate things (validation checks, duplication, existing defined named args in function definitions etc).