-
Notifications
You must be signed in to change notification settings - Fork 538
SplitSampler to have configurable sampling ability #792
Comments
@eric-haibin-lin any thoughts since you are the original author of this? |
Sorry for the late reply. Would this SplitFilter API be something that can help with your case? |
So in my use-case, I use custom datasets that are inheriting from Dataset so from glancing at SplitFilter, looks like it is specific to TextLineDataset. Can we extend this functionality to all types of datasets? I'll also take a closer look and confirm with you. |
@chandana1332 that's correct. The purpose of the filter is to allow Datasets like TextlineDataset to load a subset of the dataset to memory, in case the dataset is too large to fit into memory. On the other hand, the split sampler is designed with data.SimpleDatasetStream where the data is already split into multiple parts, stored on the disk. But in many cases there is a single file and we still want to do multi-gpu training. I do agree that we should enhance the sampler to make multi-gpu training easier. Actually there's |
hi @chandana1332 you can now use the Dataset.shard API in mxnet nightly: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/data/dataset.py#L67-L96 You can first shard the dataset, then create the sampler |
Closing it for now. Feel free to comment or reopen if you have further questions |
Description
I'm currently trying to integrate Horovod with Gluon. I came across SplitSampler that virtually partitions the data and randomly samples from a particular partition. I would like to use this partitioning mechanism with other samplers such as bucket sampler, etc. Would it be possible to configure the choice of sampling mechanism in SplitSampler rather than forcing it to be randomly sampled?
References
The text was updated successfully, but these errors were encountered: