SplitSampler to have configurable sampling ability #792

chandana1332 · 2019-06-24T17:50:56Z

Description

I'm currently trying to integrate Horovod with Gluon. I came across SplitSampler that virtually partitions the data and randomly samples from a particular partition. I would like to use this partitioning mechanism with other samplers such as bucket sampler, etc. Would it be possible to configure the choice of sampling mechanism in SplitSampler rather than forcing it to be randomly sampled?

References

https://gluon-nlp.mxnet.io/_modules/gluonnlp/data/sampler.html

chandana1332 · 2019-06-24T21:54:52Z

@eric-haibin-lin any thoughts since you are the original author of this?

eric-haibin-lin · 2019-06-26T23:16:32Z

Sorry for the late reply. Would this SplitFilter API be something that can help with your case?
#741
Basically the idea is to shard the dataset into multiple partitions, and each GPU/worker only load a partition by applying the filter to the dataset. Please feel free to provide comment/feedbacks :)

chandana1332 · 2019-06-27T20:03:32Z

So in my use-case, I use custom datasets that are inheriting from Dataset so from glancing at SplitFilter, looks like it is specific to TextLineDataset. Can we extend this functionality to all types of datasets?

I'll also take a closer look and confirm with you.

eric-haibin-lin · 2019-06-30T01:12:18Z

@chandana1332 that's correct. The purpose of the filter is to allow Datasets like TextlineDataset to load a subset of the dataset to memory, in case the dataset is too large to fit into memory.

On the other hand, the split sampler is designed with data.SimpleDatasetStream where the data is already split into multiple parts, stored on the disk. But in many cases there is a single file and we still want to do multi-gpu training.

I do agree that we should enhance the sampler to make multi-gpu training easier. Actually there's num_shards option in data.FixedBucketSampler that splits the data into multiple logical shards. Currently it returns the index for all shard for each iteration, and we can just add a shard_idx argument to support multi-process training. I'm currently busy with other tasks. Would this be something you'd like to contribute? I'm happy to give more pointers/explanations if you need :)

https://github.com/dmlc/gluon-nlp/blob/master/src/gluonnlp/data/sampler.py#L284-L289

eric-haibin-lin · 2019-10-13T04:35:34Z

hi @chandana1332 you can now use the Dataset.shard API in mxnet nightly: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/data/dataset.py#L67-L96

You can first shard the dataset, then create the sampler

eric-haibin-lin · 2019-10-13T04:35:54Z

Closing it for now. Feel free to comment or reopen if you have further questions

chandana1332 added the enhancement New feature or request label Jun 24, 2019

eric-haibin-lin closed this as completed Oct 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SplitSampler to have configurable sampling ability #792

SplitSampler to have configurable sampling ability #792

chandana1332 commented Jun 24, 2019

chandana1332 commented Jun 24, 2019

eric-haibin-lin commented Jun 26, 2019

chandana1332 commented Jun 27, 2019

eric-haibin-lin commented Jun 30, 2019

eric-haibin-lin commented Oct 13, 2019

eric-haibin-lin commented Oct 13, 2019

SplitSampler to have configurable sampling ability #792

SplitSampler to have configurable sampling ability #792

Comments

chandana1332 commented Jun 24, 2019

Description

References

chandana1332 commented Jun 24, 2019

eric-haibin-lin commented Jun 26, 2019

chandana1332 commented Jun 27, 2019

eric-haibin-lin commented Jun 30, 2019

eric-haibin-lin commented Oct 13, 2019

eric-haibin-lin commented Oct 13, 2019