Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

SplitSampler to have configurable sampling ability #792

Closed
chandana1332 opened this issue Jun 24, 2019 · 6 comments
Closed

SplitSampler to have configurable sampling ability #792

chandana1332 opened this issue Jun 24, 2019 · 6 comments
Labels
enhancement New feature or request

Comments

@chandana1332
Copy link

Description

I'm currently trying to integrate Horovod with Gluon. I came across SplitSampler that virtually partitions the data and randomly samples from a particular partition. I would like to use this partitioning mechanism with other samplers such as bucket sampler, etc. Would it be possible to configure the choice of sampling mechanism in SplitSampler rather than forcing it to be randomly sampled?

References

@chandana1332 chandana1332 added the enhancement New feature or request label Jun 24, 2019
@chandana1332
Copy link
Author

@eric-haibin-lin any thoughts since you are the original author of this?

@eric-haibin-lin
Copy link
Member

Sorry for the late reply. Would this SplitFilter API be something that can help with your case?
#741
Basically the idea is to shard the dataset into multiple partitions, and each GPU/worker only load a partition by applying the filter to the dataset. Please feel free to provide comment/feedbacks :)

@chandana1332
Copy link
Author

So in my use-case, I use custom datasets that are inheriting from Dataset so from glancing at SplitFilter, looks like it is specific to TextLineDataset. Can we extend this functionality to all types of datasets?

I'll also take a closer look and confirm with you.

@eric-haibin-lin
Copy link
Member

@chandana1332 that's correct. The purpose of the filter is to allow Datasets like TextlineDataset to load a subset of the dataset to memory, in case the dataset is too large to fit into memory.

On the other hand, the split sampler is designed with data.SimpleDatasetStream where the data is already split into multiple parts, stored on the disk. But in many cases there is a single file and we still want to do multi-gpu training.

I do agree that we should enhance the sampler to make multi-gpu training easier. Actually there's num_shards option in data.FixedBucketSampler that splits the data into multiple logical shards. Currently it returns the index for all shard for each iteration, and we can just add a shard_idx argument to support multi-process training. I'm currently busy with other tasks. Would this be something you'd like to contribute? I'm happy to give more pointers/explanations if you need :)

https://github.com/dmlc/gluon-nlp/blob/master/src/gluonnlp/data/sampler.py#L284-L289

@eric-haibin-lin
Copy link
Member

hi @chandana1332 you can now use the Dataset.shard API in mxnet nightly: https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/data/dataset.py#L67-L96

You can first shard the dataset, then create the sampler

@eric-haibin-lin
Copy link
Member

Closing it for now. Feel free to comment or reopen if you have further questions

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants