Rewrite dataloader, improves responsiveness and reliability #13447

zhreshold · 2018-11-28T23:30:47Z

Description

Rewrite dataloader with multiprocess.pool, the logic of dataloader with num_worker>0 is cleaner and more robust.

Changes with this PR:

Secure MXIndexRecordIO reader handler with process id, note that to perseve perf, no lock is introduced, separate processes will obtain new handlers to the record file.
Replaced manual queue handling with built-in multiprocessing.Pool, no matter how many epochs and iterators are launched, they share same process pool. The benefit is two fold: 1) It's faster, since new iterators will no longer trying to fork new processes 2) it's more reliable, the process pool is managed along with the entire life of dataloader, not iterators.
Benchmarked with ResNet 50 on ImageNet, p3-x16 instance with 8 GPUs and 60 CPU workers, images/sec bumped from (1500-2400) to (3000-3100), it's faster and more stable
Added an option prefetch=None:

prefetch : int, default is `num_workers * 2`
        The number of prefetching batches only works if `num_workers` > 0.
        If `prefetch` > 0, it allow worker process to prefetch certain batches before
        acquiring data from iterators.
        Note that using large prefetching batch will provide smoother bootstrapping performance,
        but will consume more shared_memory. Using smaller number may forfeit the purpose of using
        multiple worker processes, try reduce `num_workers` in this case.
        By default it defaults to `num_workers * 2`.

The rest functionality and APIs are intact with this PR. No action is required for exisiting users.

Thanks @szha @leezu for the ideas.
@piiswrong @eric-haibin-lin @hetong007 @YutingZhang for review and questions

This PR may affect gluoncv and gluonnlp dataloader variants, I will address them separately.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

ThomasDelteil · 2018-11-28T23:48:59Z

Thanks that's a massive improvement, especially for datasets that gets iterated in a small number of iterations (small datasets or big batch sizes). The killing and respawning of workers was one reason why I was questioning more and more the fact that Gluon is privileging the concept of epochs vs number of iterations in measuring progress through training for checkpointing, iterating, etc.

The next improvements I would love to see for the data loader would be the cold start issue at beginning of epoch:

prefetching the first batch of the next epoch before the previous epoch is completed.
Or when the queue is empty, first batch should be prefetched in a distributed manner across workers to avoid cold start problem on large batch-sizes. But that would become secondary if the first point is addressed.

vandanavk · 2018-11-29T02:10:52Z

@mxnet-label-bot add [Data-loading, pr-awaiting-review]

@sandeep-krishnamurthy @anirudhacharya @kalyc

leezu · 2018-11-29T07:04:44Z

python/mxnet/gluon/data/dataloader.py

+        self._prefetch = max(0, int(prefetch) if prefetch is not None else 2 * self._num_workers)
+        if self._num_workers > 0:
+            def worker_initializer(data):
+                global _worker_dataset


Won't this break when using multiple DataLoader with different datasets?

No, there is one global dataset per worker process, different data loaders have independent process pools

leezu · 2018-11-29T07:13:23Z

@ThomasDelteil regarding point 1, this is currently available in GluonNLP with PrefetchingStream. The current API is as follows: Consider a some Stream (generalization of Dataset that does not support random access but only sequential) of batches, we can wrap it in a PrefetchingStream as data = PrefetchingStream(streamofbatches). When calling iter(data) will begin prefetching. It is currently used as follows

    prefetched_iters = []
    for _ in range(min(args.num_prefetch_epoch, args.epochs)):
        prefetched_iters.append(iter(data))
    for epoch in range(args.epochs):
        if epoch + len(prefetched_iters) < args.epochs:
            prefetched_iters.append(iter(data))
        data_iter = prefetched_iters.pop(0)

The usage-pattern may be made more user-friendly, but the basic idea may also be applicable to Dataset API?

ThomasDelteil · 2018-11-29T18:43:35Z

@leezu, the problem with the stream is that the shuffling if it happens is to be done ahead of time so that the sequential access remains random.

Currently one solution that works and that I prefer, with the current Dataset and DataLoader API, is to use a continuous batch sampler and effectively iterate through your batches rather than using the concept of epoch.
You iterate your dataloader until you break out of the loop after N iterations.

You can design your continuous batch sampler to return an iterator that is effectively creating epochs times sequences of indices without replacement in a single iterator.

eric-haibin-lin

Looks good! Nice improvement

eric-haibin-lin · 2018-11-29T07:31:17Z

python/mxnet/recordio.py

@@ -182,6 +188,10 @@ def read(self):
            Buffer read.
        """
        assert not self.writable
+        if not self.pid == current_process().pid:
+            # in forked process, obtain a new handle
+            # print("PID not matching, reset")


remove unused code

leezu · 2018-11-30T05:16:57Z

@ThomasDelteil I'm not sure if I can follow the argument. You can view the current DataLoader as exposing a stream of batches. The sampling (shuffling) still happens lazily. In that case the data pipeline is composed of both Dataset (with random access for the lazy shuffling) and the notion of Stream over batches that can be prefetched

leezu · 2018-11-30T05:22:10Z

python/mxnet/gluon/data/dataloader.py

@@ -299,7 +278,7 @@ def shutdown(self):
            self._shutdown = True


-class DataLoader(object):
+class DataLoaderV1(object):


It seems the new DataLoader preserves the API, so why not remove DataLoaderV1?

If some specific implementations were relying on the old queue based methods, I think leaving the older version is still preferable for some user, what do you think?

I don't have a strong opinion about it. If you deem the reliance on queue as reliance on an undocumented implementation detail of V1, then it may be better to remove from Gluon and copy the V1 code over to the specific implementation? It may be reduce the number of APIs to maintain in the future. But I'm also OK with keeping it here if you prefer

ThomasDelteil · 2018-11-30T16:33:36Z

@leezu ok, I believe I hadn't well understood the PrefetchingStream API you described above. Makes sense now after second read.

eric-haibin-lin · 2018-11-30T18:09:02Z

I am merging this now so that it can be cherry-picked to v1.4 branch quickly

…iability (#13447) * fix recordio.py * rewrite dataloader with pool * fix batch as tuple * fix prefetching * fix pylint * picklable function * use pickle * add missing commit

eric-haibin-lin · 2018-11-30T23:30:41Z

I've added this PR to 1.4.x branch.

…ile (#13478) * updated to v1.5.0 * Bumped minor version from 1.4.0 to 1.5.0 on master * added Anirudh as maintainer for R package ... adding something useful and re-trigger PR check * Updated license file for clojure, onnx-tensorrt, gtest, R-package * Get the correct include path in pip package (#13452) * add find_include_path API * address reviewer comment * change return type from list to string * add unit test * address reviewer comment * address reviewer comment * address reviewer comment * address reviewer comment * fix include path problem in pip package * add comment * fix lint error * address reviewer comment * address reviewer comment * Use ~/.ccache as default ccache directory so is not cache is not erased on reboot (#13431) * Skip flaky test #13446 (#13480) * Rewrite dataloader with process pool, improves responsiveness and reliability (#13447) * fix recordio.py * rewrite dataloader with pool * fix batch as tuple * fix prefetching * fix pylint * picklable function * use pickle * add missing commit * Fix errors in docstrings for subgraph op; use code directive (#13463) * [MXNET-1158] JVM Memory Management Documentation (#13105) * update train_mnist * Add documentation for JVM Memory Management * update doc * address nit picks * address nit picks * Grammar and clarity edits for memory management doc * Edits for scala memory management * Update memory-management.md * Update memory-management.md * Update memory-management.md * capitalization fix * Update row_sparse tutorial (#13414) Update row_sparse tutorial * Add resiliency to onnx export code (#13426) * Added resiliency to onnx export code - With previous infer-shape implementation, if input shape was list instead of tuple or if extra non-existent parameters were provided, the code would still work. The fixes in this commit make sure that behavior is restored to prevent any compatibility issues with existing export code. * Fixed name of net in unittest * Fix pylint * [MXNET-1185] Support large array in several operators (part 1) (#13418) * fix a few operators with large arrays (# of elements) * fix bug in broadcast_div and add tests * address reviewer comment * add unit test * add empty line * retrigger CI * [MXNET-1210 ] Gluon Audio - Example (#13325) * Initialized the example * Addressed PR comments, about existing synset.txt file - no overwrite * RST - docstring issues fixed * added README * Addressed PR comments * Addressed PR comments, checking Divide by 0 * Raising error if format is not supported. * changed a line for ndarray of labels * Trigger CI * Trigger CI * PR comments addressed around skip_header argument * Addressed PR comments around librosa import * PR Comments * Passing lazy=lazy from argument * Added PR comments, labels to README.MD * Trigger CI * Addressing PR Comments in README * Modified README.md * Added example under audio folder * Retrigger CI * Retrigger CI * ONNX export: Instance normalization, Shape (#12920) * ONNX import/export: Make backend_rep common * ONNX export: Instance Normalization * ONNX export: Shape operator * Clarify dependency on OpenCV in CNN Visualization tutorial. (#13495) * clarify ops faq regarding docs strings (#13492) * Add graph_compact operator. (#13436) * add graph_compact. * fix. * add doc. * add tests for graph_compact. * address comments. * update docs. * trigger CI * Deprecate Jenkinsfile (#13474) * update github location for sampled_block.py (#13508) Updated to https://github.com/dmlc/gluon-nlp/blob/master/src/gluonnlp/model/sampled_block.py * #13453 [Clojure] - Add Spec Validations to the Optimizer namespace (#13499) * ONNX export: Logical operators (#12852) * Fix cmake options parsing in dev_menu (#13458) Add GPU+MKLDNN unittests to dev_menu * Revert "Manually track num_max_thread (#12380)" (#13501) This reverts commit 7541021. * Feature/mkldnn static 2 (#13503) * build mkldnn as static lib * update makefile to statically build mkldnn * build static mkldnn * fix static name * fix static name * update static for mac * rename mkldnn dep in ci * remove moving mkldnn dynamic lib * remove commented code * remove mkldnn dnaymic for unitest * force static for mkldnn lib * remove dynamic mkldnn bind * only link windows * add mkldnn.mk * try force linking * remove mkldnn dynanmic check * remove test mkldnn install * fix spacing * fix index * add artifacts * add comment about windows * remove static * update makefile * fix toctree Sphinx errors (#13489) * fix toctree errors * nudging file for CI * Disabled flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker (#13527) * [MXNET-1234] Fix shape inference problems in Activation backward (#13409) * Provide a failing test for ReLU activation shape inference bug * Fix Activation backward shape inference fixes: #13333 * Add softsign Activation to test_gluon.py * Use activation in GPU if we are using CUDNN and not MKLDNN as it's happening right now * Don't disable MKLDNN

kohillyang · 2018-12-08T03:09:39Z

Is it possible to implement another Dataloader with threadpool? According to my test, Dataloader with threadpool is faster than Dataloader with multiprocess(about 2~5x), especially when the dataset is small or most of codes are written with C++, for example, rotating the image is written by Opencv library. And it has less bugs, less requirements. Because some people are running there program with docker and maybe their shared memory is not enough. And Dataloader with threadpool start more quickly as its overhead is smaller.

I am wonder if mxnet can give us a choise, using threadpool or multiprocess.pool.

leezu · 2018-12-08T04:25:45Z

You may just add an option to use multiprocessing.pool.ThreadPool instead of the process pool that is currently always used. Probably the worker_initializer function needs to be changed too due to the global variable. Do you want to contribute a PR?

anirudhacharya · 2018-12-08T04:41:09Z

Dataloader with threadpool is faster than Dataloader with multiprocess(about 2~5x), especially when the dataset is small or most of codes are written with C++, for example, rotating the image is written by Opencv library

what is the basis for this comment. If you have multiple CPU cores isn't multiprocessing more efficient, has easier memory management and bypasses GIL?

leezu · 2018-12-08T05:00:49Z

@anirudh2290 multiprocessing is often extremely inefficient. The reason is that any object that is passed between the processes must first be pickled and unpickled. Pickling is very slow. Using multi-threading of course requires extra care to bypass the GIL, but can make sense if the pickling overhead is too much. I assume @kohillyang has such case.

ThomasDelteil · 2018-12-08T05:36:13Z

Yes I think the problem with multi-threading is to make sure the user understands that to get a real parallel processing boost, it needs to be done in the underlying libraries like opencv to bypass the GIL. I think there's a case for it but it shouldn't be the default and have appropriate warning in the documentation. I believe different pickling protocols can be chosen to trade off compute for memory @leezu ? Maybe we could also expose that to the users

leezu · 2018-12-08T05:49:14Z

@ThomasDelteil yes, threading should certainly not be the default. Unfortunately Python does not allow choosing the Pickle protocol version in the multiprocessing context. "Multiprocessing uses default protocol and there is no simple (without hacking the stdlib) way to pass the right protocol manually." https://bugs.python.org/issue26507 https://bugs.python.org/issue28053 https://bugs.python.org/issue23403

Thus we are stuck with a high pickling overhead when using multiprocessing for the foreseeable future

zhreshold · 2018-12-08T07:35:01Z

Regarding threadpool version, I think it's quite easy to swap to. Performance-wise, it's really case-by-case. And @kohillyang 's suggestion is good to have, we can have both version ready without a lot of change.

…iability (apache#13447) * fix recordio.py * rewrite dataloader with pool * fix batch as tuple * fix prefetching * fix pylint * picklable function * use pickle * add missing commit

…ile (apache#13478) * updated to v1.5.0 * Bumped minor version from 1.4.0 to 1.5.0 on master * added Anirudh as maintainer for R package ... adding something useful and re-trigger PR check * Updated license file for clojure, onnx-tensorrt, gtest, R-package * Get the correct include path in pip package (apache#13452) * add find_include_path API * address reviewer comment * change return type from list to string * add unit test * address reviewer comment * address reviewer comment * address reviewer comment * address reviewer comment * fix include path problem in pip package * add comment * fix lint error * address reviewer comment * address reviewer comment * Use ~/.ccache as default ccache directory so is not cache is not erased on reboot (apache#13431) * Skip flaky test apache#13446 (apache#13480) * Rewrite dataloader with process pool, improves responsiveness and reliability (apache#13447) * fix recordio.py * rewrite dataloader with pool * fix batch as tuple * fix prefetching * fix pylint * picklable function * use pickle * add missing commit * Fix errors in docstrings for subgraph op; use code directive (apache#13463) * [MXNET-1158] JVM Memory Management Documentation (apache#13105) * update train_mnist * Add documentation for JVM Memory Management * update doc * address nit picks * address nit picks * Grammar and clarity edits for memory management doc * Edits for scala memory management * Update memory-management.md * Update memory-management.md * Update memory-management.md * capitalization fix * Update row_sparse tutorial (apache#13414) Update row_sparse tutorial * Add resiliency to onnx export code (apache#13426) * Added resiliency to onnx export code - With previous infer-shape implementation, if input shape was list instead of tuple or if extra non-existent parameters were provided, the code would still work. The fixes in this commit make sure that behavior is restored to prevent any compatibility issues with existing export code. * Fixed name of net in unittest * Fix pylint * [MXNET-1185] Support large array in several operators (part 1) (apache#13418) * fix a few operators with large arrays (# of elements) * fix bug in broadcast_div and add tests * address reviewer comment * add unit test * add empty line * retrigger CI * [MXNET-1210 ] Gluon Audio - Example (apache#13325) * Initialized the example * Addressed PR comments, about existing synset.txt file - no overwrite * RST - docstring issues fixed * added README * Addressed PR comments * Addressed PR comments, checking Divide by 0 * Raising error if format is not supported. * changed a line for ndarray of labels * Trigger CI * Trigger CI * PR comments addressed around skip_header argument * Addressed PR comments around librosa import * PR Comments * Passing lazy=lazy from argument * Added PR comments, labels to README.MD * Trigger CI * Addressing PR Comments in README * Modified README.md * Added example under audio folder * Retrigger CI * Retrigger CI * ONNX export: Instance normalization, Shape (apache#12920) * ONNX import/export: Make backend_rep common * ONNX export: Instance Normalization * ONNX export: Shape operator * Clarify dependency on OpenCV in CNN Visualization tutorial. (apache#13495) * clarify ops faq regarding docs strings (apache#13492) * Add graph_compact operator. (apache#13436) * add graph_compact. * fix. * add doc. * add tests for graph_compact. * address comments. * update docs. * trigger CI * Deprecate Jenkinsfile (apache#13474) * update github location for sampled_block.py (apache#13508) Updated to https://github.com/dmlc/gluon-nlp/blob/master/src/gluonnlp/model/sampled_block.py * apache#13453 [Clojure] - Add Spec Validations to the Optimizer namespace (apache#13499) * ONNX export: Logical operators (apache#12852) * Fix cmake options parsing in dev_menu (apache#13458) Add GPU+MKLDNN unittests to dev_menu * Revert "Manually track num_max_thread (apache#12380)" (apache#13501) This reverts commit 7541021. * Feature/mkldnn static 2 (apache#13503) * build mkldnn as static lib * update makefile to statically build mkldnn * build static mkldnn * fix static name * fix static name * update static for mac * rename mkldnn dep in ci * remove moving mkldnn dynamic lib * remove commented code * remove mkldnn dnaymic for unitest * force static for mkldnn lib * remove dynamic mkldnn bind * only link windows * add mkldnn.mk * try force linking * remove mkldnn dynanmic check * remove test mkldnn install * fix spacing * fix index * add artifacts * add comment about windows * remove static * update makefile * fix toctree Sphinx errors (apache#13489) * fix toctree errors * nudging file for CI * Disabled flaky test test_gluon_data.test_recordimage_dataset_with_data_loader_multiworker (apache#13527) * [MXNET-1234] Fix shape inference problems in Activation backward (apache#13409) * Provide a failing test for ReLU activation shape inference bug * Fix Activation backward shape inference fixes: apache#13333 * Add softsign Activation to test_gluon.py * Use activation in GPU if we are using CUDNN and not MKLDNN as it's happening right now * Don't disable MKLDNN

zhreshold added 5 commits November 28, 2018 14:52

fix recordio.py

54c0d0c

rewrite dataloader with pool

965dc2e

fix batch as tuple

43e315c

fix prefetching

9ed47a3

fix pylint

a400375

zhreshold requested a review from szha as a code owner November 28, 2018 23:30

zhreshold mentioned this pull request Nov 28, 2018

Allow dataloader iterator to be reused #13209

Closed

6 tasks

marcoabreu added Data-loading pr-awaiting-review PR is waiting for code review labels Nov 29, 2018

leezu reviewed Nov 29, 2018

View reviewed changes

zhreshold added 2 commits November 29, 2018 00:22

picklable function

e5b11d4

use pickle

7de1f2e

eric-haibin-lin approved these changes Nov 29, 2018

View reviewed changes

add missing commit

1b2fc73

leezu reviewed Nov 30, 2018

View reviewed changes

eric-haibin-lin merged commit 883d771 into apache:master Nov 30, 2018

zhreshold deleted the rewrite-dataloader branch November 30, 2018 20:50

zhreshold mentioned this pull request Dec 11, 2018

Complimentary gluon DataLoader improvements #13606

Merged

4 tasks

zhreshold mentioned this pull request Dec 17, 2018

[Feature] update sharded loader dmlc/gluon-nlp#468

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite dataloader, improves responsiveness and reliability #13447

Rewrite dataloader, improves responsiveness and reliability #13447

zhreshold commented Nov 28, 2018

ThomasDelteil commented Nov 28, 2018 •

edited

Loading

vandanavk commented Nov 29, 2018

leezu Nov 29, 2018

zhreshold Nov 29, 2018

leezu commented Nov 29, 2018

ThomasDelteil commented Nov 29, 2018

eric-haibin-lin left a comment

eric-haibin-lin Nov 29, 2018

leezu commented Nov 30, 2018

leezu Nov 30, 2018

zhreshold Nov 30, 2018

leezu Nov 30, 2018

ThomasDelteil commented Nov 30, 2018

eric-haibin-lin commented Nov 30, 2018

eric-haibin-lin commented Nov 30, 2018

kohillyang commented Dec 8, 2018

leezu commented Dec 8, 2018

anirudhacharya commented Dec 8, 2018

leezu commented Dec 8, 2018 •

edited

Loading

ThomasDelteil commented Dec 8, 2018 •

edited

Loading

leezu commented Dec 8, 2018

zhreshold commented Dec 8, 2018

Rewrite dataloader, improves responsiveness and reliability #13447

Rewrite dataloader, improves responsiveness and reliability #13447

Conversation

zhreshold commented Nov 28, 2018

Description

Checklist

Essentials

Changes

Comments

ThomasDelteil commented Nov 28, 2018 • edited Loading

vandanavk commented Nov 29, 2018

leezu Nov 29, 2018

Choose a reason for hiding this comment

zhreshold Nov 29, 2018

Choose a reason for hiding this comment

leezu commented Nov 29, 2018

ThomasDelteil commented Nov 29, 2018

eric-haibin-lin left a comment

Choose a reason for hiding this comment

eric-haibin-lin Nov 29, 2018

Choose a reason for hiding this comment

leezu commented Nov 30, 2018

leezu Nov 30, 2018

Choose a reason for hiding this comment

zhreshold Nov 30, 2018

Choose a reason for hiding this comment

leezu Nov 30, 2018

Choose a reason for hiding this comment

ThomasDelteil commented Nov 30, 2018

eric-haibin-lin commented Nov 30, 2018

eric-haibin-lin commented Nov 30, 2018

kohillyang commented Dec 8, 2018

leezu commented Dec 8, 2018

anirudhacharya commented Dec 8, 2018

leezu commented Dec 8, 2018 • edited Loading

ThomasDelteil commented Dec 8, 2018 • edited Loading

leezu commented Dec 8, 2018

zhreshold commented Dec 8, 2018

ThomasDelteil commented Nov 28, 2018 •

edited

Loading

leezu commented Dec 8, 2018 •

edited

Loading

ThomasDelteil commented Dec 8, 2018 •

edited

Loading