[v1.x] provide a faster PrefetchedDataLoader #19748

Neutron3529 · 2021-01-13T15:07:59Z

Description

there already exists some faster dataloader in mxnet 2.0, but in v1.x, the exist dataloader is slower and could be improved by changing its prefetch behavior as what 2.0 have done.
test:

$ cat iternew.py && python iternew.py
import mxnet as mx
from mxnet.gluon.data import DataLoader,ArrayDataset
from time import sleep,perf_counter_ns
train_data=ArrayDataset(mx.nd.array([[i] for i in range(50000)]),mx.nd.array([[99-i] for i in range(50000)]))
test_data=ArrayDataset(mx.nd.array([[i] for i in range(10000)]),mx.nd.array([[99-i] for i in range(10000)]))
def transform_train(sample):
  sleep(0.0016)
  return sample

def transform_test(sample):
  sleep(0.0008)
  return sample

train_iter=DataLoader(train_data.transform_first(transform_train),batch_size=500,num_workers=10)
test_iter =DataLoader(test_data .transform_first(transform_test ),batch_size=500,num_workers=10)
if True:
  tic=perf_counter_ns()
  for epoch in range(10):
    print("epoch"+str(epoch)+" start at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s")
    for i in train_iter:
      sleep(0.1)
    print("       finished train phase at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s")
    for i in test_iter:
      sleep(0.05)
    print("        finished test phase at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s")
  print("cost="+str((perf_counter_ns()-tic)*1e-9)+"s")
epoch0 start at 0.0s
       finished train phase at 11.28s
        finished test phase at 12.35s
epoch1 start at 12.35s
       finished train phase at 22.73s
        finished test phase at 23.79s
epoch2 start at 23.79s
       finished train phase at 34.15s
        finished test phase at 35.21s
epoch3 start at 35.22s
       finished train phase at 45.59s
        finished test phase at 46.66s
epoch4 start at 46.66s
       finished train phase at 57.01s
        finished test phase at 58.07s
epoch5 start at 58.07s
       finished train phase at 68.43s
        finished test phase at 69.5s
epoch6 start at 69.5s
       finished train phase at 79.87s
        finished test phase at 80.93s
epoch7 start at 80.93s
       finished train phase at 91.3s
        finished test phase at 92.37s
epoch8 start at 92.37s
       finished train phase at 102.74s
        finished test phase at 103.8s
epoch9 start at 103.8s
       finished train phase at 114.17s
        finished test phase at 115.23s
cost=115.23376344s

(cost ~129.67192333600002s if we are using Dataloader rather than PrefetchedDataLoader)
(test is done using v1.7.0, the newest v1.x is on my GPU server and running, I do not want to bother my GPU server.)

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

add a auto_reload flag for DataLoader, which loads data faster in the first several batch compare to the default Dataloader.
(Don't know whether the test case is appropriate.)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here
Now, the default behavior of DataLoader is prefetch immediately after it is created, rather than wait for its __iter__() is called.
This behavior is compatible to MXNet 2.0's dataloader with nopython mode, since it really speeds up the program (~5% faster with my handwriting autoaugment transform function when training CIFAR-100 with RTX 3090, batch_size=250, wide resnet 16-4 and deep mutual learning technique.) and provide almost no difference(all the parameter create the prefetched dataloader is private and should not be modified by any other code), I switch the default behavior to prefetch.

Now, since my programming skill is very poor, this `PrefetchedDataLoader` only allow generate a single iter at the same time. the benefit of `PrefetchedDataLoader` is that, `PrefetchedDataLoader` provides better performance with a simple replacement in most of the existing codes. test: ```python $ cat iternew.py && python iternew.py import mxnet as mx from mxnet.gluon.data import PrefetchedDataLoader as DataLoader,ArrayDataset from time import sleep,perf_counter_ns train_data=ArrayDataset(mx.nd.array([[i] for i in range(50000)]),mx.nd.array([[99-i] for i in range(50000)])) test_data=ArrayDataset(mx.nd.array([[i] for i in range(10000)]),mx.nd.array([[99-i] for i in range(10000)])) def transform_train(sample): sleep(0.0016) return sample def transform_test(sample): sleep(0.0008) return sample train_iter=DataLoader(train_data.transform_first(transform_train),batch_size=500,num_workers=10) test_iter =DataLoader(test_data .transform_first(transform_test ),batch_size=500,num_workers=10) if True: tic=perf_counter_ns() for epoch in range(10): print("epoch"+str(epoch)+" start at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s") for i in train_iter: sleep(0.1) print(" finished train phase at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s") for i in test_iter: sleep(0.05) print(" finished test phase at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s") print("cost="+str((perf_counter_ns()-tic)*1e-9)+"s") epoch0 start at 0.0s finished train phase at 11.25s finished test phase at 12.31s epoch1 start at 12.31s finished train phase at 22.62s finished test phase at 23.68s epoch2 start at 23.68s finished train phase at 34.03s finished test phase at 35.09s epoch3 start at 35.09s finished train phase at 45.41s finished test phase at 46.48s epoch4 start at 46.48s finished train phase at 56.82s finished test phase at 57.88s epoch5 start at 57.88s finished train phase at 68.24s finished test phase at 69.3s epoch6 start at 69.3s finished train phase at 79.65s finished test phase at 80.71s epoch7 start at 80.71s finished train phase at 91.04s finished test phase at 92.11s epoch8 start at 92.11s finished train phase at 102.46s finished test phase at 103.53s epoch9 start at 103.53s finished train phase at 113.89s finished test phase at 114.95s cost=114.94954171600001s ``` (cost ~`129.67192333600002s` if we are using `Dataloader` rather than `PrefetchedDataLoader`)

mxnet-bot · 2021-01-13T15:08:04Z

Hey @Neutron3529 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [clang, centos-gpu, edge, windows-gpu, website, miscellaneous, sanity, centos-cpu, windows-cpu, unix-cpu, unix-gpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

there already exists some faster dataloader in mxnet 2.0, but in v1.x, the exist dataloader is slower and could be improved by changing its prefetch behavior as what 2.0 have done. ```python $ cat iternew.py && python iternew.py import mxnet as mx from mxnet.gluon.data import PrefetchedDataLoader as DataLoader,ArrayDataset from time import sleep,perf_counter_ns train_data=ArrayDataset(mx.nd.array([[i] for i in range(50000)]),mx.nd.array([[99-i] for i in range(50000)])) test_data=ArrayDataset(mx.nd.array([[i] for i in range(10000)]),mx.nd.array([[99-i] for i in range(10000)])) def transform_train(sample): sleep(0.0016) return sample def transform_test(sample): sleep(0.0008) return sample train_iter=DataLoader(train_data.transform_first(transform_train),batch_size=500,num_workers=10) test_iter =DataLoader(test_data .transform_first(transform_test ),batch_size=500,num_workers=10) if True: tic=perf_counter_ns() for epoch in range(10): print("epoch"+str(epoch)+" start at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s") for i in train_iter: sleep(0.1) print(" finished train phase at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s") for i in test_iter: sleep(0.05) print(" finished test phase at "+str(round((perf_counter_ns()-tic)*1e-9,2))+"s") print("cost="+str((perf_counter_ns()-tic)*1e-9)+"s") epoch0 start at 0.0s finished train phase at 11.28s finished test phase at 12.35s epoch1 start at 12.35s finished train phase at 22.73s finished test phase at 23.79s epoch2 start at 23.79s finished train phase at 34.15s finished test phase at 35.21s epoch3 start at 35.22s finished train phase at 45.59s finished test phase at 46.66s epoch4 start at 46.66s finished train phase at 57.01s finished test phase at 58.07s epoch5 start at 58.07s finished train phase at 68.43s finished test phase at 69.5s epoch6 start at 69.5s finished train phase at 79.87s finished test phase at 80.93s epoch7 start at 80.93s finished train phase at 91.3s finished test phase at 92.37s epoch8 start at 92.37s finished train phase at 102.74s finished test phase at 103.8s epoch9 start at 103.8s finished train phase at 114.17s finished test phase at 115.23s cost=115.23376344s ```

add unittest for PrefetchedDataLoader

update document

previous test shows that there may be something wrong with the `_MultiWorkerIter` according to the inappropriate __iter__() is called, I tried to fix it by moving the call here.

fix the outdated perfetcheddataloader

Neutron3529 · 2021-01-17T02:03:02Z

The reason why using Dataloader with auto_reload is:
MXNet 2.0's DataLoader with the default nopython mode prefetch data by default.

MXNet 2 uses version number 2 because it breaks APIs. MXNet uses https://semver.org/ and we must not introduce backward incompatible changes in the v1.x branch. (Changing defaults with major impact is backwards incompatible). It's fine to add new features in v1.x.

most of the behavior is not changed since it is only prefetch data rather than modify data.

There is only one iter for a DataLoader in most of the cases.(Thus only one prefetched iter is generated.)
if we call iter explicitly, we should call it twice (one right after the define of the DataLoader, and another one after the previous iter is consumed).

So what's the problem here? Currently I'm not convinced your code / documentation is correct. For example:
    >>> train_iter = DataLoader(train_data.transform_first(transform_train),
    ...                         batch_size=1,num_workers=1)
    (pre)fetching data here
    >>> it = iter(train_iter) # nothing is generated since lazy-evaluation occurs
    >>> it2 = iter(train_iter)
    >>> it3 = iter(train_iter)
    >>> it4 = iter(train_iter)
    >>> _ = next(it2) # the first iter we are using is the prefetched iter.
    >>> _ = next(it) # since the prefetched iter is cconsumed, we have to fetch data for `it`.
However, looking at your implementation, actually 4 prefetched iters are created and the comments in the last two lines are wrong. Please correct me if you disagree.

due to the lazy evaluation, the iter will not call self.refresh/self.clean until the first __next__() is called, thus we have 4 iters, but only the first iter we use (it2 here) is the prefetched iter.

what's more, for a regular training procedure:

    >>> train_data = ArrayDataset([i for i in range(10)],[9-i for i in range(10)])
    >>> def transform_train(sample):
    ...   if sample == 0 : print('(pre)fetching data here')
    ...   return sample
    ...
    >>> train_iter = DataLoader(train_data.transform_first(transform_train),
    ...                         auto_reload=False, batch_size=1,num_workers=1)
    >>> test_data = ArrayDataset([i for i in range(10)],[9-i for i in range(10)])
    >>> test_iter = DataLoader(test_data, batch_size=1,num_workers=1)
    >>> for epoch in range(200):
    ...   # there is almost no difference between it and the default DataLoader
    ...   for data, label in train_iter:
    ...     # training...
    ...   for data, label in test_iter:
    ...     # testing...

there is only one iter per DataLoader each time. Most of the times, users will not consider what happened under the dataloader.

(maybe we should not using with ag.record(): since "Explicit is better than implicit." (Zen of Python))

What's the relation to the current discussion?

Here we implicit modify something help for calculate the gradient of the network.
I say it just for that, it is fine for us to using some implicit operations to simplify the execution of the program.

Neutron3529 · 2021-01-17T03:45:20Z

@mxnet-bot run ci [centos-cpu]

mxnet-bot · 2021-01-17T03:45:27Z

Jenkins CI successfully triggered : [centos-cpu]

leezu

Please change the default of auto_reload to ensure legacy code is not affected by this PR.

leezu

Thank you!

This reverts commit 2c8d858.

This reverts commit 7d934a7.

Neutron3529 requested a review from szha as a code owner January 13, 2021 15:07

lanking520 added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Jan 13, 2021

Neutron3529 changed the title ~~provide a faster PrefetchedDataLoader~~ [v1.x] provide a faster PrefetchedDataLoader Jan 13, 2021

This comment has been minimized.

Sign in to view

Neutron3529 closed this Jan 13, 2021

Neutron3529 added 2 commits January 13, 2021 23:32

Update test_gluon_data.py

7e44bcf

add unittest for PrefetchedDataLoader

Neutron3529 reopened this Jan 13, 2021

lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jan 13, 2021

Update dataloader.py

1c6170c

update document

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jan 13, 2021

delete trailing-whitespace

aee2b3b

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Jan 14, 2021

szha requested a review from zhreshold January 14, 2021 01:52

lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jan 14, 2021

remove the modification of num_workers.

f1c40b3

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jan 14, 2021

Update dataloader.py

9aa4548

previous test shows that there may be something wrong with the `_MultiWorkerIter` according to the inappropriate __iter__() is called, I tried to fix it by moving the call here.

lanking520 removed the pr-work-in-progress PR is still work in progress label Jan 14, 2021

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review labels Jan 17, 2021

Update dataloader.py

ca5315b

fix the outdated perfetcheddataloader

lanking520 added pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Jan 17, 2021

lanking520 added pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jan 17, 2021

leezu suggested changes Jan 17, 2021

View reviewed changes

using pytest for nested loop

2c8d858

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review labels Jan 17, 2021

change auto_reload to false.

0af3b38

leezu approved these changes Jan 17, 2021

View reviewed changes

lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jan 17, 2021

Revert "using pytest for nested loop"

6ae84bd

This reverts commit 2c8d858.

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jan 17, 2021

szha merged commit 7d934a7 into apache:v1.x Feb 5, 2021

Neutron3529 deleted the patch-1 branch February 7, 2021 05:05

access2rohit mentioned this pull request Feb 10, 2021

[v1.x] test_gluon_data unit tests failing #19877

Closed

access2rohit pushed a commit to access2rohit/incubator-mxnet that referenced this pull request Feb 10, 2021

Revert "[v1.x] provide a faster PrefetchedDataLoader (apache#19748)"

cda168b

This reverts commit 7d934a7.

access2rohit mentioned this pull request Feb 10, 2021

[WIP]Attempt to root cause test failure in v1.x branch #19879

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.x] provide a faster PrefetchedDataLoader #19748

[v1.x] provide a faster PrefetchedDataLoader #19748

Neutron3529 commented Jan 13, 2021 •

edited

Loading

mxnet-bot commented Jan 13, 2021

This comment has been minimized.

Neutron3529 commented Jan 17, 2021

Neutron3529 commented Jan 17, 2021

mxnet-bot commented Jan 17, 2021

leezu left a comment

leezu left a comment

[v1.x] provide a faster PrefetchedDataLoader #19748

[v1.x] provide a faster PrefetchedDataLoader #19748

Conversation

Neutron3529 commented Jan 13, 2021 • edited Loading

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented Jan 13, 2021

This comment has been minimized.

Neutron3529 commented Jan 17, 2021

Neutron3529 commented Jan 17, 2021

mxnet-bot commented Jan 17, 2021

leezu left a comment

Choose a reason for hiding this comment

leezu left a comment

Choose a reason for hiding this comment

Neutron3529 commented Jan 13, 2021 •

edited

Loading