Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix problematic backward of take & embedding #11795

Merged
merged 1 commit into from
Jul 25, 2018

Conversation

haojin2
Copy link
Contributor

@haojin2 haojin2 commented Jul 18, 2018

Description

Fix for #11314

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Modify the backward of take and embedding to use the previous AddTakeGrad kernel

Comments

@leezu I've used the script in #11314 and verified this can pass more than 3000 trials.

@haojin2
Copy link
Contributor Author

haojin2 commented Jul 18, 2018

Benchmark script:

import mxnet as mx
import numpy as np
import time

N = 50000
ctx = mx.gpu(0)

embedding = mx.gluon.nn.Embedding(N, 300)
embedding.initialize(ctx=ctx)

i = 0
np.random.seed(1)
idx = mx.nd.array(np.random.randint(0, N, size=(1024, 160)), ctx=ctx)

print(np.max(np.bincount(idx.asnumpy().flatten().astype(np.int64))))

a = time.time()
for i in range(500000):
    with mx.autograd.record():
        emb_in = embedding(idx)
        loss = emb_in.sum()
        loss.backward()
print(time.time() - a)

Benchmark results:
Original implementation: 146.33599019050598
New Implementation: 147.44622945785522

Slowdown is negligible after switching to new kernel.

@leezu
Copy link
Contributor

leezu commented Jul 18, 2018

Thanks @haojin2! Regarding your test, did you run on Tesla V100 GPUs with Cuda 9.2? Also it seems that the AddTakeGradLargeBatch kernel and related code is unused now and should be removed?

@haojin2 haojin2 changed the title Fix problematic backward of take & embedding [WIP] Fix problematic backward of take & embedding Jul 18, 2018
@haojin2
Copy link
Contributor Author

haojin2 commented Jul 18, 2018

@leezu I'm encountering some unit test problems in this and will debug that a bit. The benchmarks were on a single K80 on my dev machine, I'll also do a benchmark on a p3 tmr. Regarding the original kernels, I'll run a check on whether they are still used elsewhere or we may even need to bring them back if we observe a severe performance regression on V100, I'll keep it for now and marking this PR as WIP. Will let you know once I've got the latest results. Thanks for your quick reply!

@haojin2
Copy link
Contributor Author

haojin2 commented Jul 23, 2018

@leezu I found that my fix is only slightly faster than turning MXNET_FORCE_ADDTAKEGRAD on with the previously shown benchmark scripts, I think we're good with the current implementation now?

@leezu
Copy link
Contributor

leezu commented Jul 24, 2018

Ok, thanks for looking into this! In that case we should make MXNET_FORCE_ADDTAKEGRAD the default behavior and delete the AddTakeGradLargeBatch kernel.

@leezu
Copy link
Contributor

leezu commented Jul 24, 2018

@Roshrini I believe this should be addressed before freezing code for MXNet 1.3. The MXNet 1.3 release is likely to be used a lot on P3 instances with Cuda 9.2, meaning that the buggy backward pass of the Embedding operator will hit many users.

@haojin2
Copy link
Contributor Author

haojin2 commented Jul 24, 2018

@leezu So I can change my PR to make the default behavior be AddTakeGrad, does that sound good to you?

@leezu
Copy link
Contributor

leezu commented Jul 24, 2018

Yes, I think either your improved take operator should be used or the MXNET_FORCE_ADDTAKEGRAD should be default. I'll send a mail to dev@ to see what other people think.

@haojin2
Copy link
Contributor Author

haojin2 commented Jul 24, 2018

@leezu I would be preferring the original AddTakeGrad as that required minimal code change.

@eric-haibin-lin
Copy link
Member

Is this complete? I see some discussions on dev@ about this bug

@haojin2
Copy link
Contributor Author

haojin2 commented Jul 24, 2018

@eric-haibin-lin This PR will be modified to make AddTakeGrad as the new default.

@KellenSunderland
Copy link
Contributor

KellenSunderland commented Jul 24, 2018

First of all, great work diving into this @haojin2.

One concern I have is that the overhead of running the imperative python loop would dominate the performance, and make the two implementations seem equivalent in terms of execution time. Would it be possible to run the same test prefixed with nvprof and then paste the kernel execution time summary?

@haojin2
Copy link
Contributor Author

haojin2 commented Jul 24, 2018

@KellenSunderland One argument that I would like to make here is that end-to-end performance is what matters most to actual users, since that's what they would experience on their end. On the other hand, as the LargeBatch kernel is exhibiting flaky problematic behavior, I don't think it should be kept (I think one other guy on the dev list thread about this is also agreeing with this point), not to mention here it's not even making any fundamental performance difference.
In conclusion, I think the LargeBatch version should be got rid of anyways, but I shall investigate whether the new general take grad kernel or the AddTakeGrad kernel is giving a better performance using some kind of profiler, does that sound good to you?

@KellenSunderland
Copy link
Contributor

KellenSunderland commented Jul 24, 2018

Yeah I agree with that logic (and the arguments on the list). I just think the numbers are suspiciously close, and I've had the case many times in the past that when numbers are this close it's actually due to the fact that python imperative calls and cuda overhead are the bottlenecks for my measurement.

Edit: Since I'm bringing it up maybe the onus is on me to verify this, I'll do a quick test and LYK what the results are.

@haojin2
Copy link
Contributor Author

haojin2 commented Jul 24, 2018

I'll also do some researches on my side, thanks for bringing this issue up!

@KellenSunderland
Copy link
Contributor

KellenSunderland commented Jul 25, 2018

Actually looking at a quick profile it seems it should be pretty representative. Mostly GPU work going on.
screen shot 2018-07-24 at 5 01 24 pm

@haojin2
Copy link
Contributor Author

haojin2 commented Jul 25, 2018

@KellenSunderland So I think there should be not much difference between the two versions of the kernel?

@haojin2 haojin2 changed the title [WIP] Fix problematic backward of take & embedding Fix problematic backward of take & embedding Jul 25, 2018
@KellenSunderland
Copy link
Contributor

KellenSunderland commented Jul 25, 2018 via email

@haojin2
Copy link
Contributor Author

haojin2 commented Jul 25, 2018

@leezu The AddTakeGrad kernel can pass more than 3500 trials on the reproduction script, I think this PR should be a good fix without any fundamental performance loss.

@haojin2
Copy link
Contributor Author

haojin2 commented Jul 25, 2018

@KellenSunderland Thanks for the help on profiling!

@eric-haibin-lin eric-haibin-lin merged commit fe1c7ab into apache:master Jul 25, 2018
@haojin2 haojin2 deleted the fix_11314 branch July 25, 2018 21:07
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
@sxjscience
Copy link
Member

I find the previous performance test was conducted using time.time(). It’s not safe to do that due to the tremendous overhead of the imperative API in MXNet. We should rely on nvprof in the future.

@KellenSunderland
Copy link
Contributor

KellenSunderland commented Aug 30, 2019 via email

@sxjscience
Copy link
Member

sxjscience commented Aug 30, 2019 via email

@sxjscience
Copy link
Member

sxjscience commented Aug 30, 2019 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants