Fix problematic backward of take & embedding #11795

haojin2 · 2018-07-18T04:04:13Z

Description

Fix for #11314

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Modify the backward of take and embedding to use the previous AddTakeGrad kernel

Comments

@leezu I've used the script in #11314 and verified this can pass more than 3000 trials.

haojin2 · 2018-07-18T04:54:26Z

Benchmark script:

import mxnet as mx
import numpy as np
import time

N = 50000
ctx = mx.gpu(0)

embedding = mx.gluon.nn.Embedding(N, 300)
embedding.initialize(ctx=ctx)

i = 0
np.random.seed(1)
idx = mx.nd.array(np.random.randint(0, N, size=(1024, 160)), ctx=ctx)

print(np.max(np.bincount(idx.asnumpy().flatten().astype(np.int64))))

a = time.time()
for i in range(500000):
    with mx.autograd.record():
        emb_in = embedding(idx)
        loss = emb_in.sum()
        loss.backward()
print(time.time() - a)

Benchmark results:
Original implementation: 146.33599019050598
New Implementation: 147.44622945785522

Slowdown is negligible after switching to new kernel.

leezu · 2018-07-18T06:12:16Z

Thanks @haojin2! Regarding your test, did you run on Tesla V100 GPUs with Cuda 9.2? Also it seems that the AddTakeGradLargeBatch kernel and related code is unused now and should be removed?

haojin2 · 2018-07-18T06:15:42Z

@leezu I'm encountering some unit test problems in this and will debug that a bit. The benchmarks were on a single K80 on my dev machine, I'll also do a benchmark on a p3 tmr. Regarding the original kernels, I'll run a check on whether they are still used elsewhere or we may even need to bring them back if we observe a severe performance regression on V100, I'll keep it for now and marking this PR as WIP. Will let you know once I've got the latest results. Thanks for your quick reply!

haojin2 · 2018-07-23T22:03:20Z

@leezu I found that my fix is only slightly faster than turning MXNET_FORCE_ADDTAKEGRAD on with the previously shown benchmark scripts, I think we're good with the current implementation now?

leezu · 2018-07-24T01:15:50Z

Ok, thanks for looking into this! In that case we should make MXNET_FORCE_ADDTAKEGRAD the default behavior and delete the AddTakeGradLargeBatch kernel.

leezu · 2018-07-24T01:17:30Z

@Roshrini I believe this should be addressed before freezing code for MXNet 1.3. The MXNet 1.3 release is likely to be used a lot on P3 instances with Cuda 9.2, meaning that the buggy backward pass of the Embedding operator will hit many users.

haojin2 · 2018-07-24T01:26:07Z

@leezu So I can change my PR to make the default behavior be AddTakeGrad, does that sound good to you?

leezu · 2018-07-24T01:28:14Z

Yes, I think either your improved take operator should be used or the MXNET_FORCE_ADDTAKEGRAD should be default. I'll send a mail to dev@ to see what other people think.

haojin2 · 2018-07-24T01:30:54Z

@leezu I would be preferring the original AddTakeGrad as that required minimal code change.

eric-haibin-lin · 2018-07-24T05:21:30Z

Is this complete? I see some discussions on dev@ about this bug

haojin2 · 2018-07-24T05:23:46Z

@eric-haibin-lin This PR will be modified to make AddTakeGrad as the new default.

KellenSunderland · 2018-07-24T16:57:11Z

First of all, great work diving into this @haojin2.

One concern I have is that the overhead of running the imperative python loop would dominate the performance, and make the two implementations seem equivalent in terms of execution time. Would it be possible to run the same test prefixed with nvprof and then paste the kernel execution time summary?

haojin2 · 2018-07-24T23:07:27Z

@KellenSunderland One argument that I would like to make here is that end-to-end performance is what matters most to actual users, since that's what they would experience on their end. On the other hand, as the LargeBatch kernel is exhibiting flaky problematic behavior, I don't think it should be kept (I think one other guy on the dev list thread about this is also agreeing with this point), not to mention here it's not even making any fundamental performance difference.
In conclusion, I think the LargeBatch version should be got rid of anyways, but I shall investigate whether the new general take grad kernel or the AddTakeGrad kernel is giving a better performance using some kind of profiler, does that sound good to you?

KellenSunderland · 2018-07-24T23:23:44Z

Yeah I agree with that logic (and the arguments on the list). I just think the numbers are suspiciously close, and I've had the case many times in the past that when numbers are this close it's actually due to the fact that python imperative calls and cuda overhead are the bottlenecks for my measurement.

Edit: Since I'm bringing it up maybe the onus is on me to verify this, I'll do a quick test and LYK what the results are.

haojin2 · 2018-07-24T23:40:51Z

I'll also do some researches on my side, thanks for bringing this issue up!

KellenSunderland · 2018-07-25T00:02:58Z

Actually looking at a quick profile it seems it should be pretty representative. Mostly GPU work going on.

haojin2 · 2018-07-25T18:08:16Z

@KellenSunderland So I think there should be not much difference between the two versions of the kernel?

KellenSunderland · 2018-07-25T18:36:47Z

Yes, I'd agree based on these measurements.

…

On Wed, Jul 25, 2018 at 11:08 AM Hao Jin ***@***.***> wrote: @KellenSunderland <https://github.com/KellenSunderland> So I think there should be not much difference between the two versions of the kernel? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11795 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHGTE2YxPEWgjyoRWT3YV-NnM3XgyAzbks5uKLQwgaJpZM4VT6UG> .

haojin2 · 2018-07-25T18:37:02Z

@leezu The AddTakeGrad kernel can pass more than 3500 trials on the reproduction script, I think this PR should be a good fix without any fundamental performance loss.

haojin2 · 2018-07-25T18:37:54Z

@KellenSunderland Thanks for the help on profiling!

…void nan (apache#11795)

sxjscience · 2019-08-26T15:44:43Z

I find the previous performance test was conducted using time.time(). It’s not safe to do that due to the tremendous overhead of the imperative API in MXNet. We should rely on nvprof in the future.

KellenSunderland · 2019-08-30T12:42:18Z

Could be a valid point Xingjian. Are you seeing differences when measured with nvprof? Can you post a summary?

…

On Mon, Aug 26, 2019, 8:45 AM Xingjian Shi ***@***.***> wrote: I find the previous performance test was conducted using time.time(). It’s not safe to do that due to the tremendous overhead of the imperative API in MXNet. We should rely on nvprof in the future. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11795>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABYZGEYRTSCQE3Q55QNSOH3QGP3BDANCNFSM4FKPUUDA> .

sxjscience · 2019-08-30T14:43:51Z

Would you help profile the change? I think the current change should not have passed the code interview. Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: Kellen Sunderland <[email protected]> Sent: Friday, August 30, 2019 5:42:43 AM To: apache/incubator-mxnet <[email protected]> Cc: Xingjian SHI <[email protected]>; Comment <[email protected]> Subject: Re: [apache/incubator-mxnet] Fix problematic backward of take & embedding (#11795) Could be a valid point Xingjian. Are you seeing differences when measured with nvprof? Can you post a summary?

On Mon, Aug 26, 2019, 8:45 AM Xingjian Shi ***@***.***> wrote: I find the previous performance test was conducted using time.time(). It’s not safe to do that due to the tremendous overhead of the imperative API in MXNet. We should rely on nvprof in the future. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11795>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABYZGEYRTSCQE3Q55QNSOH3QGP3BDANCNFSM4FKPUUDA> .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub<#11795>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABHQH3RIP6DA7SN26CH5JI3QHEIUHANCNFSM4FKPUUDA>.

sxjscience · 2019-08-30T15:18:25Z

I’ll do that later if you do not have time. Currently, there are lots of other urgent issues for me so I’m asking for help. Thanks, Xingjian Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: Xingjian SHI Sent: Friday, August 30, 2019 7:43:41 AM To: apache/incubator-mxnet <[email protected]>; apache/incubator-mxnet <[email protected]> Cc: Comment <[email protected]> Subject: Re: [apache/incubator-mxnet] Fix problematic backward of take & embedding (#11795) Would you help profile the change? I think the current change should not have passed the code interview. Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: Kellen Sunderland <[email protected]> Sent: Friday, August 30, 2019 5:42:43 AM To: apache/incubator-mxnet <[email protected]> Cc: Xingjian SHI <[email protected]>; Comment <[email protected]> Subject: Re: [apache/incubator-mxnet] Fix problematic backward of take & embedding (#11795) Could be a valid point Xingjian. Are you seeing differences when measured with nvprof? Can you post a summary?

On Mon, Aug 26, 2019, 8:45 AM Xingjian Shi ***@***.***> wrote: I find the previous performance test was conducted using time.time(). It’s not safe to do that due to the tremendous overhead of the imperative API in MXNet. We should rely on nvprof in the future. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11795>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABYZGEYRTSCQE3Q55QNSOH3QGP3BDANCNFSM4FKPUUDA> .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub<#11795>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABHQH3RIP6DA7SN26CH5JI3QHEIUHANCNFSM4FKPUUDA>.

haojin2 requested a review from anirudh2290 as a code owner July 18, 2018 04:04

haojin2 changed the title ~~Fix problematic backward of take & embedding~~ [WIP] Fix problematic backward of take & embedding Jul 18, 2018

haojin2 force-pushed the fix_11314 branch from 2028f27 to 62c6dfc Compare July 24, 2018 07:08

making AddTakeGrad as default for backward of embedding and take

da67a26

haojin2 force-pushed the fix_11314 branch from 62c6dfc to da67a26 Compare July 24, 2018 16:48

haojin2 changed the title ~~[WIP] Fix problematic backward of take & embedding~~ Fix problematic backward of take & embedding Jul 25, 2018

eric-haibin-lin merged commit fe1c7ab into apache:master Jul 25, 2018

haojin2 deleted the fix_11314 branch July 25, 2018 21:07

eric-haibin-lin mentioned this pull request Jul 31, 2018

Mxnet produce 'nan' training error! #8096

Closed

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

making AddTakeGrad as default for backward of embedding and take to a…

f26f6f7

…void nan (apache#11795)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix problematic backward of take & embedding #11795

Fix problematic backward of take & embedding #11795

haojin2 commented Jul 18, 2018 •

edited

Loading

haojin2 commented Jul 18, 2018 •

edited

Loading

leezu commented Jul 18, 2018

haojin2 commented Jul 18, 2018 •

edited

Loading

haojin2 commented Jul 23, 2018

leezu commented Jul 24, 2018

leezu commented Jul 24, 2018

haojin2 commented Jul 24, 2018

leezu commented Jul 24, 2018

haojin2 commented Jul 24, 2018

eric-haibin-lin commented Jul 24, 2018

haojin2 commented Jul 24, 2018

KellenSunderland commented Jul 24, 2018 •

edited

Loading

haojin2 commented Jul 24, 2018

KellenSunderland commented Jul 24, 2018 •

edited

Loading

haojin2 commented Jul 24, 2018

KellenSunderland commented Jul 25, 2018 •

edited

Loading

haojin2 commented Jul 25, 2018

KellenSunderland commented Jul 25, 2018 via email

haojin2 commented Jul 25, 2018

haojin2 commented Jul 25, 2018

sxjscience commented Aug 26, 2019

KellenSunderland commented Aug 30, 2019 via email

sxjscience commented Aug 30, 2019 via email

sxjscience commented Aug 30, 2019 via email

Fix problematic backward of take & embedding #11795

Fix problematic backward of take & embedding #11795

Conversation

haojin2 commented Jul 18, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

haojin2 commented Jul 18, 2018 • edited Loading

leezu commented Jul 18, 2018

haojin2 commented Jul 18, 2018 • edited Loading

haojin2 commented Jul 23, 2018

leezu commented Jul 24, 2018

leezu commented Jul 24, 2018

haojin2 commented Jul 24, 2018

leezu commented Jul 24, 2018

haojin2 commented Jul 24, 2018

eric-haibin-lin commented Jul 24, 2018

haojin2 commented Jul 24, 2018

KellenSunderland commented Jul 24, 2018 • edited Loading

haojin2 commented Jul 24, 2018

KellenSunderland commented Jul 24, 2018 • edited Loading

haojin2 commented Jul 24, 2018

KellenSunderland commented Jul 25, 2018 • edited Loading

haojin2 commented Jul 25, 2018

KellenSunderland commented Jul 25, 2018 via email

haojin2 commented Jul 25, 2018

haojin2 commented Jul 25, 2018

sxjscience commented Aug 26, 2019

KellenSunderland commented Aug 30, 2019 via email

sxjscience commented Aug 30, 2019 via email

sxjscience commented Aug 30, 2019 via email

haojin2 commented Jul 18, 2018 •

edited

Loading

haojin2 commented Jul 18, 2018 •

edited

Loading

haojin2 commented Jul 18, 2018 •

edited

Loading

KellenSunderland commented Jul 24, 2018 •

edited

Loading

KellenSunderland commented Jul 24, 2018 •

edited

Loading

KellenSunderland commented Jul 25, 2018 •

edited

Loading