Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[BUGFIX] fix issue in elemwise_add #20380

Closed
wants to merge 11 commits into from
Closed

Conversation

KexinFeng
Copy link
Contributor

@KexinFeng KexinFeng commented Jun 23, 2021

Description

Thie PR is created to fix the issue #20293, where operator elemwise_add has bug when CachedOp uses static_alloc.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@mxnet-bot
Copy link

Hey @KexinFeng , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-gpu, edge, clang, centos-gpu, sanity, centos-cpu, windows-cpu, website, windows-gpu, miscellaneous, unix-cpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@mseth10 mseth10 added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Jun 23, 2021
This reverts commit 6c2f76f.
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jun 23, 2021
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jun 23, 2021
@matteosal
Copy link
Contributor

I have verified that this change fixes the issue

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Jun 24, 2021
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jun 24, 2021
@KexinFeng KexinFeng requested a review from szha as a code owner June 24, 2021 19:19
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jun 24, 2021
@ptrendx
Copy link
Member

ptrendx commented Jun 25, 2021

This is unfortunately not a valid fix for the issue (and the fact that without static allocation everything works fine shows that) - the bug seems to lie rather inside the CachedOp and its handling of the static memory allocation. The CloneGradient method in the backward of elemwise_add is just a simple passing the input gradient data in both directions (vs copying it as the proposed ElemwiseGradUseNone does), which is the right thing to do.

@matteosal
Copy link
Contributor

Any news on this?

This is unfortunately not a valid fix for the issue (and the fact that without static allocation everything works fine shows that) - the bug seems to lie rather inside the CachedOp and its handling of the static memory allocation. The CloneGradient method in the backward of elemwise_add is just a simple passing the input gradient data in both directions (vs copying it as the proposed ElemwiseGradUseNone does), which is the right thing to do.

So this is an inefficient workaround because it wastes some time copying the data? Can there be other shortcomings besides performance by using this fix? We really need this to work and we don't mind paying just a small performance drop while waiting for a proper fix.

@KexinFeng
Copy link
Contributor Author

Any news on this?

This is unfortunately not a valid fix for the issue (and the fact that without static allocation everything works fine shows that) - the bug seems to lie rather inside the CachedOp and its handling of the static memory allocation. The CloneGradient method in the backward of elemwise_add is just a simple passing the input gradient data in both directions (vs copying it as the proposed ElemwiseGradUseNone does), which is the right thing to do.

So this is an inefficient workaround because it wastes some time copying the data? Can there be other shortcomings besides performance by using this fix? We really need this to work and we don't mind paying just a small performance drop while waiting for a proper fix.

Hi, currently this fix can deal with your problem but fails other tests during the pull request. So it is still under investigation. It doesn't seem to me to cause inefficiency; it's just not valid yet. But maybe it can be a workaround

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Jul 17, 2021
@KexinFeng KexinFeng closed this Jul 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants