Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

MKLDNN fallback when not recording gradients and calling backwards #12411

Closed
wants to merge 17 commits into from

Conversation

azai91
Copy link
Contributor

@azai91 azai91 commented Aug 30, 2018

Description

PR to address (#10994). There is a case where users may want to run the backwards pass but not record any gradients (https://mxnet.incubator.apache.org/api/python/autograd/autograd.html#mxnet.autograd.record - this should be addressed in a later PR as it does not make sense). MKLDNN does not handle this case and instead we will fallback.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@azai91
Copy link
Contributor Author

azai91 commented Aug 30, 2018

This PR is waiting merge of #12019

@azai91 azai91 force-pushed the fix/10994 branch 3 times, most recently from 1ccef95 to a56a306 Compare August 30, 2018 22:16
@stu1130
Copy link
Contributor

stu1130 commented Sep 18, 2018

@azai91 #12019 was merged. Is it still WIP?

@azai91
Copy link
Contributor Author

azai91 commented Sep 24, 2018

still working in progress. working with customer on different issue. will fix this next.

@vandanavk
Copy link
Contributor

@mxnet-label-bot [pr-work-in-progress]

@marcoabreu marcoabreu added the pr-work-in-progress PR is still work in progress label Sep 25, 2018
@vrakesh
Copy link
Contributor

vrakesh commented Oct 9, 2018

@azai91 requesting an update on this PR.

@Roshrini
Copy link
Member

@azai91 Any update on this PR?

@ankkhedia
Copy link
Contributor

@azai91 Thanks for the contribution!
Could you please update this PR?

@anirudhacharya
Copy link
Member

@azai91 any update on this PR. you could close it and reopen the PR once the changes are ready.

@sandeep-krishnamurthy @anirudh2290

@stu1130
Copy link
Contributor

stu1130 commented Nov 21, 2018

@azai91 could you address the CI failure?

@vandanavk
Copy link
Contributor

@mxnet-label-bot add [pr-awaiting-testing]

@marcoabreu marcoabreu added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Nov 27, 2018
@azai91
Copy link
Contributor Author

azai91 commented Nov 28, 2018

still investigating. this PR goes beyond the the original issue I was addressing.

@azai91 azai91 changed the title [WIP] MKLDNN fallback when not recording gradients and calling backwards MKLDNN fallback when not recording gradients and calling backwards Nov 28, 2018
@azai91
Copy link
Contributor Author

azai91 commented Nov 28, 2018

addressed original problem with issuse (#10994). However, I discovered another issue and created a new PR to address (#13445).

check_hybrid_static_memory()
check_hybrid_static_memory(static_alloc=True)
check_hybrid_static_memory(static_alloc=True, static_shape=True)
check_hybrid_static_memory(train_mode=[True, False])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldnt it be train_modes ?

@roywei
Copy link
Member

roywei commented Dec 11, 2018

@azai91 Thanks for taking the time to dive into the issue, could you resolve the conflict and trigger CI if still working on this?

@sandeep-krishnamurthy
Copy link
Contributor

@azai91 - can you please rebase and fix CI issues? @pengzhao-intel - You may be interested in this PR?

@pengzhao-intel
Copy link
Contributor

Actually, this is a system-level issue that how do we handle the situation the backward is called with the is_train=False. I think it doesn't can be done by the simple bypass.

We will have the offline discussion first and back later @TaoLv .

Copy link
Member

@TaoLv TaoLv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems I missed some context of this issue. What's the expectation in mxnet if backward is called with is_train=false? What's the grad_output for backward function and if needed, what's the workspace for backward function?

@Roshrini
Copy link
Member

Roshrini commented Jan 8, 2019

@azai91 Can you please rebase this PR?

@stu1130
Copy link
Contributor

stu1130 commented Jan 16, 2019

@azai91 ping again! thanks

@pengzhao-intel
Copy link
Contributor

IMO, this usage and default behavior is not clear even in the original path.
@azai91 @stu1130 could you help get a clear picture first and then we can discuss how to handle it by MKLDNN backend?

@sandeep-krishnamurthy
Copy link
Contributor

@apeforest - @TaoLv raised a good question here. Can you please help us answer -

What's the expectation in mxnet if backward is called with is_train=false? What's the grad_output >> for backward function and if needed, what's the workspace for backward function?

@pengzhao-intel
Copy link
Contributor

ping, any update @azai91

@abhinavs95
Copy link
Contributor

@azai91 Could you please rebase and fix the CI issues?

@abhinavs95
Copy link
Contributor

@mxnet-label-bot update [pr-awaiting-response, pr-work-in-progress]

@marcoabreu marcoabreu added pr-awaiting-response PR is reviewed and waiting for contributor to respond and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Mar 27, 2019
@azai91 azai91 closed this Mar 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-response PR is reviewed and waiting for contributor to respond pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.