Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix for flaky test_operator_gpu.test_spatial_transformer_with_type #12557

Merged
merged 1 commit into from
Sep 14, 2018

Conversation

haojin2
Copy link
Contributor

@haojin2 haojin2 commented Sep 13, 2018

Description

Fix for #11839. Consistency check fails between in-house CPU & GPU impl due to nuances in codes. Fixed a missing calculation step in GPU impl without CUDNN

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Add cudnn_off parameter to SpatialTransformer
  • Fix the inconsistency between CPU & GPU implementation

Comments

New test passed more than 10000 times

MXNET_TEST_COUNT=10000 nosetests -s --verbose tests/python/gpu/test_operator_gpu.py:test_spatial_transformer_with_type
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=2123178900 to reproduce.
test_operator_gpu.test_spatial_transformer_with_type ... ok

----------------------------------------------------------------------
Ran 1 test in 452.145s

OK

@haojin2
Copy link
Contributor Author

haojin2 commented Sep 13, 2018

@access2rohit
Copy link
Contributor

LGTM

@lupesko
Copy link
Contributor

lupesko commented Sep 14, 2018

Thanks for contributing a fix Hao!
@mxnet-label-bot [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Sep 14, 2018
@apeforest
Copy link
Contributor

Please provide more description in the PR to facilitate reviewers. Why is this test case flaky and what did you do to fix.

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge pending description update

@haojin2
Copy link
Contributor Author

haojin2 commented Sep 14, 2018

@apeforest Please take a look at the Changes section for more details. Please read all sections of the description before you say it's not enough.

@apeforest
Copy link
Contributor

IMO, Changes section is not the place to explain the rationale of a PR but the place to list the specific changes included in the PR, such as "added a parameter X to function Y", "added a unit test in module Z" etc. Nonetheless, I am still not clear why "Fix the inconsistency between CPU & GPU implementation" would fix the flakiness in test. Please elaborate more in the Description section. It not only helps reviewers understand the PR but also helps other developers to learn from this fix. Thank you!

@haojin2
Copy link
Contributor Author

haojin2 commented Sep 14, 2018

@apeforest I think at least "lack of description" should not be a blocker for merging the PR as it's not part of the standards for PRs to be accepted. Plus, a competent reviewer would be able to parse the code changes together with all info provided in all sections of the description of the PR. One-sentence description added in Description section. @eric-haibin-lin Please merge the PR ASAP.

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Sep 14, 2018

@haojin2 "a competent reviewer would be able to parse the code changes together with all info provided in all sections of the description of the PR" could be interpreted like gatekeeping. People will have a hard time moving from beginners to competent reviewers if each PR has only the minimum amount of information required by competent reviewers to understand them.

This is more a general comment than specific to this one PR, I think we should strive to have descriptions and explanations that can allow newbies and mxnet amateurs to understand more of the gist of the changes that are contributed to the repo. This will help to lower the barrier to entry and make the community more inclusive.

Each PR is an opportunity to educate and inform the community of how MXNet is running under the hood. Feel free to put more than less in your descriptions, competent reviewers will ignore the pieces of information they already know, and everybody else will applaud you for that 👍

edit:
To give a concrete example, here you could explain why we sometimes offer a cuda and a cudnn version of the operators and how to switch between one implementation and the other. For this specific fix, my understanding is that the cuda implementation was missing a value update to bottom_left_v which led to the wrong value to be computed?

@haojin2
Copy link
Contributor Author

haojin2 commented Sep 14, 2018

@ThomasDelteil IMHO, education does not equal to spoon-feeding, and we should still encourage balancing the workload we put on contributors and the easiness to understand the code for new members, plus, I would claim that one should be competent enough to understand the code to give constructive code reviews, don't you think so? Note that I'm not saying that one cannot ask questions on PRs. Asking questions is actually what I would encourage community members doing on my PRs instead of instructing me to do something on my PR (which is not a good practice toward a more friendly and inclusive community).

@haojin2
Copy link
Contributor Author

haojin2 commented Sep 14, 2018

@ThomasDelteil cudnn_off is more of a testing functionality which allows us to compare all versions (in-house CPU, in-house GPU, CuDNN) without re-building the package, but sometimes if users are not satisfied with CuDNN for some reason they can also use that in their code. I think you got the idea behind the change to GPU impl. I'm not familiar with the operator either but we know we should give same results from all 3 version and the GPU one is violating that rule, so after comparison I found that GPU code is a bit different, thus the change.

@ThomasDelteil
Copy link
Contributor

Thanks @haojin2 for your answer. I wonder if we could leverage onnx to check whether our operator implementations are consistent with PyTorch / TF / Caffe2 etc.

@eric-haibin-lin eric-haibin-lin merged commit f8ed533 into apache:master Sep 14, 2018
@haojin2 haojin2 deleted the fix_spatial_transfomer branch September 14, 2018 22:37
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request Sep 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants