Fix for flaky test_operator_gpu.test_spatial_transformer_with_type #12557

haojin2 · 2018-09-13T23:33:16Z

Description

Fix for #11839. Consistency check fails between in-house CPU & GPU impl due to nuances in codes. Fixed a missing calculation step in GPU impl without CUDNN

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Add cudnn_off parameter to SpatialTransformer
Fix the inconsistency between CPU & GPU implementation

Comments

New test passed more than 10000 times

MXNET_TEST_COUNT=10000 nosetests -s --verbose tests/python/gpu/test_operator_gpu.py:test_spatial_transformer_with_type
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=2123178900 to reproduce.
test_operator_gpu.test_spatial_transformer_with_type ... ok

----------------------------------------------------------------------
Ran 1 test in 452.145s

OK

…tency between CPU & GPU code

haojin2 · 2018-09-13T23:37:32Z

@eric-haibin-lin @anirudh2290

access2rohit · 2018-09-13T23:49:26Z

LGTM

lupesko · 2018-09-14T07:41:56Z

Thanks for contributing a fix Hao!
@mxnet-label-bot [pr-awaiting-review]

apeforest · 2018-09-14T16:49:47Z

Please provide more description in the PR to facilitate reviewers. Why is this test case flaky and what did you do to fix.

eric-haibin-lin

Ready to merge pending description update

haojin2 · 2018-09-14T18:42:04Z

@apeforest Please take a look at the Changes section for more details. Please read all sections of the description before you say it's not enough.

apeforest · 2018-09-14T19:26:40Z

IMO, Changes section is not the place to explain the rationale of a PR but the place to list the specific changes included in the PR, such as "added a parameter X to function Y", "added a unit test in module Z" etc. Nonetheless, I am still not clear why "Fix the inconsistency between CPU & GPU implementation" would fix the flakiness in test. Please elaborate more in the Description section. It not only helps reviewers understand the PR but also helps other developers to learn from this fix. Thank you!

haojin2 · 2018-09-14T20:02:37Z

@apeforest I think at least "lack of description" should not be a blocker for merging the PR as it's not part of the standards for PRs to be accepted. Plus, a competent reviewer would be able to parse the code changes together with all info provided in all sections of the description of the PR. One-sentence description added in Description section. @eric-haibin-lin Please merge the PR ASAP.

ThomasDelteil · 2018-09-14T20:18:56Z

@haojin2 "a competent reviewer would be able to parse the code changes together with all info provided in all sections of the description of the PR" could be interpreted like gatekeeping. People will have a hard time moving from beginners to competent reviewers if each PR has only the minimum amount of information required by competent reviewers to understand them.

This is more a general comment than specific to this one PR, I think we should strive to have descriptions and explanations that can allow newbies and mxnet amateurs to understand more of the gist of the changes that are contributed to the repo. This will help to lower the barrier to entry and make the community more inclusive.

Each PR is an opportunity to educate and inform the community of how MXNet is running under the hood. Feel free to put more than less in your descriptions, competent reviewers will ignore the pieces of information they already know, and everybody else will applaud you for that 👍

edit:
To give a concrete example, here you could explain why we sometimes offer a cuda and a cudnn version of the operators and how to switch between one implementation and the other. For this specific fix, my understanding is that the cuda implementation was missing a value update to bottom_left_v which led to the wrong value to be computed?

haojin2 · 2018-09-14T20:38:27Z

@ThomasDelteil IMHO, education does not equal to spoon-feeding, and we should still encourage balancing the workload we put on contributors and the easiness to understand the code for new members, plus, I would claim that one should be competent enough to understand the code to give constructive code reviews, don't you think so? Note that I'm not saying that one cannot ask questions on PRs. Asking questions is actually what I would encourage community members doing on my PRs instead of instructing me to do something on my PR (which is not a good practice toward a more friendly and inclusive community).

haojin2 · 2018-09-14T20:48:03Z

@ThomasDelteil cudnn_off is more of a testing functionality which allows us to compare all versions (in-house CPU, in-house GPU, CuDNN) without re-building the package, but sometimes if users are not satisfied with CuDNN for some reason they can also use that in their code. I think you got the idea behind the change to GPU impl. I'm not familiar with the operator either but we know we should give same results from all 3 version and the GPU one is violating that rule, so after comparison I found that GPU code is a bit different, thus the change.

ThomasDelteil · 2018-09-14T21:12:35Z

Thanks @haojin2 for your answer. I wonder if we could leverage onnx to check whether our operator implementations are consistent with PyTorch / TF / Caffe2 etc.

…tency between CPU & GPU code (apache#12557)

add cudnn_off parameter to SpatialTransformer Op and fix the inconsis…

cdbd4e6

…tency between CPU & GPU code

haojin2 requested a review from anirudh2290 as a code owner September 13, 2018 23:33

access2rohit approved these changes Sep 13, 2018

View reviewed changes

marcoabreu added the pr-awaiting-review PR is waiting for code review label Sep 14, 2018

eric-haibin-lin approved these changes Sep 14, 2018

View reviewed changes

eric-haibin-lin merged commit f8ed533 into apache:master Sep 14, 2018

haojin2 deleted the fix_spatial_transfomer branch September 14, 2018 22:37

anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request Sep 19, 2018

add cudnn_off parameter to SpatialTransformer Op and fix the inconsis…

b5a0852

…tency between CPU & GPU code (apache#12557)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for flaky test_operator_gpu.test_spatial_transformer_with_type #12557

Fix for flaky test_operator_gpu.test_spatial_transformer_with_type #12557

haojin2 commented Sep 13, 2018 •

edited by eric-haibin-lin

Loading

haojin2 commented Sep 13, 2018

access2rohit commented Sep 13, 2018

lupesko commented Sep 14, 2018

apeforest commented Sep 14, 2018

eric-haibin-lin left a comment

haojin2 commented Sep 14, 2018

apeforest commented Sep 14, 2018

haojin2 commented Sep 14, 2018

ThomasDelteil commented Sep 14, 2018 •

edited

Loading

haojin2 commented Sep 14, 2018

haojin2 commented Sep 14, 2018 •

edited

Loading

ThomasDelteil commented Sep 14, 2018

Fix for flaky test_operator_gpu.test_spatial_transformer_with_type #12557

Fix for flaky test_operator_gpu.test_spatial_transformer_with_type #12557

Conversation

haojin2 commented Sep 13, 2018 • edited by eric-haibin-lin Loading

Description

Checklist

Essentials

Changes

Comments

haojin2 commented Sep 13, 2018

access2rohit commented Sep 13, 2018

lupesko commented Sep 14, 2018

apeforest commented Sep 14, 2018

eric-haibin-lin left a comment

Choose a reason for hiding this comment

haojin2 commented Sep 14, 2018

apeforest commented Sep 14, 2018

haojin2 commented Sep 14, 2018

ThomasDelteil commented Sep 14, 2018 • edited Loading

haojin2 commented Sep 14, 2018

haojin2 commented Sep 14, 2018 • edited Loading

ThomasDelteil commented Sep 14, 2018

haojin2 commented Sep 13, 2018 •

edited by eric-haibin-lin

Loading

ThomasDelteil commented Sep 14, 2018 •

edited

Loading

haojin2 commented Sep 14, 2018 •

edited

Loading