-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes #18632
Conversation
Hey @ciyongch , Thanks for submitting the PR
CI supported jobs: [windows-gpu, website, centos-cpu, windows-cpu, unix-gpu, sanity, edge, miscellaneous, clang, centos-gpu, unix-cpu] Note: |
Great job to root cause this bug :) |
@ChaiBapchya @leezu it looks like there's CI issues in current v1.6.x, which was existed in previous commit #18586. Do you know if there's anyone working on this? Thanks! edge
unix-gpu
|
@sandeep-krishnamurthy @ChaiBapchya for helps :) |
@ciyongch @PatricZhao Going forward, I created another PR on 1.6.x branch: #18597
However, it fails on setuptools as you pointed out. I'll try to get that fixed so that we can get the CI fixed for 1.6.x |
7d554b8
to
72ba804
Compare
It passed all 11 why did we have to retrigger? Is codecov blocking merge? |
Hi @ChaiBapchya , I saw the codecov test cases failed and the mxnet-bot doesn't support re-trigger. Not sure if they're a merge blocker or not, I just re-trigger the cases. |
I don't think that's the case.
|
@mxnet-bot run ci [unix-cpu] |
Jenkins CI successfully triggered : [unix-cpu] |
Yes, this is a common issue for all the current branches, I will do the backport to other branches as well.
Currently, we've only verified this via a customized case which is kind of complicated, I will try to add some tests later to cover it. |
Codecov failures are still there...which shouldn't be the blocker I think. |
Codecov is not a blocker. |
@pengzhao-intel @TaoLv this will be good to go after your review and approval |
Can we add a basic test to verify this? I guess reviewers would feel confident to approve this once they know there is a proper test to verify it and that it passes. @sandeep-krishnamurthy wdyt? |
To "fix" the codecov showing up on the 1.x branches, you can include the 3 lines from https://github.com/apache/incubator-mxnet/pull/18497/files |
@ChaiBapchya We've verified the fix via an offline customized cases, anyway, it's quite reasonable to add a UT to cover this case. I will try to add this today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Hi @ChaiBapchya @leezu @pengzhao-intel @TaoLv , now all the CI passed and the UT is added as well, please help to merge, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the UT. LGTM!
… variable input shapes (apache#18632) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov
… variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <[email protected]>
… variable input shapes (apache#18632) (apache#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <[email protected]>
* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <[email protected]> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]>
* * Fix einsum gradient (apache#18482) * [v1.7.x] Backport PRs of numpy features (apache#18653) * add zero grad for npi_unique (apache#18080) * fix np.clip scalar input case (apache#17788) * fix true_divide (apache#18393) Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> * [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649) * Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]> * revise activations (apache#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <[email protected]> * Fail build_windows.py if all retries failed (apache#18177) * Update to thrust 1.9.8 on Windows (apache#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (apache#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]>
* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <[email protected]> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]> Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]>
Description
When doing calibration with variable input shapes, a new
executor
will be created here in the case of the current input has different shape compared to the previous one. While thecallback
function is only bound to the very firstexecutor
instead of passed down to the succeedingexecutors
which shares the same symbol.This PR enables passing down the
callback
function, to address the calibration skipping issue.Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments
@pengzhao-intel @TaoLv @ChaiBapchya @szha