Fix Windows GPU CI #17962

leezu · 2020-04-02T19:39:27Z

Description

Minimal version of #17808

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled.

CC: @marcoabreu

mxnet-bot · 2020-04-02T19:39:32Z

Hey @leezu , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, unix-cpu, centos-cpu, sanity, unix-gpu, website, clang, windows-cpu, edge, miscellaneous, centos-gpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

ci/build_windows.py

marcoabreu · 2020-04-02T19:41:15Z

ci/build_windows.py

        if 'CUDA_PATH' not in os.environ:
-            os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v9.2"
+            os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2"


Please revert the CUDA change. That does not represent a fix.

It's a requirement for VS2019

Usually visual studio does allow more than one cuda version. You just have to install the respective Toolkit.

What do you mean? Cuda 9.2 does not support VS 2019

You can read for example https://superuser.com/questions/1506044/installing-cuda-9-2-with-vs-2019 or https://devblogs.microsoft.com/cppblog/cuda-10-1-available-now-with-support-for-latest-microsoft-visual-studio-2019-versions/

Oh, they intentionally broke that backwards compatibility feature. Usually it was possible to compile older cuda versions in later vs versions by installing toolkits which make sure that the integration is available. Seems like that caused issues and thus Microsoft and Nvidia decided to not go that route any further.

In that case, find to proceed.

But is there still some kind of compatibility mode which checks that the is still compliant with older cuda standards?

But is there still some kind of compatibility mode which checks that the is still compliant with older cuda standards?

No. Unix & CentOS tests Cuda 10.1. Windows now tests Cuda 10.2.
But the risk is low that we'd break Cuda 9 support within the next few days. So it's not a one-way door decision. I suggest we discuss on dev if we want to support cuda 9. If we decide to support it, let's switch the CentOS tests to use Cuda 9.

tests/python/unittest/test_gluon_data.py

leezu · 2020-04-02T19:44:31Z

Re #17808 (comment)

Connectivity errors were never an issue with Windows slaves

Connectivity issues became a problem after switching to Windows Server 2019. The switch was done as the old AMI apparently can't be rebuilt anymore and a new AMI had to be started. I was not involved in that effort, but I think it's reasonable to get the Windows AMI instructions working again and choose the latest Windows Server for doing that.

Jenkins connectivity issues typically result from system or network load problems. The newer version of Windows may have some issues causing network problems on a slower machine such as g3. If you have an alternative fix, please propose it.

Moving to a g4 instance to run the Windows GPU tests resolved the connectivity issue.

marcoabreu · 2020-04-02T19:47:07Z

Well the new Ami should not have been moved into production then. Sorry, but changing a hundred knobs to facilitate one change isn't right. Either get a stable replacement and deploy that or leave stuff as it is. Replacing an existing system with an inferior version does not make sense to me.

These are some standards which I do not see aligned with the projects interest.

leezu · 2020-04-02T19:52:08Z

Could you elaborate on how the distinction between running tests on a g3 or g4 does not align with the projects interest?

leezu · 2020-04-02T22:39:55Z

Based on offline discussion with Marco, let's use a patched version of the old AMI first to fix the CI. @josephevans helped to install VS Code 2019 on the old AMI. I have further reduced the diff of this PR to include only the minimal changes to switch to VS Code 2019 and the x64 toolchain.

If this fixes the issue, we'll update to the new AMI with updated cuda and g4 instances at a later point after running it in the dev environment for a while.

marcoabreu · 2020-04-03T16:17:07Z

ci/build_windows.py

        if 'CUDA_PATH' not in os.environ:
-            os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v9.2"
+            os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2"


Usually visual studio does allow more than one cuda version. You just have to install the respective Toolkit.

aaronmarkham

Can you please summarize the changes and the reasonings in the PR description?
For example, I think it is important to note why you're bumping the CUDA version and details like this shouldn't be buried deep in the PR's communication. Anyone researching what happened here might have a hard time. Also, setting up Windows for yourself, you might want to see how CI does it and why.

So are going to x version of VS as a default?
And what's this cmake change?

leezu · 2020-04-03T16:52:41Z

Can you please summarize the changes and the reasonings in the PR description?

Done

Also, setting up Windows for yourself, you might want to see how CI does it and why.

#17808 will provide an updated setup. This PR is only a emergency fix.

leezu · 2020-04-03T17:21:53Z

@marcoabreu gpu build is still flaky due to thrust + VS2019 issues. Adding back the retries. http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17962/17/pipeline

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]>

* Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]>

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]>

* Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]>

* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <[email protected]> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]>

* * Fix einsum gradient (apache#18482) * [v1.7.x] Backport PRs of numpy features (apache#18653) * add zero grad for npi_unique (apache#18080) * fix np.clip scalar input case (apache#17788) * fix true_divide (apache#18393) Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> * [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649) * Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]> * revise activations (apache#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <[email protected]> * Fail build_windows.py if all retries failed (apache#18177) * Update to thrust 1.9.8 on Windows (apache#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (apache#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]>

* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <[email protected]> * backport mixed type Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: vexilligera <[email protected]> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <[email protected]> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]> Co-authored-by: Leonard Lausen <[email protected]> Co-authored-by: Ke Han <[email protected]> Co-authored-by: Xingjian Shi <[email protected]> Co-authored-by: Hao Jin <[email protected]> Co-authored-by: Xi Wang <[email protected]> Co-authored-by: Yijun Chen <[email protected]> Co-authored-by: vexilligera <[email protected]> Co-authored-by: ciyong <[email protected]> Co-authored-by: Tao Lv <[email protected]>

leezu requested review from aaronmarkham and marcoabreu as code owners April 2, 2020 19:39

marcoabreu suggested changes Apr 2, 2020

View reviewed changes

leezu mentioned this pull request Apr 2, 2020

[WIP] Windows dev environment configuration, update install instructions from source in the docs #17808

Closed

7 tasks

leezu requested a review from marcoabreu April 2, 2020 20:04

leezu force-pushed the fixci branch from c310881 to 8084ac3 Compare April 2, 2020 22:38

leezu force-pushed the fixci branch 3 times, most recently from 184ca36 to f70dc39 Compare April 3, 2020 03:30

vexilligera added 2 commits April 3, 2020 03:50

vs 2019 and cuda 10.2

8fb3563

update dmlc-core

3fd7f4a

leezu force-pushed the fixci branch 4 times, most recently from a348cfd to 3fd7f4a Compare April 3, 2020 16:12

marcoabreu suggested changes Apr 3, 2020

View reviewed changes

marcoabreu approved these changes Apr 3, 2020

View reviewed changes

leezu force-pushed the fixci branch from 567c82f to 89e34bb Compare April 3, 2020 16:27

Remove OpenCV_RUNTIME and OpenCV_ARCH based on fixed autodetection

4d647b7

leezu force-pushed the fixci branch from 89e34bb to 4d647b7 Compare April 3, 2020 16:32

aaronmarkham reviewed Apr 3, 2020

View reviewed changes

Add build retrials due to cuda thrust + VS2019 flakyness

7536ce4

leezu force-pushed the fixci branch from 00f00df to 7536ce4 Compare April 3, 2020 17:39

vinitra-zz mentioned this pull request Apr 3, 2020

[ONNX export] Fixing spatial export for batchnorm #17711

Merged

4 tasks

leezu merged commit 66ee118 into apache:master Apr 3, 2020

leezu deleted the fixci branch April 3, 2020 20:20

ChaiBapchya mentioned this pull request Apr 3, 2020

Try ningyuan win gpu build #17947

Closed

leezu mentioned this pull request Apr 3, 2020

Backport CI fixes #17967

Closed

This was referenced Apr 5, 2020

update default windows Visual Studio from 2015 to 2017 #16712

Closed

Fix cudnn Dropout reproducibility #17547

Merged

leezu mentioned this pull request Apr 24, 2020

CUDA: unspecified launch failure on CI Windows #17616

Closed

ChaiBapchya mentioned this pull request Jul 3, 2020

[v1.7.x] backport mixed type binary ops to v1.7.x #18649

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Windows GPU CI #17962

Fix Windows GPU CI #17962

leezu commented Apr 2, 2020 •

edited

Loading

mxnet-bot commented Apr 2, 2020

marcoabreu Apr 2, 2020

leezu Apr 2, 2020 •

edited

Loading

marcoabreu Apr 3, 2020

leezu Apr 3, 2020

leezu Apr 3, 2020 •

edited

Loading

marcoabreu Apr 3, 2020

leezu Apr 3, 2020

leezu commented Apr 2, 2020 •

edited

Loading

marcoabreu commented Apr 2, 2020 •

edited

Loading

leezu commented Apr 2, 2020

leezu commented Apr 2, 2020 •

edited

Loading

marcoabreu Apr 3, 2020

aaronmarkham left a comment

leezu commented Apr 3, 2020 •

edited

Loading

leezu commented Apr 3, 2020

Fix Windows GPU CI #17962

Fix Windows GPU CI #17962

Conversation

leezu commented Apr 2, 2020 • edited Loading

Description

mxnet-bot commented Apr 2, 2020

marcoabreu Apr 2, 2020

Choose a reason for hiding this comment

leezu Apr 2, 2020 • edited Loading

Choose a reason for hiding this comment

marcoabreu Apr 3, 2020

Choose a reason for hiding this comment

leezu Apr 3, 2020

Choose a reason for hiding this comment

leezu Apr 3, 2020 • edited Loading

Choose a reason for hiding this comment

marcoabreu Apr 3, 2020

Choose a reason for hiding this comment

leezu Apr 3, 2020

Choose a reason for hiding this comment

leezu commented Apr 2, 2020 • edited Loading

marcoabreu commented Apr 2, 2020 • edited Loading

leezu commented Apr 2, 2020

leezu commented Apr 2, 2020 • edited Loading

marcoabreu Apr 3, 2020

Choose a reason for hiding this comment

aaronmarkham left a comment

Choose a reason for hiding this comment

leezu commented Apr 3, 2020 • edited Loading

leezu commented Apr 3, 2020

leezu commented Apr 2, 2020 •

edited

Loading

leezu Apr 2, 2020 •

edited

Loading

leezu Apr 3, 2020 •

edited

Loading

leezu commented Apr 2, 2020 •

edited

Loading

marcoabreu commented Apr 2, 2020 •

edited

Loading

leezu commented Apr 2, 2020 •

edited

Loading

leezu commented Apr 3, 2020 •

edited

Loading