Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix Windows GPU CI #17962

Merged
merged 4 commits into from
Apr 3, 2020
Merged

Fix Windows GPU CI #17962

merged 4 commits into from
Apr 3, 2020

Conversation

leezu
Copy link
Contributor

@leezu leezu commented Apr 2, 2020

Description

Minimal version of #17808

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled.

CC: @marcoabreu

@mxnet-bot
Copy link

Hey @leezu , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, unix-cpu, centos-cpu, sanity, unix-gpu, website, clang, windows-cpu, edge, miscellaneous, centos-gpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

ci/build_windows.py Outdated Show resolved Hide resolved
ci/build_windows.py Outdated Show resolved Hide resolved
if 'CUDA_PATH' not in os.environ:
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v9.2"
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert the CUDA change. That does not represent a fix.

Copy link
Contributor Author

@leezu leezu Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a requirement for VS2019

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually visual studio does allow more than one cuda version. You just have to install the respective Toolkit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? Cuda 9.2 does not support VS 2019

Copy link
Contributor Author

@leezu leezu Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, they intentionally broke that backwards compatibility feature. Usually it was possible to compile older cuda versions in later vs versions by installing toolkits which make sure that the integration is available. Seems like that caused issues and thus Microsoft and Nvidia decided to not go that route any further.

In that case, find to proceed.

But is there still some kind of compatibility mode which checks that the is still compliant with older cuda standards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is there still some kind of compatibility mode which checks that the is still compliant with older cuda standards?

No. Unix & CentOS tests Cuda 10.1. Windows now tests Cuda 10.2.
But the risk is low that we'd break Cuda 9 support within the next few days. So it's not a one-way door decision. I suggest we discuss on dev if we want to support cuda 9. If we decide to support it, let's switch the CentOS tests to use Cuda 9.

tests/python/unittest/test_gluon_data.py Outdated Show resolved Hide resolved
@leezu
Copy link
Contributor Author

leezu commented Apr 2, 2020

Re #17808 (comment)

Connectivity errors were never an issue with Windows slaves

Connectivity issues became a problem after switching to Windows Server 2019. The switch was done as the old AMI apparently can't be rebuilt anymore and a new AMI had to be started. I was not involved in that effort, but I think it's reasonable to get the Windows AMI instructions working again and choose the latest Windows Server for doing that.

Jenkins connectivity issues typically result from system or network load problems. The newer version of Windows may have some issues causing network problems on a slower machine such as g3. If you have an alternative fix, please propose it.

Moving to a g4 instance to run the Windows GPU tests resolved the connectivity issue.

@marcoabreu
Copy link
Contributor

marcoabreu commented Apr 2, 2020

Well the new Ami should not have been moved into production then. Sorry, but changing a hundred knobs to facilitate one change isn't right. Either get a stable replacement and deploy that or leave stuff as it is. Replacing an existing system with an inferior version does not make sense to me.

These are some standards which I do not see aligned with the projects interest.

@leezu
Copy link
Contributor Author

leezu commented Apr 2, 2020

Could you elaborate on how the distinction between running tests on a g3 or g4 does not align with the projects interest?

@leezu
Copy link
Contributor Author

leezu commented Apr 2, 2020

Based on offline discussion with Marco, let's use a patched version of the old AMI first to fix the CI. @josephevans helped to install VS Code 2019 on the old AMI. I have further reduced the diff of this PR to include only the minimal changes to switch to VS Code 2019 and the x64 toolchain.

If this fixes the issue, we'll update to the new AMI with updated cuda and g4 instances at a later point after running it in the dev environment for a while.

@leezu leezu force-pushed the fixci branch 3 times, most recently from 184ca36 to f70dc39 Compare April 3, 2020 03:30
@leezu leezu force-pushed the fixci branch 4 times, most recently from a348cfd to 3fd7f4a Compare April 3, 2020 16:12
if 'CUDA_PATH' not in os.environ:
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v9.2"
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually visual studio does allow more than one cuda version. You just have to install the respective Toolkit.

Copy link
Contributor

@aaronmarkham aaronmarkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please summarize the changes and the reasonings in the PR description?
For example, I think it is important to note why you're bumping the CUDA version and details like this shouldn't be buried deep in the PR's communication. Anyone researching what happened here might have a hard time. Also, setting up Windows for yourself, you might want to see how CI does it and why.

So are going to x version of VS as a default?
And what's this cmake change?

@leezu
Copy link
Contributor Author

leezu commented Apr 3, 2020

Can you please summarize the changes and the reasonings in the PR description?

Done

Also, setting up Windows for yourself, you might want to see how CI does it and why.

#17808 will provide an updated setup. This PR is only a emergency fix.

@leezu
Copy link
Contributor Author

leezu commented Apr 3, 2020

@marcoabreu gpu build is still flaky due to thrust + VS2019 issues. Adding back the retries. http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17962/17/pipeline

@leezu leezu merged commit 66ee118 into apache:master Apr 3, 2020
@leezu leezu deleted the fixci branch April 3, 2020 20:20
leezu added a commit to leezu/mxnet that referenced this pull request Apr 3, 2020
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>
leezu added a commit to leezu/mxnet that referenced this pull request Apr 3, 2020
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>
@leezu leezu mentioned this pull request Apr 3, 2020
mk-61 pushed a commit to mk-61/incubator-mxnet that referenced this pull request Apr 7, 2020
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>
MoisesHer pushed a commit to MoisesHer/incubator-mxnet that referenced this pull request Apr 10, 2020
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>
yijunc pushed a commit to yijunc/incubator-mxnet that referenced this pull request Jul 4, 2020
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>
yijunc pushed a commit to yijunc/incubator-mxnet that referenced this pull request Jul 4, 2020
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>
szha pushed a commit that referenced this pull request Jul 5, 2020
* Fix Windows GPU CI (#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>

* backport mixed type

Co-authored-by: Leonard Lausen <[email protected]>
Co-authored-by: vexilligera <[email protected]>
shuo-ouyang pushed a commit to shuo-ouyang/incubator-mxnet that referenced this pull request Aug 9, 2020
Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>
leezu added a commit to leezu/mxnet that referenced this pull request Oct 1, 2020
* Fix Windows GPU CI (apache#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>

* backport mixed type

Co-authored-by: Leonard Lausen <[email protected]>
Co-authored-by: vexilligera <[email protected]>
samskalicky pushed a commit that referenced this pull request Oct 2, 2020
* * Fix einsum gradient (#18482)

* [v1.7.x] Backport PRs of numpy features (#18653)

* add zero grad for npi_unique (#18080)

* fix np.clip scalar input case (#17788)

* fix true_divide (#18393)

Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>

* [v1.7.x] backport mixed type binary ops to v1.7.x (#18649)

* Fix Windows GPU CI (#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>

* backport mixed type

Co-authored-by: Leonard Lausen <[email protected]>
Co-authored-by: vexilligera <[email protected]>

* revise activations (#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <[email protected]>

* Fail build_windows.py if all retries failed (#18177)

* Update to thrust 1.9.8 on Windows (#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <[email protected]>
Co-authored-by: Xingjian Shi <[email protected]>
Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>
Co-authored-by: Yijun Chen <[email protected]>
Co-authored-by: vexilligera <[email protected]>
Co-authored-by: ciyong <[email protected]>
Co-authored-by: Tao Lv <[email protected]>
samskalicky pushed a commit to samskalicky/incubator-mxnet that referenced this pull request Oct 2, 2020
* * Fix einsum gradient (apache#18482)

* [v1.7.x] Backport PRs of numpy features (apache#18653)

* add zero grad for npi_unique (apache#18080)

* fix np.clip scalar input case (apache#17788)

* fix true_divide (apache#18393)

Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>

* [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649)

* Fix Windows GPU CI (apache#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>

* backport mixed type

Co-authored-by: Leonard Lausen <[email protected]>
Co-authored-by: vexilligera <[email protected]>

* revise activations (apache#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <[email protected]>

* Fail build_windows.py if all retries failed (apache#18177)

* Update to thrust 1.9.8 on Windows (apache#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (apache#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <[email protected]>
Co-authored-by: Xingjian Shi <[email protected]>
Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>
Co-authored-by: Yijun Chen <[email protected]>
Co-authored-by: vexilligera <[email protected]>
Co-authored-by: ciyong <[email protected]>
Co-authored-by: Tao Lv <[email protected]>
samskalicky added a commit that referenced this pull request Oct 3, 2020
* * Fix einsum gradient (#18482)

* [v1.7.x] Backport PRs of numpy features (#18653)

* add zero grad for npi_unique (#18080)

* fix np.clip scalar input case (#17788)

* fix true_divide (#18393)

Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>

* [v1.7.x] backport mixed type binary ops to v1.7.x (#18649)

* Fix Windows GPU CI (#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <[email protected]>

* backport mixed type

Co-authored-by: Leonard Lausen <[email protected]>
Co-authored-by: vexilligera <[email protected]>

* revise activations (#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <[email protected]>

* Fail build_windows.py if all retries failed (#18177)

* Update to thrust 1.9.8 on Windows (#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <[email protected]>
Co-authored-by: Xingjian Shi <[email protected]>
Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>
Co-authored-by: Yijun Chen <[email protected]>
Co-authored-by: vexilligera <[email protected]>
Co-authored-by: ciyong <[email protected]>
Co-authored-by: Tao Lv <[email protected]>

Co-authored-by: Leonard Lausen <[email protected]>
Co-authored-by: Ke Han <[email protected]>
Co-authored-by: Xingjian Shi <[email protected]>
Co-authored-by: Hao Jin <[email protected]>
Co-authored-by: Xi Wang <[email protected]>
Co-authored-by: Yijun Chen <[email protected]>
Co-authored-by: vexilligera <[email protected]>
Co-authored-by: ciyong <[email protected]>
Co-authored-by: Tao Lv <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants