[WIP] Windows dev environment configuration, update install instructions from source in the docs #17808

vexilligera · 2020-03-11T03:04:33Z

Description

Taking over #17206 with fix to #17635, update toolchains to vs2019 and cuda 10.2

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

leezu · 2020-03-11T03:20:42Z

ci/windows_dev_env/windows_deps_headless_installer.py

+    'openblas': 'https://windows-post-install.s3-us-west-2.amazonaws.com/OpenBLAS-windows-v0_2_19.zip',
+    'opencv': 'https://windows-post-install.s3-us-west-2.amazonaws.com/opencv-windows-4.1.2-vc14_vc15.zip',
+    'cudnn': 'https://windows-post-install.s3-us-west-2.amazonaws.com/cudnn-9.2-windows10-x64-v7.4.2.24.zip',
+    'nvdriver': 'https://windows-post-install.s3-us-west-2.amazonaws.com/nvidia_display_drivers_398.75_server2016.zip',


Who owns the windows-post-install bucket?

I currently don't have access to windows-post-install, so I opened another bucket (mxnet-windows-build). It would be easy to merge them.

mxnet-ci-dev

leezu · 2020-03-11T03:27:00Z

ci/windows_dev_env/windows_deps_headless_installer.py

+    logging.info("Perl install complete")
+
+
+def install_clang():


Why do we need clang when compiling with visual studio?

They probably need to support tvm op build on windows as well, which may need this.

leezu · 2020-03-11T03:27:11Z

ci/windows_dev_env/windows_deps_headless_installer.py

+        logging.info("Visual studio install complete.")
+
+
+def install_cmake():


ci/windows_dev_env/windows_deps_headless_installer.py

leezu · 2020-03-11T03:30:14Z

docs/static_site/src/pages/get_started/windows_setup.md

 ```
-cmake -G "Visual Studio 15 2017 Win64" -T cuda=9.2,host=x64 -DUSE_CUDA=1 -DUSE_CUDNN=1 -DUSE_NVRTC=1 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_BLAS=open -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_LIST=Common -DCUDA_TOOLSET=9.2 -DCUDNN_INCLUDE=C:\cuda\include -DCUDNN_LIBRARY=C:\cuda\lib\x64\cudnn.lib "C:\incubator-mxnet"
+.\setup.ps1


Is it feasible to provide short non-blackbox instructions to install MXNet?

For example, if users have a standard installation of visual studio, can the installation experience be pretty similar to https://mxnet.apache.org/get_started/ubuntu_setup#build-the-mxnet-shared-library-from-source Step 2 and Step 3?

Yes, we can explicitly use the cmd command invoked from ci/build_windows.py

aaronmarkham · 2020-03-11T16:17:34Z

Let me know when this is ready, and I'll test it out.

leezu · 2020-03-19T00:49:38Z

@vexilligera what is the status of this PR?

vexilligera · 2020-03-19T02:19:19Z

@vexilligera what is the status of this PR?

I'm trying to update the AMI to VS2019 and CUDA 10.2 and there's a bug with dmlc-core, probably caused by different MSVC standards. I'm working on that. If it's an issue with MSVC then I will try to use Clang-cl for building.

vexilligera · 2020-03-26T03:31:59Z

Making it public is redistribution and thus license infringement.

I may also need to upload the VS2019 web installer as in install_vs, since the original download page needs redirection to the file url and I couldn't find any short url as for VS2017.

ChaiBapchya · 2020-03-26T06:15:47Z

Updated links for cudnn & VS2019 [private bucket: restricted access]

leezu · 2020-03-26T23:32:15Z

@mxnet-bot run ci [windows-cpu, windows-gpu]

mxnet-bot · 2020-03-26T23:32:23Z

Jenkins CI successfully triggered : [windows-cpu, windows-gpu]

ChaiBapchya · 2020-03-27T18:51:28Z

retriggering.. windows

ChaiBapchya · 2020-03-27T19:21:21Z

It's failing only for windows-cpu MKLDNN MKL part with the error
"can't find specified path"

[2020-03-27T18:52:40.264Z] "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsx86_amd64.bat" && cmake -G Ninja -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=cl -DUSE_CUDA=OFF 
[2020-03-27T18:52:40.264Z] -DUSE_CUDNN=OFF -DENABLE_CUDA_RTC=OFF -DUSE_OPENCV=ON -DOpenCV_RUNTIME=vc15 -DOpenCV_ARCH=x64 -DUSE_OPENMP=ON -DUSE_BLAS=mkl -DUSE_LAPACK=ON -DUSE_DIST_KVSTORE=OFF -DUSE_MKL_IF_AVAILABLE=ON 
[2020-03-27T18:52:40.264Z] -DUSE_MKLDNN=ON -DCMAKE_BUILD_TYPE=Release C:\jenkins_slave\workspace\build-cpu-mkldnn-mkl

@larroy is this related to path not being on oneline? or the path itself is incorrect?

ChaiBapchya · 2020-03-27T19:42:02Z

Turns out the path mentioned in ci/build_windows.py is correct

VS 2019': r'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsx86_amd64.bat

That is exactly where the bat file is.

[verified this by launching an instance using the updated windows AMI]

However, this path : "C:\jenkins_slave\workspace\build-cpu-mkldnn-mkl" doesn't exist on the AMI. Joe fixed It and updated AMI. testing on that.

vexilligera · 2020-03-28T13:32:28Z

WIN_GPU_MKLDNN is having a flaky issue similar to pytorch/pytorch#25393

Also need to update 3rdparty/dmlc-core to the latest to support VS2019

vexilligera · 2020-03-29T01:52:12Z

@mxnet-bot run ci [windows-gpu]

mxnet-bot · 2020-03-29T01:52:20Z

Jenkins CI successfully triggered : [windows-gpu]

vexilligera · 2020-03-29T02:39:05Z

@mxnet-bot run ci [unix-gpu]

mxnet-bot · 2020-03-29T02:39:13Z

Jenkins CI successfully triggered : [unix-gpu]

vexilligera · 2020-03-29T02:59:35Z

@mxnet-bot run ci [windows-gpu]

mxnet-bot · 2020-03-29T02:59:43Z

Jenkins CI successfully triggered : [windows-gpu]

ChaiBapchya · 2020-03-29T04:04:37Z

@marcoabreu Even in this case, @vexilligera is unable to reproduce the error locally. And WIN_GPU and WIN_GPU_MKLDNN is flaky. How do you suggest we proceed trying to resolve the thrust issue on CI?

Fail : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/13/pipeline
Pass : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/12/pipeline

As pointed out in the pytorch issue : pytorch/pytorch#25393
they face a similar problem with their Windows CI

ChaiBapchya · 2020-03-29T04:25:58Z

@vexilligera as discussed offline, lets try testing locally for WIN_GPU and WIN_GPU_MKLDNN build 10 times each (since 1 run takes 20-30mins) to come up with some basis... (ideally would have tried 100 times but given the resource & time constraints)

vexilligera · 2020-03-29T05:39:48Z

@vexilligera as discussed offline, lets try testing locally for WIN_GPU and WIN_GPU_MKLDNN build 10 times each (since 1 run takes 20-30mins) to come up with some basis... (ideally would have tried 100 times but given the resource & time constraints)

On my local test, the WIN_GPU_MKLDNN is much more flaky than WIN_GPU, as all WIN_GPU builds passed while about 1/3 of WIN_GPU_MKLDNN builds failed, based on my historical test data.

@ChaiBapchya suggests introducing a maximum retry number to circumvent this flaky issue as pytorch has done here pytorch/pytorch#35375

@haojin2 suggests us removing the WIN_GPU_MKLDNN test entirely since MKLDNN doesn't make much sense as we are running on GPU, and GPU_MKLDNN case is covered on other platforms.

ChaiBapchya · 2020-03-29T06:42:01Z

We can get rid of WIN_GPU_MKLDNN tests altogether but that still leaves us with the flakiness of WIN_GPU as can be seen in these builds
For roughly same code of this PR & same windows AMI, below are the results so far

WIN_GPU	WIN_GPU_MKLDNN	Build Number	Link
✖︎	✔︎	15	http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/15/pipeline
✖︎	✔︎	14	http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/14/pipeline
✔︎	✔︎	12	http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/12/pipeline
✔︎	✖︎	13	http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/13/pipeline

Ofcourse your tests on local have a different story to tell...

vexilligera · 2020-03-29T08:17:38Z

@larroy do you have any idea on how to solve Win 126 on CI? Basically the binaries aren't in the search path and dlopen fails due to that

marcoabreu · 2020-03-29T09:36:25Z

I'll have a look at the logs when I'm at my computer

Just to be clear about two things:

Retries are fine for now, but must be removed before merging.
It is not an option to remove GPU mkldnn. It does make sense for cases where an operator is not supported by the GPU or the user wants to run some parts on CPU. Just because it breaks, it doesn't mean that we can remove it to ease our situation.

stdout output of the two print calls is missing in http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-17808/runs/26/nodes/108/steps/154/log/?start=0

leezu · 2020-04-02T18:12:23Z

Connectivity errors appear fixed now by switching to g4 (with faster cpu and network).

Removed the debug statements and rebased on latest master.

leezu · 2020-04-02T18:18:30Z

@marcoabreu Do you still stand by your veto?

"Retries are fine for now, but must be removed before merging."

I suggest to go ahead and merge this with the retries. We don't have control over the thrust + VS code incompatibility and may need to wait for a fix on nvidia's or microsoft's side. Doesn't seem worth to block all MXNet development until the bug in "thrust + VS code" is fixed, given it's an open issue affecting other projects as well: pytorch/pytorch#25393

The issue is tracked on our side in #17935

leezu · 2020-04-02T18:31:50Z

CMakeLists.txt

@@ -163,6 +163,8 @@ if(MSVC)
  add_definitions(-DDMLC_STRICT_CXX11)
  add_definitions(-DNOMINMAX)
  set(CMAKE_C_FLAGS "/MP")
+  # report an accurate value for recent C++ language standards support
+  set(CMAKE_CXX_FLAGS "/Zc:__cplusplus")


@ChaiBapchya @vexilligera you added this line, but it doesn't have any effect as it is overwritte in the line below

set(flag ${flag} "secondval")

does append right?
I thought we are appending to CMAKE_CXX_FLAGS

the typo!!!! CMAKE_CXX_FLAGS instead of CMAKE_C_FLAGS

My bad. I thought it would append...

marcoabreu

On one hand the argument is that the build has to be unblocked ASAP, but at the same 700 changed lines are introduced.

I'd propose to rethink what a "hotfix" exactly is and come back with a minimal changeset which resolves the issue. THEN we can discuss other changes. But it's not okay to bring in a rushed refactor under the umbrella of fixing stuff.

ci/build_windows.py

marcoabreu · 2020-04-02T19:14:07Z

ci/build_windows.py

        if 'CUDA_PATH' not in os.environ:
-            os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v9.2"
+            os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2"


What is your concern?

We didn't agree on increasing the minimum cuda version

We agreed on using VS 2019. Cuda 10 is a requirement for using VS 2019

Are you sure about that? Usually visual studio can be upgraded by installing the vc++ toolset which then grants the possibility to compile other cuda versions. I'm not aware that this categorically excluded.

https://superuser.com/questions/1506044/installing-cuda-9-2-with-vs-2019

https://devblogs.microsoft.com/cppblog/cuda-10-1-available-now-with-support-for-latest-microsoft-visual-studio-2019-versions/

marcoabreu · 2020-04-02T19:14:21Z

ci/safe_docker_run.py

@@ -38,8 +38,9 @@

 from util import config_logging

-DOCKER_STOP_TIMEOUT_SECONDS = 3
+DOCKER_STOP_TIMEOUT_SECONDS = 10


Unrelated change

marcoabreu · 2020-04-02T19:15:49Z

ci/windows_dev_env/windows_deps_headless_installer.py

+
+def install_mkl():
+    logging.info("Installing MKL 2019.3.203...")
+    file_path = download("http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/15247/w_mkl_2019.3.203.exe")


Move links and similar things into constants

apache#17961

marcoabreu · 2020-04-02T19:21:28Z

Connectivity errors were never an issue with Windows slaves. It sounds like you've changed something as part of generating the AMIs. Thus, I'm hesitant to just accept a bunch of more changes (which have not been made transparent) as a result to fix a previously created issue.

Please just revert stuff back to what it was and stop putting duckttape over things. The build used to work and the slaves used to be stable. But I have only seen attempts to change things but no rootcausing (I know about the 2GB limit). Some action resulted in our build breaking and the only sane choice is to revert whatever was done and then do these efforts again.

I will not accept multiple layers of fixes which could have been avoided if the changes were properly developed and tested in a separate environment. At the moment, it seems like open heart surgery on a productive system.

leezu · 2020-04-02T19:46:19Z

Discussion continued at #17962 (comment)

vexilligera · 2020-04-17T02:22:13Z

Closing since this is now fixed in #17962

leezu · 2020-04-17T03:10:57Z

@vexilligera it's not fixed. We need a script to generate the working AMI. #17962 is only a hotfix.

Will you be working on the automated setup or is someone else taking over the work? Please clarify. Thank you

vexilligera requested review from aaronmarkham, marcoabreu and szha as code owners March 11, 2020 03:04

leezu reviewed Mar 11, 2020

View reviewed changes

vexilligera changed the title ~~Windows dev environment configuration, update install instructions from source in the docs~~ [WIP] Windows dev environment configuration, update install instructions from source in the docs Mar 11, 2020

This was referenced Mar 21, 2020

Fix and optimize handling of vectorized memory accesses #17767

Merged

Use x64 toolchain for CI windows build #17912

Closed

leezu and others added 17 commits April 2, 2020 18:06

print to stderr as well

af0d910

stdout output of the two print calls is missing in http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-17808/runs/26/nodes/108/steps/154/log/?start=0

Drop 7.0

ef64105

Remove debug statements in load_lib

73437de

add 7.0 back to build both

19cc08b

print env vars in gpu ci

01e3faf

drop 7.0

2fef756

add path during test, print stuff

a786bfc

build_windows.py cleanup

f6f5a6a

print PATH

5d14df3

update print and add 7.0 back

9944a2e

drop 7.0 and print instance type

ca81f3a

print instance type

dddbf2a

add 7.0 again

3997316

drop 7.0 before the previous build is done

2054768

pip install -e python

ba1896e

Switch to g4 instance

e0b8a23

Remove debug statements

bac2b67

leezu force-pushed the win_build_gpu branch from c9eb500 to bac2b67 Compare April 2, 2020 18:11

leezu reviewed Apr 2, 2020

View reviewed changes

marcoabreu suggested changes Apr 2, 2020

View reviewed changes

leezu added 2 commits April 2, 2020 19:21

Fix CMakeLists

27cce4a

Workaround broken gluon dataloader test_dataloader_context test

12e8a6c

apache#17961

leezu mentioned this pull request Apr 2, 2020

Fix Windows GPU CI #17962

Merged

vexilligera closed this Apr 17, 2020

		logging.info("Visual studio install complete.")


		def install_cmake():

[WIP] Windows dev environment configuration, update install instructions from source in the docs #17808

[WIP] Windows dev environment configuration, update install instructions from source in the docs #17808

Conversation

vexilligera commented Mar 11, 2020 • edited Loading

Description

Checklist

Essentials

Changes

Comments

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronmarkham commented Mar 11, 2020

leezu commented Mar 19, 2020

vexilligera commented Mar 19, 2020 • edited Loading

vexilligera commented Mar 26, 2020

ChaiBapchya commented Mar 26, 2020 • edited Loading

leezu commented Mar 26, 2020

mxnet-bot commented Mar 26, 2020

ChaiBapchya commented Mar 27, 2020

ChaiBapchya commented Mar 27, 2020

ChaiBapchya commented Mar 27, 2020 • edited Loading

vexilligera commented Mar 28, 2020 • edited Loading

vexilligera commented Mar 29, 2020

mxnet-bot commented Mar 29, 2020

vexilligera commented Mar 29, 2020

mxnet-bot commented Mar 29, 2020

vexilligera commented Mar 29, 2020

mxnet-bot commented Mar 29, 2020

ChaiBapchya commented Mar 29, 2020 • edited Loading

ChaiBapchya commented Mar 29, 2020

vexilligera commented Mar 29, 2020 • edited Loading

ChaiBapchya commented Mar 29, 2020

vexilligera commented Mar 29, 2020

marcoabreu commented Mar 29, 2020

leezu commented Apr 2, 2020

leezu commented Apr 2, 2020 • edited Loading

Choose a reason for hiding this comment

ChaiBapchya Apr 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu commented Apr 2, 2020

leezu commented Apr 2, 2020

vexilligera commented Apr 17, 2020

leezu commented Apr 17, 2020

vexilligera commented Mar 11, 2020 •

edited

Loading

vexilligera commented Mar 19, 2020 •

edited

Loading

ChaiBapchya commented Mar 26, 2020 •

edited

Loading

ChaiBapchya commented Mar 27, 2020 •

edited

Loading

vexilligera commented Mar 28, 2020 •

edited

Loading

ChaiBapchya commented Mar 29, 2020 •

edited

Loading

vexilligera commented Mar 29, 2020 •

edited

Loading

leezu commented Apr 2, 2020 •

edited

Loading

ChaiBapchya Apr 2, 2020 •

edited

Loading