-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[WIP] Windows dev environment configuration, update install instructions from source in the docs #17808
Conversation
'openblas': 'https://windows-post-install.s3-us-west-2.amazonaws.com/OpenBLAS-windows-v0_2_19.zip', | ||
'opencv': 'https://windows-post-install.s3-us-west-2.amazonaws.com/opencv-windows-4.1.2-vc14_vc15.zip', | ||
'cudnn': 'https://windows-post-install.s3-us-west-2.amazonaws.com/cudnn-9.2-windows10-x64-v7.4.2.24.zip', | ||
'nvdriver': 'https://windows-post-install.s3-us-west-2.amazonaws.com/nvidia_display_drivers_398.75_server2016.zip', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who owns the windows-post-install
bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I currently don't have access to windows-post-install, so I opened another bucket (mxnet-windows-build). It would be easy to merge them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mxnet-ci-dev
logging.info("Perl install complete") | ||
|
||
|
||
def install_clang(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need clang when compiling with visual studio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They probably need to support tvm op build on windows as well, which may need this.
logging.info("Visual studio install complete.") | ||
|
||
|
||
def install_cmake(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused
``` | ||
cmake -G "Visual Studio 15 2017 Win64" -T cuda=9.2,host=x64 -DUSE_CUDA=1 -DUSE_CUDNN=1 -DUSE_NVRTC=1 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_BLAS=open -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_LIST=Common -DCUDA_TOOLSET=9.2 -DCUDNN_INCLUDE=C:\cuda\include -DCUDNN_LIBRARY=C:\cuda\lib\x64\cudnn.lib "C:\incubator-mxnet" | ||
.\setup.ps1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it feasible to provide short non-blackbox instructions to install MXNet?
For example, if users have a standard installation of visual studio, can the installation experience be pretty similar to https://mxnet.apache.org/get_started/ubuntu_setup#build-the-mxnet-shared-library-from-source Step 2 and Step 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can explicitly use the cmd command invoked from ci/build_windows.py
Let me know when this is ready, and I'll test it out. |
@vexilligera what is the status of this PR? |
I'm trying to update the AMI to VS2019 and CUDA 10.2 and there's a bug with dmlc-core, probably caused by different MSVC standards. I'm working on that. If it's an issue with MSVC then I will try to use Clang-cl for building. |
I may also need to upload the VS2019 web installer as in |
Updated links for cudnn & VS2019 [private bucket: restricted access] |
@mxnet-bot run ci [windows-cpu, windows-gpu] |
Jenkins CI successfully triggered : [windows-cpu, windows-gpu] |
retriggering.. windows |
It's failing only for windows-cpu MKLDNN MKL part with the error
@larroy is this related to path not being on oneline? or the path itself is incorrect? |
Turns out the path mentioned in ci/build_windows.py is correct
That is exactly where the bat file is. [verified this by launching an instance using the updated windows AMI] However, this path : "C:\jenkins_slave\workspace\build-cpu-mkldnn-mkl" doesn't exist on the AMI. Joe fixed It and updated AMI. testing on that. |
WIN_GPU_MKLDNN is having a flaky issue similar to pytorch/pytorch#25393 Also need to update 3rdparty/dmlc-core to the latest to support VS2019 |
@mxnet-bot run ci [windows-gpu] |
Jenkins CI successfully triggered : [windows-gpu] |
@mxnet-bot run ci [unix-gpu] |
Jenkins CI successfully triggered : [unix-gpu] |
@mxnet-bot run ci [windows-gpu] |
Jenkins CI successfully triggered : [windows-gpu] |
@marcoabreu Even in this case, @vexilligera is unable to reproduce the error locally. And WIN_GPU and WIN_GPU_MKLDNN is flaky. How do you suggest we proceed trying to resolve the thrust issue on CI? Fail : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17808/13/pipeline As pointed out in the pytorch issue : pytorch/pytorch#25393 |
@vexilligera as discussed offline, lets try testing locally for WIN_GPU and WIN_GPU_MKLDNN build 10 times each (since 1 run takes 20-30mins) to come up with some basis... (ideally would have tried 100 times but given the resource & time constraints) |
On my local test, the WIN_GPU_MKLDNN is much more flaky than WIN_GPU, as all WIN_GPU builds passed while about 1/3 of WIN_GPU_MKLDNN builds failed, based on my historical test data. @ChaiBapchya suggests introducing a maximum retry number to circumvent this flaky issue as pytorch has done here pytorch/pytorch#35375 @haojin2 suggests us removing the WIN_GPU_MKLDNN test entirely since MKLDNN doesn't make much sense as we are running on GPU, and GPU_MKLDNN case is covered on other platforms. |
We can get rid of WIN_GPU_MKLDNN tests altogether but that still leaves us with the flakiness of WIN_GPU as can be seen in these builds Ofcourse your tests on local have a different story to tell... |
@larroy do you have any idea on how to solve Win 126 on CI? Basically the binaries aren't in the search path and dlopen fails due to that |
I'll have a look at the logs when I'm at my computer Just to be clear about two things:
|
stdout output of the two print calls is missing in http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-17808/runs/26/nodes/108/steps/154/log/?start=0
Connectivity errors appear fixed now by switching to g4 (with faster cpu and network). Removed the debug statements and rebased on latest master. |
@marcoabreu Do you still stand by your veto?
I suggest to go ahead and merge this with the retries. We don't have control over the thrust + VS code incompatibility and may need to wait for a fix on nvidia's or microsoft's side. Doesn't seem worth to block all MXNet development until the bug in "thrust + VS code" is fixed, given it's an open issue affecting other projects as well: pytorch/pytorch#25393 The issue is tracked on our side in #17935 |
CMakeLists.txt
Outdated
@@ -163,6 +163,8 @@ if(MSVC) | |||
add_definitions(-DDMLC_STRICT_CXX11) | |||
add_definitions(-DNOMINMAX) | |||
set(CMAKE_C_FLAGS "/MP") | |||
# report an accurate value for recent C++ language standards support | |||
set(CMAKE_CXX_FLAGS "/Zc:__cplusplus") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChaiBapchya @vexilligera you added this line, but it doesn't have any effect as it is overwritte in the line below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set(flag ${flag} "secondval")
does append right?
I thought we are appending to CMAKE_CXX_FLAGS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the typo!!!! CMAKE_CXX_FLAGS
instead of CMAKE_C_FLAGS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I thought it would append...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On one hand the argument is that the build has to be unblocked ASAP, but at the same 700 changed lines are introduced.
I'd propose to rethink what a "hotfix" exactly is and come back with a minimal changeset which resolves the issue. THEN we can discuss other changes. But it's not okay to bring in a rushed refactor under the umbrella of fixing stuff.
if 'CUDA_PATH' not in os.environ: | ||
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v9.2" | ||
os.environ["CUDA_PATH"] = "C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't agree on increasing the minimum cuda version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We agreed on using VS 2019. Cuda 10 is a requirement for using VS 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure about that? Usually visual studio can be upgraded by installing the vc++ toolset which then grants the possibility to compile other cuda versions. I'm not aware that this categorically excluded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -38,8 +38,9 @@ | |||
|
|||
from util import config_logging | |||
|
|||
DOCKER_STOP_TIMEOUT_SECONDS = 3 | |||
DOCKER_STOP_TIMEOUT_SECONDS = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated change
|
||
def install_mkl(): | ||
logging.info("Installing MKL 2019.3.203...") | ||
file_path = download("http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/15247/w_mkl_2019.3.203.exe") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move links and similar things into constants
Connectivity errors were never an issue with Windows slaves. It sounds like you've changed something as part of generating the AMIs. Thus, I'm hesitant to just accept a bunch of more changes (which have not been made transparent) as a result to fix a previously created issue. Please just revert stuff back to what it was and stop putting duckttape over things. The build used to work and the slaves used to be stable. But I have only seen attempts to change things but no rootcausing (I know about the 2GB limit). Some action resulted in our build breaking and the only sane choice is to revert whatever was done and then do these efforts again. I will not accept multiple layers of fixes which could have been avoided if the changes were properly developed and tested in a separate environment. At the moment, it seems like open heart surgery on a productive system. |
Discussion continued at #17962 (comment) |
Closing since this is now fixed in #17962 |
@vexilligera it's not fixed. We need a script to generate the working AMI. #17962 is only a hotfix. Will you be working on the automated setup or is someone else taking over the work? Please clarify. Thank you |
Description
Taking over #17206 with fix to #17635, update toolchains to vs2019 and cuda 10.2
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments