-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some multi-GPU unit tests hang when running in a different Docker environment #5963
Comments
How did you obtain the device ID? Dask assign GPU by setting CUDA VISIBLE DEVICE, so inside XGBoost we are always using first visible device, which is 0 |
@trivialfis I used
Thanks. That's good to know. On the other hand, I have no idea as to why |
So, #5873 (comment) can we close this issue? |
@trivialfis No, this issue is present in the master branch. You can try the command yourself: tests/ci_build/ci_build.sh gpu_build docker --build-arg CUDA_VERSION=10.2 \
tests/ci_build/build_via_cmake.sh -DUSE_CUDA=ON -DUSE_NCCL=ON \
-DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
tests/ci_build/test_python.sh mgpu |
I found the error by setting env var
Commands: tests/ci_build/ci_build.sh gpu_build docker --build-arg CUDA_VERSION=10.2 \
tests/ci_build/build_via_cmake.sh -DUSE_CUDA=ON -DUSE_NCCL=ON \
-DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON -DGPU_COMPUTE_VER=75
CI_DOCKER_EXTRA_PARAMS_INIT='-e DMLC_WORKER_STOP_PROCESS_ON_ERROR=false' \
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
tests/ci_build/test_python.sh mgpu |
I turned on extra diagnostics from NCCL, per suggestion of pytorch/pytorch#20313. Command: CI_DOCKER_EXTRA_PARAMS_INIT='-e NCCL_DEBUG=INFO -e DMLC_WORKER_STOP_PROCESS_ON_ERROR=false' \
tests/ci_build/ci_build.sh gpu nvidia-docker -it --build-arg CUDA_VERSION=10.2 \
tests/ci_build/test_python.sh mgpu Error log: 7635c25c81cd:41590:41835 [0] NCCL INFO Channel 00 : 3[1c0] -> 2[1b0] via direct shared memory The By default, Docker allocates only 64 MB for CI_DOCKER_EXTRA_PARAMS_INIT='--shm-size=2g' tests/ci_build/ci_build.sh gpu nvidia-docker -it \
--build-arg CUDA_VERSION=10.2 tests/ci_build/test_python.sh mgpu |
#5873 (comment)
Log from 4-process setup:
even though here we should have been using 4 GPUs.See #5963 (comment).The undefined behavior exists in following tests:
tests/python-gpu/test_gpu_with_dask.py::TestDistributedGPU::test_dask_array
tests/distributed/runtests-gpu.sh
The behavior is "undefined" in the sense that using a different Docker container causes the tests to fail, even though they were succeeding previously.
Passing (current CI setup):
Failing (the test just hangs):
The text was updated successfully, but these errors were encountered: