Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify UCX configs, permitting UCX_TLS=all #792

Merged
merged 11 commits into from
Nov 29, 2021

Conversation

pentschev
Copy link
Member

Up until now, we require users to specify what transports should be used by UCX, pushing the configuration burden onto the user, being also error-prone. We can now reduce this configuration burden with just one configuration being added in dask/distributed#5526: DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT/distributed.comm.ucx.create_cuda_context, which creates the CUDA context before UCX is initialized.

This is an example of how to setup a cluster with dask-cuda-worker after this change:

# Scheduler
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True dask-scheduler --protocol ucx --interface ib0

# Workers
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda dask-cuda-worker ucx://${SCHEDULER_IB0_IP}:8786 --interface ib0 --rmm-pool-size 29GiB

# Client
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True python client.py

Similarly, one can setup: LocalCUDACluster(protocol="ucx", interface="ib0").

Note above how ib0 is intentionally specified. That is mandatory to use RDMACM, as it is necessary to have listeners bind to an InfiniBand interface, but can be left unspecified when using systems without InfiniBand or if RDMACM isn't required (discouraged on systems that have InfiniBand connectivity). The UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda option is specified for optimal InfiniBand performance with CUDA and will be default in UCX 1.12, when specifying it won't be necessary anymore.

Changes introduced here are backwards-compatible, meaning the old options such as --enable-nvlink/enable_nvlink=True are still valid. However, if any of those options is specified, the user is responsible to enable/disable all desired transports, which can also be useful for benchmarking specific transports.

Finally, creating a CUDA context may not be necessary by UCX in the future, at a point where it will be possible to remove DASK_DISTRIBUTED__COMM__UCX__CREATE_CUDA_CONTEXT=True from scheduler/client processes entirely.

@github-actions github-actions bot added the python python code needed label Nov 18, 2021
@pentschev pentschev changed the base branch from branch-21.12 to branch-22.02 November 18, 2021 22:36
@pentschev pentschev added 2 - In Progress Currently a work in progress feature request New feature or request non-breaking Non-breaking change labels Nov 18, 2021
@pentschev
Copy link
Member Author

@codecov-commenter
Copy link

codecov-commenter commented Nov 18, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.02@f1b0e27). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-22.02     #792   +/-   ##
===============================================
  Coverage                ?   89.30%           
===============================================
  Files                   ?       16           
  Lines                   ?     2057           
  Branches                ?        0           
===============================================
  Hits                    ?     1837           
  Misses                  ?      220           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1b0e27...ad669fc. Read the comment docs.

@github-actions github-actions bot added the doc Documentation label Nov 19, 2021
@pentschev pentschev removed the doc Documentation label Nov 19, 2021
@pentschev pentschev marked this pull request as ready for review November 19, 2021 20:42
@pentschev pentschev requested a review from a team as a code owner November 19, 2021 20:42
@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

Since this is a somewhat large change, I'm targeting 22.02 to avoid any major breakage at this time. If anyone feels differently and think this is a great feature to have in 21.12 and we shouldn't wait, please let me know.

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @pentschev

pentschev added a commit to pentschev/dask-cuda that referenced this pull request Nov 22, 2021
This is necessary to fill the missing create_cuda_context configuration
option recently added to Distributed. In
rapidsai#792 it will be used to
simplify UCX configuration.
pentschev added a commit to pentschev/dask-cuda that referenced this pull request Nov 22, 2021
This is necessary to fill the missing create_cuda_context configuration
option recently added in Distributed. In
rapidsai#792 it will be used to
simplify UCX configuration.
rapids-bot bot pushed a commit that referenced this pull request Nov 22, 2021
This is necessary to fill the missing create_cuda_context configuration option recently added in Distributed. In
#792 it will be used to simplify UCX configuration.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #801
@github-actions github-actions bot added the doc Documentation label Nov 22, 2021
@pentschev pentschev removed the doc Documentation label Nov 22, 2021
@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

Thanks @madsbk for the review here!

@pentschev
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e4a7754 into rapidsai:branch-22.02 Nov 29, 2021
@jakirkham
Copy link
Member

Thanks Peter! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress feature request New feature or request non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants