Skip to content

Added high-speed interconnect support by enabling UCX#87

Merged
leofang merged 23 commits intoconda-forge:masterfrom
shankar1729:master
Apr 23, 2021
Merged

Added high-speed interconnect support by enabling UCX#87
leofang merged 23 commits intoconda-forge:masterfrom
shankar1729:master

Conversation

@shankar1729
Copy link
Contributor

@shankar1729 shankar1729 commented Apr 2, 2021

Checklist

Fixes #42
Fixes #38 (missing openib support). Instead of enabling ibverbs, which is marked as deprecated within openmpi anyway, enabling ucx support is much easier and cleaner as the ucx package is already available on conda-forge.

Adding ucx as a dependency (during build and run) automatically gets openmpi to include support for a number of high-speed interconnects (Infiniband, Omnipath etc.). No change to the build scripts beyond the dependency in meta.yaml!

@conda-forge-linter
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@jakirkham
Copy link
Member

Edited OP to note issues fixed. Hope that is ok :)

@jakirkham
Copy link
Member

Also I think we need the --with-ucx flag in build.sh’s call to ./configure

@jakirkham
Copy link
Member

Thanks Ravishankar! 😀

@shankar1729
Copy link
Contributor Author

Great! Hope it's useful. OpenMPI's configure seems to enable UCX automatically if it finds the headers. So I didn't need to add --with-ucx since its headers and libs get pulled into the standard path in the build environment.

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failures are due to ucx only being available on Linux. So suggesting add selectors.

Currently we don't have pinning worked out in the ucx package. So included pinning to 1.9.0. Raised issue ( conda-forge/ucx-split-feedstock#88 ) about fixing this in the ucx package

Co-authored-by: jakirkham <jakirkham@gmail.com>
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are building this on other architectures with Linux, this constrains it to x86_64

shankar1729 and others added 2 commits April 2, 2021 17:11
Co-authored-by: jakirkham <jakirkham@gmail.com>
Co-authored-by: jakirkham <jakirkham@gmail.com>
@shankar1729
Copy link
Contributor Author

Seems fine to me: I'm new to conda packaging though, so I'm not yet familiar with how to test these properly.

@jakirkham
Copy link
Member

No worries and thanks again for working on this 🙂

@leofang
Copy link
Member

leofang commented Apr 4, 2021

@shankar1729 @jakirkham Thanks. I didn't know UCX is available on Conda-Forge. A couple questions:

  1. Is the ucx package smart enough to locate necessary nic drivers, modules, etc, in situations where they are installed in non standard paths?
  2. Can we list ucx in "run_constrained" instead of "run"? This makes it optional (but would be used if present). I am assuming ucx is dlopen'd at runtime so that we can do this.
  3. On CUDA awareness: I see that ucx is compiled using CUDA. Does it work for all CUDA versions? Would Open MPI's CUDA awareness be impacted if we go through ucx instead of smcuda? If UCX is installed, can we set its CUDA support off by default (as it'd look up the pointer attribute every time, which incurs extra overhead)? Do we still use the same MCA parameter opal_cuda_support to turn it on?

@shankar1729
Copy link
Contributor Author

  1. I don't know about this one.
  2. I checked by force-removing ucx. It causes warnings during mpi startup about mca_osc_ucx and mca_pml_ucx due to missing libucp.so.0, but otherwise works fine with alternate (low-speed) interfaces.
  3. OpenMPI automatically uses smcuda when applicable for multiple-GPUs on a single node. As far as I understand, the node-to-node GPU communications don't work in the present conda openmpi without ucx. All my GPUs are on one node, so I can't test this easily.

@leofang
Copy link
Member

leofang commented Apr 6, 2021

I am testing this locally and can only get CUDA awareness work by setting UCX_MEMTYPE_CACHE=n

UCX_MEMTYPE_CACHE=n mpirun -n 4 ./mpi_hello_world_cuda

Seems to be documented here.

I am in favor of getting the ucx support in, but do not enable it by default. @shankar1729 Do you mind if I push to your branch? I'll make necessary changes (setting default mca parameters, add user instructions, etc) following the same treatment for plain (non-UCX) CUDA awareness.

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark "changes requested" to block this while we sort things out.

On CUDA awareness: I see that ucx is compiled using CUDA. Does it work for all CUDA versions?

I just realized this was already discussed elsewhere (conda-forge/ucx-split-feedstock#66 and I was involved): UCX is sensitive to CUDA versions. Does it mean Open MPI will now be sensitive to CUDA version as well (i.e., do we have to build it using several CUDA versions as done in UCX, from 10.2 to 11.2)?

@jakirkham
Copy link
Member

Responded in issue ( conda-forge/ucx-split-feedstock#66 ). Though it will require some more investigation

Should add am in the middle of other work. So it might take a bit before I have bandwidth to look into this

@shankar1729
Copy link
Contributor Author

@leofang Yes, please do use my branch to stage these changes.

Also, with respect to the cuda builds of UCX, it seems the build recipe has been modified to support it, but the released versions on conda-forge are without cuda.

Consequently, in my local tests I need to bypass the ucx pml when using GPU memory. I have always run my gpu codes with --mca pml ob1 when using openmpi (else I get a segfault in libucp.so), so ignore my response to point 3 above. With the current released ucx package linked to openmpi, openmpi bypasses ucx because I specified this switch in my tests.

Best,
Shankar

leofang added 2 commits April 10, 2021 00:15
MNT: Re-rendered with conda-build 3.20.3, conda-smithy 3.10.0, and conda-forge-pinning 2021.04.09.17.43.25
leofang added 3 commits April 21, 2021 17:39
MNT: Re-rendered with conda-build 3.20.3, conda-smithy 3.10.0, and conda-forge-pinning 2021.04.21.17.39.23
@leofang
Copy link
Member

leofang commented Apr 22, 2021

@conda-forge-admin, please restart ci

@leofang leofang marked this pull request as ready for review April 22, 2021 01:09
@isuruf
Copy link
Member

isuruf commented Apr 22, 2021

Why does this need multiple builds? Can't this be one single build as it used to be?

@leofang
Copy link
Member

leofang commented Apr 22, 2021

Why does this need multiple builds? Can't this be one single build as it used to be?

So for CUDA it needs two builds: 9.2 and 10.0+. It is because UCX changes the internal code for CUDA 10.0+, see the discussion around conda-forge/ucx-split-feedstock#66 (comment).

As for the additional pure CPU build, my reasoning is we use ucx-proc =*=gpu as one of the host dependency, which I am not sure if it would cause problems at runtime (say because we build with GPU-enabled UCX but run with GPU-disabled UCX), so I wanted to stay on the safe side until conda-forge/ucx-split-feedstock#66 is fixed. But perhaps I worry too much.

@isuruf
Copy link
Member

isuruf commented Apr 22, 2021

It is because UCX changes the internal code for CUDA 10.0+, see the discussion around conda-forge/ucx-split-feedstock#66 (comment).

That is internal code and the public interface is the same, so I don't see any reason to have multiple builds for CUDA version.

As for the additional pure CPU build, my reasoning is we use ucx-proc =*=gpu as one of the host dependency, which I am not sure if it would cause problems at runtime (say because we build with GPU-enabled UCX but run with GPU-disabled UCX), so I wanted to stay on the safe side until conda-forge/ucx-split-feedstock#66 is fixed. But perhaps I worry too much.

That just shows that the public interface is the same, so there's no need for openmpi to have a dependence on GPU enabled or disabled UCX.

@leofang
Copy link
Member

leofang commented Apr 22, 2021

That is internal code and the public interface is the same, so I don't see any reason to have multiple builds for CUDA version.

I don't think it is true. The CUDA runtime must have the new cudaLaunchHostFunc API supported.

@isuruf
Copy link
Member

isuruf commented Apr 22, 2021

I don't think it is true. The CUDA runtime must have the new cudaLaunchHostFunc API supported.

That's an internal detail of ucx and openmpi doesn't care.

@leofang
Copy link
Member

leofang commented Apr 22, 2021

I don't think it is true. The CUDA runtime must have the new cudaLaunchHostFunc API supported.

That's an internal detail of ucx and openmpi doesn't care.

@jakirkham @pentschev I think @isuruf made a convincing argument here. I think we are guarded by ucx handing the CUDA dependency for us. I am under the impression that the handling of CUDA GPU buffers in Open MPI is mutually exclusive --- it goes through solely on either ucx or smcuda, so it should not cause issues if at build time Open MPI sees a different CUDA version from what UCX was built on. I will restore to the single build as Isuru suggested.

Copying @jsquyres in case we missed something 😅

MNT: Re-rendered with conda-build 3.20.3, conda-smithy 3.10.0, and conda-forge-pinning 2021.04.21.17.39.23
@leofang
Copy link
Member

leofang commented Apr 22, 2021

I confirmed that for all ucx builds (cpu & gpu with different CUDA versions) their headers are identical, so ucx didn't do some code generation magic that writes the build-time info (say, CUDA_VERSION) into its headers, so we should be good.

@pentschev
Copy link

Yeah, I agree this is a UCX implementation detail, therefore the OpenMPI package doesn't need to worry about that.

@leofang
Copy link
Member

leofang commented Apr 22, 2021

@conda-forge/openmpi this is ready!

@jsquyres
Copy link

FYI -- the long-awaited DDT fix came in for v4.1.x last night (open-mpi/ompi#8837) -- we're making a new 4.1.1RC right now, and may well release 4.1.1 as early as tomorrow. Just in case this influences your thinking...

@leofang
Copy link
Member

leofang commented Apr 22, 2021

Thanks for sharing the good news, Jeff! Yeah we will get this rebuild (of v4.1.0) in, and then build v4.1.1 afterwards so it continues to have ucx optionally enabled.

@leofang leofang self-assigned this Apr 22, 2021
@leofang leofang mentioned this pull request Apr 22, 2021
@leofang
Copy link
Member

leofang commented Apr 23, 2021

Let's get it in and see how it goes! Thanks @shankar1729 @jakirkham @pentschev @isuruf @jsquyres!

@leofang leofang merged commit b1bb684 into conda-forge:master Apr 23, 2021
@leofang
Copy link
Member

leofang commented Apr 23, 2021

I have verified locally with a very simple MPI code that the package is working. The following combination is tested:

  1. CPU code, without UCX (mpirun -n 2 ./my_test)
  2. CPU code, with UCX (mpirun -n 2 --mca pml ucx --mca osc ucx ./my_test)
  3. GPU code, without UCX (mpirun -n 2 --mca opal_cuda_support 1 ./my_gpu_test)
  4. GPU code, with UCX (mpirun -n 2 --mca pml ucx --mca osc ucx ./my_gpu_test)

For 4, unlike a local build that I tested with, for some reason it just works out of box without setting UCX_MEMTYPE_CACHE=n.

tl;dr: we are cool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow extra configuration arguments Missing openib support

7 participants