Skip to content

Conversation

ovidiusm
Copy link
Contributor

@ovidiusm ovidiusm commented Oct 10, 2025

What?

ibverbs dev packages are broken in CUDA base images (missing symlinks).
Direct reinstallation from the base image is not possible because the deb files have been deleted (probably to shrink the image).
Direct reinstallation from the Ubuntu repository is not always possible/desired because the Ubuntu versions are older.
DOCA repository contains the deb packages, but it needs to be added before reinstallation is attempted, not after. Current scripts are adding it too late.

Why?

Seeing recurrent issues where ibverbs installation is not functional. This has caused issues in both internal and external projects.

How?

Updated procedure to prepare the environment:

  1. Install Ubuntu build dependencies
  2. Install DOCA repository and apt update & upgrade. This upgrades the ibverbs packages to the DOCA version on images where the version was different; or does nothing where the versions are identical.
  3. Forced reinstallation step for images where the version was identical (so upgrade was ineffective).
  4. Start downloading and building other dependencies from source.

The same procedure should be used on CUDA 12.x and 13.x

Also added a version check for GPUNETIO plugin since it only supports CUDA 12.8.x and 12.9.x. A newer plugin version will be added separately with support for 13.x. The check is needed also to avoid building on CUDA 12.6 image in one of the CI tests.

Copy link

👋 Hi ovidiusm! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@ovidiusm
Copy link
Contributor Author

/build

Signed-off-by: Ovidiu Mara <[email protected]>
@ovidiusm
Copy link
Contributor Author

/build

@ovidiusm
Copy link
Contributor Author

/build

@ovidiusm ovidiusm force-pushed the nixl-fix-build-cuda-13 branch from 85292aa to 40648a6 Compare October 10, 2025 19:48
@ovidiusm
Copy link
Contributor Author

/build

Signed-off-by: Ovidiu Mara <[email protected]>
@ovidiusm
Copy link
Contributor Author

/build

@ovidiusm ovidiusm marked this pull request as ready for review October 10, 2025 21:32
@ovidiusm ovidiusm requested review from a team, aranadive and brminich as code owners October 10, 2025 21:32
@ovidiusm ovidiusm requested a review from mkhazraee October 10, 2025 21:35
@aranadive aranadive merged commit 51a11eb into ai-dynamo:main Oct 13, 2025
21 checks passed
@ovidiusm ovidiusm deleted the nixl-fix-build-cuda-13 branch October 13, 2025 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants