NIXL EP wheels: Add support for multiple CUDA and PyTorch versions#1646
NIXL EP wheels: Add support for multiple CUDA and PyTorch versions#1646ovidiusm wants to merge 27 commits into
Conversation
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
|
👋 Hi ovidiusm! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
|
👀 Investigating |
|
🤖 CI Triage Agent — I have everything needed to produce a complete diagnosis. Here is the full breakdown: Summary: Pre-commit hooks ( Root cause: Two independent problems detected by pre-commit: 1.
|
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
|
/build |
…he code Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
|
/build |
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: ovidiusm <ovidium@nvidia.com>
|
/build |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@contrib/build-wheel.sh`:
- Around line 159-169: The probe and installer flags currently use
--extra-index-url which leaves PyPI as the primary index and can pull non-CUDA
wheels; update the torch resolution in torch_classify(), torch_uv_flags(), and
build_wheel to pin CUDA indexes as the primary --index-url: for the stable
probe/branch use --index-url "$TORCH_STABLE_INDEX" (remove index-strategy), for
nightly use --index-url "$TORCH_NIGHTLY_INDEX" and keep --extra-index-url
"$TORCH_STABLE_INDEX" as a fallback, and split the build-deps vs torch install
steps so build deps are installed from PyPI without torch index flags while
torch itself is installed with the CUDA-pinned --index-url. Ensure the commands
invoking uv pip install use the named functions torch_classify(),
torch_uv_flags(), and build_wheel locations when applying these changes.
In `@examples/device/ep/nixl_ep/__init__.py`:
- Around line 26-30: Detect the torch minor version string currently built into
_torch_mm and validate it against an explicit supported set before calling
importlib.import_module; if it is not supported, raise an ImportError that
includes the full detected torch.__version__ and a list of supported minor
versions. Specifically, before calling
importlib.import_module(f".nixl_ep_cpp_torch{_torch_mm}", __package__), check
_torch_mm against the allowed values (e.g., a tuple/list you define), and if
absent raise ImportError with a clear message naming torch.__version__ and the
supported minors; otherwise proceed to import and set
sys.modules[f"{__package__}.nixl_ep_cpp"] = _nixl_ep_cpp as before.
In `@src/bindings/python/nixl-meta/nixl_ep/__init__.py`:
- Around line 44-50: The loop that tries backends using
importlib.import_module(mod_name) currently only catches ModuleNotFoundError, so
import-time failures like ImportError or OSError will propagate and stop
fallback; change the except to catch (ModuleNotFoundError, ImportError, OSError)
as e, and keep the existing special-case check for ModuleNotFoundError (if
e.name != mod_name: raise) but for ImportError/OSError simply continue to the
next mod_name; ensure the successful path still returns
importlib.import_module(mod_name).__name__ inside the try block.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 77bcd126-f819-48ff-ae85-b8b81fc126ca
📒 Files selected for processing (11)
contrib/Dockerfile.manylinuxcontrib/build-wheel.shcontrib/wheel_merge.pyexamples/device/ep/meson.buildexamples/device/ep/nixl_ep/__init__.pypyproject.tomlsrc/bindings/python/nixl-meta/meson.buildsrc/bindings/python/nixl-meta/nixl/meson.buildsrc/bindings/python/nixl-meta/nixl_ep/__init__.pysrc/bindings/python/nixl-meta/nixl_ep/meson.buildsrc/bindings/python/nixl-meta/pyproject.toml.in
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/bindings/python/nixl-meta/nixl_ep/__init__.py`:
- Around line 1-17: There are two top-level packages named nixl_ep (this
__init__.py and examples/device/ep/nixl_ep) causing mypy's "Duplicate module"
error; fix by either renaming this package (e.g., to nixl_ep_meta) and updating
any imports referencing nixl_ep, or keep the name and disambiguate CI by
excluding the example package from type-checking (add an exclude or files/paths
entry in mypy.ini to ignore examples/device/ep/nixl_ep); locate the module in
__init__.py (nixl_ep) and the example package path (examples/device/ep/nixl_ep)
when applying the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: ead6418f-f2b7-4879-8526-9378403ce77e
📒 Files selected for processing (3)
src/bindings/python/nixl-meta/nixl/meson.buildsrc/bindings/python/nixl-meta/nixl_ep/__init__.pysrc/bindings/python/nixl-meta/nixl_ep/meson.build
…comments Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
|
👀 Investigating |
|
🤖 CI Triage Agent — I have everything I need. The diagnosis is complete and unambiguous. Summary: Root cause: Commit All other hooks (mypy, isort, black, flake8 — skipped as no Python files touched; case conflicts, shebangs, merge conflicts, line endings, trailing whitespace — all Passed). Only Implicated commit: File: Suggested fix: On line 230 of - # This doesnt require CUDA 13 ...
+ # This doesn't require CUDA 13 ...After applying the fix, re-run Related: PR #1646 — NIXL EP wheels: Add support for multiple CUDA and PyTorch versions |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
contrib/build-wheel.sh (1)
153-158:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPin the nightly probe to the nightly CUDA index.
This branch still leaves PyPI as the primary index by using only
--extra-index-url. That meanstorch_classify()can mark a version asnightlybased on a non-CUDA/PyPI match, theninstall_torch()fails later because it installs from the nightly CUDA index only.Suggested fix
- elif uv pip install --dry-run --pre \ - --python "$PROBE/bin/python" \ - --extra-index-url "$TORCH_STABLE_INDEX" \ - --extra-index-url "$TORCH_NIGHTLY_INDEX" \ - --index-strategy unsafe-best-match \ + elif uv pip install --dry-run --pre \ + --python "$PROBE/bin/python" \ + --index-url "$TORCH_NIGHTLY_INDEX" \ + --extra-index-url "$TORCH_STABLE_INDEX" \ "torch==${VER}.*" >/dev/null 2>&1; then CLASS="nightly"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@contrib/build-wheel.sh` around lines 153 - 158, The nightly probe currently leaves PyPI as the primary index by using --extra-index-url for $TORCH_NIGHTLY_INDEX, which can misclassify a version as nightly; update the pip dry-run probe in build-wheel.sh (the `elif` branch that runs `pip install --dry-run ... "torch==${VER}.*"`) to use --index-url "$TORCH_NIGHTLY_INDEX" (make the nightly CUDA index the primary index) instead of --extra-index-url for the nightly index so the probe checks the correct source; this change ensures `torch_classify()` matches the same index that `install_torch()` will use.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@contrib/build-wheel.sh`:
- Around line 229-230: Fix the typo in the comment string that currently reads
"doesnt" to include the missing apostrophe ("doesn't") so codespell no longer
fails; update the comment in the contrib/build-wheel.sh diff block (the comment
beginning "torch + nvidia-* in each venv is several GB; tear down so the docker
layer doesnt blow up across the (python, torch) matrix.") to use "doesn't".
- Around line 71-75: The new CLI flag --torch-versions is parsed into
TORCH_VERSIONS in the case block but not documented in the script's help/usage
output; update the usage/help text (the function or block that prints the CLI
usage) to include a short description of --torch-versions and its expected
value, matching the style of the other options so users see the flag when
running --help; ensure the option name and variable TORCH_VERSIONS are
referenced so the docs stay consistent with the parser.
---
Duplicate comments:
In `@contrib/build-wheel.sh`:
- Around line 153-158: The nightly probe currently leaves PyPI as the primary
index by using --extra-index-url for $TORCH_NIGHTLY_INDEX, which can misclassify
a version as nightly; update the pip dry-run probe in build-wheel.sh (the `elif`
branch that runs `pip install --dry-run ... "torch==${VER}.*"`) to use
--index-url "$TORCH_NIGHTLY_INDEX" (make the nightly CUDA index the primary
index) instead of --extra-index-url for the nightly index so the probe checks
the correct source; this change ensures `torch_classify()` matches the same
index that `install_torch()` will use.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 44cbbc54-f86e-4ccc-b5d2-984f558c06d1
📒 Files selected for processing (1)
contrib/build-wheel.sh
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@contrib/build-wheel.sh`:
- Line 272: The assignment BASE_WHL=$(ls "$TMP_DIR"/dist/*.whl) is fragile
because it parses ls output and can yield multiple newline-separated names;
replace it with glob expansion and explicit handling of matches (e.g., use an
array like files=( "$TMP_DIR"/dist/*.whl ) or set -- "$TMP_DIR"/dist/*.whl, then
validate count and pick the intended entry) and update any downstream uses
(e.g., cp "$BASE_WHL") to use the chosen single path; modify the same pattern at
the other occurrence around the earlier noted line (the second glob usage) so
both BASE_WHL and the related variable(s) come from safe shell globbing and
include a check for zero or multiple matches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: bc6364c8-24a7-4d99-8f74-de7d541ca4a2
📒 Files selected for processing (1)
contrib/build-wheel.sh
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@contrib/build-wheel.sh`:
- Around line 239-240: The shell commands calling auditwheel repair and
./contrib/wheel_add_ucx_plugins.py should quote directory/plateform variables to
prevent word-splitting: update the auditwheel repair invocation to quote
"$WHL_PLATFORM" and update the wheel_add_ucx_plugins.py call to quote
"$UCX_PLUGINS_DIR", "$NIXL_PLUGINS_DIR" (keep $AUDITWHEEL_EXCLUDES unquoted to
allow splitting) so the variables are treated as single arguments even when they
contain spaces.
- Around line 312-314: The final copy step uses a raw glob to copy wheels (cp
"$TMP_DIR"/dist/*.whl "$OUTPUT_DIR") instead of validating/normalizing the
output like the multi-torch path; replace that raw copy with a call to the
existing helper single_wheel(...) to pick/validate the built wheel and then move
it to OUTPUT_DIR. Locate the block that calls build_wheel and repair_wheel
(functions build_wheel and repair_wheel) and change the final cp step to use
single_wheel pointing at "$TMP_DIR/dist" (and ensure it writes or moves the
selected wheel into OUTPUT_DIR), preserving TMP_DIR and OUTPUT_DIR variables.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 745e2b49-c6b0-474a-8df7-5d7c52d4ad21
📒 Files selected for processing (1)
contrib/build-wheel.sh
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
|
/build |
guy-ealey-morag
left a comment
There was a problem hiding this comment.
We discussed some points to improve in later PRs, but other than that LGTM
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
What?
Adds a
nixl_epPython dispatch layer that ships in the meta wheel and routesimport nixl_epto the correct backend at runtime, picked from two axes:torch.version.cuda.torch.__version__.Each
nixl-cu12/nixl-cu13backend wheel ships all supported torch ABIs as separate compiled extensions in the same package, so a single backend wheel works against any installed torch in the supported range without reinstalling.Why?
vLLM is working on upgrading torch from 2.11 to 2.12. Currently NIXL EP wheel ships binary with ABI compatibility only with 2.11, by the time we release this may be obsolete.
Also, there may be a transition period where vLLM will ship a mix of images, some requiring PyTorch 2.11 (stable) and some 2.12 (nightly) so it is best if we can support both.
vLLM PR: vllm-project/vllm#40077
Also fixes vllm-project/vllm#42525
How?
The full import flow works like this:
import nixl_eplands in the meta dispatcher,src/bindings/python/nixl-meta/nixl_ep/__init__.py. It readstorch.version.cudaand picksnixl_ep_cu12ornixl_ep_cu13:On CPU-only torch it falls through to a
cu13→cu12probe order so that something gets loaded and the import does not fail, since vLLM imports unconditionally, but it will not be used. This is the same asimport nixlworks.The selected backend (e.g.
nixl_ep_cu13/__init__.py, shipped in thenixl-cu13wheel) readstorch.__version__and loads the matching ABI extension bindings through a second dispatch:With torch 2.12 installed, this resolves to
nixl_ep_cu13.nixl_ep_cpp_torch212.The meta dispatcher then re-exposes the backend's public symbols and submodules under the top-level
nixl_epnamespace so that the import is fully pass-through:So
nixl_ep.Buffer,nixl_ep.EventOverlap,nixl_ep.Config,nixl_ep.utils,nixl_ep.bufferall work regardless of which (CUDA, torch) backend was selected.If torch reports a CUDA major for which no backend wheel is installed, the meta dispatcher raises
ImportErrorwith a clear message (e.g."torch reports CUDA 13 but nixl-cu13 is not installed").Wheel layout
Supported PyTorch versions: 2.11,2.12.
PyTorch 2.13 is intentionally excluded since it is a nightly package that is not built for all platforms at the moment, and also requires C++20. Out of scope of this PR.
Meta wheel —
nixl-1.1.0-py3-none-any.whl(~12 KB, pure-Python, dispatch only):Backend wheel —
nixl_cu13-1.1.0-cp312-cp312-manylinux_2_28_x86_64.whl:The
cu12backend wheel mirrors this, withnixl_cu12/+nixl_ep_cu12/and the same per-torch.soset.Testing
Test matrix
.solibcudart.so.X2.11.0+cu129(stable)nixl_ep_cu12nixl_ep_cpp_torch21112✓2.12.0.dev20260401+cu129(nightly)nixl_ep_cu12nixl_ep_cpp_torch21212✓2.11.0+cu130(stable)nixl_ep_cu13nixl_ep_cpp_torch21113✓2.12.0+cu130(stable)nixl_ep_cu13nixl_ep_cpp_torch21213✓For each case, we check that:
torch.version.cudamatches the requested cuda-major).nixl_epdispatcher routed to the correctnixl_ep_cu{12,13}backend based ontorch.version.cuda.nixl_ep_cpp_torch{211,212}.sobased ontorch.__version__.Buffer,Config,EventOverlapall resolve through both layers (thenixl_ep_cppalias trick is working inbuffer.py/utils.py).lddshows the wheel-bundledlibucp-…/libucs-…fromnixl_cu{12,13}.libs/, the rightlibcudart.so.{12,13}for the cu-major, andlibtorch*.sofrom the resolved torch in the venv. Nonot found.import nixl(non-EP meta) also imports clean.End-to-end: dispatch contract from
nixl_ep/__init__.py(cuda-major fromtorch.version.cuda) andnixl_ep_cu{N}/__init__.py(torch ABI slug fromtorch.__version__) is verified across the full {cu12, cu13} × {2.11, 2.12} matrix.Testing strategy
Smoke test script
Test log:
Details
Summary by CodeRabbit
New Features
Chores
Style