Skip to content
This repository was archived by the owner on Aug 28, 2025. It is now read-only.

Conversation

@jakirkham
Copy link
Member

@jakirkham jakirkham commented May 2, 2025

Now that CUDA 12.9 is out. Update pynvjitlink to CUDA 12.9.

Part of issue: rapidsai/build-planning#173

@jakirkham jakirkham requested a review from a team as a code owner May 2, 2025 23:46
@jakirkham jakirkham added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels May 2, 2025
@jakirkham
Copy link
Member Author

jakirkham commented May 2, 2025

Looks like the Conda test for Python 3.10 on ARM had a segfault on CI. Snippet of the error below:

Details
test_pynvjitlink_api.py::test_add_fatbin_with_cubin_error PASSED         [ 91%]
Fatal Python error: Segmentation fault

Current thread 0x0000ffff86c547e0 (most recent call first):
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pynvjitlink/api.py", line 53 in add_data
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pynvjitlink/api.py", line 77 in add_fatbin
  File "/__w/pynvjitlink/pynvjitlink/pynvjitlink/tests/test_pynvjitlink_api.py", line 92 in test_duplicate_symbols_cubin_and_fatbin
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/python.py", line 159 in pytest_pyfunc_call
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/python.py", line 1627 in runtest
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/runner.py", line 174 in pytest_runtest_call
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/runner.py", line 242 in <lambda>
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/runner.py", line 341 in from_call
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/runner.py", line 241 in call_and_report
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/runner.py", line 132 in runtestprotocol
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/runner.py", line 113 in pytest_runtest_protocol
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/main.py", line 362 in pytest_runtestloop
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/main.py", line 337 in _main
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/main.py", line 283 in wrap_session
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/main.py", line 330 in pytest_cmdline_main
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/config/__init__.py", line 175 in main
  File "/opt/conda/envs/test/lib/python3.10/site-packages/_pytest/config/__init__.py", line 201 in console_main
  File "/opt/conda/envs/test/lib/python3.10/site-packages/pytest/__main__.py", line 9 in <module>
  File "/opt/conda/envs/test/lib/python3.10/runpy.py", line 86 in _run_code
  File "/opt/conda/envs/test/lib/python3.10/runpy.py", line 196 in _run_module_as_main

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, numba.core.typeconv._typeconv, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.typing.builtins.itertools, numba.cpython.builtins.math, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, numba.mviewbuf, numba.core.typing.cmathdecl.cmath, pynvjitlink._nvjitlinklib, numba.types.itertools, numba.cpython.numbers.math, numba.cpython.hashing.math, numba.cpython.hashing.sys, numba.cpython.mathimpl.math, numba.cpython.mathimpl.sys (total: 29)
ci/test_conda.sh: line 58:  1479 Segmentation fault      (core dumped) python -m pytest --cache-clear --junitxml="${RAPIDS_TESTS_DIR}/junit-pynvjitlink.xml" -v
test_pynvjitlink_api.py::test_duplicate_symbols_cubin_and_fatbin /__w/pynvjitlink/pynvjitlink

Edit: Seeing the same error in the Conda test for Python 3.13 on x86_64 on CI

@jakirkham
Copy link
Member Author

Also seeing the following error in the wheel builds on CI:

Details
Rocky Linux 8 - AppStream                        16 MB/s |  16 MB     00:00    
Rocky Linux 8 - BaseOS                           37 MB/s |  18 MB     00:00    
Rocky Linux 8 - Extras                          1.1 MB/s | 942 kB     00:00    
Rocky Linux 8 - PowerTools                      7.4 MB/s | 2.8 MB     00:00    
cuda                                             60  B/s |  30  B     00:00    
Errors during downloading metadata for repository 'cuda':
  - Status code: 404 for https://developer.download.nvidia.com/compute/cuda/repos/rhel8/sbsa/repodata/141181c011407fd81d902a3092b27cd5892ece66276cc6bc82d5357c98d2e25a-modules.yaml.gz (IP: 23.45.46.200)
  - Status code: 404 for https://developer.download.nvidia.com/compute/cuda/repos/rhel8/sbsa/repodata/cb820d25bd923414e078f1ab24d0b1d2c1f92aebc913e8411fc1da3de393c677-filelists.xml.gz (IP: 23.45.46.200)
  - Status code: 404 for https://developer.download.nvidia.com/compute/cuda/repos/rhel8/sbsa/repodata/4c46d330fca13ffc74f1f818c58b27cb901275d6c2848345af717be62163e96d-primary.xml.gz (IP: 23.45.46.200)
Error: Failed to download metadata for repo 'cuda': Yum repo downloading error: Downloading error(s): repodata/4c46d330fca13ffc74f1f818c58b27cb901275d6c2848345af717be62163e96d-primary.xml.gz - Cannot download, all mirrors were already tried without success; repodata/cb820d25bd923414e078f1ab24d0b1d2c1f92aebc913e8411fc1da3de393c677-filelists.xml.gz - Cannot download, all mirrors were already tried without success; repodata/141181c011407fd81d902a3092b27cd5892ece66276cc6bc82d5357c98d2e25a-modules.yaml.gz - Cannot download, all mirrors were already tried without success
Error: Process completed with exit code 1

However am able to download these URLs locally

Think we need a retry workflow for yum to handle these network issues. Filed upstream issue: rapidsai/gha-tools#169

@jakirkham
Copy link
Member Author

Trying rerunning to see if the wheel CI failures clear out

@gmarkall
Copy link
Contributor

gmarkall commented May 3, 2025

I suspect libnvjitlink 12.9 is not handling errors as gracefully as previous versions have done (there have been similar problems in the past).

@gmarkall
Copy link
Contributor

gmarkall commented May 3, 2025

There is some misbehaviour inside nvjitlink when an error occurs in test_duplicate_symbols_cubin_and_fatbin:

pynvjitlink/tests/test_pynvjitlink_api.py::test_duplicate_symbols_cubin_and_fatbin ==78711== Invalid read of size 8
==78711==    at 0xC50B30EC: ??? (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC5E282A9: libnvJitLink_static_ca9688632607b8f844ea5ec22638306551329a87 (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC5E2840A: libnvJitLink_static_48deece8fe1e49bdb3c717ab477bf39b4f55b33f (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC5E2C14F: ??? (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC5E2C3B2: libnvJitLink_static_bb885e489c3011fa30d603028f4491924721fd1f (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC5DE8D03: ??? (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC5DE9521: libnvJitLink_static_8c7a44a5c809a3fca7bb3f91b8b7d51b1d7e85a9 (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC684D48B: libnvJitLink_static_7af337afe102509b14fbdcb4f3a9b4e448359af2 (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC68488EF: libnvJitLink_static_17b6d48c497352bc2646b6fc0dfa0e2065a26dbb (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC50B31D1: ??? (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC50B44E0: ??? (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC50A41E1: ??? (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==  Address 0xbf53d0f0 is 48 bytes inside a block of size 192 free'd
==78711==    at 0x484BB0C: free (vg_replace_malloc.c:985)
==78711==    by 0xC50B48EF: libnvJitLink_static_8b1d4706728f95a8a8732c4e797e4ae179c05ba2 (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC50A4017: ??? (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0x2D9AD1: cfunction_call (methodobject.c:551)
==78711==    by 0x2B68FB: _PyObject_MakeTpCall (call.c:242)
==78711==    by 0x1A687C: _PyEval_EvalFrameDefault.cold (generated_cases.c.h:813)
==78711==    by 0x373C0B: _PyObject_VectorcallDictTstate (call.c:146)
==78711==    by 0x3C37A9: UnknownInlinedFun (call.c:504)
==78711==    by 0x3C37A9: slot_tp_call (typeobject.c:9556)
==78711==    by 0x2B68FB: _PyObject_MakeTpCall (call.c:242)
==78711==    by 0x1ACD3A: _PyEval_EvalFrameDefault.cold (generated_cases.c.h:1502)
==78711==    by 0x373C0B: _PyObject_VectorcallDictTstate (call.c:146)
==78711==    by 0x3C37A9: UnknownInlinedFun (call.c:504)
==78711==    by 0x3C37A9: slot_tp_call (typeobject.c:9556)
==78711==  Block was alloc'd at
==78711==    at 0x48487EF: malloc (vg_replace_malloc.c:442)
==78711==    by 0xC5E2FF61: libnvJitLink_static_b58a5458563440498452cf6c27f13bbf6e2b1c96 (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC5E2AE59: libnvJitLink_static_26d4fc05d50c93644001f719b371ff3de747fe26 (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC50B4C7E: libnvJitLink_static_5c65ac30ac74eb28d70bd90bee9a4b47eafc075f (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0xC50A4371: ??? (in /home/gmarkall/miniforge3/envs/pynvjitlink-12-9/lib/python3.13/site-packages/pynvjitlink/_nvjitlinklib.cpython-313-x86_64-linux-gnu.so)
==78711==    by 0x2D9AD1: cfunction_call (methodobject.c:551)
==78711==    by 0x2B68FB: _PyObject_MakeTpCall (call.c:242)
==78711==    by 0x1A687C: _PyEval_EvalFrameDefault.cold (generated_cases.c.h:813)
==78711==    by 0x373C0B: _PyObject_VectorcallDictTstate (call.c:146)
==78711==    by 0x3C37A9: UnknownInlinedFun (call.c:504)
==78711==    by 0x3C37A9: slot_tp_call (typeobject.c:9556)
==78711==    by 0x2B68FB: _PyObject_MakeTpCall (call.c:242)
==78711==    by 0x1ACD3A: _PyEval_EvalFrameDefault.cold (generated_cases.c.h:1502)

With CUDA 12.9 this leads to invalid reads within nvjitlink.
@gmarkall
Copy link
Contributor

gmarkall commented May 3, 2025

Skipping the test in question removes the invalid reads - I've pushed a commit with the skip that I think should enable us to move forward (the test was previously xfailed anyway).

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm comfortable moving forward with this, and dealing with the test issue separately / later (I see no practical resolution within a reasonable timeframe for a CUDA 12.9-based release of pynvjitlink).

@gmarkall gmarkall merged commit f663c39 into rapidsai:main May 4, 2025
49 checks passed
gmarkall added a commit to gmarkall/pynvjitlink that referenced this pull request May 4, 2025
- Update to CUDA 12.9 (rapidsai#138)
- feat(conda): port conda recipe to rattler-build (rapidsai#137)
- Download build artifacts from Github for CI (rapidsai#136)
- Moving wheel builds to specified location and uploading build artifacts to Github (rapidsai#135)
- Use mainline shared-workflows again (rapidsai#134)
@gmarkall gmarkall mentioned this pull request May 4, 2025
gmarkall added a commit that referenced this pull request May 4, 2025
- Update to CUDA 12.9 (#138)
- feat(conda): port conda recipe to rattler-build (#137)
- Download build artifacts from Github for CI (#136)
- Moving wheel builds to specified location and uploading build
artifacts to Github (#135)
- Use mainline shared-workflows again (#134)

<!--

Thank you for contributing to pynvjitlink :)

Here are some guidelines to help the review process go smoothly.

1. Please write a description in this text box of the changes that are
being
   made.

2. Please ensure that you have written units tests for the changes
made/features
   added.

3. If you are closing an issue please use one of the automatic closing
words as
noted here:
https://help.github.com/articles/closing-issues-using-keywords/

4. If your pull request is not ready for review but you want to make use
of the
continuous integration testing facilities please label it with `[WIP]`.

5. If your pull request is ready to be reviewed without requiring
additional
work on top of it, then remove the `[WIP]` label (if present) and
replace
it with `[REVIEW]`. If assistance is required to complete the
functionality,
for example when the C/C++ code of a feature is complete but Python
bindings
are still required, then add the label `[HELP-REQ]` so that others can
triage
and assist. The additional changes then can be implemented on top of the
same PR. If the assistance is done by members of the rapidsAI team, then
no
additional actions are required by the creator of the original PR for
this,
otherwise the original author of the PR needs to give permission to the
person(s) assisting to commit to their personal fork of the project. If
that
doesn't happen then a new PR based on the code of the original PR can be
opened by the person assisting, which then will be the PR that will be
   merged.

6. Once all work has been done and review has taken place please do not
add
features or make changes out of the scope of those requested by the
reviewer
(doing this just add delays as already reviewed code ends up having to
be
re-reviewed/it is hard to tell what is new etc!). Further, please do not
rebase your branch on main/force push/rewrite history, doing any of
these
   causes the context of any comments made by reviewers to be lost. If
   conflicts occur against main they should be resolved by merging main
   into the branch used for making the pull request.

Many thanks in advance for your cooperation!

-->
@jakirkham jakirkham deleted the update_cuda_12.9 branch May 6, 2025 21:27
@jakirkham
Copy link
Member Author

Thanks Graham and Bradley! 🙏

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants