-
Notifications
You must be signed in to change notification settings - Fork 234
fix(library): avoid spurious close of cached shared library when calling cudart.getLocalRuntimeVersion
#1010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 4 commits
bc41afe
09a594c
8cf16e2
73c43ae
cd89f39
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,9 +1,11 @@ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # SPDX-License-Identifier: Apache-2.0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import ctypes | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import functools | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import struct | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import sys | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| import weakref | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| from cuda.pathfinder._dynamic_libs.find_nvidia_dynamic_lib import _FindNvidiaDynamicLib | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| from cuda.pathfinder._dynamic_libs.load_dl_common import LoadedDL, load_dependencies | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -14,12 +16,14 @@ | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| check_if_already_loaded_from_elsewhere, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| load_with_abs_path, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| load_with_system_search, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| unload_dl, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| else: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| from cuda.pathfinder._dynamic_libs.load_dl_linux import ( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| check_if_already_loaded_from_elsewhere, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| load_with_abs_path, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| load_with_system_search, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| unload_dl, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
@@ -117,4 +121,13 @@ def load_nvidia_dynamic_lib(libname: str) -> LoadedDL: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| f" Currently running: {pointer_size_bits}-bit Python" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| f" {sys.version_info.major}.{sys.version_info.minor}" | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| return _load_lib_no_cache(libname) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| library = _load_lib_no_cache(libname) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # Ensure that the library is unloaded after GC runs on `library` | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # We only need the address, so the rest of whatever is in `library` is free | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # to be cleaned up. The integer address is immutable, so it gets copied | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| # upon being referenced here | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| weakref.finalize(library, unload_dl, ctypes.c_void_p(library._handle_uint)) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| with gil, __symbol_lock: | |
| # Load library | |
| handle = load_nvidia_dynamic_lib("nvJitLink")._handle_uint | |
| # Load function | |
| global __nvJitLinkCreate | |
| __nvJitLinkCreate = GetProcAddress(handle, 'nvJitLinkCreate') | |
| global __nvJitLinkDestroy | |
| __nvJitLinkDestroy = GetProcAddress(handle, 'nvJitLinkDestroy') | |
| global __nvJitLinkAddData | |
| __nvJitLinkAddData = GetProcAddress(handle, 'nvJitLinkAddData') | |
| global __nvJitLinkAddFile | |
| __nvJitLinkAddFile = GetProcAddress(handle, 'nvJitLinkAddFile') | |
| global __nvJitLinkComplete | |
| __nvJitLinkComplete = GetProcAddress(handle, 'nvJitLinkComplete') | |
| global __nvJitLinkGetLinkedCubinSize | |
| __nvJitLinkGetLinkedCubinSize = GetProcAddress(handle, 'nvJitLinkGetLinkedCubinSize') | |
| global __nvJitLinkGetLinkedCubin | |
| __nvJitLinkGetLinkedCubin = GetProcAddress(handle, 'nvJitLinkGetLinkedCubin') | |
| global __nvJitLinkGetLinkedPtxSize | |
| __nvJitLinkGetLinkedPtxSize = GetProcAddress(handle, 'nvJitLinkGetLinkedPtxSize') | |
| global __nvJitLinkGetLinkedPtx | |
| __nvJitLinkGetLinkedPtx = GetProcAddress(handle, 'nvJitLinkGetLinkedPtx') | |
| global __nvJitLinkGetErrorLogSize | |
| __nvJitLinkGetErrorLogSize = GetProcAddress(handle, 'nvJitLinkGetErrorLogSize') | |
| global __nvJitLinkGetErrorLog | |
| __nvJitLinkGetErrorLog = GetProcAddress(handle, 'nvJitLinkGetErrorLog') | |
| global __nvJitLinkGetInfoLogSize | |
| __nvJitLinkGetInfoLogSize = GetProcAddress(handle, 'nvJitLinkGetInfoLogSize') | |
| global __nvJitLinkGetInfoLog | |
| __nvJitLinkGetInfoLog = GetProcAddress(handle, 'nvJitLinkGetInfoLog') | |
| global __nvJitLinkVersion | |
| __nvJitLinkVersion = GetProcAddress(handle, 'nvJitLinkVersion') | |
| __py_nvjitlink_init = True | |
| return 0 |
I need to find out: What happens if we trigger unloading a library while we still have the addresses where the symbols were loaded into memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only runs when the process is finalizing.
I don't think that's true.
You're right, sorry.
Actually, no, based on this quick experiment it is true:
-
I checked out your
leave-dl-danglingbranch (this PR) -
I added this one line:
--- a/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_dl_linux.py
+++ b/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_dl_linux.py
@@ -47,6 +47,7 @@ RTLD_DI_ORIGIN = 6
def unload_dl(handle: ctypes.c_void_p) -> None:
+ print(f"\nLOOOK unload_dl({handle=!r})", flush=True)
result = LIBDL.dlclose(handle)
if result:
raise RuntimeError(LIBDL.dlerror())- Then I ran the cuda_bindings unit tests:
(WslLocalCudaVenv) rwgk-win11.localdomain:~/forked/cuda-python/cuda_bindings $ pytest -ra -s -v tests/
====================================================================================================== test session starts =======================================================================================================
platform linux -- Python 3.12.3, pytest-8.4.2, pluggy-1.6.0 -- /home/rgrossekunst/forked/cuda-python/cuda_pathfinder/WslLocalCudaVenv/bin/python3
cachedir: .pytest_cache
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/rgrossekunst/forked/cuda-python/cuda_bindings/tests
configfile: pytest.ini
plugins: benchmark-5.1.0
collected 192 items / 1 skipped
tests/test_cuda.py::test_cuda_memcpy PASSED
tests/test_cuda.py::test_cuda_array PASSED
tests/test_cuda.py::test_cuda_repr_primitive PASSED
<... snip ...>
tests/test_cudart.py::test_getLocalRuntimeVersion PASSED
<... snip ...>
tests/test_utils.py::test_cyclical_imports[nvvm] PASSED
tests/test_utils.py::test_cyclical_imports[runtime] PASSED
tests/test_utils.py::test_cyclical_imports[cufile] PASSED
==================================================================================================== short test summary info =====================================================================================================
SKIPPED [1] tests/test_cufile.py:40: skipping cuFile tests on WSL
================================================================================================= 192 passed, 1 skipped in 7.03s =================================================================================================
LOOOK unload_dl(handle=c_void_p(791712928))
LOOOK unload_dl(handle=c_void_p(781336800))
LOOOK unload_dl(handle=c_void_p(732155104))
LOOOK unload_dl(handle=c_void_p(731635248))
(WslLocalCudaVenv) rwgk-win11.localdomain:~/forked/cuda-python/cuda_bindings $
Because of functools.cache, the finalizers only run in the process tear-down, when the handles are sure to disappear anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very nervous about the finalizer solution because as per Phillip's investigation last week we will call the C APIs to do cleanup, but aren't the finalizers called before that step and leaving us dangling pointers?
We have been leaking DSO handles on purpose since pretty much the beginning of this project (~2021) and no one complained, in particular with respect to our handling of the driver (libcuda). Why is this leaking suddenly not acceptable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... aren't the finalizers called before that step and leaving us dangling pointers?
Is it possible that the void* function pointers in various places get called after the library is closed, which happens at the module level? This would mean that the function (literally, the function object corresponding to the thing created by def; this matters because that's where the cache is stored) gets torn down and the finalizer called on LoadedDL before invoking one of those functions.
I can imagine the following scenario:
lib = load_nvidia_shared_library("nvjitlink")
foo = create_something_that_eventually_calls_a_cleanup_routine() # note the possible implicit call to `load_nvidia_shared_library`
# ... many lines of code later
#
# Python shuts down
# Python calls the finalizer for `lib`
# Python GC's `foo` which needs an open handle corresponding to `lib`'s address
So, we are probably always going to be constrained by this if we support code paths that can call globally-defined-and-expected-to-be-valid-pointers at any point in a program's lifetime.
I just spoke with @rwgk in person and we agreed that leaking the handles is preferable for now. I will revert the most recent commit, leave the comment, and then we can merge this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the best any of us knows, the answer is "we don't know, but we know that it might be possible to need to call these functions at any point, so we leak the handles to avoid a worse outcome"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying my comment from the private PR:
Could you help me understand the context more? Where is the functools.cache?
I'm thinking it wouldn't be difficult to change this function to do the caching of the result/error right here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/nvidia/cuda-python/blob/main/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_nvidia_dynamic_lib.py?plain=1#L54
The caching isn't done for the function's result.
It's that this function loads the library and then closes it, invalidating the handle to that library that is cached by
functools.cachethat decoratesload_nvidia_dynamic_lib. The pointer itself remains valid, but the symbol table (at least in the elf loader) contains NULL pointers that are eventually dereferenced during a subsequentdlsymcall with that (now invalid) handle.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reposting my responpose also:
Ah ... thanks. Could you please give me a moment to think about this?
I didn't realize that the caching implies: never close the handle. That's not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep the discussion here to avoid duplicating on cuda-python-private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent most of the day going down this rabbit hole, so I'm happy to talk it through IRL if that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you need to get this issue out of the way asap:
WDYT about:
Comment out the code here (but don't delete for easy reference).
Add this comment:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really a huge fan of that in general.
We have git history if someone really needs the exact code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no rush here since 3.13t is experimental and 3.14 is still an RC.
If you have a solution you want to explore, have at it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will not have a solution overnight. For the moment I'd just do this:
It's fine to delete the code for closing the handle entirely. From what I learned yesterday afternoon, the code here will have to change for sure, if we decide to support closing the handles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following what problem remains with the weakref finalize solution. I'm not saying it is without problems, but I currently don't see how it doesn't solve the problem of correctly closing a CDLL-opened library at the right time.