Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 13 additions & 7 deletions cuda_bindings/cuda/bindings/cyruntime.pyx.in
Original file line number Diff line number Diff line change
Expand Up @@ -1917,13 +1917,19 @@ cdef cudaError_t getLocalRuntimeVersion(int* runtimeVersion) except ?cudaErrorCa
cdef cudaError_t err = cudaSuccess
err = (<cudaError_t (*)(int*) except ?cudaErrorCallRequiresNewerDriver nogil> __cudaRuntimeGetVersion)(runtimeVersion)

# Unload
{{if 'Windows' == platform.system()}}
windll.FreeLibrary(handle)
{{else}}
dlfcn.dlclose(handle)
{{endif}}
# We explicitly do *NOT* cleanup the library handle here, acknowledging
# that, yes, the handle leaks. The reason is that there's a
# `functools.cache` on the top-level caller of this function.
#
# This means this library would be opened once and then immediately closed,
# all the while remaining in the cache lurking there for people to call.
#
# Since we open the library one time (technically once per unique library name),
# there's not a ton of leakage, which we deem acceptable for the 1000x speedup
# achieved by caching (ultimately) `ctypes.CDLL` calls.
#
# Long(er)-term we can explore cleaning up the library using higher-level
# Python mechanisms, like `__del__` or `weakref.finalizer`s.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying my comment from the private PR:

Could you help me understand the context more? Where is the functools.cache?

I'm thinking it wouldn't be difficult to change this function to do the caching of the result/error right here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the functools.cache?

https://github.com/nvidia/cuda-python/blob/main/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_nvidia_dynamic_lib.py?plain=1#L54

I'm thinking it wouldn't be difficult to change this function to do the caching of the result/error right here.

The caching isn't done for the function's result.

It's that this function loads the library and then closes it, invalidating the handle to that library that is cached by functools.cache that decorates load_nvidia_dynamic_lib. The pointer itself remains valid, but the symbol table (at least in the elf loader) contains NULL pointers that are eventually dereferenced during a subsequent dlsym call with that (now invalid) handle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reposting my responpose also:

Ah ... thanks. Could you please give me a moment to think about this?

I didn't realize that the caching implies: never close the handle. That's not good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep the discussion here to avoid duplicating on cuda-python-private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent most of the day going down this rabbit hole, so I'm happy to talk it through IRL if that helps.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you need to get this issue out of the way asap:

WDYT about:

Comment out the code here (but don't delete for easy reference).

Add this comment:

# Skip closing handle until https://github.com/NVIDIA/cuda-python/issues/1011 is resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment out the code here (but don't delete for easy reference).

Not really a huge fan of that in general.

We have git history if someone really needs the exact code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you need to get this issue out of the way asap:

There's no rush here since 3.13t is experimental and 3.14 is still an RC.

If you have a solution you want to explore, have at it!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will not have a solution overnight. For the moment I'd just do this:

# Currently pathfinder does not support closing the handle.
# See https://github.com/NVIDIA/cuda-python/issues/1011 for background.

It's fine to delete the code for closing the handle entirely. From what I learned yesterday afternoon, the code here will have to change for sure, if we decide to support closing the handles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following what problem remains with the weakref finalize solution. I'm not saying it is without problems, but I currently don't see how it doesn't solve the problem of correctly closing a CDLL-opened library at the right time.


# Return
return err
{{endif}}
16 changes: 9 additions & 7 deletions cuda_bindings/tests/test_cudart.py
Original file line number Diff line number Diff line change
Expand Up @@ -1404,10 +1404,12 @@ def test_struct_pointer_comparison(target):


def test_getLocalRuntimeVersion():
try:
err, version = cudart.getLocalRuntimeVersion()
except pathfinder.DynamicLibNotFoundError:
pytest.skip("cudart dynamic lib not available")
else:
assertSuccess(err)
assert version >= 12000 # CUDA 12.0
# verify that successive calls do not segfault the interpreter
for _ in range(10):
try:
err, version = cudart.getLocalRuntimeVersion()
except pathfinder.DynamicLibNotFoundError:
pytest.skip("cudart dynamic lib not available")
else:
assertSuccess(err)
assert version >= 12000 # CUDA 12.0
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,20 @@ def _load_libdl() -> ctypes.CDLL:
LIBDL.dlerror.argtypes = []
LIBDL.dlerror.restype = ctypes.c_char_p

LIBDL.dlclose.argtypes = [ctypes.c_void_p]
LIBDL.dlclose.restype = ctypes.c_int

# First appeared in 2004-era glibc. Universally correct on Linux for all practical purposes.
RTLD_DI_LINKMAP = 2
RTLD_DI_ORIGIN = 6


def unload_dl(handle: ctypes.c_void_p) -> None:
result = LIBDL.dlclose(handle)
if result:
raise RuntimeError(LIBDL.dlerror())


class _LinkMapLNameView(ctypes.Structure):
"""
Prefix-only view of glibc's `struct link_map` used **solely** to read `l_name`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@
kernel32.AddDllDirectory.argtypes = [ctypes.wintypes.LPCWSTR]
kernel32.AddDllDirectory.restype = ctypes.c_void_p # DLL_DIRECTORY_COOKIE

kernel32.FreeLibrary.argtypes = [ctypes.wintypes.HMODULE]
kernel32.FreeLibrary.restype = ctypes.c_bool


def ctypes_handle_to_unsigned_int(handle: ctypes.wintypes.HMODULE) -> int:
"""Convert ctypes HMODULE to unsigned int."""
Expand Down Expand Up @@ -157,3 +160,9 @@ def load_with_abs_path(libname: str, found_path: str) -> LoadedDL:
raise RuntimeError(f"Failed to load DLL at {found_path}: Windows error {error_code}")

return LoadedDL(found_path, False, ctypes_handle_to_unsigned_int(handle))


def unload_dl(handle: ctypes.c_void_p) -> None:
result = kernel32.FreeLibrary(handle)
if not result:
raise RuntimeError(f"Failed to load windows DLL with error code: {ctypes.GetLastError()}")
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

import ctypes
import functools
import struct
import sys
import weakref

from cuda.pathfinder._dynamic_libs.find_nvidia_dynamic_lib import _FindNvidiaDynamicLib
from cuda.pathfinder._dynamic_libs.load_dl_common import LoadedDL, load_dependencies
Expand All @@ -14,12 +16,14 @@
check_if_already_loaded_from_elsewhere,
load_with_abs_path,
load_with_system_search,
unload_dl,
)
else:
from cuda.pathfinder._dynamic_libs.load_dl_linux import (
check_if_already_loaded_from_elsewhere,
load_with_abs_path,
load_with_system_search,
unload_dl,
)


Expand Down Expand Up @@ -117,4 +121,13 @@ def load_nvidia_dynamic_lib(libname: str) -> LoadedDL:
f" Currently running: {pointer_size_bits}-bit Python"
f" {sys.version_info.major}.{sys.version_info.minor}"
)
return _load_lib_no_cache(libname)

library = _load_lib_no_cache(libname)

# Ensure that the library is unloaded after GC runs on `library`
#
# We only need the address, so the rest of whatever is in `library` is free
# to be cleaned up. The integer address is immutable, so it gets copied
# upon being referenced here
weakref.finalize(library, unload_dl, ctypes.c_void_p(library._handle_uint))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where within _load_lib_no_cache if check_if_already_loaded_from_elsewhere returns the LoadedDL then we probably shouldn't close the handle either since we're just resolving the handle to an already loaded library from elsewhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's quite complex. I took me 5 minutes to write this prompt and upload the files, then the reasoning part took a whopping 10 minutes (longest I've seen so far):

https://chatgpt.com/share/68d3709e-a60c-8008-95c2-978251b18874

I already spend 20 minutes reviewing the response. I think it's really good, although I already discovered one small flaw in the reference implementation. I'll work on it a little more to be sure it can be made to work in general, then we can decide what balance of complexity vs features we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkraus14

we probably shouldn't close the handle either since we're just resolving the handle to an already loaded library from elsewhere

If we're dlopen-ing it, then we need a corresponding dlclose. Are we dlopen-ing the library when we resolve the handle?

@rwgk

I really think we should try to think through the problem and not just go with whatever chatgpt gives us.

I personally find it hard to justify a complicated AI-generated solution that includes additional APIs, each of which includes a lot of concepts (such as pinning) that are novel in Python APIs and the need to maintain a solution that a human did not come up with and likely cannot explain with the same level of understanding they had if they produced the solution themselves.

That said, this is just my opinion, so take it with a grain of salt. If the AI solution is really better than getting to the bottom of the problem and building a tailor-made solution then perhaps it's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since RTLD_NOLOAD increfs, then the finalize here is still necessary even if the library was loaded from elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RTLD_NOLOAD is via a ctypes.CDLL call though as opposed to dlopen directly. It seems ctypes doesn't actually close the handle, which is surprising to me and makes my previous statement incorrect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, ok, tx.

I've been blissfully ignorant of the handle refcount system, I need to review the existing code with a new mindset. Unblocking you here with a minimal change, and looking carefully in a separate PR is preferred.

One thing I want to look into: Is it ok for the existing clients (mainly cython-gen, cybind) if we close the handle, which may trigger unloading the library? The clients are holding on to pointers into the dynamic libs, e.g.

  • with gil, __symbol_lock:
    # Load library
    handle = load_nvidia_dynamic_lib("nvJitLink")._handle_uint
    # Load function
    global __nvJitLinkCreate
    __nvJitLinkCreate = GetProcAddress(handle, 'nvJitLinkCreate')
    global __nvJitLinkDestroy
    __nvJitLinkDestroy = GetProcAddress(handle, 'nvJitLinkDestroy')
    global __nvJitLinkAddData
    __nvJitLinkAddData = GetProcAddress(handle, 'nvJitLinkAddData')
    global __nvJitLinkAddFile
    __nvJitLinkAddFile = GetProcAddress(handle, 'nvJitLinkAddFile')
    global __nvJitLinkComplete
    __nvJitLinkComplete = GetProcAddress(handle, 'nvJitLinkComplete')
    global __nvJitLinkGetLinkedCubinSize
    __nvJitLinkGetLinkedCubinSize = GetProcAddress(handle, 'nvJitLinkGetLinkedCubinSize')
    global __nvJitLinkGetLinkedCubin
    __nvJitLinkGetLinkedCubin = GetProcAddress(handle, 'nvJitLinkGetLinkedCubin')
    global __nvJitLinkGetLinkedPtxSize
    __nvJitLinkGetLinkedPtxSize = GetProcAddress(handle, 'nvJitLinkGetLinkedPtxSize')
    global __nvJitLinkGetLinkedPtx
    __nvJitLinkGetLinkedPtx = GetProcAddress(handle, 'nvJitLinkGetLinkedPtx')
    global __nvJitLinkGetErrorLogSize
    __nvJitLinkGetErrorLogSize = GetProcAddress(handle, 'nvJitLinkGetErrorLogSize')
    global __nvJitLinkGetErrorLog
    __nvJitLinkGetErrorLog = GetProcAddress(handle, 'nvJitLinkGetErrorLog')
    global __nvJitLinkGetInfoLogSize
    __nvJitLinkGetInfoLogSize = GetProcAddress(handle, 'nvJitLinkGetInfoLogSize')
    global __nvJitLinkGetInfoLog
    __nvJitLinkGetInfoLog = GetProcAddress(handle, 'nvJitLinkGetInfoLog')
    global __nvJitLinkVersion
    __nvJitLinkVersion = GetProcAddress(handle, 'nvJitLinkVersion')
    __py_nvjitlink_init = True
    return 0

I need to find out: What happens if we trigger unloading a library while we still have the addresses where the symbols were loaded into memory?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only runs when the process is finalizing.

I don't think that's true.

You're right, sorry.

Actually, no, based on this quick experiment it is true:

  • I checked out your leave-dl-dangling branch (this PR)

  • I added this one line:

--- a/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_dl_linux.py
+++ b/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_dl_linux.py
@@ -47,6 +47,7 @@ RTLD_DI_ORIGIN = 6


 def unload_dl(handle: ctypes.c_void_p) -> None:
+    print(f"\nLOOOK unload_dl({handle=!r})", flush=True)
     result = LIBDL.dlclose(handle)
     if result:
         raise RuntimeError(LIBDL.dlerror())
  • Then I ran the cuda_bindings unit tests:
(WslLocalCudaVenv) rwgk-win11.localdomain:~/forked/cuda-python/cuda_bindings $ pytest -ra -s -v tests/
====================================================================================================== test session starts =======================================================================================================
platform linux -- Python 3.12.3, pytest-8.4.2, pluggy-1.6.0 -- /home/rgrossekunst/forked/cuda-python/cuda_pathfinder/WslLocalCudaVenv/bin/python3
cachedir: .pytest_cache
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/rgrossekunst/forked/cuda-python/cuda_bindings/tests
configfile: pytest.ini
plugins: benchmark-5.1.0
collected 192 items / 1 skipped

tests/test_cuda.py::test_cuda_memcpy PASSED
tests/test_cuda.py::test_cuda_array PASSED
tests/test_cuda.py::test_cuda_repr_primitive PASSED
<... snip ...>
tests/test_cudart.py::test_getLocalRuntimeVersion PASSED
<... snip ...>
tests/test_utils.py::test_cyclical_imports[nvvm] PASSED
tests/test_utils.py::test_cyclical_imports[runtime] PASSED
tests/test_utils.py::test_cyclical_imports[cufile] PASSED

==================================================================================================== short test summary info =====================================================================================================
SKIPPED [1] tests/test_cufile.py:40: skipping cuFile tests on WSL
================================================================================================= 192 passed, 1 skipped in 7.03s =================================================================================================

LOOOK unload_dl(handle=c_void_p(791712928))

LOOOK unload_dl(handle=c_void_p(781336800))

LOOOK unload_dl(handle=c_void_p(732155104))

LOOOK unload_dl(handle=c_void_p(731635248))
(WslLocalCudaVenv) rwgk-win11.localdomain:~/forked/cuda-python/cuda_bindings $

Because of functools.cache, the finalizers only run in the process tear-down, when the handles are sure to disappear anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very nervous about the finalizer solution because as per Phillip's investigation last week we will call the C APIs to do cleanup, but aren't the finalizers called before that step and leaving us dangling pointers?

We have been leaking DSO handles on purpose since pretty much the beginning of this project (~2021) and no one complained, in particular with respect to our handling of the driver (libcuda). Why is this leaking suddenly not acceptable here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... aren't the finalizers called before that step and leaving us dangling pointers?

Is it possible that the void* function pointers in various places get called after the library is closed, which happens at the module level? This would mean that the function (literally, the function object corresponding to the thing created by def; this matters because that's where the cache is stored) gets torn down and the finalizer called on LoadedDL before invoking one of those functions.

I can imagine the following scenario:

lib = load_nvidia_shared_library("nvjitlink")
foo = create_something_that_eventually_calls_a_cleanup_routine() # note the possible implicit call to `load_nvidia_shared_library`

# ... many lines of code later
#
# Python shuts down
# Python calls the finalizer for `lib`
# Python GC's `foo` which needs an open handle corresponding to `lib`'s address

So, we are probably always going to be constrained by this if we support code paths that can call globally-defined-and-expected-to-be-valid-pointers at any point in a program's lifetime.

I just spoke with @rwgk in person and we agreed that leaking the handles is preferable for now. I will revert the most recent commit, leave the comment, and then we can merge this fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the best any of us knows, the answer is "we don't know, but we know that it might be possible to need to call these functions at any point, so we leak the handles to avoid a worse outcome"

return library