Skip to content

Support consuming NRT functions from external libraries#167

Merged
gmarkall merged 44 commits intoNVIDIA:mainfrom
brandon-b-miller:support-cudf-strings
May 8, 2025
Merged

Support consuming NRT functions from external libraries#167
gmarkall merged 44 commits intoNVIDIA:mainfrom
brandon-b-miller:support-cudf-strings

Conversation

@brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Mar 19, 2025

This PR moves some NRT function declarations to a header that can be found at build time by external libraries. It also fixes a bug where NRT wasn't linked if its declarations were present in the passed user source, but not the half of the PTX numba generates itself.

@gmarkall gmarkall added the 2 - In Progress Currently a work in progress label Mar 19, 2025

def maybe_link_nrt(self, link, tgt_ctx, asm):
if not tgt_ctx.enable_nrt:
if not tgt_ctx.enable_nrt or self.nrt is False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want the nrt kwarg for the JIT decorator to take precedence over the global NRT setting for enabling it? I think this will never link the NRT if it is globally disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. I think the best behavior might be

  • kwarg takes precedence, allowing users definitive control over if the NRT code should be added to the link for a particular kernel
  • global flag enables the detection mechanism

But I'm not sure if numba's code generation will actually provide increfs/decrefs without enable_nrt set to true in the target context, is that correct? In that case we might be adding the refcounting functions to the link but not actually calling them.

The motivation for this kwarg is twofold: Where it came up was an edge case I observed in cuDF where an NRT managed object is created and immediately returned, such as in lambda string: return string. In this case it works out that NRT functions are used, but they only appear in the separately provided PTX file. The ASM that numba inspects when deciding to link NRT or not contains only the declarations for the parent functions that eventually call the NRT ones, so this defeats numba's NRT detection.

Overall I feel that unless the NRT detection is absolutely definitive somehow it might be more reliable to let the user decide when to link NRT. This has the drawback of creating two switches, one to enable the possiblity of NRT, and then a second one to actually add it to the link. But even that I think its a little better than somewhat shaky auto detection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ASM that numba inspects when deciding to link NRT or not contains only the declarations for the parent functions that eventually call the NRT ones, so this defeats numba's NRT detection.

Are you saying that numba-cuda doesn't link the NRT when it should? If so, then I would have seen that as a bug in numba-cuda?

Or, if it adds NRT to the asm when it's not necessary, we should be running the refcount pruning pass to elide those unnecessary NRT calls.

Is there a third situation I'm missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re you saying that numba-cuda doesn't link the NRT when it should? If so, then I would have seen that as a bug in numba-cuda?

Yes, this is the case. It is the PTX file that is provided by cuDF that is not checked. The PTX from the numba side is checked, but doesn't contain the actual NRT calls - it only contains a call to a shim function that uses NRT functions within its implementation.

Initially I felt that since cuDF was somewhat burying the NRT calls maybe it should also be the one to signal to numba that the NRT implementations are also needed (or the same for any consuming library that does the same thing).

But I also see that it could be considered a bug in numba-cuda that all the input PTX files are not checked for NRT functions. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we support using the NRT functions from external code, then ideally we should be checking for calls to NRT functions in external code rather than asking the user to do additional work - particularly if it requires signalling in the @cuda.jit decorator, because the person writing the jitted function can often be someone different to the person who implemented another library.

We've seen this problem with the link and extensions kwargs already, where libraries want to use these features, but then have to ask the users to remember to use them with no explicit connection between use of the library and the requirement to add the kwargs.

Copy link
Contributor Author

@brandon-b-miller brandon-b-miller Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of the changes in 0b23a09 and dd568e7 ? Here we drop the kwarg and scan the ptx files at the expense of a double read of the same file, something theres probably an easy solution for.

Copy link
Contributor

@isVoid isVoid Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for being late to the discussion.

particularly if it requires signalling in the @cuda.jit decorator, because the person writing the jitted function can often be someone different to the person who implemented another library.

If it's the library dev duty to specify nrt arg, how about adding it to the LinkableCode interface? The downside is that this cannot be specified if user would want to directly use a path to the link argument. Perhaps we can make that an unsupported case or simply deprecate passing in a raw path (and reintroduce it as a Path subclass to LinkableCode).

What let me to think of this is that 0b23a09 NRT discovery mechanism may not be able to handle binary format input such as LTOIR (disassembly could work but I think that's a too much price to pay).

I increasingly see that Numba as a compiler need to separate the duty from lib dev and lib user (kernel writer). And LinkableCode interface can be the perfect vehicle to separate those needs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's the library dev duty to specify nrt arg, how about adding it to the LinkableCode interface?

I think this sounds like a good idea.

The downside is that this cannot be specified if user would want to directly use a path to the link argument.

I think that's not too significant a downside. I see the provision of a path to link as a bit of a legacy anyway, because it's not very convenient - it's just what happened to be implemented first, a long time ago, perhaps before the API design had a lot of consideration.

Perhaps we can make that an unsupported case or simply deprecate passing in a raw path (and reintroduce it as a Path subclass to LinkableCode).

I think introducing a Path subclass might not fit exactly right, because the linker needs to know what type of object is being passed in. A helper function that determines the type based on the extension, reads it in, and then returns the appropriate LinkableCode subclass instance would probably be OK though. (Perhaps a class method LinkableCode.from_path(), maybe?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Perhaps a class method LinkableCode.from_path(), maybe?).

I think that's a good design.

@gmarkall gmarkall added 4 - Waiting on author Waiting for author to respond to review and removed 2 - In Progress Currently a work in progress labels Mar 19, 2025
@gmarkall
Copy link
Contributor

Would this be better broken into two PRs? One for making the NRT functions available externally, and one for the kwarg?

@gmarkall
Copy link
Contributor

Also, if we're going to make nrt.cuh available for external users, I think we need a documentation section explaining it and outlining what we consider the public API and expected behaviour to be.

@brandon-b-miller
Copy link
Contributor Author

Also, if we're going to make nrt.cuh available for external users, I think we need a documentation section explaining it and outlining what we consider the public API and expected behaviour to be.

I agree. I think that given that the numpy API is available now we're getting to the point where a doc update is needed to describe the available NRT features. What would you say to giving that section of the docs a broad overhaul, either as part of this PR or separately, including what nrt.cuh is for and how to use it?

@brandon-b-miller brandon-b-miller added 3 - Ready for Review Ready for review by team 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Mar 31, 2025
link_nrt = nrt_in_asm(asm)
if not link_nrt:
for file in link:
if file.endswith('ptx'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, there can also be linkable_code.PTXSource in the input. Could there be input in other format that also contains NRT calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a reasonable question. This would only currently catch the case where the user supplied a PTX file that utilized NRT functions. I suppose it could also occur with a .cu file, which it shouldn't be hard to update things for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should just be {".cu", ".ptx"} since that's all we accept in link=

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of add_data() (#235) should probably be fixed within this PR in ctypes and cuda-python linkers, because without that fix the additional tests fail, e.g.:

======================================================================
ERROR: test_nrt_detect_linkable_code (numba.cuda.tests.nrt.test_nrt.TestNrtLinking.test_nrt_detect_linkable_code) (code=<numba.cuda.cudadrv.linkable_code.Cubin object at 0x706fbc9763d0>)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/tests/nrt/test_nrt.py", line 204, in test_nrt_detect_linkable_code
    kernel[1, 1]()
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/dispatcher.py", line 737, in __call__
    return self.dispatcher.call(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/dispatcher.py", line 883, in call
    kernel = _dispatcher.Dispatcher._cuda_call(self, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/dispatcher.py", line 891, in _compile_for_args
    return self.compile(tuple(argtypes))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gmarkall/numbadev/numba/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/dispatcher.py", line 1159, in compile
    kernel.bind()
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/dispatcher.py", line 375, in bind
    cufunc = self._codelibrary.get_cufunc()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/codegen.py", line 260, in get_cufunc
    cubin = self.get_cubin(cc=device.compute_capability)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/codegen.py", line 237, in get_cubin
    self._link_all(linker, cc, ignore_nonlto=False)
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/codegen.py", line 204, in _link_all
    linker.add_file_guess_ext(path, ignore_nonlto)
  File "/home/gmarkall/numbadev/numba-cuda/numba_cuda/numba/cuda/cudadrv/driver.py", line 2886, in add_file_guess_ext
    self.add_data(
    ^^^^^^^^^^^^^
AttributeError: 'CtypesLinker' object has no attribute 'add_data'

I haven't looked into why this is not also failing on CI, but I think it ought to also be checked on CI.

@gmarkall gmarkall added 4 - Waiting on author Waiting for author to respond to review and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Apr 30, 2025
@gmarkall
Copy link
Contributor

gmarkall commented May 3, 2025

I just merged main so this branch can pick up the recent CI fixes.

detect .extern .func decls of NRT functions in any file in the link or
any of the passed LinkableCode objects have NRT enabled. In the special
case of PTXSource or CUSource objects, the source code will be inspected
even if the NRT flag is not set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we inspecting the source of PTX and CUDA C/C++ source files? Shouldn't the user be expected to explicitly state that they require NRT by setting the LinkableCode nrt property? There's a couple of reasons I think it would be better to require this:

  • I think a while back we concluded it would be hard to perfectly determine whether arbitrary linkable code used the NRT (potential false positives and false negatives), so we shouldn't try.
  • A user might pass in an object as PTX or CUDA source, and rely on NRT getting linked in automatically. If they then decide to switch to LTO-IR for better performance, the automatic linking will no longer work, and their kernels will appear to suddenly have unresolved NRT symbols.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite +1 on requiring the user the be explicit about when they need NRT and less autodetection. Does the following seem right then? We'd link NRT if:

  • it is globally enabled

And either:

  • NRT is autodetected in the main asm (this is the only thing we try and autodetect NRT symbols in)
  • a LinkableCode object with NRT specifically enabled is passed

If a user passes a .cu or .ptx file (in any form, either as a path or as a LinkableCode instance) without NRT explicitly set, it may fail to link if the code does contain NRT symbols.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds good. I'm trying to think of a way to remove the autodetection even for internal code, too (similar to how #240 should be able to remove it for CGs) but it's a little trickier with it coming from Numba core as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be addressed in ff12e16

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is addressed from a technical standpoint (modulo the minor comment below) but the docstring here is now out of date, suggesting that PTX and CU sources will be inspected.

@gmarkall gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels May 3, 2025
@brandon-b-miller brandon-b-miller requested a review from gmarkall May 6, 2025 12:15
brandon-b-miller and others added 2 commits May 6, 2025 07:16
Co-authored-by: Graham Markall <535640+gmarkall@users.noreply.github.com>
Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is generally looking good - just a couple of minor comments on the diff.

I tried merging in #241 locally and it got the tests passing for me.

Once #241 is in and the comments are addressed, I think this will be good to merge.

@gmarkall gmarkall merged commit bbfce31 into NVIDIA:main May 8, 2025
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Waiting on reviewer Waiting for reviewer to respond to author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants