Skip to content

Conversation

@isVoid
Copy link
Contributor

@isVoid isVoid commented Apr 22, 2025

Per nvvm documentation, shuffle APIs takes a IR constant for the mode parameter. In current Numba implementation, it is a variable. This could crash the NVVM lowering when the ConstantFolds are not applied to IR (-opt=3 e.g.). This PR fixes #228

@isVoid isVoid changed the title Fix Invalid NVVM IR emitted when lowering shfl_sync APIs #230 Fix Invalid NVVM IR emitted when lowering shfl_sync APIs Apr 22, 2025
@isVoid isVoid marked this pull request as ready for review April 22, 2025 20:53
@isVoid isVoid requested a review from gmarkall April 22, 2025 21:05
Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving this PR forward and fixing the correctness issues in it.

When I tried to follow through the implementation to understand what's going on and correlate it with the NVVM IR and CUDA C/C++ programming documentation, I found it very hard to follow what was going on, because the implementation diverges from both in a lot of small ways (because it always did in Numba, not as a result of this PR).

There was also some redundant work with creating structs and extracting values from them that was not quite redundant in the old implementation. Now that the implementation is all combined into a single intrinsic instead of the old way of having typing, lowering, a stub, and jitted functions, the extra packing and unpacking is un-needed.

To get to a point where I could follow through what was going on I found I had to iterate through modifying the code to align it with the docs and C/C++ implementation, and get rid of the extra packing and unpacking - this is in gmarkall@931f865

A couple of other notes from whilst I was doing that:

  • I kept making mistakes that involved test_shfl_sync_types test failing, so it was helpful to use subTests for it: gmarkall@de76f69
  • The docstrings for the intrinsics were missing, so they wouldn't have been visible in an IDE or shown up in the documentation.

I suspect we will also have a similar issue for the vote sync intrinsic, which is documented as requiring a constant mode as well: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#vote - we should get this PR in and then repeat the process for that intrinsic (and maybe check if there are other intrinsics we use where we make this mistake too).

compiled[1, nelem](ary, xor)
self.assertTrue(np.all(ary == exp))

def test_shfl_sync_const_mode_val(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also have a variant of this test that uses compile_ptx() with cc=(10, 0) to ensure that all is well when going through the path that uses LLVM 18.x IR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmarkall gmarkall added the 4 - Waiting on author Waiting for author to respond to review label Apr 23, 2025
This consists of a number of changes:

- Add notes on implementation and links to docs
- Map public API arguments to those referred to in the NVVM IR docs
- Restore docstrings for intrinsics
- Simplify some of the intrinsic logic - there is no need to construct
  aggregates containing the full returned structure from the intrinsic
  only to extract the computed value just prior to returning from the
  intrinsic.
@isVoid
Copy link
Contributor Author

isVoid commented Apr 23, 2025

The docstrings for the intrinsics were missing, so they wouldn't have been visible in an IDE or shown up in the documentation.

Can you elaborate? The code is directly documented here: https://github.com/NVIDIA/numba-cuda/blob/main/docs/source/reference/kernel.rst#warp-intrinsics. It should show up when the API is accessible from numba.cuda?

typingctx,
membermask_type,
mode_value,
a_type,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a slight confusion when reading the code, in that a_type is both passed in from typing and extracted from lowering arg parameter. Later to realize that the first a_type is used for typing and the second for lowering. And they are executed in different times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - there's no need to get it again from the signature, so I think that definition later of it can be removed - see below. What do you think?

@isVoid isVoid requested a review from gmarkall April 23, 2025 19:27
@gmarkall gmarkall added 4 - Waiting on reviewer Waiting for reviewer to respond to author and removed 4 - Waiting on author Waiting for author to respond to review labels Apr 30, 2025
membermask, a, b = args

# Types
a_type = sig.args[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this, as per https://github.com/NVIDIA/numba-cuda/pull/231/files#r2070337358:

Suggested change
a_type = sig.args[1]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further to our discussion earlier where you suggested not doing this so that it's not captured from the outer function - I'll not commit this suggestion, and merge the PR as-is.


for func in subtest:
with self.subTest(func=func.__name__):
compile_ptx(func, (int32[:], int32), cc=(10, 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised this passed on CI on toolkits that pre-date CC 10.0. I'm looking into what's going on here now, but do you know why this didn't fail on older toolkits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I now recall it's a long-standing behaviour that if we ask for a higher CC than supported, we just use the highest supported CC (because PTX for an older CC can be compiled for hardware of a newer CC).

# CC higher than supported
return supported_ccs[-1] # Choose the highest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok TIL.

Copy link
Contributor

@gmarkall gmarkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this if you are - whether you want to remove the second assignment of a_type marked on the diff or to proceed with it as-is.

@gmarkall gmarkall added 4 - Waiting on author Waiting for author to respond to review 5 - Ready to merge Testing and reviews complete, ready to merge and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author 4 - Waiting on author Waiting for author to respond to review labels May 1, 2025
@gmarkall gmarkall merged commit 02df550 into NVIDIA:main May 2, 2025
37 checks passed
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request May 2, 2025
- Fix Invalid NVVM IR emitted when lowering shfl_sync APIs (NVIDIA#231)
- Add Bfloat16 Low++ Bindings (NVIDIA#166)
- Fix cuda.jit decorator inline (NVIDIA#181)
- Feature: cuda specific make_attribute_wrapper (NVIDIA#193)
- return a none tuple if no libdevice path is found (NVIDIA#234)
@gmarkall gmarkall mentioned this pull request May 2, 2025
gmarkall added a commit to gmarkall/numba-cuda that referenced this pull request May 3, 2025
- Local variable debug info deduplication (NVIDIA#222)
- Fix package installation for wheels CI  (NVIDIA#238)
- Fix Invalid NVVM IR emitted when lowering shfl_sync APIs (NVIDIA#231)
- Add Bfloat16 Low++ Bindings (NVIDIA#166)
- Fix cuda.jit decorator inline (NVIDIA#181)
- Feature: cuda specific make_attribute_wrapper (NVIDIA#193)
- return a none tuple if no libdevice path is found (NVIDIA#234)
@gmarkall gmarkall mentioned this pull request May 3, 2025
gmarkall added a commit that referenced this pull request May 3, 2025
- Local variable debug info deduplication (#222)
- Fix package installation for wheels CI  (#238)
- Fix Invalid NVVM IR emitted when lowering shfl_sync APIs (#231)
- Add Bfloat16 Low++ Bindings (#166)
- Fix cuda.jit decorator inline (#181)
- Feature: cuda specific make_attribute_wrapper (#193)
- return a none tuple if no libdevice path is found (#234)
gmarkall pushed a commit that referenced this pull request Nov 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to merge Testing and reviews complete, ready to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Warp operations (shfl_sync, etc.) must use a constant int for the mode

2 participants