-
Notifications
You must be signed in to change notification settings - Fork 54
Fix Invalid NVVM IR emitted when lowering shfl_sync APIs #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gmarkall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving this PR forward and fixing the correctness issues in it.
When I tried to follow through the implementation to understand what's going on and correlate it with the NVVM IR and CUDA C/C++ programming documentation, I found it very hard to follow what was going on, because the implementation diverges from both in a lot of small ways (because it always did in Numba, not as a result of this PR).
There was also some redundant work with creating structs and extracting values from them that was not quite redundant in the old implementation. Now that the implementation is all combined into a single intrinsic instead of the old way of having typing, lowering, a stub, and jitted functions, the extra packing and unpacking is un-needed.
To get to a point where I could follow through what was going on I found I had to iterate through modifying the code to align it with the docs and C/C++ implementation, and get rid of the extra packing and unpacking - this is in gmarkall@931f865
A couple of other notes from whilst I was doing that:
- I kept making mistakes that involved
test_shfl_sync_typestest failing, so it was helpful to use subTests for it: gmarkall@de76f69 - The docstrings for the intrinsics were missing, so they wouldn't have been visible in an IDE or shown up in the documentation.
I suspect we will also have a similar issue for the vote sync intrinsic, which is documented as requiring a constant mode as well: https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#vote - we should get this PR in and then repeat the process for that intrinsic (and maybe check if there are other intrinsics we use where we make this mistake too).
| compiled[1, nelem](ary, xor) | ||
| self.assertTrue(np.all(ary == exp)) | ||
|
|
||
| def test_shfl_sync_const_mode_val(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also have a variant of this test that uses compile_ptx() with cc=(10, 0) to ensure that all is well when going through the path that uses LLVM 18.x IR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This consists of a number of changes: - Add notes on implementation and links to docs - Map public API arguments to those referred to in the NVVM IR docs - Restore docstrings for intrinsics - Simplify some of the intrinsic logic - there is no need to construct aggregates containing the full returned structure from the intrinsic only to extract the computed value just prior to returning from the intrinsic.
Can you elaborate? The code is directly documented here: https://github.com/NVIDIA/numba-cuda/blob/main/docs/source/reference/kernel.rst#warp-intrinsics. It should show up when the API is accessible from |
| typingctx, | ||
| membermask_type, | ||
| mode_value, | ||
| a_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a slight confusion when reading the code, in that a_type is both passed in from typing and extracted from lowering arg parameter. Later to realize that the first a_type is used for typing and the second for lowering. And they are executed in different times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - there's no need to get it again from the signature, so I think that definition later of it can be removed - see below. What do you think?
| membermask, a, b = args | ||
|
|
||
| # Types | ||
| a_type = sig.args[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need this, as per https://github.com/NVIDIA/numba-cuda/pull/231/files#r2070337358:
| a_type = sig.args[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further to our discussion earlier where you suggested not doing this so that it's not captured from the outer function - I'll not commit this suggestion, and merge the PR as-is.
|
|
||
| for func in subtest: | ||
| with self.subTest(func=func.__name__): | ||
| compile_ptx(func, (int32[:], int32), cc=(10, 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised this passed on CI on toolkits that pre-date CC 10.0. I'm looking into what's going on here now, but do you know why this didn't fail on older toolkits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I now recall it's a long-standing behaviour that if we ask for a higher CC than supported, we just use the highest supported CC (because PTX for an older CC can be compiled for hardware of a newer CC).
numba-cuda/numba_cuda/numba/cuda/cudadrv/nvvm.py
Lines 478 to 479 in 9c727b0
| # CC higher than supported | |
| return supported_ccs[-1] # Choose the highest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok TIL.
gmarkall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy with this if you are - whether you want to remove the second assignment of a_type marked on the diff or to proceed with it as-is.
- Fix Invalid NVVM IR emitted when lowering shfl_sync APIs (NVIDIA#231) - Add Bfloat16 Low++ Bindings (NVIDIA#166) - Fix cuda.jit decorator inline (NVIDIA#181) - Feature: cuda specific make_attribute_wrapper (NVIDIA#193) - return a none tuple if no libdevice path is found (NVIDIA#234)
- Local variable debug info deduplication (NVIDIA#222) - Fix package installation for wheels CI (NVIDIA#238) - Fix Invalid NVVM IR emitted when lowering shfl_sync APIs (NVIDIA#231) - Add Bfloat16 Low++ Bindings (NVIDIA#166) - Fix cuda.jit decorator inline (NVIDIA#181) - Feature: cuda specific make_attribute_wrapper (NVIDIA#193) - return a none tuple if no libdevice path is found (NVIDIA#234)
- Local variable debug info deduplication (#222) - Fix package installation for wheels CI (#238) - Fix Invalid NVVM IR emitted when lowering shfl_sync APIs (#231) - Add Bfloat16 Low++ Bindings (#166) - Fix cuda.jit decorator inline (#181) - Feature: cuda specific make_attribute_wrapper (#193) - return a none tuple if no libdevice path is found (#234)
Per nvvm documentation, shuffle APIs takes a IR constant for the mode parameter. In current Numba implementation, it is a variable. This could crash the NVVM lowering when the ConstantFolds are not applied to IR (-opt=3 e.g.). This PR fixes #228