Update spec to make kernel validation optional #2564

RossBrunton · 2025-01-15T14:14:30Z

Several adapters don't support validating kernel signatures when
enqueued. To handle this, we now allow urEnqueueKernelLaunch to return
SUCCESS even when parameters are invalid.

frasercrmck · 2025-02-03T10:28:34Z

source/adapters/cuda/enqueue.cpp

+          ThreadsPerBlock[0], ThreadsPerBlock[1], ThreadsPerBlock[2], LocalSize,
+          CuStream, const_cast<void **>(ArgPointers.data()), nullptr));
+    } catch (ur_result_t Err) {
+      // Cuda returns UR_RESULT_ERROR_INVALID_VALUE if the args are incorrect


This may be true, but it's not only if the args are incorrect. I suspect cuLaunchKernel can return CUDA_ERROR_INVALID_VALUE -> UR_RESULT_ERROR_INVALID_VALUE for many reasons (the docs don't say, obviously). Would that be a problem, if we return UR_RESULT_ERROR_INVALID_KERNEL_ARGS mistakenly in some cases?

Also just as a matter of style, should we say that "Cuda" returns a UR error code? Could that mislead people? Its' really UR_CHECK_ERROR that's translating CUDA_ERROR_INVALID_VALUE to UR_RESULT_ERROR_INVALID_VALUE

I'm not sure there's a better option; INVALID_VALUE isn't a correct return for the test (since all the arguments to enqueueKernelLaunch are valid - It's just the arguments passed to the kernel itself that are wrong).

Should I leave it like it is in this MR (and issues can be patched when they are found), or revert this change and mark the test as XFAIL until a better solution is found?

I've searched the internet a bit for cuLaunchKernel and INVALID_VALUE and all results I've seen are because people are compiling for a different architecture than the one they're running on ([1]) or they're launching with incorrect threads/dimensions ([2], [3]) or even CUDA context mismanagement ([4]). I've not seen anything saying that this error is returning when arguments are invalid.

There's also [5] which states:

The error CUDA_ERROR_INVALID_VALUE will be returned if kernel parameters are specified with both kernelParams and extra (i.e. both kernelParams and extra are non-NULL).

Of course, some of the situations I've encountered above may well never occur in the corresponding UR usage, because they'd be caught by other UR APIs.

But still my gut is telling me we're going to report false positives, which is less useful than just accepting the CUDA API's weaknesses. If we can't be sure, I'd lean towards keeping it a known test failure (XFAIL).

I'd take @npmiller's thoughts on this.

[1], [2], [3], [4], [5]

Oh that's a tough one, in practice I suspect that error code mostly comes from [1], but we can, and likely already are, ensuring the sizes are correct in the adapter before calling cuLaunchKernel, and it still goes through UR_CHECK_ERROR so the actual error code information still goes in the logger no matter what code we actually return.

That being said I think I also lean towards just keeping the error code as-is and leaving the test XFAIL, ultimately there are still some cases where this error code might get returned that aren't about the kernel arguments and that would be very confusing. Since this is optional now I think it makes more sense to just not support it for the CUDA adapter as we can't support it properly.

I'll make it XFAIL for the moment, and it can be fixed properly in a separate task.

Several adapters don't support validating kernel signatures when enqueued. To handle this, we now allow urEnqueueKernelLaunch to return `SUCCESS` even when parameters are invalid. Some tests have also been updated. The CUDA adapter has also been updated to handle invalid arguments.

RossBrunton force-pushed the ross/argval branch from b58991e to e933dfd Compare January 15, 2025 14:14

github-actions bot added loader Loader related feature/bug conformance Conformance test suite issues. specification Changes or additions to the specification labels Jan 15, 2025

RossBrunton mentioned this pull request Jan 15, 2025

Look into and reconsider CTS tests with high fail rate across adapters. #2514

Open

RossBrunton force-pushed the ross/argval branch 4 times, most recently from 7c1118b to f249b9b Compare January 22, 2025 17:19

github-actions bot added the cuda CUDA adapter specific issues label Jan 22, 2025

RossBrunton force-pushed the ross/argval branch 3 times, most recently from 25f458d to 7affcd2 Compare January 31, 2025 12:38

RossBrunton changed the title ~~RFC: Update spec to make kernel validation optional~~ Update spec to make kernel validation optional Jan 31, 2025

RossBrunton marked this pull request as ready for review January 31, 2025 15:50

RossBrunton requested review from a team as code owners January 31, 2025 15:50

RossBrunton requested a review from frasercrmck January 31, 2025 15:50

frasercrmck reviewed Feb 3, 2025

View reviewed changes

RossBrunton force-pushed the ross/argval branch from 7affcd2 to 38d12a5 Compare February 12, 2025 11:54

RossBrunton force-pushed the ross/argval branch from 38d12a5 to 9f6415d Compare February 19, 2025 11:59

RossBrunton mentioned this pull request Feb 19, 2025

[UR] Update spec to make kernel argument validation in urEnqueueKernelLaunch optional intel/llvm#17068

Closed

RossBrunton closed this Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update spec to make kernel validation optional #2564

Update spec to make kernel validation optional #2564

RossBrunton commented Jan 15, 2025

frasercrmck Feb 3, 2025

frasercrmck Feb 3, 2025

RossBrunton Feb 12, 2025

frasercrmck Feb 17, 2025 •

edited

Loading

npmiller Feb 17, 2025

RossBrunton Feb 18, 2025

Update spec to make kernel validation optional #2564

Update spec to make kernel validation optional #2564

Conversation

RossBrunton commented Jan 15, 2025

frasercrmck Feb 3, 2025

Choose a reason for hiding this comment

frasercrmck Feb 3, 2025

Choose a reason for hiding this comment

RossBrunton Feb 12, 2025

Choose a reason for hiding this comment

frasercrmck Feb 17, 2025 • edited Loading

Choose a reason for hiding this comment

npmiller Feb 17, 2025

Choose a reason for hiding this comment

RossBrunton Feb 18, 2025

Choose a reason for hiding this comment

frasercrmck Feb 17, 2025 •

edited

Loading