-
Notifications
You must be signed in to change notification settings - Fork 124
Update spec to make kernel validation optional #2564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b58991e
to
e933dfd
Compare
7c1118b
to
f249b9b
Compare
25f458d
to
7affcd2
Compare
source/adapters/cuda/enqueue.cpp
Outdated
ThreadsPerBlock[0], ThreadsPerBlock[1], ThreadsPerBlock[2], LocalSize, | ||
CuStream, const_cast<void **>(ArgPointers.data()), nullptr)); | ||
} catch (ur_result_t Err) { | ||
// Cuda returns UR_RESULT_ERROR_INVALID_VALUE if the args are incorrect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be true, but it's not only if the args are incorrect. I suspect cuLaunchKernel
can return CUDA_ERROR_INVALID_VALUE
-> UR_RESULT_ERROR_INVALID_VALUE
for many reasons (the docs don't say, obviously). Would that be a problem, if we return UR_RESULT_ERROR_INVALID_KERNEL_ARGS
mistakenly in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just as a matter of style, should we say that "Cuda" returns a UR error code? Could that mislead people? Its' really UR_CHECK_ERROR
that's translating CUDA_ERROR_INVALID_VALUE
to UR_RESULT_ERROR_INVALID_VALUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure there's a better option; INVALID_VALUE
isn't a correct return for the test (since all the arguments to enqueueKernelLaunch
are valid - It's just the arguments passed to the kernel itself that are wrong).
Should I leave it like it is in this MR (and issues can be patched when they are found), or revert this change and mark the test as XFAIL until a better solution is found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've searched the internet a bit for cuLaunchKernel
and INVALID_VALUE
and all results I've seen are because people are compiling for a different architecture than the one they're running on ([1]) or they're launching with incorrect threads/dimensions ([2], [3]) or even CUDA context mismanagement ([4]). I've not seen anything saying that this error is returning when arguments are invalid.
There's also [5] which states:
The error CUDA_ERROR_INVALID_VALUE will be returned if kernel parameters are specified with both kernelParams and extra (i.e. both kernelParams and extra are non-NULL).
Of course, some of the situations I've encountered above may well never occur in the corresponding UR usage, because they'd be caught by other UR APIs.
But still my gut is telling me we're going to report false positives, which is less useful than just accepting the CUDA API's weaknesses. If we can't be sure, I'd lean towards keeping it a known test failure (XFAIL).
I'd take @npmiller's thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's a tough one, in practice I suspect that error code mostly comes from [1], but we can, and likely already are, ensuring the sizes are correct in the adapter before calling cuLaunchKernel
, and it still goes through UR_CHECK_ERROR
so the actual error code information still goes in the logger no matter what code we actually return.
That being said I think I also lean towards just keeping the error code as-is and leaving the test XFAIL, ultimately there are still some cases where this error code might get returned that aren't about the kernel arguments and that would be very confusing. Since this is optional now I think it makes more sense to just not support it for the CUDA adapter as we can't support it properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make it XFAIL for the moment, and it can be fixed properly in a separate task.
7affcd2
to
38d12a5
Compare
Several adapters don't support validating kernel signatures when enqueued. To handle this, we now allow urEnqueueKernelLaunch to return `SUCCESS` even when parameters are invalid. Some tests have also been updated. The CUDA adapter has also been updated to handle invalid arguments.
38d12a5
to
9f6415d
Compare
Several adapters don't support validating kernel signatures when
enqueued. To handle this, we now allow urEnqueueKernelLaunch to return
SUCCESS
even when parameters are invalid.