Skip to content

Conversation

@GeorgeWeb
Copy link
Contributor

@GeorgeWeb GeorgeWeb commented Sep 4, 2024

This change does the grid dimensions validation earlier as it was previously deferred to cuLaunchKernel. For now it returns UR_RESULT_ERROR_INVALID_VALUE as it is also handled this way in the DPC++ sycl runtime, but it allows better control to target this with a more suitable error code in the future.

Related Q&A discussion: intel/llvm#14160

intel/llvm PR: intel/llvm#15390

@GeorgeWeb GeorgeWeb force-pushed the georgi/max-wg-dim-error branch from a429618 to 95cf885 Compare September 4, 2024 17:25
@github-actions github-actions bot added the cuda CUDA adapter specific issues label Sep 4, 2024
@GeorgeWeb GeorgeWeb force-pushed the georgi/max-wg-dim-error branch from 95cf885 to 391ef40 Compare September 13, 2024 13:24
@GeorgeWeb GeorgeWeb force-pushed the georgi/max-wg-dim-error branch from 391ef40 to 8ab63e2 Compare September 13, 2024 13:25
@GeorgeWeb GeorgeWeb marked this pull request as ready for review September 13, 2024 13:32
@GeorgeWeb GeorgeWeb requested a review from a team as a code owner September 13, 2024 13:32
@GeorgeWeb GeorgeWeb requested a review from JackAKirk September 13, 2024 13:32
@JackAKirk
Copy link
Contributor

JackAKirk commented Sep 13, 2024

I've got mixed feelings. On the one hand it allows you to give the user more information, that the invalid value is due to invalid grid size. On the other hand this is more information than you get compared to cuda runtime, and I'm not sure that it needs to be this specific at the expense of slowing down the kernel launch to do the query. It also sets a precedent to do this.

Personally I think we should just be preserving the native errors but emitting a sycl exception, which is what this achieves:
#500 (comment)

I had thought that we had already agreed this since we had a meeting about it and I thought that the above was the agreed solution: see this #589 and this: intel/llvm#10066

This way users get the same error messages as the native cuda backend runtime, and they can look them up using the native documentation, but they are packaged in a sycl::exception satisfying the spec and allowing them to be caught.

@GeorgeWeb
Copy link
Contributor Author

GeorgeWeb commented Sep 13, 2024

I've got mixed feelings. On the one hand it allows you to give the user more information, that the invalid value is due to invalid grid size. On the other hand this is more information than you get compared to cuda runtime, and I'm not sure that it needs to be this specific at the expense of slowing down the kernel launch to do the query. It also sets a precedent to do this.

Personally I think we should just be preserving the native errors but emitting a sycl exception, which is what this achieves: #500 (comment)

I had thought that we had already agreed this since we had a meeting about it and I thought that the above was the agreed solution: see this #589 and this: intel/llvm#10066

This way users get the same error messages as the native cuda backend runtime, and they can look them up using the native documentation, but they are packaged in a sycl::exception satisfying the spec and allowing them to be caught.

I wasn't aware where the direction was regarding preserving the native errors. I agree we'll get the native invalid-value error anyways but in order to give the user more information regarding why they are getting invalid value here, we have this bit of code in the sycl runtime that does the above check to report better information to the user, see https://github.com/intel/llvm/blob/sycl/sycl/source/detail/error_handling/error_handling.cpp#L341. The sycl runtime will always throw accordingly, but in case the result code ever changes from invalid to something else in the cuda runtime, we'll miss on delivering the extra bit of information. I guess since this change does not require an actual change of the resulting error code for the sycl runtime to specialize the error handling in error_handling.cpp, then it's okay to not do the extra early check here.

Update: setPluginSpecificMessage should fix this where needed. As agreed with Jack, it's not needed in this instance yet.

@GeorgeWeb
Copy link
Contributor Author

GeorgeWeb commented Sep 13, 2024

I see your point now. I'll close this PR because at the moment the same INVALID_VALUE mapping is returned from the driver API anyways. I'll use the above approach with setPluginSpecificMessage you already linked in future instances where it is needed. Thanks!

@GeorgeWeb GeorgeWeb closed this Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda CUDA adapter specific issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants