[Cuda] Check for exceeding grid dimension limits (work-group count) #2050

GeorgeWeb · 2024-09-04T17:24:17Z

This change does the grid dimensions validation earlier as it was previously deferred to cuLaunchKernel. For now it returns UR_RESULT_ERROR_INVALID_VALUE as it is also handled this way in the DPC++ sycl runtime, but it allows better control to target this with a more suitable error code in the future.

Related Q&A discussion: intel/llvm#14160

intel/llvm PR: intel/llvm#15390

JackAKirk · 2024-09-13T13:49:23Z

I've got mixed feelings. On the one hand it allows you to give the user more information, that the invalid value is due to invalid grid size. On the other hand this is more information than you get compared to cuda runtime, and I'm not sure that it needs to be this specific at the expense of slowing down the kernel launch to do the query. It also sets a precedent to do this.

Personally I think we should just be preserving the native errors but emitting a sycl exception, which is what this achieves:
#500 (comment)

I had thought that we had already agreed this since we had a meeting about it and I thought that the above was the agreed solution: see this #589 and this: intel/llvm#10066

This way users get the same error messages as the native cuda backend runtime, and they can look them up using the native documentation, but they are packaged in a sycl::exception satisfying the spec and allowing them to be caught.

GeorgeWeb · 2024-09-13T15:29:49Z

I've got mixed feelings. On the one hand it allows you to give the user more information, that the invalid value is due to invalid grid size. On the other hand this is more information than you get compared to cuda runtime, and I'm not sure that it needs to be this specific at the expense of slowing down the kernel launch to do the query. It also sets a precedent to do this.

Personally I think we should just be preserving the native errors but emitting a sycl exception, which is what this achieves: #500 (comment)

I had thought that we had already agreed this since we had a meeting about it and I thought that the above was the agreed solution: see this #589 and this: intel/llvm#10066

This way users get the same error messages as the native cuda backend runtime, and they can look them up using the native documentation, but they are packaged in a sycl::exception satisfying the spec and allowing them to be caught.

I wasn't aware where the direction was regarding preserving the native errors. I agree we'll get the native invalid-value error anyways but in order to give the user more information regarding why they are getting invalid value here, we have this bit of code in the sycl runtime that does the above check to report better information to the user, see https://github.com/intel/llvm/blob/sycl/sycl/source/detail/error_handling/error_handling.cpp#L341. The sycl runtime will always throw accordingly, but in case the result code ever changes from invalid to something else in the cuda runtime, we'll miss on delivering the extra bit of information. I guess since this change does not require an actual change of the resulting error code for the sycl runtime to specialize the error handling in error_handling.cpp, then it's okay to not do the extra early check here.

Update: setPluginSpecificMessage should fix this where needed. As agreed with Jack, it's not needed in this instance yet.

GeorgeWeb · 2024-09-13T15:59:51Z

I see your point now. I'll close this PR because at the moment the same INVALID_VALUE mapping is returned from the driver API anyways. I'll use the above approach with setPluginSpecificMessage you already linked in future instances where it is needed. Thanks!

GeorgeWeb force-pushed the georgi/max-wg-dim-error branch from a429618 to 95cf885 Compare September 4, 2024 17:25

github-actions bot added the cuda CUDA adapter specific issues label Sep 4, 2024

GeorgeWeb force-pushed the georgi/max-wg-dim-error branch from 95cf885 to 391ef40 Compare September 13, 2024 13:24

[Cuda] Check for exceeding grid dimension limits (work-group count)

8ab63e2

GeorgeWeb force-pushed the georgi/max-wg-dim-error branch from 391ef40 to 8ab63e2 Compare September 13, 2024 13:25

GeorgeWeb mentioned this pull request Sep 13, 2024

Corresponding DPC++ PR for unified-runtime/pull/2050 intel/llvm#15390

Closed

GeorgeWeb marked this pull request as ready for review September 13, 2024 13:32

GeorgeWeb requested a review from a team as a code owner September 13, 2024 13:32

GeorgeWeb requested a review from JackAKirk September 13, 2024 13:32

JackAKirk approved these changes Sep 13, 2024

View reviewed changes

GeorgeWeb closed this Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Cuda] Check for exceeding grid dimension limits (work-group count) #2050

[Cuda] Check for exceeding grid dimension limits (work-group count) #2050

Uh oh!

GeorgeWeb commented Sep 4, 2024 •

edited

Loading

Uh oh!

JackAKirk commented Sep 13, 2024 •

edited

Loading

Uh oh!

GeorgeWeb commented Sep 13, 2024 •

edited

Loading

Uh oh!

GeorgeWeb commented Sep 13, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Cuda] Check for exceeding grid dimension limits (work-group count) #2050

[Cuda] Check for exceeding grid dimension limits (work-group count) #2050

Uh oh!

Conversation

GeorgeWeb commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackAKirk commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GeorgeWeb commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GeorgeWeb commented Sep 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GeorgeWeb commented Sep 4, 2024 •

edited

Loading

JackAKirk commented Sep 13, 2024 •

edited

Loading

GeorgeWeb commented Sep 13, 2024 •

edited

Loading

GeorgeWeb commented Sep 13, 2024 •

edited

Loading