-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA] Decouple CUDA contexts from PI contexts #8197
Conversation
7185416
to
b1be579
Compare
This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context. CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API. As shown in intel#8124 and intel#7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues. The peer to peer interface proposal in intel#6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement. So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context. This approach as a number of advantages: * Use of the primary context is recommended by Nvidia * Simplifies the CUDA context management in the plugin * Available CUDA context in device based entry points * Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches. * Easier and likely more efficient interactions with CUDA runtime applications. * Easier to expose P2P capabilities * Easier to support multiple devices in a SYCL context It does have a few drawbacks from the previous approach: * Drops support for `make_context` interop, no sensible "native handle" to pass in (`get_native` is still supported fine). * No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process. So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.
Older versions of gcc struggle with attributes on namespaces
cf2bb17
to
8685475
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Admittedly I am a little worried about decoupling the contexts fully, though I agree with the benefits of using a common default context (as it is also the motivation for sycl_ext_oneapi_default_context). Are we confident there would be no benefit to keeping the relation and just change the default to use primary context and making an inverse of the deprecated property?
If we keep the existing relation, we would still have issues for device only entry points, and having a primary context on the device but also allowing regular CUDA contexts in the PI context would just add an extra layer of complexity and possible issues with mismatched contexts. As far as I can tell the use case of having multiple CUDA contexts for one device within the same process, which is enabled by the current architecture seems to be a very rare use case, and for example people that need multiple CUDA contexts for the same device are usually more likely to be using say MPS and have multiple processes anyway. It's also interesting to note that as I understand it, hipSYCL is using the CUDA runtime API for their CUDA support, which automatically manages contexts and uses primary contexts, so they already have this decoupling between SYCL contexts and CUDA contexts. So overall I'm pretty confident that in most cases breaking this relationship between the SYCL and CUDA contexts is fine, but I can't say for 100% certain that it won't break some very specific workflows. However this is also partly why this patch is a little aggressive, I think we're probably better off taking a chance on breaking some very specific workflow, even if it means we need to revisit later on, than carry over all this complexity for an hypothetical workflow we don't even know exists. And if it does break something hopefully we'll then have a better idea of what people are trying to do, and we can maybe come up with a solution that works better with how CUDA works, for example we could do like the CUDA runtime and allow the SYCL runtime to use contexts that are set on the current thread by users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch seems to do what it intends to do. I am okay with giving it a go.
@bader | @romanovvlad - Anyone we should check with before going ahead with this?
|
No objections given the explanation in the description (thanks for that, btw) |
Thanks! We need formal approval from @intel/llvm-reviewers-cuda and @intel/llvm-reviewers-runtime teams to merge the PR. |
This is no longer needed with #8197
This is no longer needed with #8197
Since intel/llvm#8197, SYCL CUDA backend uses CUDA primary context by default, so individual context setting is no longer required.
…27) * Ignoring VIM temporary files * Vector Addition example update Works with latest DPC++ and SYCL2020 features README updated to reflect CUDA backend has USM support * Removing unnecessary CUDA Driver types Since intel/llvm#8197, SYCL CUDA backend uses CUDA primary context by default, so individual context setting is no longer required. * Using moder queue construction SYCL 1.2.1 device selectors have been deprecated in favour of a new simplified form using lambdas. * Format files Run clang-format on files, separate commit to avoid noise * Explicitly setting CUDA context on host task Because of the changes on SYCL context, it is necessary now to set the active CUDA context manually inside the host task. Note there was some clang-formatting here as well * Addressing feedback from Gordon
The primary context has been default for a while in CUDA PI/Adapter. See #8197. This PR brings the HIP adapter up to speed. It also changes the scoped context to only take a `ur_device_handle_t` since this is coupled with a native primary context in HIP
The primary context has been default for a while in CUDA PI/Adapter. See intel#8197. This PR brings the HIP adapter up to speed. It also changes the scoped context to only take a `ur_device_handle_t` since this is coupled with a native primary context in HIP
The primary context has been default for a while in CUDA PI/Adapter. See intel/llvm#8197. This PR brings the HIP adapter up to speed. It also changes the scoped context to only take a `ur_device_handle_t` since this is coupled with a native primary context in HIP
The primary context has been default for a while in CUDA PI/Adapter. See intel#8197. This PR brings the HIP adapter up to speed. It also changes the scoped context to only take a `ur_device_handle_t` since this is coupled with a native primary context in HIP
The primary context has been default for a while in CUDA PI/Adapter. See intel/llvm#8197. This PR brings the HIP adapter up to speed. It also changes the scoped context to only take a `ur_device_handle_t` since this is coupled with a native primary context in HIP
This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context.
CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API.
As shown in #8124 and #7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues.
The peer to peer interface proposal in #6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement.
So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context.
This approach as a number of advantages:
It does have a few drawbacks from the previous approach:
make_context
interop, no sensible "native handle" to pass in (get_native
is still supported fine).So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.