Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Decouple CUDA contexts from PI contexts #8197

Merged
merged 2 commits into from
Feb 9, 2023

Conversation

npmiller
Copy link
Contributor

@npmiller npmiller commented Feb 3, 2023

This patch moves the CUDA context from the PI context to the PI device, and switches to always using the primary context.

CUDA contexts are different from SYCL contexts in that they're tied to a single device, and that they are required to be active on a thread for most calls to the CUDA driver API.

As shown in #8124 and #7526 the current mapping of CUDA context to PI context, causes issues for device based entry points that still need to call the CUDA APIs, we have workarounds to solve that but they're a bit hacky, inefficient, and have a lot of edge case issues.

The peer to peer interface proposal in #6104, is also device based, but enabling peer to peer for CUDA is done on the CUDA contexts, so the current mapping would make it difficult to implement.

So this patch solves most of these issues by decoupling the CUDA context from the SYCL context, and simply managing the CUDA contexts in the devices, it also changes the CUDA context management to always use the primary context.

This approach as a number of advantages:

  • Use of the primary context is recommended by Nvidia
  • Simplifies the CUDA context management in the plugin
  • Available CUDA context in device based entry points
  • Likely more efficient in the general case, with less opportunities to accidentally cause costly CUDA context switches.
  • Easier and likely more efficient interactions with CUDA runtime applications.
  • Easier to expose P2P capabilities
  • Easier to support multiple devices in a SYCL context

It does have a few drawbacks from the previous approach:

  • Drops support for make_context interop, no sensible "native handle" to pass in (get_native is still supported fine).
  • No opportunity for users to separate their work into different CUDA contexts. It's unclear if there's any actual use case for this, it seems very uncommon in CUDA codebases to have multiple CUDA contexts for a single CUDA device in the same process.

So overall I believe this should be a net benefit in general, and we could revisit if we run into an edge case that would need more fine grained CUDA context management.

@npmiller npmiller requested review from a team as code owners February 3, 2023 14:19
@npmiller npmiller temporarily deployed to aws February 3, 2023 14:24 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws February 3, 2023 14:31 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws February 3, 2023 15:17 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws February 3, 2023 15:55 — with GitHub Actions Inactive
This patch moves the CUDA context from the PI context to the PI device,
and switches to always using the primary context.

CUDA contexts are different from SYCL contexts in that they're tied to a
single device, and that they are required to be active on a thread for
most calls to the CUDA driver API.

As shown in intel#8124 and intel#7526 the current mapping of
CUDA context to PI context, causes issues for device based entry points
that still need to call the CUDA APIs, we have workarounds to solve that
but they're a bit hacky, inefficient, and have a lot of edge case
issues.

The peer to peer interface proposal in intel#6104, is also device
based, but enabling peer to peer for CUDA is done on the CUDA contexts,
so the current mapping would make it difficult to implement.

So this patch solves most of these issues by decoupling the CUDA context
from the SYCL context, and simply managing the CUDA contexts in the
devices, it also changes the CUDA context management to always use the
primary context.

This approach as a number of advantages:

* Use of the primary context is recommended by Nvidia
* Simplifies the CUDA context management in the plugin
* Available CUDA context in device based entry points
* Likely more efficient in the general case, with less opportunities to
  accidentally cause costly CUDA context switches.
* Easier and likely more efficient interactions with CUDA runtime
  applications.
* Easier to expose P2P capabilities
* Easier to support multiple devices in a SYCL context

It does have a few drawbacks from the previous approach:

* Drops support for `make_context` interop, no sensible "native handle"
  to pass in (`get_native` is still supported fine).
* No opportunity for users to separate their work into different CUDA
  contexts. It's unclear if there's any actual use case for this, it
  seems very uncommon in CUDA codebases to have multiple CUDA contexts
  for a single CUDA device in the same process.

So overall I believe this should be a net benefit in general, and we
could revisit if we run into an edge case that would need more fine
grained CUDA context management.
Older versions of gcc struggle with attributes on namespaces
@npmiller npmiller temporarily deployed to aws February 6, 2023 14:57 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws February 6, 2023 15:31 — with GitHub Actions Inactive
Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly I am a little worried about decoupling the contexts fully, though I agree with the benefits of using a common default context (as it is also the motivation for sycl_ext_oneapi_default_context). Are we confident there would be no benefit to keeping the relation and just change the default to use primary context and making an inverse of the deprecated property?

@npmiller
Copy link
Contributor Author

npmiller commented Feb 7, 2023

Are we confident there would be no benefit to keeping the relation and just change the default to use primary context and making an inverse of the deprecated property?

If we keep the existing relation, we would still have issues for device only entry points, and having a primary context on the device but also allowing regular CUDA contexts in the PI context would just add an extra layer of complexity and possible issues with mismatched contexts.

As far as I can tell the use case of having multiple CUDA contexts for one device within the same process, which is enabled by the current architecture seems to be a very rare use case, and for example people that need multiple CUDA contexts for the same device are usually more likely to be using say MPS and have multiple processes anyway.

It's also interesting to note that as I understand it, hipSYCL is using the CUDA runtime API for their CUDA support, which automatically manages contexts and uses primary contexts, so they already have this decoupling between SYCL contexts and CUDA contexts.

So overall I'm pretty confident that in most cases breaking this relationship between the SYCL and CUDA contexts is fine, but I can't say for 100% certain that it won't break some very specific workflows.

However this is also partly why this patch is a little aggressive, I think we're probably better off taking a chance on breaking some very specific workflow, even if it means we need to revisit later on, than carry over all this complexity for an hypothetical workflow we don't even know exists. And if it does break something hopefully we'll then have a better idea of what people are trying to do, and we can maybe come up with a solution that works better with how CUDA works, for example we could do like the CUDA runtime and allow the SYCL runtime to use contexts that are set on the current thread by users.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch seems to do what it intends to do. I am okay with giving it a go.

@bader | @romanovvlad - Anyone we should check with before going ahead with this?

@bader
Copy link
Contributor

bader commented Feb 7, 2023

Patch seems to do what it intends to do. I am okay with giving it a go.

@bader | @romanovvlad - Anyone we should check with before going ahead with this?

@smaslov-intel?

@smaslov-intel
Copy link
Contributor

Patch seems to do what it intends to do. I am okay with giving it a go.
@bader | @romanovvlad - Anyone we should check with before going ahead with this?

@smaslov-intel?

No objections given the explanation in the description (thanks for that, btw)

@bader
Copy link
Contributor

bader commented Feb 7, 2023

Patch seems to do what it intends to do. I am okay with giving it a go.
@bader | @romanovvlad - Anyone we should check with before going ahead with this?

@smaslov-intel?

No objections given the explanation in the description (thanks for that, btw)

Thanks! We need formal approval from @intel/llvm-reviewers-cuda and @intel/llvm-reviewers-runtime teams to merge the PR.

@npmiller npmiller temporarily deployed to aws February 8, 2023 09:51 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws February 9, 2023 03:22 — with GitHub Actions Inactive
@bader bader merged commit a3542ed into intel:sycl Feb 9, 2023
npmiller added a commit that referenced this pull request Feb 9, 2023
bader pushed a commit that referenced this pull request Feb 9, 2023
Ruyk added a commit to codeplaysoftware/SYCL-For-CUDA-Examples that referenced this pull request Feb 13, 2023
Since intel/llvm#8197, SYCL CUDA backend uses
CUDA primary context by default, so individual context setting is no
longer required.
Ruyk added a commit to codeplaysoftware/SYCL-For-CUDA-Examples that referenced this pull request Apr 19, 2023
…27)

* Ignoring VIM temporary files

* Vector Addition example update

Works with latest DPC++ and SYCL2020 features
README updated to reflect CUDA backend has USM support

* Removing unnecessary CUDA Driver types

Since intel/llvm#8197, SYCL CUDA backend uses
CUDA primary context by default, so individual context setting is no
longer required.

* Using moder queue construction

SYCL 1.2.1 device selectors have been deprecated in favour of a new
simplified form using lambdas.

* Format files

Run clang-format on files, separate commit to avoid noise

* Explicitly setting CUDA context on host task

Because of the changes on SYCL context, it is necessary now to set the
active CUDA context manually inside the host task.

Note there was some clang-formatting here as well

* Addressing feedback from Gordon
steffenlarsen pushed a commit that referenced this pull request Jul 26, 2023
The primary context has been default for a while in CUDA PI/Adapter. See
#8197.

This PR brings the HIP adapter up to speed.

It also changes the scoped context to only take a `ur_device_handle_t`
since this is coupled with a native primary context in HIP
veselypeta pushed a commit to veselypeta/llvm that referenced this pull request Sep 28, 2023
The primary context has been default for a while in CUDA PI/Adapter. See
intel#8197.

This PR brings the HIP adapter up to speed.

It also changes the scoped context to only take a `ur_device_handle_t`
since this is coupled with a native primary context in HIP
szadam pushed a commit to szadam/unified-runtime that referenced this pull request Oct 13, 2023
The primary context has been default for a while in CUDA PI/Adapter. See
intel/llvm#8197.

This PR brings the HIP adapter up to speed.

It also changes the scoped context to only take a `ur_device_handle_t`
since this is coupled with a native primary context in HIP
mdtoguchi pushed a commit to mdtoguchi/llvm that referenced this pull request Oct 18, 2023
The primary context has been default for a while in CUDA PI/Adapter. See
intel#8197.

This PR brings the HIP adapter up to speed.

It also changes the scoped context to only take a `ur_device_handle_t`
since this is coupled with a native primary context in HIP
omarahmed1111 pushed a commit to omarahmed1111/unified-runtime that referenced this pull request Oct 23, 2023
The primary context has been default for a while in CUDA PI/Adapter. See
intel/llvm#8197.

This PR brings the HIP adapter up to speed.

It also changes the scoped context to only take a `ur_device_handle_t`
since this is coupled with a native primary context in HIP
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants