Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA][PI][runtime][ABI-break] Add support for multi-device context #6446

Closed
wants to merge 50 commits into from

Conversation

t4c1
Copy link
Contributor

@t4c1 t4c1 commented Jul 15, 2022

Introduces support for having multiple CUDA devices in one context.

To facilitate moving buffer and image memory between devices within the same context, some ABI-breaking changes had to be made to the runtime and PI interface.

This includes expanding the check for whether memory needs to be moved from just checking whether the context is the same to also checking whether device is the same. So this creates a performance regression for multiple devices that share memory within the same context. These will now also make copies of memory allocations for each device. This will be resolved in a future pull request, when we introduce a memory migration PI API for direct transfers between devices without going trough the host.

Tests in intel/llvm-test-suite#1102.

pi_context context, pi_mem_flags flags, size_t size, void *host_ptr,
pi_mem *ret_mem, const pi_mem_properties *properties = nullptr);
__SYCL_EXPORT pi_result
piMemBufferCreate(pi_context context, pi_device device, pi_mem_flags flags,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you need to add "device"? How is it used and what if it is not given?
We didn't need that for buffer migration in L0 BE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the device, the memory will be allocated on. Even if we are putting multiple CUDA devices in a single context, they still have distinct device memories, so for the allocation we need to know what device to allocate on. It needs to be given and the runtime is changed in this PR so that it is always given. On backends that do not need this information device argument can be ignored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer would be represented by multiple allocations, one on each device in the context. It would then "migrate" (copy between allocations) depending on where the buffer is being accessed from. So what is that single "device" that you are adding and why it was not necessary for OpenCL and Level-Zero but is needed for CUDA?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CUDA backend we want to be able to put multiple devices in the same context even when they do not share the same memory. So memory allocation can not be located in a context (that could contain different devices with distinct memories), but must be located on a specific device.

As far as I understand it other backends so far only supported putting multiple devices in the same context when they share memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, both OpenCL and Level-Zero backends support context with multiple devices in it even if they have their distinct memories. That's why I said there multiple allocations representing such a buffer in the context, and it is being copied essentially from device where it was used the last to where it is needed next.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is a possible way to go, but I am not sure it justifies the "complexity" of breaking backwards compatibility (btw, I am still not sure why you want to have some "initial" device for a buffer allocation since that can immediately change as app starts using it from other devices). The buffer "migration" across devices in the context implemented in OpenCL and Level-Zero is not a big deal, in my view, so how do we justify a re-course?

Adding few folks for extra opinions: @steffenlarsen , @romanovvlad

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(btw, I am still not sure why you want to have some "initial" device for a buffer allocation since that can immediately change as app starts using it from other devices)

There is no preemptive allocation on some device with this approach. You only have a device allocation when you are first using a buffer on that device. It can only migrate to another device after that allocation is first used on the device it was allocated on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no preemptive allocation on some device with this approach. You only have a device allocation when you are first using a buffer on that device. It can only migrate to another device after that allocation is first used on the device it was allocated on.

So this is the device that first used that buffer, and nothing more? Eventually it will be migrated (by SYCL RT in your proposal) to whatever other device that needs it in this context, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Except if the backend returns PLUGIN_MANAGED_OR_SAME from piGetMemoryConnection. In this case RT will do nothing and assume the backend can handle any migrations it needs. That can be used for any pair of devices in the same context for the OpenCL backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the query.

kbobrovs
kbobrovs previously approved these changes Jul 18, 2022
Copy link
Contributor

@kbobrovs kbobrovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sycl/plugins/esimd_emulator part OK

@smaslov-intel
Copy link
Contributor

what's the status pf this please?

@npmiller
Copy link
Contributor

npmiller commented Jan 2, 2023

So this is in a fairly good state (I do need to resolve conflicts), and it works well on CUDA platforms but the CI does flag some issues on other platforms that still need to be investigated.

But even once this is resolved I'm not sure we're ready to merge, because it does changes the PI API, in a way that works well in the SYCL runtime but that may be problematic for other uses, so I've brought up the changes here to the Unified Runtime:

So it might make sense to hold off on merging this until we discuss it further for the Unified Runtime.

@npmiller npmiller temporarily deployed to aws January 27, 2023 17:19 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws January 27, 2023 18:03 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws January 27, 2023 18:36 — with GitHub Actions Inactive
@npmiller
Copy link
Contributor

/verify with intel/llvm-test-suite#1102

@npmiller npmiller temporarily deployed to aws April 5, 2023 18:22 — with GitHub Actions Inactive
@npmiller npmiller temporarily deployed to aws April 5, 2023 19:30 — with GitHub Actions Inactive
@veselypeta
Copy link
Contributor

Should these changes be mirrored in unified-runtime @npmiller ?

Copy link
Contributor

This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

@github-actions github-actions bot added the Stale label Sep 16, 2024
Copy link
Contributor

This pull request was closed because it has been stalled for 30 days with no activity.

@github-actions github-actions bot closed this Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants