-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA][PI][runtime][ABI-break] Add support for multi-device context #6446
Conversation
pi_context context, pi_mem_flags flags, size_t size, void *host_ptr, | ||
pi_mem *ret_mem, const pi_mem_properties *properties = nullptr); | ||
__SYCL_EXPORT pi_result | ||
piMemBufferCreate(pi_context context, pi_device device, pi_mem_flags flags, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you need to add "device"? How is it used and what if it is not given?
We didn't need that for buffer migration in L0 BE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the device, the memory will be allocated on. Even if we are putting multiple CUDA devices in a single context, they still have distinct device memories, so for the allocation we need to know what device to allocate on. It needs to be given and the runtime is changed in this PR so that it is always given. On backends that do not need this information device argument can be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The buffer would be represented by multiple allocations, one on each device in the context. It would then "migrate" (copy between allocations) depending on where the buffer is being accessed from. So what is that single "device" that you are adding and why it was not necessary for OpenCL and Level-Zero but is needed for CUDA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CUDA backend we want to be able to put multiple devices in the same context even when they do not share the same memory. So memory allocation can not be located in a context (that could contain different devices with distinct memories), but must be located on a specific device.
As far as I understand it other backends so far only supported putting multiple devices in the same context when they share memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, both OpenCL and Level-Zero backends support context with multiple devices in it even if they have their distinct memories. That's why I said there multiple allocations representing such a buffer in the context, and it is being copied essentially from device where it was used the last to where it is needed next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is a possible way to go, but I am not sure it justifies the "complexity" of breaking backwards compatibility (btw, I am still not sure why you want to have some "initial" device for a buffer allocation since that can immediately change as app starts using it from other devices). The buffer "migration" across devices in the context implemented in OpenCL and Level-Zero is not a big deal, in my view, so how do we justify a re-course?
Adding few folks for extra opinions: @steffenlarsen , @romanovvlad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(btw, I am still not sure why you want to have some "initial" device for a buffer allocation since that can immediately change as app starts using it from other devices)
There is no preemptive allocation on some device with this approach. You only have a device allocation when you are first using a buffer on that device. It can only migrate to another device after that allocation is first used on the device it was allocated on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no preemptive allocation on some device with this approach. You only have a device allocation when you are first using a buffer on that device. It can only migrate to another device after that allocation is first used on the device it was allocated on.
So this is the device that first used that buffer, and nothing more? Eventually it will be migrated (by SYCL RT in your proposal) to whatever other device that needs it in this context, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. Except if the backend returns PLUGIN_MANAGED_OR_SAME
from piGetMemoryConnection
. In this case RT will do nothing and assume the backend can handle any migrations it needs. That can be used for any pair of devices in the same context for the OpenCL backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sycl/plugins/esimd_emulator part OK
what's the status pf this please? |
So this is in a fairly good state (I do need to resolve conflicts), and it works well on CUDA platforms but the CI does flag some issues on other platforms that still need to be investigated. But even once this is resolved I'm not sure we're ready to merge, because it does changes the PI API, in a way that works well in the SYCL runtime but that may be problematic for other uses, so I've brought up the changes here to the Unified Runtime: So it might make sense to hold off on merging this until we discuss it further for the Unified Runtime. |
/verify with intel/llvm-test-suite#1102 |
Should these changes be mirrored in unified-runtime @npmiller ? |
This pull request is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days. |
This pull request was closed because it has been stalled for 30 days with no activity. |
Introduces support for having multiple CUDA devices in one context.
To facilitate moving buffer and image memory between devices within the same context, some ABI-breaking changes had to be made to the runtime and PI interface.
This includes expanding the check for whether memory needs to be moved from just checking whether the context is the same to also checking whether device is the same. So this creates a performance regression for multiple devices that share memory within the same context. These will now also make copies of memory allocations for each device. This will be resolved in a future pull request, when we introduce a memory migration PI API for direct transfers between devices without going trough the host.
Tests in intel/llvm-test-suite#1102.