Skip to content
This repository has been archived by the owner on Aug 19, 2024. It is now read-only.

SYCL_USM_ARRAY_INTERFACE protocol for DPPY #1

Open
diptorupd opened this issue Mar 17, 2021 · 9 comments
Open

SYCL_USM_ARRAY_INTERFACE protocol for DPPY #1

diptorupd opened this issue Mar 17, 2021 · 9 comments

Comments

@diptorupd
Copy link
Collaborator

diptorupd commented Mar 17, 2021

We need a protocol similar to the __cuda_array_interface__ (CAI) protocol for DPPY. @oleksandr-pavlyk has already drafted an initial version of a CAI-like protocol called __sycl_usm_array_interface__ (SUAI) in [1]. This ticket opens the discussion on finalizing SUAI before we publish it publicly.

There are three main goals for CAI (and by corollary SUAI) protocol:

  1. Enable a producer to inform a consumer of the memory location of data passed from the producer to the consumer.
  2. Provide information to a consumer about how to copy the data. A copy can be via host, or via device-to-device IPC, or even zero-copy. The information typically requires the consumer to parse the pointer type and other attribute. For CUDA, cuPointerGetAttribute is typically used, and for SYCL get_pointer_type is a similar function for USM pointers.
  3. Finally, ensure safety guarantees, i.e., make sure a consumer can use the data only after all operations on the data is completed on the producer-side.

The third point has stalled CAI’s adoption in TensorFlow [2] and led to a year-long debate [6] before the v3 revision of CAI [4] was finalized in Numba 0.53.

Limitations in current SUAI:

A) SUAI is ambiguous about point 2).

The syclobj attribute of SUAI can be either a dpctl.SyclQueue, or dpctl.SyclContext, or a 3-tuple to identify a specific device.

Although our proposal allows for multiple ways for specifying the sycl::queue or sycl::context, this is not foolproof. The decision as to how the SUAI is parsed at the consumer end cannot be controlled by a producer. A producer may return a dpctl.SyclContext inside SUAI, but a consumer that does not use dpctl will be unable to consume the pointer.

This is also pertinent for cases where the consumer is completely oblivious of the SYCL API and relies on some lower level implementation to deal with a CUDA device/ SYCL USM pointer. In [4], a user (leofang) provides mpi4py as specific example. Mpi4py is completely CUDA unaware and relies on the underlying MPI library to deal with the CUDA pointer in whichever way the MPI library sees fit.

I feel we should either separate out the various options for syclobj and add them as separate attributes to the dictionary at the cost of making the protocol too verbose, or explore an alternative that is similar to CAI v3 and stores a capsule containing a SYCL queue in SUAI.

B) We have not addressed the synchronization requirements that CAI v3 addresses.

How do we guarantee that the consumer can safely use the data pointer?

CAI addresses the issue in two way:

  • It allows passing the CUDA stream to enable asynchronous execution if the consumer chooses to use the same stream. Or else, the consumer may choose to use explicit synchronization before accessing the data using another stream.
  • If a producer has not enqueued any work on the data, and the data is ready the stream option on CAI can be omitted.

The synchronization question needs to be addressed. Either by explicitly stating that SUAI is a blocking protocol or designing something similar to CAI.

[1] https://github.com/IntelPython/dpctl/wiki/Zero-copy-data-exchange-using-SYCL-USM
[2] tensorflow/tensorflow#29039
[3] https://numba.pydata.org/numba-doc/dev/cuda/cuda_array_interface.html
[4] https://gmarkall.github.io/numba-rtd-theme/cuda/cuda_array_interface.html
[5] numba/numba#4933
[6] numba/numba#5162

@fschlimb
Copy link

@diptorupd I am not sure I understand the ambiguity argument. I understand that we want to support consumers who do not (want to) use dpctl. Using types defined in dpctl hence should be avoided. Adding a dpctl.queue and dpctl.context would actually introduce the possibility for ambiguity while the suggested protocol is not ambiguous. Can you explain where the ambiguity is?

With respect to data-readiness we should consider explicit dependence semantics. It seems natural to use SYCL's USM events and it could also help providing a cleaner/more explicit async feature in numba.

@oleksandr-pavlyk
Copy link
Contributor

Relevant to this issue is DLPack's discussion of synchronization semantics.

@oleksandr-pavlyk
Copy link
Contributor

I would rephrase "Provide information to a consumer about how to copy the data." as "Provide a consumer with information necessary to copy the data."

It's at the consumer's discretion whether to copy via p2p, or via host.

SUAI is not ambiguous about item 2, but does introduce a dependency on dpctl in cases where sharing of dpctl.SyclContext or dpctl.SyclQueue is required (not mainstream workflows).

The DL teams request was to support named PyCapsules as well, with "sycl_context_ptr" and "sycl_queue_ptr" being the names I would propose. Capsules should be one single use only, in that retrieving the pointer from the capsule should be accompanied with change to the PyCapsule instance's name, like it is done with dlpack's capsule.

@michael-smirnov
Copy link

Enable a producer to inform a consumer of the memory location of data

Memory data location is known only for host and device USM pointers, while shared data exist both on the host and device and it is undefined in every particular case where the actual data resides (and will the data transfer occur before the computation or not). So my understanding is that we provide information for a consumer on which devices he can access the data efficiently, but not the actual data location.

The syclobj attribute of SUAI can be either a dpctl.SyclQueue, or dpctl.SyclContext, or a 3-tuple to identify a specific device

The context does not contain the whole information about data: it does not know about the device where data were (potentially) allocated and therefore where it can be accessed more efficiently. So it's strange to have context as one of the options. Why can't we replace syclobj with all these options with queueobj that will always contain just a capsule of sycl::queue? It always contains enough information about data and can be used in the packages without dpctl support. And dpctl can provide additional helpers, for example, to create dpctl.SyclQueue or dpctl.SyclContext objects from queueobj.

The synchronization question needs to be addressed

Will it solve the problems if we extend __sycl__usm_array_interface__ with eventlist parameter that contains the capsule of std::vector<sycl::event>?

@michael-smirnov
Copy link

Capsules should be one single use only

@oleksandr-pavlyk , why capsules shall be single use only? How then sharing the same resources between dppy packages will be organized?

@oleksandr-pavlyk
Copy link
Contributor

Enable a producer to inform a consumer of the memory location of data

Memory data location is known only for host and device USM pointers, while shared data exist both on the host and device and it is undefined in every particular case where the actual data resides (and will the data transfer occur before the computation or not). So my understanding is that we provide information for a consumer on which devices he can access the data efficiently, but not the actual data location.

I suggest to rephrase that sentence into "Enable a producer to inform a consumer how to access data, i.e. shared USM pointer and SYCL context the pointer is bound to".

The internal workings of USM shared memory are not pertinent for ability to share data.

The syclobj attribute of SUAI can be either a dpctl.SyclQueue, or dpctl.SyclContext, or a 3-tuple to identify a specific device

The context does not contain the whole information about data: it does not know about the device where data were (potentially) allocated and therefore where it can be accessed more efficiently. So it's strange to have context as one of the options. Why can't we replace syclobj with all these options with queueobj that will always contain just a capsule of sycl::queue? It always contains enough information about data and can be used in the packages without dpctl support. And dpctl can provide additional helpers, for example, to create dpctl.SyclQueue or dpctl.SyclContext objects from queueobj.

Per SYCL, all you need is USM pointer and a sycl::context. The device can be recovered with sycl::get_pointer_device(ptr, ctx), and context and device can be used to create a queue, if needed. The context is also all you need to free the memory sycl::free(ptr, ctx).

The SUAI does not insist on have a queue to enable passing just the minimal information required. I do not anticipate this usage to be common. I also think the most common used value for syclobj will be a filter selector data (an integer, a triple of backend/device_type/relative_id, or perhaps even a string). This info will be used to select a root device and query DPC++ run-time for its default context.

The synchronization question needs to be addressed

Will it solve the problems if we extend __sycl__usm_array_interface__ with eventlist parameter that contains the capsule of std::vector<sycl::event>?

We could, but the feedback I received from community (data-apis/consortium-feedback#1 (comment), data-apis/consortium-feedback#1 (comment), dmlc/dlpack#57 (comment)) is that imposing on consumers of SYCL USM arrays to work with SYCL native objects like events is discouraged.

Instead, it was proposed that synchronization is performed by consumer handing the producer an in-order queue, the producer submits into a there a trivial kernel which depends on its internal events. The consumer submits a trivial kernel into the in-order queue, and uses the event returns in its production, possibly out-of-order queues, thus achieving the needed synchronization.

@oleksandr-pavlyk
Copy link
Contributor

Capsules should be one single use only

@oleksandr-pavlyk , why capsules shall be single use only? How then sharing the same resources between dppy packages will be organized?

Producer creates a copy, packages its reference and class destructor in a capsule and shares it with a consumer. Rinse and repeat for different consumer packages.

Regarding why "single use": because subsequent operations on the queue modify internal structure of the queue. The queue may well be thread safe, and it is OK.

The practice to rename capsule after use is from CUDF, see data-apis/consortium-feedback#1 (comment). The link to source code in that comment is stale, please use https://github.com/rapidsai/cudf/blob/branch-0.19/python/cudf/cudf/_lib/interop.pyx#L34

@michael-smirnov
Copy link

imposing on consumers of SYCL USM arrays to work with SYCL native objects like events is discouraged

I'm sorry, but I don't see real problems using wrappers over native sycl objects like events internally. We already have wrappers over queue and using them. Or using the queue is also discouraged by the community? 😄
For convenience, we can provide additional functions in dpctl that work with such wrappers, like

dpctl.wait(obj.__sycl_usm_array_iface__['eventlist'])

@oleksandr-pavlyk
Copy link
Contributor

oleksandr-pavlyk commented Mar 24, 2021

imposing on consumers of SYCL USM arrays to work with SYCL native objects like events is discouraged

I'm sorry, but I don't see real problems using wrappers over native sycl objects like events internally. We already have wrappers over queue and using them. Or using the queue is also discouraged by the community? 😄

The feedback I received from DLPack maintainers encouraged us to work out interoperability protocol that avoids imposing on pkgA to use and manipulate SYCL entities created by another package pkgB. This included sycl::event and sycl::queue.

Essentially the packages exchange tokens (representable as native Python objects) that can be traded with offload runtime for the actual run-time objects.

This does not mean we can not exchange these objects internally, but there should be a common path where this is not required, like working with root devices and their default contexts.

For convenience, we can provide additional functions in dpctl that work with such wrappers, like

dpctl.wait(obj.__sycl_usm_array_iface__['eventlist'])

We already have dpctl.SyclQueue.wait, but that waits on all events associated recorded in the queue. Currently DPCPP does not provide for a barrier predicated on event-list that applies to an out-of-order queue. There is an SYCL extension proposal for that though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants