-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify synchronization semantics #57
Comments
I am also curious as to whether DLPack is blocking. |
I would like to add that another data exchange formant, CUDA Array Interface, has a recent revision that @kkraus14 and I and many others have contributed to, see the Synchronization section from its v3 protocol. It turns out nontrivial to ensure the data being exchanged is correct with respect to the semantics of CUDA (or HIP or OpenCL, if one cares). It appears that DLPack currently lacks this level of consideration, and assumes implicitly that the Producer and the Consumer exchanging the data live on the same (whatever) stream/queue. This is a very strong assumption with hindsight. I understand no one has complained about lacking such a synchronization requirement in DLPack (as this should have been a serious issue since the advent of DLPack, but in practice even on the CUDA Array Interface side we never receive a single bug report for not guaranteeing the synchronization behavior), but if the Python Array API Consortium is going to adopt DLPack as the official zero-copy standard, this will need to be rigorously sorted out as done in CUDA Array Interface v3. cc: @rgommers (for awareness) |
I think that making synchronization optional might also be an option. In most high-level CUDA tasks, as @leofang has already mentioned, most people do use the array on the same stream. However, forcing synchronization is time-consuming and goes against one of the main benefits of CUDA programming, host/device asynchronous behavior. |
Thanks @leofang @kkraus14 . The issue of synchronization has been thought of, and it is certainly possible to come up with a specification that allows more asynchronize behavior. See also my previous related comment to this issue here data-apis/consortium-feedback#1 (comment) The main design goal is mainly to get folks to agree on common set of things. The main problem is not about a solution for synchronization, but to get frameworks to agree on the mechanism. Most modern frameworks have their own opion about allocation and scheduling, in many cases these opinions can vary. For example, both TF-RT and MXNet has its own internal scheduler. New abstraction are also being proposed (e.g. the case of graph base scheduling). Based on that observation, we concluded that it is hard to get people agree on the synchronization and allocation mechanism (yet), and resort to a simpler convention -- we require producer and consumer to be on the same stream. Most protocol that brings synchronization behavior will inevitably embed stream, or context into the data structure. This would defer synchronization burden to the runtime implementation of the system(consumer). While on one hand it is certainly more flexible, it does bring engineering cost to the developers who are supposed to consume the DLPack array. Additionally, fewer expert programmers may want to think about synchronization, while there are broader range of programmers who may not even consider this perspective and only uses one stream throughout ther applications. Enabling context in DLPack means this complexity falls into the crowd, rather than a few experts. It also brings extra burden to the compilers themselves. The compiler will need to generate optional synchronization code based on the streams, which is non-trivial. Finally, the simple protocol (of asking producer/consumer to be on the same stream) will not prevent async computation. As no synchronization is needed if both are on the same stream(for most apps they are). Alternative APIs can also be designed to request DLPack array that is presented in certain context, for example an optionally API below: array.to_dlpack_on_stream(stream_number) Such API will put the burden of sychronization to the producer, which is not necessarily more complex than putting things to the consumer, but reduces the metal burden on the consumer side(think about common developers who might want to consume the array). To summarize, I think it is good to have a discussion about synchronization API. However, based on the current state, I think it would be useful to de-couple the discussion of synchronization(which is harder to get consensus) from the data structure(which most frameworks already agree on). I would also argue that simplicity is not necessary a bad thing in here, since we could build APIs that as powerful, but drives adoption in common cases |
To summarize things further, let us discuss the following two variants of possible APIs to support sychronization
From the flexibility PoV. S1 style API is certainly more flexible. For example, the consumer can choose to continue run computation on the streams provided by the producer. The reality is, however, most of the application/framework developers (rightfully) have opinions about what streams to run computation on(that are attached to their own internal scheduler). So the best thing consumer could do is to run a dependency sync to bring the array to the internal stream then consumes the data, if the developer knows how to do so. In many cases In the case of S0 style API simplicity is also a merit -- so more developers can agree on and implement such a protocol correctly. While being limited, it won't slows down computation, because it can also support asynchronize computation when possible. It also separates the opinion about data structure from the opinion of sychronization. |
Hi @veritas9872 @tqchen Thanks for your replies! Looks like @tqchen you have put serious thoughts on this, and it's great to have you write it up here for future reference 🙂 I would like to kindly nudge you to read the Synchronization section of the CUDA Array Interface (CAI) v3, as many factors you raised are already considered throughly and covered there. An over-brief summary is below:
All these considerations together guarantee the correctness of computations, which was deemed critical across the libraries at the time (PyTorch, TF, Numba, CuPy, RAPIDS, etc), while leaving flexibilities to expert users, in particular advanced HPC/ML users. In the CAI documentation a very complicated example was given to demonstrate that without such a synchronization requirement a computation would almost certainly go wrong. My conjecture for CAI (before v3) and DLPack to work well so far is most libraries live on CUDA's legacy default stream, which has an implicit synchronization/serialization for kernels launched on it. But given that more libraries are adopting the per-thread default stream, this would eventually become a real concern. If DLPack is going to assume the Producer and Consumer need to live on the same stream, it is fine to me, but it should be clearly stated in the documentation, which is currently not the case. Also, we would need a way to make sure of this if correctness is of top concern, which circles back to the 2nd point above that we need to access the Producer's stream pointer.
I totally agree. We had a very hard time when revising v3. Library and framework maintainers have their own opinions. But it's very odd to me that you mentioned TF, because TF was one of the main drivers that were unwilling to adopt CAI unless we specify clearly the synchronization behavior (hence leading to the v3 revision), so it is difficult for me to understand how come they are OK with DLPack's (currently) unsafe behavior but gave us tons of headache (and in the end it's still unclear to me if they'd implement CAI v3). On the API consideration, I think one important distinction between CAI and DLPack is that we do not expose the intermediate object to the User, as all of the information are simply dumped to a dict and generated (consumed) internally in Producer (Consumer). The benefit is threefold:
So I think this is the reason that you have to consider the S0 vs S1 styles above. To us it's really not an issue if the struct could have an additional field to hold the stream pointer (following S1), and Users can simply write (pseudo) code as simple as this (see the CAI doc for a better, practical example) a = producer.array([1, 2, 3])
b = consumer.asarray(a) # <-- sync happens here, if a stream pointer is given that is, there is no explicit calls to It is necessary to revise DLPack's struct. Consider we add an optional def to_dlpack(stream=None):
pass
def from_dlpack(object):
pass this has three disadvantages:
Given the above reasons, I just don't think your S0 style would work if steam synchronization is considered an issue. |
A note just on this: we discussed that recently and agree to change the Python API to |
Taking a step back from CUDA's stream, for supporting synchronization I think DLPack may need to support other modes of scheduling as well (e.g. barrier). It might make sense to start with a more limited scenario first (i.e. always synchronize) that can be implemented in all abstraction and then add more interfaces to relax it. |
Right now the options are clearly summarized as per S0 and S1 design choices. The choice of python API has converged on what @rgommers described. The synchronization perspective of CAI mentioned by @leofang is summarized in S1(I tried to summarize it after reading the CAI v3 proposal). Right now, dlpack adopts a simple synchronization semantics(per S0) is specified as
It would be great to try to talk about S0 and S1 style API further, perhaps from the framework implementer's PoV. |
Here is my (personal) take on S1 style API(e.g. CAI v3) . First of all, it is certainly possible to specify the runtime dependent synchronization behavior correctly in a S1 API (CAI v3 should be such an example). Noteably, such specification also brings in additional complications to the implementer that consumes a DLPack tensor. Specifically the following scenarios:
Both K0 and K1 are possible barriers for a fully compatible implementation. In contrast, S0 style API, while being limited, is simpler to implement overall. We are more likely going to get fully compatible implementations from frameworks. So our question here is not really about which one is more advanced and flexible(we all agree that S1 is more flexible and "advanced", while S0 is simpler and limits the possible ways to synchronize), but about the tradeoff between simplicity and flexibility. |
The reason I don't think either S0 or S1 reached the heart of the problem is that they differ only in where to store the reference that's necessary for synchronization, for which we could coordinate as part of defining the synchronization semantics in DLPack. To me, the key difference between the current DLPack and the potential synchronization extension is that in asynchronous setting the provider needs to return a promise object. Such promise object can be a reference to which frameworks on supplying and consuming ends have common knowledge on how to synchronize, such as a stream number, but it's not always true. It's fully possible that the provider supplies such promise whose synchronization depends on a mechanism that the consumer has no knowledge of. More generally, the synchronization methods need not couple with the memory. For example, two imaginary frameworks TF and MX are running on a platform with both TPU and GPU, and TF passes an array promise to MX of which the computation happens on TPU first. It's fully possible that TF is the only framework that has knowledge on how to implement synchronization on TPU. Requiring frameworks to know how to synchronize may not even be possible, which means that DLPack may have to choose not to support such hardware even if the memory space is addressable by all frameworks. In general, I think it makes sense for the promise object to supply a query function on whether the data is ready or not. Beyond that, I think of CUDA stream, or other synchronization methods commonly supported on multiple frameworks, more of special cases. |
Wait...this brings up a question I've been wondering: What's DLPack's requirement for cross architecture transfer (say cpu <-> gpu)? I thought for the purpose of enabling zero copy with no surprise, it should error out and such transfers must be done by other means (usually a sane library would implement it). Is such behavior documented somewhere? |
I believe DLPack only does one job at the moment, which is to facilitate the exchange of data in commonly addressable memory spaces. It does not yet handle data transfer across devices. What I described above is the scenario where the final memory address is being filled asynchronously as part of the final step of computation on a special device. |
Yeah I think cudaMallocAsync etc appear in CUDA 11.2. |
The scenario is a more complex version of what was considered in CAI v3, so in this regard I agree with you that CAI v3 is a special case, but well it was meant for CUDA only 😗 I brought up CAI v3 is mainly to emphasize and echo the original issue that synchronization (or not) behavior at the interoperating points should be considered, determined, and documented. |
Yes, for practical reasons I think it's valid (and most likely desired), and I'm happy to continue discussion on coordinating CUDA synchronization as a special case. My above comments are meant to make it clear that the same abstraction doesn't necessarily apply generally. |
Great discussions, personally i do not have an opinion on style of possible synchronizations. Although both memory copy and other (allocations) all can have a stream based abstraction, so the S0 specification is still sound (although limited). One observation from the current discussion is that while most of us agrees on data structures, there are many opinions about possible synchronization API(e.g. explicit specify string, stream based, or opaque future/promise). Each of them would bring some level of complexity and pros and cons. My main point is to ideally de-couple the standardization of synchronization (which is more complicated) from standardization of data structure itself. This would however, means we may not to couple the synchronization related data structure(e.g. stream) into DLTensor and limit the synchronization to a S0-level requirement at DLPack level. I would also argue that the S0-level semantics is simple and self-contained for now, can handle async settings(via default stream or same stream). Additional discussions about synchronization API standardizations are great we could open a separate thread for that :) |
Let me attempt to summarize the proposals so far and their pros and cons S0: Producer Sync based on Stream Specified by Consumerdef __dlpack__(self, stream=None):
"""Get a DLTensor capsule that is visible in stream.
The producer will take charge to do dependency synchronize to stream.
If no stream is None, then it defaults to the default stream(0 in cuda).
"""
pass Pros
Cons:
A further simplification of S0 would be requiring the exchange to always synced to the default stream (Note this is still different from explicit synchronization), this would remove the stream signature from the dlpack function S1: Producer Stores the stream in the Exchange Data Structure, Defer Sync to ConsumerMostly from API style described by @leofang, CAI style synchronization. class DLManagedTensor:
stream: int
def __dlpack__(self):
"""Return an exchange structure that stores the stream,"""
pass Pros:
Cons:
S2: Producer Stores an opaque Promise object in the Exchange Data Structure, Defer Sync to ConsumerProposal by @szha class DLManagedTensor:
sync: function<()>
ready: function<()>
def __dlpack__():
"""Return an exchange structure that stores the stream,"""
pass Pros:
Cons:
|
My current proposal is to start simple, and adopt a S0 style API. The simplicity and benefit bought by A1 should be sufficient for most of the high performing cases. In reality I believe default stream is the only one that different frameworks could have agreed to, given framework have their opinions about internal stream context. This means the power of B0 in S1 is rarely used. Additionally, A2 is critical for framework adoptions, as we want simple support to be implemented correctly for most of the cases. While additional effort can be put into bringing in stronger support if necessary. |
S0 with
|
@rgommers it applies to driver API that have a global stream space that can be converted to integer, so for now it is CUDA and rocm case. In such cases, stream are guaranteed to be the same thing in the same process. For other driver APIs, the producer and consumer need to agree on the convention of "default stream". The lack of global default is a limitation of other APIs considering the need of exchange -- in the scenario of single app of course being able to create app-specific context was considered as being flexible. However such flexibility limits the sharing of context between apps bought by standardization. The producer and consumer need to agree to default context in this case. A possible extension later is could be try to create such a global context table for other device APIs, where each application can access. 'None'(instead of '0') indicates the default stream, so it is unambigous. But I believe '0' is also fine. You are right that |
If
From what I can tell, the only way to guarantee that the stream isn't prematurely destroyed here is to either synchronize it before creating dlpack already has lifetime management semantics for the memory in the form of the deleter, where I'd argue we need similar for any stream resources used. We could say that synchronization needs to happen by producers before handing off the |
Isn't this just a way of saying "I prefer S1 over S0", or " I want optimal performance even when both libraries use non-default streams, at the cost of implementation complexity"? The trade-off here has some unstated assumption on how often this kind of data interchange is used. If one is gluing together two ML models by exchanging some final or intermediate output once in a while, and the occasional extra sync if both of those models use non-default streams isn't a big deal. If you have use cases that use |
As a meta-comment: I do not think it's necessarily the right goal to end up with essentially a copy of Other options include:
|
This isn't true in many cases. Take for example a library which uses the default stream asynchronously, which is not uncommon. If I'm forced to synchronize the stream to make the memory able to be used immediately by any other stream, I'm required to synchronize the entire device which prevents me from doing any other work on the device until that synchronization is finished. I'd argue this is a big deal.
I 100% agree that it shouldn't be a copy, but if we're interested in being an interchange protocol for devices that have an asynchronous execution model and an asynchronous memory model, then we should really support that asynchrony in the interchange protocol.
I personally don't like that we're passing things like memory pointers / streams around as integers in Python when we could be passing some type of object oriented wrapper around that handles the lifetime management instead, but we followed the spec for |
We are getting back to the argument of S1 style API and S0 style API :) Although it would be great to dissect the argument a bit. Right now the discussions goes back to fully synchronize vs async handling, and implies that "S0== sync entire device, S1 == async handling". This is not necessarily true. I want to highlight that S0 does not imply synchronization of the entire device, and is never meant to say so. The decision of synchronization vs async handling can be done by the implementer in either S0 or S1 settings. Let us discuss how async exchange can be handled in S0:
// event can also be created on the fly, or create a synchronizer object and cache it.
// We could build auxiliary function that can be called from python side if that helps the frameworks
void PushStreamDep(cudaStream_t src, cudaStream dst) {
cudaEvent_t event;
cudaEventCreate(&event);
cudaEventRecord(&event ,src);
cudaStreamWaitForEvent(dst, event);
cudaEventDestroy(&event);
} The
Note that both W0 and W1 does not require synchronization of entire device and falls under the S0 API. The only difference from S1 is that In both W0 and W1. producer and consumer streams are being managed by producer and consumer only, without having to worry about the stream lifecycle due to exportation. This is again another simplicity bought by the S0 style handling. To summarize, I think we all agree that it is ideal to handle exchange asynchronously when possible. Both S0 and S1 style API have mechanism to do so. |
Another case we covered in We decided to handle this using
What would C/C++ libraries do? How do a producer C/C++ library and a consumer C/C++ library exchange stream information in a standard way? CUDA streams could likely be an implementation detail of the library where they're not exposed to a user developing with the library. If we want to avoid putting streams, events, etc. into the
Unfortunately, calling |
@kkraus14 I believe a common util might be useful, although it should be simple enough for framework to just have their own impl |
Can someone remind me why we wanna sync over the Consumer's stream, not the Producer's? Isn't the whole point of synchronizations to wait until the Producer is ready to handle out the data? Did I remember it wrong? |
@leofang The "sync" (PushDep) happens asynchrously, and marks a dependency in the queue, which won't block the producer or consumer, at least in the context of CUDA |
Thanks for quick reply and for the quotations over sync, @tqchen! But if the Producer does not expose a stream or event handle (in CUDA), how does the Consumer establish such a stream order (the dependency you referred to)? |
In the current API, the producer implements |
I'm coming to this discussion late, and I have not yet had a chance to internalize all the prior comments. The way I understand the problem, If my understanding is correct, this does not generalize to SYCL. For consumer to ensure that SYCL kernels submitted by producer have finished updating the memory in the array to be shared, the |
@oleksandr-pavlyk I am not that familar with SYCL. If SYCL have constructs like fence(barrier) that can depends on events. An analogy to CUDA's |
Ah right, thanks for the reminder @tqchen. @oleksandr-pavlyk I think if SYCL's queues and events are analog to CUDA's streams and events, |
It is not possible in SYCL to insert a barrier to an out-of-order queue that will make all further submissions to the queue dependent on that event. It is possible to do it synchronously from the host with In view of this, an implementation of To synchronize in an out-of-order queue requires passing an event, which will be used when submitting new kernels (e.g. see USM-version of GEMM in oneMKL). Regarding whether SYCL handles can be passed as integers in Python, I am afraid not (like shared pointers can not be passed as integers in Python). The dpctl Python package exposes The opaque pointers to SYCL objects can of course be passed around as named |
Thanks @oleksandr-pavlyk . I agree that in such cases we would require exchanges through in order queues provided by the consumer. To adapt such case, and when the consumer want to operate on out of order queues. The consumer can first create an in-order queue, just for exchange. Create an event X after the exchange finishes, and use X as dependencies for future out of order queues that depend on the data. Likely this won't create additional overhead other than the event depencency tracking. Such way of adaptation would enable de-couple the issues of the synchronization, lifecycle management from the data structure. Speaking for myself a framework implementer's PoV. The generalization to arbitrary event dependency chain, although being flexible, creates additional overheads during exchange for the lifecycle management, asynchronization convention etc. Having a layered simpler view(e.g. default in order queue) would help SYCL in the long run, and where we can learn from CUDA (simple and good enough for most cases) |
@tqchen In SYCL, the consumer would always need to provide an in-order-queue to ensure synchronization (since queue is out of order by default). Not passing any queue would require the There is a SYCL extension proposed to enable one to asynchronously wait on event on the entire queue which may allow Even so using in-order queue is going to be more performant. |
Try to summarize the current state of discussion as well as pointing out another missing points (per GPU streams). Here is a complete proposal under S0 S0a: Producer Sync based on Stream Specified by Consumerdef __dlpack__(self, stream=None):
"""Get a DLTensor capsule that is visible in stream.
The producer will take charge to do dependency synchronize to stream.
If no stream is None, then it defaults to the legacy default stream
"""
pass
def __dlpack_device__(self) ->Tuple[Int, Int]:
"""Return a tuple pair of device_type, device_id in DLPack convention"""
# consumer code:
def consumer.from_dlpack(producer):
device = producer.__dlpack_device__()
consumer_stream = consumer.find_exchange_stream(device)
dlpack_caps = producer.__dlpack_stream__(consumer_stream)
return conver_to_consumer_array(dlpack_caps) Note that most systems associate different stream to a particular device. So if we want to use non-default stream, then knowing which stream to synchronize to is important. As discussed ealier, the main benefit of this style of APIs are:
Given the need to quickly move on, we can also start with a reduced version, and continue more discussions on stream exchange. S0reduced: Producer Sync based on Stream Specified by Consumerdef __dlpack__(self):
"""Get a DLTensor capsule that is visible in stream.
The producer will take charge to do dependency synchronize to
legacy default streamof the corresponding device
"""
pass This way the |
I think S0a is the way to go. |
Good catch regarding the need for
I agree, adding this second method makes sense and it's not too likely there's something that would need changing later. |
Seems like we've aligned on the Python side with having Presumably we need to solve this same problem for C libraries now. Should I open a new issue or do we want to continue in this issue? |
Thanks @kkraus14 how about we open a new issue? Given that there are no standardized interface for C library exchange, perhaps we could add that as a interface recommendation? |
Summary: Partially Fixes #55090 Depends on #55365 Inspired by dmlc/dlpack#57 (comment) Questions, in PyTorch we can't create streams or easily synchronize them from just an integer. Should we add an [`ExternalStream`](https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.ExternalStream.html) object like the one we have in CuPy? TODO: Add tests Would like some feedback as this design needs quite a few iterations rgommers leofang Pull Request resolved: #57110 Reviewed By: saketh-are Differential Revision: D30761481 Pulled By: mruberry fbshipit-source-id: e85d78df3c1f8defc2a698878da89cd843cb1209
From the CUDA side, nearly all of the DL frameworks and array libraries that support dlpack use CUDA streams and stream order both their computations and memory allocations. In its current form, dlpack doesn't specify any synchronization semantics as well as doesn't have a way to specify information to allow for a producer-consumer pair to exchange the necessary information to continue to stream order computations.
I imagine there's a similar problem in other contexts as well (OpenCL, ROCm, etc.) where maybe it's possible to generalize an approach.
cc @oleksandr-pavlyk
The text was updated successfully, but these errors were encountered: