Skip to content

Conversation

@MasterJH5574
Copy link
Contributor

This PR introduces a new function GetCurrentStreamto device API, which returns the current stream of the given device.

Meanwhile, this PR updates the "CreateStream" of CUDA to creating a non-blocking stream, so that the execution on this stream can overlap with the execution of other streams.

This PR also changes the GPUCopy of CUDA device API to always using cudaMemcpyAsync.

This PR introduces a new function `GetCurrentStream`to device API,
which returns the current stream of the given device.

Meanwhile, this PR updates the "CreateStream" of CUDA to creating
a non-blocking stream, so that the execution on this stream can
overlap with the execution of other streams.

This PR also changes the `GPUCopy` of CUDA device API to always
using `cudaMemcpyAsync`.
@tqchen tqchen merged commit 48992a4 into apache:main Mar 9, 2024
Lunderberg pushed a commit to Lunderberg/tvm that referenced this pull request Mar 12, 2024
This PR introduces a new function `GetCurrentStream`to device API,
which returns the current stream of the given device.

Meanwhile, this PR updates the "CreateStream" of CUDA to creating
a non-blocking stream, so that the execution on this stream can
overlap with the execution of other streams.

This PR also changes the `GPUCopy` of CUDA device API to always
using `cudaMemcpyAsync`.
@Lunderberg
Copy link
Contributor

Lunderberg commented Mar 13, 2024

This PR also changes the GPUCopy of CUDA device API to always using cudaMemcpyAsync.

I think this portion of the commit needs to be reverted. Prior to this commit, the NDArray::CopyTo function could be called to transfer an array to/from the GPU and return the transferred array. After this commit, there is no synchronization point after the cudaMemcpyAsync, before returning control to the caller of NDArray::CopyTo.

  • The caller may read from the NDArray result immediately after it completes. After this commit, this is a read from uninitialized memory.
  • The caller may free the backing allocation of the NDArray argument immediate after NDArray::CopyTo completes. After this commit, this causes CUDA to read from a dangling pointer.

This function is used in many locations which relied on the previous semantics. For example:

@tqchen
Copy link
Member

tqchen commented Mar 13, 2024

Indeed agree that this makes things more relaxed than beahvior.

On the other hand, from the device api's pov, we don't really guarantee the sync behavior in generic DeviceAPI:

  • In most GPU APIs, both copy from/to host and across are async (e.g. in the case of metal or vulkan)
  • The default CUDA sync behavior of copyfromto actually was mainly limited to the default stream, so it was some what limited to cuda for a default stream setting.

One possible middleground we could have is to update CopyTo to always enable a StreamSync before CopyTo ends, this would help us preserve original usage of CopyTo, but still allows low level device API to enable async copy behavior that would generally provide more optimizations opportunities.

@tqchen
Copy link
Member

tqchen commented Mar 13, 2024

actually it is great this topic get revealed! since the original logics would cause issues for backends like metal/vulkan due to the fact that these copies are async, we also explicitly documented the possible sync/async behavior in NDArray interface

  • CopyFrom/ToBytes is always sync
  • CopyTo(NDArray) can be async
  • CopyTo(Device) should be changed to sync

@tqchen
Copy link
Member

tqchen commented Mar 13, 2024

#16716 contains the followup

@Lunderberg
Copy link
Contributor

Thank you for the quick turnaround on the fix, and I like it. I agree that most GPU frameworks are asynchronous by design, and by necessity. My concern was mainly that it was a change in the existing

The default CUDA sync behavior of copyfromto actually was mainly limited to the default stream, so it was some what limited to cuda for a default stream setting.

Ah, I had thought that was intentional. Absent any explicit opt-in, the GPU operations would be synchronized on attempting to read, but all sequences of GPU operations would be asynchronous. With the stream parameter, the transfers to the GPU would also be async.

I like the change, to have the most common API be synchronous, while all internal APIs are asynchronous.

thaisacs pushed a commit to thaisacs/tvm that referenced this pull request Apr 3, 2024
This PR introduces a new function `GetCurrentStream`to device API,
which returns the current stream of the given device.

Meanwhile, this PR updates the "CreateStream" of CUDA to creating
a non-blocking stream, so that the execution on this stream can
overlap with the execution of other streams.

This PR also changes the `GPUCopy` of CUDA device API to always
using `cudaMemcpyAsync`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants