[DeviceAPI] Support "GetCurrentStream" #16689

MasterJH5574 · 2024-03-08T02:29:34Z

This PR introduces a new function GetCurrentStreamto device API, which returns the current stream of the given device.

Meanwhile, this PR updates the "CreateStream" of CUDA to creating a non-blocking stream, so that the execution on this stream can overlap with the execution of other streams.

This PR also changes the GPUCopy of CUDA device API to always using cudaMemcpyAsync.

This PR introduces a new function `GetCurrentStream`to device API, which returns the current stream of the given device. Meanwhile, this PR updates the "CreateStream" of CUDA to creating a non-blocking stream, so that the execution on this stream can overlap with the execution of other streams. This PR also changes the `GPUCopy` of CUDA device API to always using `cudaMemcpyAsync`.

Lunderberg · 2024-03-13T21:15:09Z

This PR also changes the GPUCopy of CUDA device API to always using cudaMemcpyAsync.

I think this portion of the commit needs to be reverted. Prior to this commit, the NDArray::CopyTo function could be called to transfer an array to/from the GPU and return the transferred array. After this commit, there is no synchronization point after the cudaMemcpyAsync, before returning control to the caller of NDArray::CopyTo.

The caller may read from the NDArray result immediately after it completes. After this commit, this is a read from uninitialized memory.
The caller may free the backing allocation of the NDArray argument immediate after NDArray::CopyTo completes. After this commit, this causes CUDA to read from a dangling pointer.

This function is used in many locations which relied on the previous semantics. For example:

The "vm.builtin.to_device" PackedFunc, which is the lowered form of R.to_device.
The SampleTopPFromLogits function.

tqchen · 2024-03-13T21:52:02Z

Indeed agree that this makes things more relaxed than beahvior.

On the other hand, from the device api's pov, we don't really guarantee the sync behavior in generic DeviceAPI:

In most GPU APIs, both copy from/to host and across are async (e.g. in the case of metal or vulkan)
The default CUDA sync behavior of copyfromto actually was mainly limited to the default stream, so it was some what limited to cuda for a default stream setting.

One possible middleground we could have is to update CopyTo to always enable a StreamSync before CopyTo ends, this would help us preserve original usage of CopyTo, but still allows low level device API to enable async copy behavior that would generally provide more optimizations opportunities.

tqchen · 2024-03-13T21:59:05Z

actually it is great this topic get revealed! since the original logics would cause issues for backends like metal/vulkan due to the fact that these copies are async, we also explicitly documented the possible sync/async behavior in NDArray interface

CopyFrom/ToBytes is always sync
CopyTo(NDArray) can be async
CopyTo(Device) should be changed to sync

tqchen · 2024-03-13T22:19:20Z

#16716 contains the followup

Lunderberg · 2024-03-14T21:00:22Z

Thank you for the quick turnaround on the fix, and I like it. I agree that most GPU frameworks are asynchronous by design, and by necessity. My concern was mainly that it was a change in the existing

The default CUDA sync behavior of copyfromto actually was mainly limited to the default stream, so it was some what limited to cuda for a default stream setting.

Ah, I had thought that was intentional. Absent any explicit opt-in, the GPU operations would be synchronized on attempting to read, but all sequences of GPU operations would be asynchronous. With the stream parameter, the transfers to the GPU would also be async.

I like the change, to have the most common API be synchronous, while all internal APIs are asynchronous.

This PR introduces a new function `GetCurrentStream`to device API, which returns the current stream of the given device. Meanwhile, this PR updates the "CreateStream" of CUDA to creating a non-blocking stream, so that the execution on this stream can overlap with the execution of other streams. This PR also changes the `GPUCopy` of CUDA device API to always using `cudaMemcpyAsync`.

tqchen approved these changes Mar 9, 2024

View reviewed changes

tqchen merged commit 48992a4 into apache:main Mar 9, 2024

ysh329 mentioned this pull request Apr 21, 2024

[Release] v0.16.0 Release Candidate Notes #16911

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DeviceAPI] Support "GetCurrentStream" #16689

[DeviceAPI] Support "GetCurrentStream" #16689

Uh oh!

MasterJH5574 commented Mar 8, 2024

Uh oh!

Lunderberg commented Mar 13, 2024 •

edited

Loading

Uh oh!

tqchen commented Mar 13, 2024 •

edited

Loading

Uh oh!

tqchen commented Mar 13, 2024 •

edited

Loading

Uh oh!

tqchen commented Mar 13, 2024

Uh oh!

Lunderberg commented Mar 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[DeviceAPI] Support "GetCurrentStream" #16689

[DeviceAPI] Support "GetCurrentStream" #16689

Uh oh!

Conversation

MasterJH5574 commented Mar 8, 2024

Uh oh!

Lunderberg commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tqchen commented Mar 13, 2024

Uh oh!

Lunderberg commented Mar 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Lunderberg commented Mar 13, 2024 •

edited

Loading

tqchen commented Mar 13, 2024 •

edited

Loading

tqchen commented Mar 13, 2024 •

edited

Loading