-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cuda::device::memcpy_async_tx #405
Conversation
f400754
to
3398386
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review, this is probably blocked on the other PRs about arrive_tx and the TMA intrinsics.
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
Copies `size` bytes from global memory `src` to shared memory `dest` and arrives | ||
on a shared memory barrier `bar`, updating its transaction count by `size` | ||
bytes. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs should mention preconditions and effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the preconditions. I really don't grok what should be documented under effects, so I have left that as is. The current memcpy_async docs didn't have anything I could use as inspiration.
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine. If we ever come up with the formulation for pipeline_tx
we'll want to move the function into its own header and reuse much of the code between pipeline and barrier versions, but since that's very much a future thing, this is sufficient for now.
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
// unconditionally. This permits checking for the feature in host code. When | ||
// __CUDA_MINIMUM_ARCH__ is available, we only enable the feature when the | ||
// hardware supports it. | ||
#if (!defined(__CUDA_MINIMUM_ARCH__)) || (defined(__CUDA_MINIMUM_ARCH__) && 900 <= __CUDA_MINIMUM_ARCH__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
3398386
to
9aab0cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. I have moved the docs, added tests, and improved the docs.
The only thing that I am still unsure of is the what the effects section should list. @gonzalobg
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
Copies `size` bytes from global memory `src` to shared memory `dest` and arrives | ||
on a shared memory barrier `bar`, updating its transaction count by `size` | ||
bytes. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the preconditions. I really don't grok what should be documented under effects, so I have left that as is. The current memcpy_async docs didn't have anything I could use as inspiration.
libcudacxx/docs/extended_api/synchronization_primitives/barrier/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
5ea0488
to
7ac7ce7
Compare
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/synchronization_primitives/barrier.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/memcpy_async/memcpy_async_tx.pass.cpp
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. Suggestions have been implemented and branch has been rebased on top of main.
libcudacxx/.upstream-tests/test/cuda/memcpy_async/memcpy_async_tx.pass.cpp
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
9f85f3c
to
abe3457
Compare
libcudacxx/.upstream-tests/test/cuda/memcpy_async/memcpy_async_tx.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/__cuda/barrier.h
Outdated
Show resolved
Hide resolved
abe3457
to
b886685
Compare
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
libcudacxx/docs/extended_api/asynchronous_operations/memcpy_async_tx.md
Outdated
Show resolved
Hide resolved
…ync_tx.md Co-authored-by: gonzalobg <[email protected]>
Thanks a lot for adding this new feature 🎉 |
Description
Closes #404
Add
cuda::device::memcpy_async_tx
for global to shared memory loads.Checklist