-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PoC]: Implement cuda::experimental::uninitialized_buffer
#1831
Conversation
🟨 CI finished in 2h 48m: Pass: 95%/361 | Total: 2d 04h | Avg: 8m 44s | Max: 40m 50s | Hits: 96%/481443
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
🏃 Runner counts (total jobs: 361)
# | Runner |
---|---|
264 | linux-amd64-cpu16 |
52 | linux-amd64-gpu-v100-latest-1 |
24 | linux-arm64-cpu16 |
21 | windows-amd64-cpu16 |
4631d6b
to
04f31bc
Compare
Is this not stream-ordered? Why? In RAPIDS a |
🟨 CI finished in 5h 41m: Pass: 99%/361 | Total: 1d 22h | Avg: 7m 47s | Max: 42m 51s | Hits: 97%/519879
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
🏃 Runner counts (total jobs: 361)
# | Runner |
---|---|
264 | linux-amd64-cpu16 |
52 | linux-amd64-gpu-v100-latest-1 |
24 | linux-arm64-cpu16 |
21 | windows-amd64-cpu16 |
We discussed this briefly yesterday in our code review hour and I advocated for having a container interface instead of a We also discussed the untyped interface, but concluded it to be too dangerous and less practical. You would have to reinterpret the storage before you could do anything useful. And if you would need a buffer of bytes, you can just instantiate the buffer with However, I later concluded that maybe what a really want (and what's also safer), is a policy to |
The reason this is not stream ordered is that I want to add a
That said, there will be no real difference between these buffers, so I did not want to write both just to change both when there are comments |
I believe that these are orthogonal issues. A simple As we discussed |
🟩 CI finished in 15h 15m: Pass: 100%/361 | Total: 2d 00h | Avg: 8m 03s | Max: 1h 02m | Hits: 97%/522432
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
🏃 Runner counts (total jobs: 361)
# | Runner |
---|---|
264 | linux-amd64-cpu16 |
52 | linux-amd64-gpu-v100-latest-1 |
24 | linux-arm64-cpu16 |
21 | windows-amd64-cpu16 |
🟨 CI finished in 8h 04m: Pass: 99%/361 | Total: 4d 01h | Avg: 16m 08s | Max: 1h 02m | Hits: 81%/519938
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
🏃 Runner counts (total jobs: 361)
# | Runner |
---|---|
264 | linux-amd64-cpu16 |
52 | linux-amd64-gpu-v100-latest-1 |
24 | linux-arm64-cpu16 |
21 | windows-amd64-cpu16 |
There is a big and important difference between a stream-ordered and non-stream-ordered buffer: the memory owned by a stream-ordered buffer object may still be valid after the object is destroyed. Consider:
Stream-ordering and RAII object lifetime don't mix together very cleanly, and I've never been able to come up with a sane way to reconcile them other than just pretend the problem doesn't exist :). |
🟩 CI finished in 8h 04m: Pass: 100%/361 | Total: 4d 01h | Avg: 16m 08s | Max: 1h 02m | Hits: 81%/521640
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
🏃 Runner counts (total jobs: 361)
# | Runner |
---|---|
264 | linux-amd64-cpu16 |
52 | linux-amd64-gpu-v100-latest-1 |
24 | linux-arm64-cpu16 |
21 | windows-amd64-cpu16 |
🟨 CI finished in 9h 39m: Pass: 99%/361 | Total: 1d 19h | Avg: 7m 15s | Max: 46m 29s | Hits: 97%/520789
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
🏃 Runner counts (total jobs: 361)
# | Runner |
---|---|
264 | linux-amd64-cpu16 |
52 | linux-amd64-gpu-v100-latest-1 |
24 | linux-arm64-cpu16 |
21 | windows-amd64-cpu16 |
As mentoined in the other PR,
As mentioned in your async_buffer PR, I don't think you should synchronize in the dtor. |
🟩 CI finished in 1d 02h: Pass: 100%/361 | Total: 1d 20h | Avg: 7m 19s | Max: 46m 29s | Hits: 97%/521640
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
🏃 Runner counts (total jobs: 361)
# | Runner |
---|---|
264 | linux-amd64-cpu16 |
52 | linux-amd64-gpu-v100-latest-1 |
24 | linux-arm64-cpu16 |
21 | windows-amd64-cpu16 |
🟨 CI finished in 11h 43m: Pass: 99%/417 | Total: 2d 00h | Avg: 6m 55s | Max: 48m 40s | Hits: 95%/523630
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really looking forward to having this utility!
Just two minor, non-blocking, optional comments related to docs.
cudax/include/cuda/experimental/__container/uninitialized_buffer.h
Outdated
Show resolved
Hide resolved
cudax/include/cuda/experimental/__container/uninitialized_buffer.h
Outdated
Show resolved
Hide resolved
🟨 CI finished in 8h 16m: Pass: 99%/417 | Total: 4d 01h | Avg: 13m 58s | Max: 1h 47m | Hits: 75%/523960
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
It serves no purpose as it only ever forwards via ADL and also breaks older nvcc
`cuda::uninitialized_buffer` provides an allocation of `N` elements of type `T` utilitzing a `cuda::mr::resource` to allocate the storage. `cuda::uninitialized_buffer` takes care of alignment and deallocation of the storage. The user is required to ensure that the lifetime of the memory resource exceeds the lifetime of the buffer.
🟨 CI finished in 4h 28m: Pass: 99%/417 | Total: 4d 04h | Avg: 14m 25s | Max: 1h 07m | Hits: 61%/524077
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
Pull Request is not mergeable
🟩 CI finished in 5h 41m: Pass: 100%/417 | Total: 4d 05h | Avg: 14m 32s | Max: 1h 07m | Hits: 61%/525816
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
+/- | CUDA Experimental |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
Why would we merge a PR that has [PoC] in the title? I would have reviewed more carefully... |
Two things:
|
It is easy to make mistakes with uninitialized memory, but many are willing to take the risk as a performance optimization. To minimize that risk, applications often work with uninitialized memory as follows:
I think that this design does not currently account for the above (but could be extended later maybe! see below):
I think there is room for an utility that does not account for any of the above:
|
Thanks a lot for the comment, much appreciated. I believe that we need to better explain the intend of this type. What it does is taking away many of the sharp edges that raw allocations have.
That is the whole purpose of this class and I believe it can be incredibly useful as a building block for higher level types such as |
Looking forward to see this inside |
) * Drop `cuda::get_property` CPO It serves no purpose as it only ever forwards via ADL and also breaks older nvcc * Ensure that we test memory resources * Implement `cuda::uninitialized_buffer` `cuda::uninitialized_buffer` provides an allocation of `N` elements of type `T` utilitzing a `cuda::mr::resource` to allocate the storage. `cuda::uninitialized_buffer` takes care of alignment and deallocation of the storage. The user is required to ensure that the lifetime of the memory resource exceeds the lifetime of the buffer.
) * Drop `cuda::get_property` CPO It serves no purpose as it only ever forwards via ADL and also breaks older nvcc * Ensure that we test memory resources * Implement `cuda::uninitialized_buffer` `cuda::uninitialized_buffer` provides an allocation of `N` elements of type `T` utilitzing a `cuda::mr::resource` to allocate the storage. `cuda::uninitialized_buffer` takes care of alignment and deallocation of the storage. The user is required to ensure that the lifetime of the memory resource exceeds the lifetime of the buffer.
It would be nice to have utilities for easy conversion from a RMM's But I agree it would be nice to be able to use those only between allocation and initialization inside a kernel, and to (cheaply) convert them to initialized types or spans afterwards. |
This
uninitialized_buffer
provides an allocation ofN
elements of typeT
utilitzing acuda::mr::resource
to allocate the storage.The buffer takes care of alignment and deallocation of the storage. The user is required to ensure that the lifetime of the memory resource exceeds the lifetime of the buffer.
Design choices:
uninitialized
is in the name. Reading uninitialized memory is one of the more common security issues in C++ so we want it to be in the name.T
what about alignment? We do not want to put that on the user, butuninitialized_buffer
does it for youdata()
,size()
andto_span()
member functions.What it is not:
There is
unique_ptr<T[]>
but that is a horrible design:unique_ptr
does initialize its element(s) which we explicitly do not wantunique_ptr
in its current form has no way of passing in the allocator / memory resource usedunique_ptr
does not offer the commonsize()
anddata()
interface