-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that cuda_memory_resource
allocates memory on the proper device
#2073
Conversation
🟨 CI finished in 2h 13m: Pass: 99%/417 | Total: 2d 06h | Avg: 7m 54s | Max: 58m 10s | Hits: 91%/524869
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
@wence- I know you've spent a lot of time thinking about the relationship among memory resources, buffers, and devices. I'd be interested to have your eyes on this. @miscco some helpful historical context on issues that can arise here, it would be to review these two RMM issues: |
542aaa8
to
53d5439
Compare
🟨 CI finished in 9h 19m: Pass: 99%/417 | Total: 2d 08h | Avg: 8m 04s | Max: 49m 17s | Hits: 95%/524002
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
Pull Request is not mergeable
🟩 CI finished in 3d 02h: Pass: 100%/417 | Total: 2d 08h | Avg: 8m 06s | Max: 49m 17s | Hits: 95%/524869
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
CUB | |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nits
constexpr cuda_memory_resource() noexcept | ||
: __device_id_(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Document which device the default constructor uses. One might plausibly think that without providing an explicit device ID that the resource would use the currently active device.
libcudacxx/include/cuda/__memory_resource/cuda_memory_resource.h
Outdated
Show resolved
Hide resolved
//! @param __other The other \c cuda_memory_resource | ||
//! @return true, if both resources hold the same device id | ||
_CCCL_NODISCARD constexpr bool operator==(cuda_memory_resource const& __other) const noexcept |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: This documents when two cuda_memory_resource
s are equivalent, but I think as new memory resources are introduced, one should think about what it means semantically for two memory resources to be equal. That way, all resources will have a consistent definition.
I think what you want for equality between two resources A
and B
is that an allocation from resource A
is morally equivalent to resource B
(that is, stream ordered, if async, in the same way, and on the same device), and an allocation from A
can be deallocated by resource B
(and vice versa).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we want to be a tiny bit stricter here in the sense that we want to ensure that users do not acidentally mix & match memory resources.
Technically you deallocate managed memory allocated with cudaMallocManaged
with a call to cudaFree
. Both are synchronous, so the question is should a managed memiry resource compare equal to a plain cudaMalloc
one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we want to be a tiny bit stricter here in the sense that we want to ensure that users do not acidentally mix & match memory resources.
That seems reasonable, so I think the current definition is fine, and one should document more generally that this is what the meaning of equality is.
53d5439
to
7d0cb9d
Compare
7d0cb9d
to
4341f07
Compare
//! By default uses device 0 to allocate memory | ||
class cuda_memory_resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important: This feels like an important design point that will have wide implications.
As @wence-'s comment shows, always using device 0 may be astonishing to the average CUDA user who will typically expect when a device isn't specified that it will use the "current device".
@pciolkosz @ericniebler what is the latest thinking with the notion of the "default device" in the runtime modernization work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This aligns with the current plan for the C++ runtime, where default device is always 0 and current device is ignored at all times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I would love it if there were no such concept as a default device (I like that there is no "current device" concept), and that everything that cares about devices needs to explicitly be constructed on one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's generally the direction we're headed in. With cuda::mr
we are taking a more gradual approach since it's already in use and we don't want to abruptly break people by requiring a device at construction. The idea is that we're adding the default argument to the ctor now, and we'll then deprecate/remove it later.
🟨 CI finished in 8h 50m: Pass: 99%/417 | Total: 6d 13h | Avg: 22m 41s | Max: 1h 14m | Hits: 80%/524387
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
🟩 CI finished in 1d 02h: Pass: 100%/417 | Total: 6d 14h | Avg: 22m 44s | Max: 1h 14m | Hits: 80%/525254
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
🟨 CI finished in 6h 44m: Pass: 99%/417 | Total: 1d 23h | Avg: 6m 49s | Max: 57m 04s | Hits: 98%/523532
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
1dc9e4a
to
7a0c1b1
Compare
578eb97
to
5b5e7a8
Compare
5b5e7a8
to
1d08fc3
Compare
🟨 CI finished in 10h 41m: Pass: 99%/417 | Total: 7d 08h | Avg: 25m 24s | Max: 1h 09m | Hits: 61%/523235
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
Pull Request is not mergeable
🟩 CI finished in 19h 21m: Pass: 100%/417 | Total: 7d 09h | Avg: 25m 34s | Max: 1h 09m | Hits: 61%/525761
|
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
+/- | libcu++ |
+/- | CUB |
+/- | Thrust |
+/- | CUDA Experimental |
+/- | pycuda |
🏃 Runner counts (total jobs: 417)
# | Runner |
---|---|
305 | linux-amd64-cpu16 |
61 | linux-amd64-gpu-v100-latest-1 |
28 | linux-arm64-cpu16 |
23 | windows-amd64-cpu16 |
…ice (NVIDIA#2073) * Ensure that `cuda_memory_resource` allocates memory on the proper device * Move `__ensure_current_device` to own header
…ice (NVIDIA#2073) * Ensure that `cuda_memory_resource` allocates memory on the proper device * Move `__ensure_current_device` to own header
cudaMalloc
always allocates memory on the current device, so we must make sure that we set the current device to the right one.By default we always allocate on device 0