Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Make device_vector safer to use in multi-device setting #1527

Closed
wence- opened this issue Apr 11, 2024 · 2 comments · Fixed by #1533
Closed

[FEA] Make device_vector safer to use in multi-device setting #1527

wence- opened this issue Apr 11, 2024 · 2 comments · Fixed by #1533
Assignees
Labels
1 - On Deck To be worked on next cpp Pertains to C++ code feature request New feature or request

Comments

@wence-
Copy link
Contributor

wence- commented Apr 11, 2024

Is your feature request related to a problem? Please describe.

Since #1370, device_buffer is safe to use in a multi-device setting wrt active devices when the destructor runs. While it was always possible (and relatively straightforward) to arrange for the active device to be correct in scenarios where no exceptions occurred, when there are exceptions setting the correct device for destruction was much more complicated.

We therefore added the cuda_set_device_raii helper object and stored the active device id in the device_buffer to ensure that the correct device is always active when calling allocate/deallocate functions.

In contrast, since device_vector is just an alias for thrust::device_vector, it still suffers from the old issue: the user must manually arrange that the correct device is active for the dtor.

Describe the solution you'd like

#1523 documents this restriction, but it would be good if we could lift it. One way would be to store the active device in the thrust allocator wrapper that we use to interface RMM's memory resources with the thrust allocator model.

We would then use cuda_set_device_raii in all the allocate/deallocate functions.

This was discounted as an approach in #1370 since it produces more device switches than necessary in some circumstances (pushing the device switching as far out as possible was preferred), so there would be some overhead compared to use of device_buffer (though hopefully small). And we note that since device_vector isn't stream ordered there are other disadvantages to using it, so the small performance cost is probably not that terminal.

Describe alternatives you've considered

Maintain status quo, and eventually deprecate and then remove device_vector, since it is not stream-ordered anyway and we are trying to move away from that model.

@wence- wence- added feature request New feature or request ? - Needs Triage Need team to review and classify labels Apr 11, 2024
@wence- wence- self-assigned this Apr 11, 2024
@wence- wence- added 1 - On Deck To be worked on next cpp Pertains to C++ code and removed ? - Needs Triage Need team to review and classify labels Apr 11, 2024
@harrism
Copy link
Member

harrism commented Apr 11, 2024

Perhaps we should benchmark it. In the single-device case, the rmm:::cuda_set_device_raii should be fast.

@harrism
Copy link
Member

harrism commented Apr 11, 2024

This was discounted as an approach in #1370 since it produces more device switches than necessary in some circumstances (pushing the device switching as far out as possible was preferred),

I think you are refering to this: #1370 (comment) . I think the case in RMM's thrust allocator isn't as bad -- putting the set device in allocate/deallocate wouldn't be the same as putting it way down in device_buffer::allocate_async()/deallocate_async(). In the latter, we would need it in those and in higher level functions that call those, resulting in duplicate device checks. Even if device_vector calls the allocator's functions multiple times (e.g. in a resize call -- deallocate and allocate), basically the RAII class would just cudaGetDeviceID() multiple times, which isn't that expensive AFAIK.

wence- added a commit to wence-/rmm that referenced this issue Apr 15, 2024
Previously, the user had to arrange that the device active when a
thrust_allocator object was created was also active when allocate and
deallocate was called. This is hard to manage if exceptions are
thrown. Instead, save the active device on construction and ensure
that it is active when calling deallocate and deallocate. This means
that device_vector is safe to destruct with RAII semantics in a
multi-device setting.

Add tests of this facility, and correct the parameterization usage in
the other thrust allocator tests such that we actually check the MRs
we're parameterizing over.

- Closes rapidsai#1527
rapids-bot bot pushed a commit that referenced this issue Apr 16, 2024
Previously, the user had to arrange that the device active when a
thrust_allocator object was created was also active when allocate and
deallocate was called. This is hard to manage if exceptions are
thrown. Instead, save the active device on construction and ensure
that it is active when calling deallocate and deallocate. This means
that device_vector is safe to destruct with RAII semantics in a
multi-device setting.

Add tests of this facility, and correct the parameterization usage in
the other thrust allocator tests such that we actually check the MRs
we're parameterizing over.

- Closes #1527

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Rong Ou (https://github.com/rongou)
  - Mark Harris (https://github.com/harrism)

URL: #1533
@bdice bdice moved this from To-do to Done in RMM Project Board Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next cpp Pertains to C++ code feature request New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants