Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds fabric handle and memory protection flags to cuda_async_memory_resource #1743

Merged
merged 4 commits into from
Dec 7, 2024

Conversation

abellina
Copy link
Contributor

@abellina abellina commented Dec 3, 2024

Closes #1741

Description

This PR adds a new fabric handle type in allocation_handle_type. It also adds an optional access_flags to set the memory access desired when exporting (prot_none, or prot_read_write). Pools that are not meant to be shareable should omit these flags.

Please note that I can't add a unit test that exports or imports these fabric handles, because it would require system setup that doesn't look to be portable.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Dec 3, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the cpp Pertains to C++ code label Dec 3, 2024
@abellina abellina marked this pull request as ready for review December 3, 2024 19:15
@abellina abellina requested a review from a team as a code owner December 3, 2024 19:15
@abellina abellina requested review from rongou and vyasr December 3, 2024 19:15
@abellina
Copy link
Contributor Author

abellina commented Dec 3, 2024

@bdice if you get a chance, I can't set the labels here, so I will need some help with this.

@rongou rongou added feature request New feature or request 3 - Ready for review Ready for review by team non-breaking Non-breaking change labels Dec 3, 2024
@abellina
Copy link
Contributor Author

abellina commented Dec 3, 2024

/ok to test

2 similar comments
@abellina
Copy link
Contributor Author

abellina commented Dec 3, 2024

/ok to test

@abellina
Copy link
Contributor Author

abellina commented Dec 3, 2024

/ok to test

@abellina
Copy link
Contributor Author

abellina commented Dec 3, 2024

I may not be in the right team or have the right permissions. If a reviewer could try triggering the build that would be great.

@abellina abellina force-pushed the allow_fabric_handles branch from 23c651e to e91c3ca Compare December 3, 2024 22:12
@abellina
Copy link
Contributor Author

abellina commented Dec 3, 2024

Note, I force pushed to put have a single signed commit. I didn't realize I needed to sign every commit in the PR.

@bdice
Copy link
Contributor

bdice commented Dec 3, 2024

/ok to test

@abellina
Copy link
Contributor Author

abellina commented Dec 4, 2024

hi @vyasr when you get a chance, could you take a look?

if (access_flag) {
cudaMemAccessDesc desc;
desc.location = pool_props.location;
desc.flags = static_cast<cudaMemAccessFlags>(access_flag.value_or(access_flags::prot_none));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside this block it must be true that access_flag has a value, so I think you can use

desc.flags = static_cast<cudaMemAccessFlags>(*access_flag);

Thinking more. In the case that access_flag is nullopt we never call cudaMemPoolSetAccess at all. Is this the intention?

I think you probably mean something like (without the need for the conditional)

cudaMemAccessDesc desc = {
  .location = pool_props.location,
  .flags = static_cast<cudaMemAccessFlags>(access_flag.value_or(access_flags::prot_none))
};
RMM_CUDA_TRY(cudaMemPoolSetAccess(pool_handle(), &desc, 1));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So before this change, we never called cudaMemPoolSetAccess. I didn't want to add another api call in the default case where the caller doesn't want to set access at all, reserving this for the case where the caller states they want to set it. Would you be OK if I took your:

desc.flags = static_cast<cudaMemAccessFlags>(*access_flag);

suggestion in the block, but didn't call the API in all cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never called cudaMemPoolSetAccess

I tried to understand from the runtime API docs what the consequence of the current status quo is (and whether it differs from unilaterally calling cudaMemPoolSetAccess(..., PROT_NONE)), but I could not figure it out. So unless we have good evidence that just doing so is fine, I suspect we should do as you did.

I would encourage the use of the named struct field initialiser though :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The named struct field initializer is a change I was able to make. Pushed.

Copy link
Contributor Author

@abellina abellina Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdice @wence- fyi #1753. Relevant to this discussion. I apologize for the noise. I will file a follow on PR with a fix/clarification to the API so that others don't get tripped up.

Signed-off-by: Alessandro Bellina <[email protected]>
@abellina abellina force-pushed the allow_fabric_handles branch from 8d40966 to 56220a4 Compare December 4, 2024 20:35
@abellina
Copy link
Contributor Author

abellina commented Dec 4, 2024

@wence- please take another look, let me know if you still want part of the std::underlying_type_t change. I could use it just to get the type of the enum?

@wence-
Copy link
Contributor

wence- commented Dec 5, 2024

@wence- please take another look, let me know if you still want part of the std::underlying_type_t change. I could use it just to get the type of the enum?

I think it's ok without.

@abellina
Copy link
Contributor Author

abellina commented Dec 5, 2024

/ok to test

@abellina
Copy link
Contributor Author

abellina commented Dec 5, 2024

Thanks @wence- updated with comment. If you don't mind, or others in the review (@vyasr, @rongou) could you trigger /ok to test, it is still not working for me.

@bdice
Copy link
Contributor

bdice commented Dec 5, 2024

/ok to test

@bdice
Copy link
Contributor

bdice commented Dec 5, 2024

I’ll take another review pass today.

@abellina
Copy link
Contributor Author

abellina commented Dec 6, 2024

/ok to test

@abellina
Copy link
Contributor Author

abellina commented Dec 6, 2024

Hi @bdice let me know if you have further comments. Thanks!

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abellina! All looks fine to me.

@bdice
Copy link
Contributor

bdice commented Dec 7, 2024

/merge

@bdice bdice removed the 3 - Ready for review Ready for review by team label Dec 7, 2024
@rapids-bot rapids-bot bot merged commit 83a8971 into rapidsai:branch-25.02 Dec 7, 2024
59 of 60 checks passed
rapids-bot bot pushed a commit that referenced this pull request Dec 9, 2024
Closes #1753
It is a follow up from #1743

I would like for rapidsai/cudf#17553 to merge first, that way I don't break the build.

I've learned that I was using `cudaMemPoolSetAccess` incorrectly. This API should only be used from a `peer` device, not from the device that created the pool. This is the reason why calling `cudaMemPoolSetAccess` with none throws an error as documented here #1753. I have tested that I can still export the fabric handles and import them using UCX in a peer device with the default access that pool owner device gets (read+write is the default). Note that this read+write default access cannot be revoked from the owner, as it wouldn't make sense to have memory that nobody has access to, but peers can call `cudaMemPoolSetAccess` to gain read+write access or to stop accessing (none) a peer's pool memory.

Authors:
  - Alessandro Bellina (https://github.com/abellina)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #1754
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp Pertains to C++ code feature request New feature or request non-breaking Non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[FEA] support fabric handles and exporting in the cuda_async_memory_resource
4 participants