Skip to content

Moving MG functions into unified API + raft::device_resources_snmg as device resource type for MG functions#454

Merged
rapids-bot[bot] merged 26 commits into
NVIDIA:branch-25.06from
viclafargue:account-for-raft-update
Apr 23, 2025
Merged

Moving MG functions into unified API + raft::device_resources_snmg as device resource type for MG functions#454
rapids-bot[bot] merged 26 commits into
NVIDIA:branch-25.06from
viclafargue:account-for-raft-update

Conversation

@viclafargue

@viclafargue viclafargue commented Nov 8, 2024

Copy link
Copy Markdown
Contributor

raft::device_resources_snmg is introduced in this PR : NVIDIA/raft#2549

@cjnolet cjnolet changed the title Account for RAFT update Account for RAFT update to SNMG APIs Dec 18, 2024
Comment thread cpp/include/cuvs/neighbors/mg.hpp Outdated
* @return the constructed IVF-Flat MG index
*/
auto build(const raft::device_resources& handle,
auto build(const raft::device_resources_snmg& clique,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the main goals with having the "snmg" resources object match the single-gpu object is that we wanted to be able to remove this additional MG. We should now be able to accept device_resources and then check to see if a nccl clique has been set on it (which would imply that it's a multi-gpu resources object and not a single-gpu object). The whole goal with doing this was to consolidate code paths.

@viclafargue viclafargue Jan 8, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NCCL clique is not set as a resource anymore, but we should still be able to implement the dispatching by checking the dynamic type of the device_resources. The real question then is, do we truly want dispatching on both the regular API (cuvs::neighbors::build) and the mg namespace (cuvs::neighbors::mg::build)? It kind of make sense that a user providing a device_resources_snmg instance to the regular API (cuvs::neighbors::build) would want things to be deployed on multiple GPUs. However, the reverse is not necessarily true. A user who explicitly chose the mg namespace, but did not provide the adequate device_resources_snmg would fallback to single GPU, potentially unintentionally. Is this what we want?

I propose that we implement the dispatching mechanism solely on the regular API (cuvs::neighbors::build) in a dedicated follow-up PR? This also allows the MG doc to explicitly avert users that they should use an adequate device_resources_snmg to use the MG API. What do you think?

@cjnolet

cjnolet commented Jan 17, 2025

Copy link
Copy Markdown
Contributor

Not sure why CI is still failing. The RAFT PR was merged quite awhile ago but I'm still seeing errors like this:

/home/coder/cuvs/cpp/include/cuvs/neighbors/mg.hpp:24:10: fatal error: raft/core/device_resources_snmg.hpp: No such file or directory
     24 | #include <raft/core/device_resources_snmg.hpp>
        |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

@bdice

bdice commented Jan 17, 2025

Copy link
Copy Markdown
Contributor

This fell back to old 24.12 RAFT packages for unclear reasons:

    libraft-headers:             24.12.00-cuda12_241211_geaf9cc72_0 rapidsai   
    libraft-headers-only:        24.12.00-cuda12_241211_geaf9cc72_0 rapidsai   

https://github.com/rapidsai/cuvs/actions/runs/12818869804/job/35749392606?pr=454#step:8:521

@cjnolet cjnolet changed the base branch from branch-24.12 to branch-25.02 January 17, 2025 00:47
@cjnolet

cjnolet commented Jan 17, 2025

Copy link
Copy Markdown
Contributor

This fell back to old 24.12 RAFT packages for unclear reasons:

Oh fooey! Thanks for noticing that @bdice, I just realized the darn target branch was never updated!

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jan 17, 2025
@viclafargue viclafargue force-pushed the account-for-raft-update branch from 6ae1c70 to b0c7ab9 Compare January 20, 2025 17:56
@copy-pr-bot

copy-pr-bot Bot commented Jan 20, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@viclafargue viclafargue changed the title Account for RAFT update to SNMG APIs Moving MG functions into unified API + raft::device_resources_snmg as device resource type for MG functions Jan 22, 2025
@copy-pr-bot

copy-pr-bot Bot commented Apr 22, 2025

Copy link
Copy Markdown

/ok to test

@cjnolet, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@cjnolet

cjnolet commented Apr 22, 2025

Copy link
Copy Markdown
Contributor

/ok to test

@copy-pr-bot

copy-pr-bot Bot commented Apr 22, 2025

Copy link
Copy Markdown

/ok to test

@cjnolet, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@cjnolet cjnolet removed the request for review from msarahan April 22, 2025 22:35
@cjnolet

cjnolet commented Apr 22, 2025

Copy link
Copy Markdown
Contributor

/ok to test da48376

@cjnolet

cjnolet commented Apr 22, 2025

Copy link
Copy Markdown
Contributor

/merge

@cjnolet

cjnolet commented Apr 23, 2025

Copy link
Copy Markdown
Contributor

/ok to test da48

@cjnolet

cjnolet commented Apr 23, 2025

Copy link
Copy Markdown
Contributor

/ok to test da48376

@copy-pr-bot

copy-pr-bot Bot commented Apr 23, 2025

Copy link
Copy Markdown

/ok to test da48376

@cjnolet, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@divyegala

Copy link
Copy Markdown
Contributor

/ok to test d2e2be9

@divyegala

Copy link
Copy Markdown
Contributor

/merge

@viclafargue viclafargue force-pushed the account-for-raft-update branch from 92ced86 to 1e1d97e Compare April 23, 2025 11:27
@viclafargue

Copy link
Copy Markdown
Contributor Author

/ok to test

@copy-pr-bot

copy-pr-bot Bot commented Apr 23, 2025

Copy link
Copy Markdown

/ok to test

@viclafargue, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@achirkin

Copy link
Copy Markdown
Contributor

/ok to test

@copy-pr-bot

copy-pr-bot Bot commented Apr 23, 2025

Copy link
Copy Markdown

/ok to test

@achirkin, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@viclafargue

Copy link
Copy Markdown
Contributor Author

/ok to test 1e1d97e

@achirkin

Copy link
Copy Markdown
Contributor

/ok to test 1e1d97e

Comment thread cpp/cmake/thirdparty/get_raft.cmake Outdated
Co-authored-by: Artem M. Chirkin <9253178+achirkin@users.noreply.github.com>
@cjnolet

cjnolet commented Apr 23, 2025

Copy link
Copy Markdown
Contributor

/ok to test 34d67ee

@cjnolet

cjnolet commented Apr 23, 2025

Copy link
Copy Markdown
Contributor

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change Python

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants