Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add --rpc-layers flag to explicitly set RPC layers #11606

Closed
wants to merge 1 commit into from

Conversation

kallewoof
Copy link
Contributor

The current setup does not allow for very precise control of how many layers to put on the local GPUs vs the remote RPC connected server(s). This adds an additional --rpc-layers flag (-nrl) which allows the user to explicitly set the number of layers to offload to RPC end.

The current setup does not allow for very precise control of how many layers to put on the local GPUs vs the remote RPC connected server(s). This adds an additional --rpc-layers flag (-nrl) which allows the user to explicitly set the number of layers to offload to RPC end.
@rgerganov
Copy link
Collaborator

In which cases you would need this instead of the using existing --tensor-split?

@kallewoof
Copy link
Contributor Author

In which cases you would need this instead of the using existing --tensor-split?

This lets you specify layers without having to think about proportions, which is a lot simpler than --tensor-split. I can find the right amount for the RPC server and GPUs separately.

@abc-nix
Copy link

abc-nix commented Feb 3, 2025

In which cases you would need this instead of the using existing --tensor-split?

This lets you specify layers without having to think about proportions, which is a lot simpler than --tensor-split. I can find the right amount for the RPC server and GPUs separately.

You don't need to think about the proportions when using --tensor_split. You can directly use the number of layers instead. Example:

$ llama.cpp/build/bin/llama-server --rpc <IP1>:<PORT1> --rpc <IP2>:<PORT2> \
--device RPC[IP1:PORT1],CUDA0,CUDA1,RPC[IP2:PORT2] \
-ngl 33 --tensor_split 3/20/10/0 --device-draft CUDA1,RPC[IP2:PORT2] -ngld 99 [...]

33 layers are being offloaded, and they are divided between 3 devices, (3 layers for the first RPC device, 20 for the first CUDA device, and 10 for the last CUDA device).

OT: The only thing I am missing is a tensor-split for the draft model.

@kallewoof kallewoof closed this Feb 3, 2025
@kallewoof kallewoof deleted the 202502-nrl-param branch February 3, 2025 14:16
@kallewoof
Copy link
Contributor Author

kallewoof commented Feb 3, 2025

Got it. The name should've been "layer-split" instead. It sounds like some low level voodoo shit so I never touched it. :)
Also, the -h output does not at all indicate you can use the number of layers. You just have to know that.

@abc-nix
Copy link

abc-nix commented Feb 3, 2025

There are a lot of undocumented things on llama.cpp. I learned many tricks reading responses to issues, to merge requests, and trying things out. As an example, the recent device order merge request is another magic trick that makes llama.cpp so much more flexible when using RPC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants