add --rpc-layers flag to explicitly set RPC layers #11606

kallewoof · 2025-02-03T02:35:23Z

The current setup does not allow for very precise control of how many layers to put on the local GPUs vs the remote RPC connected server(s). This adds an additional --rpc-layers flag (-nrl) which allows the user to explicitly set the number of layers to offload to RPC end.

The current setup does not allow for very precise control of how many layers to put on the local GPUs vs the remote RPC connected server(s). This adds an additional --rpc-layers flag (-nrl) which allows the user to explicitly set the number of layers to offload to RPC end.

rgerganov · 2025-02-03T08:19:49Z

In which cases you would need this instead of the using existing --tensor-split?

kallewoof · 2025-02-03T10:11:16Z

In which cases you would need this instead of the using existing --tensor-split?

This lets you specify layers without having to think about proportions, which is a lot simpler than --tensor-split. I can find the right amount for the RPC server and GPUs separately.

abc-nix · 2025-02-03T14:14:35Z

In which cases you would need this instead of the using existing --tensor-split?

This lets you specify layers without having to think about proportions, which is a lot simpler than --tensor-split. I can find the right amount for the RPC server and GPUs separately.

You don't need to think about the proportions when using --tensor_split. You can directly use the number of layers instead. Example:

$ llama.cpp/build/bin/llama-server --rpc <IP1>:<PORT1> --rpc <IP2>:<PORT2> \
--device RPC[IP1:PORT1],CUDA0,CUDA1,RPC[IP2:PORT2] \
-ngl 33 --tensor_split 3/20/10/0 --device-draft CUDA1,RPC[IP2:PORT2] -ngld 99 [...]

33 layers are being offloaded, and they are divided between 3 devices, (3 layers for the first RPC device, 20 for the first CUDA device, and 10 for the last CUDA device).

OT: The only thing I am missing is a tensor-split for the draft model.

kallewoof · 2025-02-03T14:18:16Z

Got it. The name should've been "layer-split" instead. It sounds like some low level voodoo shit so I never touched it. :)
Also, the -h output does not at all indicate you can use the number of layers. You just have to know that.

abc-nix · 2025-02-03T14:24:41Z

There are a lot of undocumented things on llama.cpp. I learned many tricks reading responses to issues, to merge requests, and trying things out. As an example, the recent device order merge request is another magic trick that makes llama.cpp so much more flexible when using RPC.

kallewoof closed this Feb 3, 2025

kallewoof deleted the 202502-nrl-param branch February 3, 2025 14:16

jukofyork mentioned this pull request Feb 5, 2025

llama : add option to override model tensor buffers #11397

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add --rpc-layers flag to explicitly set RPC layers #11606

add --rpc-layers flag to explicitly set RPC layers #11606

kallewoof commented Feb 3, 2025

rgerganov commented Feb 3, 2025

kallewoof commented Feb 3, 2025

abc-nix commented Feb 3, 2025

kallewoof commented Feb 3, 2025 •

edited

Loading

abc-nix commented Feb 3, 2025 •

edited

Loading

add --rpc-layers flag to explicitly set RPC layers #11606

add --rpc-layers flag to explicitly set RPC layers #11606

Conversation

kallewoof commented Feb 3, 2025

rgerganov commented Feb 3, 2025

kallewoof commented Feb 3, 2025

abc-nix commented Feb 3, 2025

kallewoof commented Feb 3, 2025 • edited Loading

abc-nix commented Feb 3, 2025 • edited Loading

kallewoof commented Feb 3, 2025 •

edited

Loading

abc-nix commented Feb 3, 2025 •

edited

Loading