rpc : make RPC servers come first in the device list #9296

rgerganov · 2024-09-03T09:39:33Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

This patch implements an idea suggested by @slaren which is to make RPC servers come first in the device list. When the last device is a local one, we don't have to transfer the logits over the network and this have significant impact on the performance.

Here are the results for my laptop (NVIDIA T1200) offloading to an RPC server with Tesla T4 on 100Mb/s network:

GPU	Model	Model Size [GiB]	Num. of Par.	Test	t/s master	t/s rpc-first	Speedup
NVIDIA T1200 Laptop GPU	llama 1B Q8_0	1.09	1100048384	pp512	1742.68	1987.90	1.14
NVIDIA T1200 Laptop GPU	llama 1B Q8_0	1.09	1100048384	tg128	67.98	82.49	1.21
NVIDIA T1200 Laptop GPU	llama 8B Q4_K_M	4.58	8030261248	pp512	356.96	465.93	1.31
NVIDIA T1200 Laptop GPU	llama 8B Q4_K_M	4.58	8030261248	tg128	27.36	29.95	1.09

The amount of network traffic sent from the RPC server to the main host is from 4x to 15x less with this patch.

slaren · 2024-09-03T11:20:26Z

src/llama.cpp

    int rpc_count = (int)model.rpc_servers.size();
-    if (gpu >= dev_count - rpc_count) {
-        const char * endpoint = model.rpc_servers[gpu - dev_count + rpc_count].c_str();
+    int local_gpu = gpu - rpc_count;


I don't think this will work correctly if a list of RPC servers is given in a build without the RPC backend (they should be ignored). The device ids should be from 0 to llama_get_device_count() - 1.

I think giving a list of RPC servers in non-RPC build should produce an error, i.e. the --rpc command line option must not be available.

Fixing that in the examples is good, but it still needs to be handled properly for 3rd party applications that use the llama.cpp API directly. llama_model::rpc_servers should probably only exist in builds with the RPC backend.

I have added ifdefs for the rpc_count. If we want to remove the existence of llama_model::rpc_server, we will need more ifdefs.

awatuna · 2024-09-04T14:46:22Z

I think this broke the Windows cuda llama-server's rpc.

The Windows cuda llama-server used to ignore --rpc.

However if you build it with GGML_RPC set then the Windows cuda llama-server, eg b3649 with #9184 patch, works utilizing both local Windows gpu and remote Linux gpu.

Now build with GGML_RPC set, the Windows cuda llama-server ignores --rpc again.

rgerganov · 2024-09-05T07:45:56Z

Now build with GGML_RPC set, the Windows cuda llama-server ignores --rpc again.

Hm, I don't have a windows machine atm but I cannot reproduce this on Linux. What do you mean the --rpc option is ignored? Can you start rpc-server on localhost and paste the output of:

llama-server -m <some_model.gguf> --rpc localhost:50052 -ngl 99

awatuna · 2024-09-05T11:19:23Z

The official Windows cuda build
https://github.com/ggerganov/llama.cpp/releases/download/b3669/llama-b3669-bin-win-cuda-cu12.2.0-x64.zip

It does not build rpc-server, and the llama-server prints help when --rpc is set.
No rpc at all.

The merged patch from another collaborator that enables rpc by default,
only enabled rpc for Windows builds, not Windows cuda builds.
Windows builds and Windows cuda builds are in different parts of the CI workflow, the patch missed it.
So I made another patch, #9184 to enable it.

The resulting b3649 with GGML_RPC set for Windows cuda build
https://github.com/awatuna/llama.cpp/releases/tag/b3655

llama-server rpc works

llm_load_tensors: ggml ctx size = 0.68 MiB
llm_load_tensors: offloading 46 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 47/47 layers to GPU
llm_load_tensors: RPC[192.168.200.246:50052] buffer size = 9228.81 MiB
llm_load_tensors: CPU buffer size = 1195.31 MiB
llm_load_tensors: CUDA0 buffer size = 18362.25 MiB

However, b3664 with GGML_RPC set for Windows cuda build
https://github.com/awatuna/llama.cpp/releases/tag/b3671

llama-server accepts --rpc, but it didn't even try to connect to rpc server.

llm_load_tensors: ggml ctx size = 0.68 MiB
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model

rgerganov · 2024-09-05T11:48:28Z

llama-server accepts --rpc, but it didn't even try to connect to rpc server.

llm_load_tensors: ggml ctx size = 0.68 MiB llama_model_load: error loading model: unable to allocate backend buffer llama_load_model_from_file: failed to load model

it doesn't connect because it fails to allocate backend buffer; you should troubleshoot this problem first

awatuna · 2024-09-05T12:32:36Z

If the rpc server is down, win cuda b3649 with rpc will spam

Failed to connect to 192.168.200.246:50052

win cuda b3664 with rpc just ignores the --rpc, doesn't try to connect at all.

* rpc : make RPC servers come first in the device list * rpc : disable options for non-RPC builds * rpc : rpc_count always zero for non-RPC builds

rgerganov force-pushed the rpc-first branch from 2540886 to c889b09 Compare September 3, 2024 09:46

slaren reviewed Sep 3, 2024

View reviewed changes

github-actions bot added the examples label Sep 3, 2024

slaren approved these changes Sep 3, 2024

View reviewed changes

ggerganov approved these changes Sep 3, 2024

View reviewed changes

rgerganov added 3 commits September 4, 2024 10:31

rpc : make RPC servers come first in the device list

0178724

rpc : disable options for non-RPC builds

7049733

rpc : rpc_count always zero for non-RPC builds

99cd446

rgerganov force-pushed the rpc-first branch from 941f9b1 to 99cd446 Compare September 4, 2024 07:34

rgerganov merged commit 82e3b03 into ggml-org:master Sep 4, 2024
51 checks passed

thxCode mentioned this pull request Jan 26, 2025

rpc: fix register position #11424

Merged

rgerganov mentioned this pull request Jan 30, 2025

Misc. bug: llama-server with rpc oom's allocation even though plenty left on devices #11435

Open

jukofyork mentioned this pull request Feb 5, 2025

llama : add option to override model tensor buffers #11397

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc : make RPC servers come first in the device list #9296

rpc : make RPC servers come first in the device list #9296

rgerganov commented Sep 3, 2024

slaren Sep 3, 2024

rgerganov Sep 3, 2024

slaren Sep 3, 2024

rgerganov Sep 3, 2024

awatuna commented Sep 4, 2024

rgerganov commented Sep 5, 2024

awatuna commented Sep 5, 2024

rgerganov commented Sep 5, 2024

awatuna commented Sep 5, 2024

rpc : make RPC servers come first in the device list #9296

rpc : make RPC servers come first in the device list #9296

Conversation

rgerganov commented Sep 3, 2024

slaren Sep 3, 2024

Choose a reason for hiding this comment

rgerganov Sep 3, 2024

Choose a reason for hiding this comment

slaren Sep 3, 2024

Choose a reason for hiding this comment

rgerganov Sep 3, 2024

Choose a reason for hiding this comment

awatuna commented Sep 4, 2024

rgerganov commented Sep 5, 2024

awatuna commented Sep 5, 2024

rgerganov commented Sep 5, 2024

awatuna commented Sep 5, 2024