Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc : make RPC servers come first in the device list #9296

Merged
merged 3 commits into from
Sep 4, 2024

Conversation

rgerganov
Copy link
Collaborator

This patch implements an idea suggested by @slaren which is to make RPC servers come first in the device list. When the last device is a local one, we don't have to transfer the logits over the network and this have significant impact on the performance.

Here are the results for my laptop (NVIDIA T1200) offloading to an RPC server with Tesla T4 on 100Mb/s network:

GPU Model Model Size [GiB] Num. of Par. Test t/s master t/s rpc-first Speedup
NVIDIA T1200 Laptop GPU llama 1B Q8_0 1.09 1100048384 pp512 1742.68 1987.90 1.14
NVIDIA T1200 Laptop GPU llama 1B Q8_0 1.09 1100048384 tg128 67.98 82.49 1.21
NVIDIA T1200 Laptop GPU llama 8B Q4_K_M 4.58 8030261248 pp512 356.96 465.93 1.31
NVIDIA T1200 Laptop GPU llama 8B Q4_K_M 4.58 8030261248 tg128 27.36 29.95 1.09

The amount of network traffic sent from the RPC server to the main host is from 4x to 15x less with this patch.

int rpc_count = (int)model.rpc_servers.size();
if (gpu >= dev_count - rpc_count) {
const char * endpoint = model.rpc_servers[gpu - dev_count + rpc_count].c_str();
int local_gpu = gpu - rpc_count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work correctly if a list of RPC servers is given in a build without the RPC backend (they should be ignored). The device ids should be from 0 to llama_get_device_count() - 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think giving a list of RPC servers in non-RPC build should produce an error, i.e. the --rpc command line option must not be available.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing that in the examples is good, but it still needs to be handled properly for 3rd party applications that use the llama.cpp API directly. llama_model::rpc_servers should probably only exist in builds with the RPC backend.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added ifdefs for the rpc_count. If we want to remove the existence of llama_model::rpc_server, we will need more ifdefs.

@rgerganov rgerganov merged commit 82e3b03 into ggml-org:master Sep 4, 2024
51 checks passed
@awatuna
Copy link
Contributor

awatuna commented Sep 4, 2024

I think this broke the Windows cuda llama-server's rpc.

The Windows cuda llama-server used to ignore --rpc.

However if you build it with GGML_RPC set then the Windows cuda llama-server, eg b3649 with #9184 patch, works utilizing both local Windows gpu and remote Linux gpu.

Now build with GGML_RPC set, the Windows cuda llama-server ignores --rpc again.

@rgerganov
Copy link
Collaborator Author

Now build with GGML_RPC set, the Windows cuda llama-server ignores --rpc again.

Hm, I don't have a windows machine atm but I cannot reproduce this on Linux. What do you mean the --rpc option is ignored? Can you start rpc-server on localhost and paste the output of:

llama-server -m <some_model.gguf> --rpc localhost:50052 -ngl 99

@awatuna
Copy link
Contributor

awatuna commented Sep 5, 2024

The official Windows cuda build
https://github.com/ggerganov/llama.cpp/releases/download/b3669/llama-b3669-bin-win-cuda-cu12.2.0-x64.zip

It does not build rpc-server, and the llama-server prints help when --rpc is set.
No rpc at all.

The merged patch from another collaborator that enables rpc by default,
only enabled rpc for Windows builds, not Windows cuda builds.
Windows builds and Windows cuda builds are in different parts of the CI workflow, the patch missed it.
So I made another patch, #9184 to enable it.

The resulting b3649 with GGML_RPC set for Windows cuda build
https://github.com/awatuna/llama.cpp/releases/tag/b3655

llama-server rpc works

llm_load_tensors: ggml ctx size = 0.68 MiB
llm_load_tensors: offloading 46 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 47/47 layers to GPU
llm_load_tensors: RPC[192.168.200.246:50052] buffer size = 9228.81 MiB
llm_load_tensors: CPU buffer size = 1195.31 MiB
llm_load_tensors: CUDA0 buffer size = 18362.25 MiB

However, b3664 with GGML_RPC set for Windows cuda build
https://github.com/awatuna/llama.cpp/releases/tag/b3671

llama-server accepts --rpc, but it didn't even try to connect to rpc server.

llm_load_tensors: ggml ctx size = 0.68 MiB
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model

@rgerganov
Copy link
Collaborator Author

llama-server accepts --rpc, but it didn't even try to connect to rpc server.

llm_load_tensors: ggml ctx size = 0.68 MiB llama_model_load: error loading model: unable to allocate backend buffer llama_load_model_from_file: failed to load model

it doesn't connect because it fails to allocate backend buffer; you should troubleshoot this problem first

@awatuna
Copy link
Contributor

awatuna commented Sep 5, 2024

If the rpc server is down, win cuda b3649 with rpc will spam

Failed to connect to 192.168.200.246:50052

win cuda b3664 with rpc just ignores the --rpc, doesn't try to connect at all.

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* rpc : make RPC servers come first in the device list

* rpc : disable options for non-RPC builds

* rpc : rpc_count always zero for non-RPC builds
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* rpc : make RPC servers come first in the device list

* rpc : disable options for non-RPC builds

* rpc : rpc_count always zero for non-RPC builds
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* rpc : make RPC servers come first in the device list

* rpc : disable options for non-RPC builds

* rpc : rpc_count always zero for non-RPC builds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants