-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc : make RPC servers come first in the device list #9296
Conversation
int rpc_count = (int)model.rpc_servers.size(); | ||
if (gpu >= dev_count - rpc_count) { | ||
const char * endpoint = model.rpc_servers[gpu - dev_count + rpc_count].c_str(); | ||
int local_gpu = gpu - rpc_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work correctly if a list of RPC servers is given in a build without the RPC backend (they should be ignored). The device ids should be from 0 to llama_get_device_count() - 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think giving a list of RPC servers in non-RPC build should produce an error, i.e. the --rpc
command line option must not be available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing that in the examples is good, but it still needs to be handled properly for 3rd party applications that use the llama.cpp API directly. llama_model::rpc_servers
should probably only exist in builds with the RPC backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added ifdefs
for the rpc_count
. If we want to remove the existence of llama_model::rpc_server
, we will need more ifdefs
.
I think this broke the Windows cuda llama-server's rpc. The Windows cuda llama-server used to ignore --rpc. However if you build it with GGML_RPC set then the Windows cuda llama-server, eg b3649 with #9184 patch, works utilizing both local Windows gpu and remote Linux gpu. Now build with GGML_RPC set, the Windows cuda llama-server ignores --rpc again. |
Hm, I don't have a windows machine atm but I cannot reproduce this on Linux. What do you mean the
|
The official Windows cuda build It does not build rpc-server, and the llama-server prints help when --rpc is set. The merged patch from another collaborator that enables rpc by default, The resulting b3649 with GGML_RPC set for Windows cuda build llama-server rpc works llm_load_tensors: ggml ctx size = 0.68 MiB However, b3664 with GGML_RPC set for Windows cuda build llama-server accepts --rpc, but it didn't even try to connect to rpc server. llm_load_tensors: ggml ctx size = 0.68 MiB |
it doesn't connect because it fails to allocate backend buffer; you should troubleshoot this problem first |
If the rpc server is down, win cuda b3649 with rpc will spam Failed to connect to 192.168.200.246:50052 win cuda b3664 with rpc just ignores the --rpc, doesn't try to connect at all. |
* rpc : make RPC servers come first in the device list * rpc : disable options for non-RPC builds * rpc : rpc_count always zero for non-RPC builds
* rpc : make RPC servers come first in the device list * rpc : disable options for non-RPC builds * rpc : rpc_count always zero for non-RPC builds
* rpc : make RPC servers come first in the device list * rpc : disable options for non-RPC builds * rpc : rpc_count always zero for non-RPC builds
This patch implements an idea suggested by @slaren which is to make RPC servers come first in the device list. When the last device is a local one, we don't have to transfer the logits over the network and this have significant impact on the performance.
Here are the results for my laptop (NVIDIA T1200) offloading to an RPC server with Tesla T4 on 100Mb/s network:
The amount of network traffic sent from the RPC server to the main host is from 4x to 15x less with this patch.