-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: fix register position #11424
rpc: fix register position #11424
Conversation
Thanks for catching this! Changes look fine to me but please wait for @slaren's approval before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applications should not depend on the order of devices, and we definitely should not modify the ggml-backend API in this way. Instead, either pass a sorted list of devices to llama.cpp, or add the necessary logic to sort the device list here:
https://github.com/ggerganov/llama.cpp/blob/2cc9b8c32c78d09cd1b4df0aaa605ab2d0176243/src/llama.cpp#L9407-L9422
Signed-off-by: thxCode <[email protected]>
@slaren PTAL |
@slaren |
Modify
|
is that a user-facing option or implementation detail? |
Both. The user can pass a custom list of devices with |
For those that don't get it (like me initially), you first need to check the device names using the $ llama.cpp/build/bin/llama-server --rpc <IP1>:<PORT1> --rpc <IP2>:<PORT2> --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX XXXX, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce GTX YYYY, compute capability 7.5, VMM: yes
Available devices:
CUDA0: NVIDIA GeForce RTX XXXX (A MiB, B MiB free)
CUDA1: NVIDIA GeForce GTX YYYY (A MiB, B MiB free)
RPC[IP1:PORT1]: RPC[IP1:PORT1] (A MiB, B MiB free)
RPC[IP2:PORT2]: RPC[IP2:PORT2] (A MiB, B MiB free) It is under $ llama.cpp/build/bin/llama-server --rpc <IP1>:<PORT1> --rpc <IP2>:<PORT2> \
--device RPC[IP1:PORT1],CUDA0,CUDA1,RPC[IP2:PORT2] \
-ngl 33 --tensor_split 3/20/10/0 --device-draft CUDA1,RPC[IP2:PORT2] -ngld 99 [...] This way, you can set up the order however you want. In the complicated example above, the main model is offloaded to the first RPC device (using IP1:PORT1 address), mostly on the CUDA0 device, and partially to the CUDA1 device, while the draft model is offloaded to the CUDA1 device and the second RPC device (using IP2:PORT2 address). |
Signed-off-by: thxCode <[email protected]>
Make sure to read the contributing guidelines before submitting a PR
PR #11262 reverted the changes introduced by PR #9296, changed the output layer assigned device to the remote RPC server.
this PR will keep assigning the output layer to the local device.
-m ../Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-fp16.gguf --tensor-split 1,9 --rpc 127.0.0.1:50052
current
after this pr