Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: fix register position #11424

Merged
merged 1 commit into from
Jan 26, 2025
Merged

rpc: fix register position #11424

merged 1 commit into from
Jan 26, 2025

Conversation

thxCode
Copy link
Contributor

@thxCode thxCode commented Jan 26, 2025

Make sure to read the contributing guidelines before submitting a PR

PR #11262 reverted the changes introduced by PR #9296, changed the output layer assigned device to the remote RPC server.

this PR will keep assigning the output layer to the local device.

-m ../Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-fp16.gguf --tensor-split 1,9 --rpc 127.0.0.1:50052

current

load_tensors: layer   0 assigned to device Metal
load_tensors: layer   1 assigned to device Metal
load_tensors: layer   2 assigned to device Metal
load_tensors: layer   3 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer   4 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer   5 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer   6 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer   7 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer   8 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer   9 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  10 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  11 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  12 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  13 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  14 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  15 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  16 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  17 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  18 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  19 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  20 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  21 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  22 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  23 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer  24 assigned to device RPC[127.0.0.1:50052]

after this pr

load_tensors: layer   0 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer   1 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer   2 assigned to device RPC[127.0.0.1:50052]
load_tensors: layer   3 assigned to device Metal
load_tensors: layer   4 assigned to device Metal
load_tensors: layer   5 assigned to device Metal
load_tensors: layer   6 assigned to device Metal
load_tensors: layer   7 assigned to device Metal
load_tensors: layer   8 assigned to device Metal
load_tensors: layer   9 assigned to device Metal
load_tensors: layer  10 assigned to device Metal
load_tensors: layer  11 assigned to device Metal
load_tensors: layer  12 assigned to device Metal
load_tensors: layer  13 assigned to device Metal
load_tensors: layer  14 assigned to device Metal
load_tensors: layer  15 assigned to device Metal
load_tensors: layer  16 assigned to device Metal
load_tensors: layer  17 assigned to device Metal
load_tensors: layer  18 assigned to device Metal
load_tensors: layer  19 assigned to device Metal
load_tensors: layer  20 assigned to device Metal
load_tensors: layer  21 assigned to device Metal
load_tensors: layer  22 assigned to device Metal
load_tensors: layer  23 assigned to device Metal
load_tensors: layer  24 assigned to device Metal

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 26, 2025
@rgerganov
Copy link
Collaborator

Thanks for catching this! Changes look fine to me but please wait for @slaren's approval before merging.

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applications should not depend on the order of devices, and we definitely should not modify the ggml-backend API in this way. Instead, either pass a sorted list of devices to llama.cpp, or add the necessary logic to sort the device list here:
https://github.com/ggerganov/llama.cpp/blob/2cc9b8c32c78d09cd1b4df0aaa605ab2d0176243/src/llama.cpp#L9407-L9422

@thxCode
Copy link
Contributor Author

thxCode commented Jan 26, 2025

@slaren PTAL

@slaren slaren merged commit 1d8ee06 into ggml-org:master Jan 26, 2025
44 of 45 checks passed
@lucyknada
Copy link

@slaren
user-facing this becomes an issue because if you --list-devices you see an entirely different order (rpc devices last) than youre supposed to pass to --tensor-split to (rpc first), so you are left to guess what the order ends up being instead of just getting it from --list-devices, through sheer bruteforce, is there a way to still adjust this to match?

@slaren
Copy link
Member

slaren commented Jan 30, 2025

Modify --list-devices so that it also sorts the device list in the same way as llama.cpp. Or as already suggested:

pass a sorted list of devices to llama.cpp

@lucyknada
Copy link

pass a sorted list of devices to llama.cpp

is that a user-facing option or implementation detail?

@slaren
Copy link
Member

slaren commented Jan 30, 2025

Both. The user can pass a custom list of devices with -dev, but the llama.cpp examples can also pass a list instead of relying on the default list of devices that llama.cpp uses.

@abc-nix
Copy link

abc-nix commented Jan 30, 2025

For those that don't get it (like me initially), you first need to check the device names using the --list-devices option (example below):

 $ llama.cpp/build/bin/llama-server --rpc <IP1>:<PORT1> --rpc <IP2>:<PORT2> --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX XXXX, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce GTX YYYY, compute capability 7.5, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce RTX XXXX (A MiB, B MiB free)
  CUDA1: NVIDIA GeForce GTX YYYY (A MiB, B MiB free)
  RPC[IP1:PORT1]: RPC[IP1:PORT1] (A MiB, B MiB free)
  RPC[IP2:PORT2]: RPC[IP2:PORT2] (A MiB, B MiB free)

It is under Available devices where you get the device names. Next time you launch llama-server, you will use the --device option with the order you want for your devices. An example:

$ llama.cpp/build/bin/llama-server --rpc <IP1>:<PORT1> --rpc <IP2>:<PORT2> \
--device RPC[IP1:PORT1],CUDA0,CUDA1,RPC[IP2:PORT2] \
-ngl 33 --tensor_split 3/20/10/0 --device-draft CUDA1,RPC[IP2:PORT2] -ngld 99 [...]

This way, you can set up the order however you want. In the complicated example above, the main model is offloaded to the first RPC device (using IP1:PORT1 address), mostly on the CUDA0 device, and partially to the CUDA1 device, while the draft model is offloaded to the CUDA1 device and the second RPC device (using IP2:PORT2 address).

tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants