ggml-rpc: add native RDMA transport for RPC backend (RoCEv2)#20590
Conversation
|
@dvv47 Do you know if I connect 2 DGX Sparks with a 200Gb/s QSFP cable, is this implementation going to be compatible? I am planning to buy the cable and test it. |
|
@ggerganov Yes, it will work with dgx spark, as it is, they use Mellanox ConnectX-7 and Libibverbs api exactly same. Low-latency benefits will be there, but full bandwidth utilization for faster model transfer weight could be optimized even further |
|
@Mithras it's possible that this is a docker limitation. Have you tried running it natively? I will also test it with Docker and will let you know the results. Maybe using --ipc=host will help. |
|
Fixed the issue, it now reconnects better, and I have retested with the ROCm 7.2 Docker container and
Also, according to my tests the Vulkan backend has a lower RDMA profit, about 3%, in both PP and TG. I am also interested in testing how this affects CUDA devices. I will probably try it this week. @Mithras for this strix halo: and if you build it yourself: |
|
@Mithras fixed that and tested on some agentic tasks |
|
@dvv47 this works great, thank you so much! Real perf boost for free! |
|
Works with virtual roce v2, too. but of course slightly slower than the ethernet stack below it. skipped the tg/s since there is barely any traffic (4mb here and there) |
…Ev2) Base PR from dvv47 adding optional RDMA transport with auto-negotiation during RPC HELLO handshake. This commit is the unmodified PR applied to current master; macOS Thunderbolt adaptations follow in subsequent commits. Ref: ggml-org#20590 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Ev2) Base PR from dvv47 adding optional RDMA transport with auto-negotiation during RPC HELLO handshake. This commit is the unmodified PR applied to current master; macOS Thunderbolt adaptations follow in subsequent commits. Ref: ggml-org#20590 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
I got the QSFP cable today. Will report results when I get some time to try it out. |
|
Got the two DGX Sparks connected. Not sure I am getting the full 200Gb/s speed, but it should be at least 100Gb/s if I am reading the numbers correctly: DetailsCable: https://www.naddod.com/products/102069.html |
|
I heard that the network of DGX Spark is a bit complicated. In reality, each port is split into two links with a speed of 100 Gbps. So, in theory, you will be able to set up link bonding and achieve 200 Gpbs over one cable. Or, without bonding, you can get 100 gbps over one port, and at the same time, 100 gbps to a second device through a second port. In the case of this RPC RDMA feature, most profit will come from a stable, low-latency connection. So I expect no difference between 25 Gbps and even 200 Gbps links |
|
Yup, that's what I figured as well. Added the latency numbers to the previous comment for completeness. |
|
EDIT 3: Got it working between RTX Pro 6000 (x86_64 host) and DGX Spark (arm64) loading Qwen3.5 122B Q8 (200GB+ model) EDIT 2: I rebased on master which has a bug in rpc-server: fixed in #21030 EDIT: MTU issue was first problem, fixed that but having another error now... will update later Hey just checking out this PR, DetailsI tried to run rpc-server between DGX Spark -> Linux PC (RTX Pro 6000) but i get an exception running llama-benchkieran@spark-95d6:~/git/llama.cpp/build$ bin/llama-bench --rpc 10.100.50.1:50052 \
-m ~/.cache/huggingface/hub/models--unsloth--Qwen3.5-122B-A10B-GGUF/snapshots/51eab4d59d53f573fb9206cb3ce613f1d0aa392b/Q4_K_M/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf -p 2048 -n 256
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB
RDMA probed: dev=rocep1s0f1 gid=2 qpn=552 inline=316
RDMA activated: qpn=552->260 mtu=4096 rx_depth=24
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
/home/kieran/git/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:1125: Remote RPC server crashed or returned malformed response
[New LWP 2814330]
[New LWP 2814326]
This GDB supports auto-downloading debuginfo from the following URLs:
<https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libhns-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmthca-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libhfi1verbs-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libirdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmlx5-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libefa-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libocrdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libcxgb4-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libvmw_pvrdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libqedr-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/liberdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmlx4-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libipathverbs-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libsiw-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmana-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/librxe-rdmav34.so
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000f68df09c7b74 in __GI___wait4 (pid=2815007, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0 0x0000f68df09c7b74 in __GI___wait4 (pid=2815007, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x0000f68df13a5a1c in ggml_print_backtrace () from /home/kieran/git/llama.cpp/build/bin/libggml-base.so.0
#2 0x0000f68df13a5bc0 in ggml_abort () from /home/kieran/git/llama.cpp/build/bin/libggml-base.so.0
#3 0x0000f68dee6aa8a0 in ggml_backend_rpc_buffer_set_tensor(ggml_backend_buffer*, ggml_tensor*, void const*, unsigned long, unsigned long) () from /home/kieran/git/llama.cpp/build/bin/libggml-rpc.so.0
#4 0x0000f68df158b790 in llama_model_loader::load_all_data(ggml_context*, std::unordered_map<unsigned int, ggml_backend_buffer*, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, ggml_backend_buffer*> > >&, std::vector<std::unique_ptr<llama_mlock, std::default_delete<llama_mlock> >, std::allocator<std::unique_ptr<llama_mlock, std::default_delete<llama_mlock> > > >*, bool (*)(float, void*), void*) () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#5 0x0000f68df15a8d7c in llama_model::load_tensors(llama_model_loader&) () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#6 0x0000f68df14f3098 in llama_model_load_from_file_impl(gguf_context*, void (*)(ggml_tensor*, void*), void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, llama_model_params) () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#7 0x0000f68df14f43a0 in llama_model_load_from_file () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#8 0x0000acf83a58d9fc in main ()
[Inferior 1 (process 2814325) detached]
Aborted (core dumped)PC side: kieran@kieran-x ~/g/l/build (feat-rdma-9493)> bin/rpc-server -H 0.0.0.0 -p 50052 -c
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97204 MiB):
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97204 MiB
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
Never expose the RPC server to an open network!
This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Starting RPC server v3.7.1
endpoint : 0.0.0.0:50052
local cache : /home/kieran/.cache/llama.cpp/rpc/
Devices:
CUDA0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (97204 MiB, 95458 MiB free)
transport : TCP (RDMA auto-negotiate enabled)
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=260 inline=316
RDMA activated: qpn=260->552 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=262 inline=316
RDMA activated: qpn=262->553 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=263 inline=316
RDMA activated: qpn=263->554 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=264 inline=316
RDMA activated: qpn=264->555 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=265 inline=316
RDMA activated: qpn=265->556 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=266 inline=316
RDMA activated: qpn=266->557 mtu=1024 rx_depth=24
Client connection closed |
|
@dvv47 thank you for this PoC, I finally got some time to play with this. For the record I am using the same testbed as @ggerganov -- two DGX Sparks connected with a single QSFP cable. I benchmarked token generation (tg128) with gpt-oss-20b using a single RPC server with different backends and transports.
It's really impressive how using a remote server over RDMA is faster compared to one running on localhost with TCP/IP. And here I am using the TCP/IP stack provided by the ConnectX-7 NIC, I suspect the performance of the standard ethernet NIC will be lower (will test this soon). As a next step I think we should try to come up with a better way to abstract the underlying transport, so we can both reuse code and implement transport specific optimizations. For example the current implementation of llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp Lines 459 to 473 in 6b949d1 Doing multiple Another possible way to leverage RDMA is to expose host buffers and load tensors directly into the RDMA buffers. This should speed up model loading. |
Yes, I was actually thinking about the same idea - raising RDMA to a higher level to get the ability to utilize RDMA WRITE functions, which would require less synchronization. However, when I made a test implementation for that, I got exactly the same inference performance. So I agree with you that this would improve model loading time, but at the cost of more code complexity, because the RPC transports would then have two different high-level interfaces tcp vs rdma. If you'd like, I can bring this idea back to life and publish it as a separate branch/pull request. |
|
By raising the level of abstraction I actually mean something very simple -- instead of abstracting primitive |
|
@rgerganov At these RDMA speeds, the aggregation latency savings are not visible over inference or even model loading time. The transfer of tensor payloads takes significantly more time than the per-message negotiation overhead. For retesting, I merged the current master (b8611), added metrics for model loading time, and implemented this send aggregation to reduce small transfers. The results were within margin of error: With my homelab cluster two ryzen 395+ gfx1151 connected with mellanox lx 4 and mellanox lx 6: |
|
I just gave it a shot tonight, it's very neat First with plain RPC A second time with the container running as root, /dev/infiniband, shared mem set, additional cap IPC_LOCK, etc. I'm curious about mtu, though, I thought it should be 9000 per the configuration guide by nvidia for connecting 2 devices? Also tried running it as llama-server and using eugr/llama-benchy to compare: (I normally run -FP8 in vllm on my sparks) |
Raising TCP MTU from 1500 to 9000 and RDMA MTU from 1024 to 4096 helps reduce CPU load and in the case of large data transfers, increase bandwidth. This is beneficial for training and model loading, but requires network isolation - all devices on the segment must share the same MTU settings. On the other hand, keeping the default MTU preserves the ability to use these devices on a regular network with internet access and other devices. In my case, I have a flat home network where the cluster nodes are connected to the same router as a laptop, mobile phones, and some IoT devices. RoCEv2 works in this case also and provide very low stable latency. In your case, it seems your TCP stack is already highly optimized, which is why the benefit from RDMA is not as large. but it's still there. |
Thanks, this also confirms my experiments so far. I think giving up on partial reads is not a big deal and your current implementation with swapping |
rgerganov
left a comment
There was a problem hiding this comment.
Please squash the commits and rebase on current master
|
Thank you for doing this! I can check the difference on a pure Infiniband 56g (though on the CPU, there are no GPUs there). In theory, the Infiniband protocol has a 2-8 times lower delay per port, and this should greatly reduce delays and increase performance. Ethernet (RoCE v2): Latency ~10–50 µs (after tuning). Jitter - Higher, depends on tuning I have Mellonox CX-3 VPI cards and a Mellonox Infiniband switch. Its very chip install if buy it on Ebay :) By the way, Digital Spark is supposed to support Infiniband port mode. If there is one, then you need to run the OpenSM service on one of the nodes, without it the Infiniband layer will not rise and there will be no link. I also have two Nvidia Spark coming before the end of the year, so I'm really following your progress :) If you need to build something and run the test, then send the commands to run it (taking into account that I have CPU versions of the servers there) |
That doesn't sound right. I have mellanox x4 and in eth mode with roce v2 have ~1us latency |
5433c79 to
b1a5fde
Compare
b1a5fde to
409857f
Compare
|
On my testbed (two DGX Sparks connected with QSFP) I get almost identical performance from CUDA-over-RPC and local CUDA backend:
build: b1a5fde (8769)
build: b1a5fde (8769) Unfortunately, loading the model with We need to figure out how to improve this. |
rgerganov
left a comment
There was a problem hiding this comment.
I think this is good to go as an initial version for RDMA support. I will prepare a follow up patch which moves all transport related stuff in a separate file (e.g. transport.cpp) and clean up the socket_t interface.
To summarize the design decisions made here:
socket_tencapsulates how client-server communication is done- partial reads are no longer supported due to RDMA
- transport capabilities are negotiated as part of
RPC_CMD_HELLO rdma_poll()is doing a busy loop causing 100% CPU usage on one core (not sure how we can avoid this)
|
I also want to make a cross-platform Windows/Linux RPC RDMA in the next version. Probably not a very popular setup, but OK for a home lab. |
|
This is the follow-up refactoring: #21998 |





Adds optional RDMA (RoCEv2) transport to the RPC backend
Two-node cluster: AMD Radeon 8060S (gfx1151) iGPUs, ConnectX-4 Lx / ConnectX-6 Lx 25GbE, RoCEv2. Model: Qwen3-Coder-Next 80B Q8_K_XL, layer split (-sm layer -ts 1/1) across both nodes, ROCm backend.
Optional: libibverbs-dev / rdma-core (Linux only). Without the cmake flag, nothing changes -- no new headers, no new linkage.