Skip to content

ggml-rpc: add native RDMA transport for RPC backend (RoCEv2)#20590

Merged
rgerganov merged 1 commit into
ggml-org:masterfrom
dvv101111:feat-rdma-9493
Apr 15, 2026
Merged

ggml-rpc: add native RDMA transport for RPC backend (RoCEv2)#20590
rgerganov merged 1 commit into
ggml-org:masterfrom
dvv101111:feat-rdma-9493

Conversation

@dvv101111
Copy link
Copy Markdown
Contributor

Adds optional RDMA (RoCEv2) transport to the RPC backend

Two-node cluster: AMD Radeon 8060S (gfx1151) iGPUs, ConnectX-4 Lx / ConnectX-6 Lx 25GbE, RoCEv2. Model: Qwen3-Coder-Next 80B Q8_K_XL, layer split (-sm layer -ts 1/1) across both nodes, ROCm backend.

Metric TCP RDMA Improvement
Prompt processing (pp2048) 651.48 678.42 +4.1%
Token generation (tg256) 30.19 32.16 +6.5%

Optional: libibverbs-dev / rdma-core (Linux only). Without the cmake flag, nothing changes -- no new headers, no new linkage.

@github-actions github-actions Bot added documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning labels Mar 15, 2026
@ggerganov
Copy link
Copy Markdown
Member

@dvv47 Do you know if I connect 2 DGX Sparks with a 200Gb/s QSFP cable, is this implementation going to be compatible? I am planning to buy the cable and test it.

@Mithras
Copy link
Copy Markdown

Mithras commented Mar 15, 2026

I might be doing something wrong but this is as far as I was able to get:
image
I have dual strix halo with x-5 and x-4 in ETH mode. No matter what GGML_RDMA_GID I set, I'm just getting infinite

llama-1  | Accepted client connection
llama-1  | RDMA probed: dev=rocep101s0 gid=5 qpn=322 inline=316
llama-1  | RDMA activated: qpn=322->463 mtu=1024 rx_depth=24
llama-1  | [get_device_memory] device: 0, free_mem: 134044098560, total_mem: 134217728000
llama-1  | Client connection closed
llama-1  | RDMA CQ poll timeout

loop. Models never finish loading. It might be something specific to containers... I've tried host network, priviledged, mapping /dev/infiniband, etc. Wasn't able to make this work. Is there something obvious I'm missing?

@dvv101111
Copy link
Copy Markdown
Contributor Author

@ggerganov Yes, it will work with dgx spark, as it is, they use Mellanox ConnectX-7 and Libibverbs api exactly same. Low-latency benefits will be there, but full bandwidth utilization for faster model transfer weight could be optimized even further

@dvv101111
Copy link
Copy Markdown
Contributor Author

@Mithras it's possible that this is a docker limitation. Have you tried running it natively? I will also test it with Docker and will let you know the results. Maybe using --ipc=host will help.

@dvv101111 dvv101111 requested a review from a team as a code owner March 15, 2026 21:56
@dvv101111
Copy link
Copy Markdown
Contributor Author

dvv101111 commented Mar 15, 2026

Fixed the issue, it now reconnects better, and I have retested with the ROCm 7.2 Docker container and
My previous test was with nightly version

image

Also, according to my tests the Vulkan backend has a lower RDMA profit, about 3%, in both PP and TG. I am also interested in testing how this affects CUDA devices. I will probably try it this week.

@Mithras for this strix halo:

docker run \
    --network host \
    --ipc host \
    --privileged \
    --ulimit memlock=-1:-1 \
    --security-opt seccomp=unconfined \
    --device /dev/dri \
    --device /dev/kfd \
    -v /dev/infiniband:/dev/infiniband \
    -e HSA_OVERRIDE_GFX_VERSION=11.5.1

and if you build it yourself:

RUN cmake -B build \
    -DGGML_HIP=ON \
    -DGGML_HIP_NO_VMM=ON \
    -DLLAMA_HIP_UMA=ON \
    -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=ON \
    -DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm -mllvm --amdgpu-unroll-threshold-local=600" \
    -DAMDGPU_TARGETS=gfx1151 \
    -DGGML_RPC=ON \
    -DGGML_RPC_RDMA=ON

@Mithras
Copy link
Copy Markdown

Mithras commented Mar 16, 2026

@dvv47 it mostly works now, thank you so much! A weird issue that still present is that it works until you let it idle for a few mins. Then it throws the same timeout error:
2026-03-15_21-25
But that only happens after you let it idle for a couple mins. I was able to run pretty big prompts from perplexica prior to that just fine


Also, it seems the only thing that's needed to make RDMA work in docker is passing /dev/infiniband device

@dvv101111 dvv101111 marked this pull request as draft March 16, 2026 10:45
@dvv101111
Copy link
Copy Markdown
Contributor Author

@Mithras fixed that and tested on some agentic tasks

@Mithras
Copy link
Copy Markdown

Mithras commented Mar 16, 2026

@dvv47 this works great, thank you so much! Real perf boost for free!
Another feature/question: Would it be possible to set GGML_RDMA_DEV and GGML_RDMA_GID per RPC? I have two strix halos and one of them has two port x-5 NIC with one port connected to my desktop PC. It'll probably just work if I use network: host but it would be nice to be able to use multi-port setup in isolated network where automatic dev/gid detection doesn't work.

@krampenschiesser
Copy link
Copy Markdown

Works with virtual roce v2, too. but of course slightly slower than the ethernet stack below it.
(used rdma link add rxe0 type rxe netdev eth0)

ggml_cuda_init: found 5 CUDA devices (Total VRAM: 79251 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15850 MiB
  Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15850 MiB
  Device 2: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15850 MiB
  Device 3: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15850 MiB
  Device 4: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15850 MiB
RDMA probed: dev=rxe0 gid=2 qpn=20 inline=256
RDMA activated: qpn=20->20 mtu=4096 rx_depth=24
| model                          |       size |     params | backend    | ngl | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --: | --------------: | -------------------: |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB |   228.69 B | CUDA,RPC   |  99 |  1 |    0 |   1 |           pp512 |        577.62 ± 6.70 |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB |   228.69 B | CUDA,RPC   |  99 |  1 |    0 |   1 |          pp1024 |       412.20 ± 12.36 |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB |   228.69 B | CUDA,RPC   |  99 |  1 |    0 |   1 |          pp2048 |        413.64 ± 5.54 |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB |   228.69 B | CUDA,RPC   |  99 |  1 |    0 |   1 |         pp10000 |        370.39 ± 4.64 |

skipped the tg/s since there is barely any traffic (4mb here and there)

AlexWorland pushed a commit to AlexWorland/llama.cpp that referenced this pull request Mar 19, 2026
…Ev2)

Base PR from dvv47 adding optional RDMA transport with auto-negotiation
during RPC HELLO handshake. This commit is the unmodified PR applied to
current master; macOS Thunderbolt adaptations follow in subsequent commits.

Ref: ggml-org#20590

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AlexWorland pushed a commit to AlexWorland/llama.cpp that referenced this pull request Mar 19, 2026
…Ev2)

Base PR from dvv47 adding optional RDMA transport with auto-negotiation
during RPC HELLO handshake. This commit is the unmodified PR applied to
current master; macOS Thunderbolt adaptations follow in subsequent commits.

Ref: ggml-org#20590

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ggerganov
Copy link
Copy Markdown
Member

@dvv47 Do you know if I connect 2 DGX Sparks with a 200Gb/s QSFP cable, is this > implementation going to be compatible? I am planning to buy the cable and test it.

@ggerganov Yes, it will work with dgx spark, as it is, they use Mellanox ConnectX-7 and Libibverbs api exactly same. Low-latency benefits will be there, but full bandwidth utilization for faster model transfer weight could be optimized even further

I got the QSFP cable today. Will report results when I get some time to try it out.

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented Mar 21, 2026

Got the two DGX Sparks connected. Not sure I am getting the full 200Gb/s speed, but it should be at least 100Gb/s if I am reading the numbers correctly:

Details

Cable: https://www.naddod.com/products/102069.html

$ ibdev2netdev 
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)
$ ib_write_bw -d rocep1s0f0 --report_gbits -F
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : rocep1s0f0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs	: OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0178 PSN 0xe286ee RKey 0x18431d VAddr 0x00f2e9affd8000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:169:254:78:122
 remote address: LID 0000 QPN 0x017a PSN 0x476605 RKey 0x184300 VAddr 0x00e9841df15000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:169:254:48:241
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      5000             108.30             108.29 		  0.206553
---------------------------------------------------------------------------------------
# nccl-tests version 2.18.2 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0 unalign: 0
#
# Using devices
#  Rank  0 Group  0 Pid  31967 on spark-17ed device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid  32456 on spark-a163 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
 17179869184    2147483648     float    none      -1   495564   34.67   17.33       0   396500   43.33   21.66       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 19.499 
#
# Collective test concluded: all_gather_perf
#
$ ib_write_lat -d rocep1s0f0 --report_gbits
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write Latency Test
 Dual-port       : OFF		Device         : rocep1s0f0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: OFF
 ibv_wr* API     : ON
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 220[B]
 rdma_cm QPs	: OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x017a PSN 0x7b2cdb RKey 0x1707c4 VAddr 0x00c31234c02000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:169:254:78:122
 remote address: LID 0000 QPN 0x017c PSN 0x90bf28 RKey 0x169b59 VAddr 0x00b51d8310f000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:169:254:48:241
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec] 
 2       1000          1.65           2.81         1.74     	      1.74        	0.03   		1.86    		2.81   
---------------------------------------------------------------------------------------

@dvv101111
Copy link
Copy Markdown
Contributor Author

I heard that the network of DGX Spark is a bit complicated. In reality, each port is split into two links with a speed of 100 Gbps. So, in theory, you will be able to set up link bonding and achieve 200 Gpbs over one cable. Or, without bonding, you can get 100 gbps over one port, and at the same time, 100 gbps to a second device through a second port. In the case of this RPC RDMA feature, most profit will come from a stable, low-latency connection. So I expect no difference between 25 Gbps and even 200 Gbps links

@ggerganov
Copy link
Copy Markdown
Member

Yup, that's what I figured as well. Added the latency numbers to the previous comment for completeness.

@v0l
Copy link
Copy Markdown

v0l commented Mar 26, 2026

EDIT 3: Got it working between RTX Pro 6000 (x86_64 host) and DGX Spark (arm64) loading Qwen3.5 122B Q8 (200GB+ model)

EDIT 2: I rebased on master which has a bug in rpc-server: fixed in #21030

EDIT: MTU issue was first problem, fixed that but having another error now... will update later

Hey just checking out this PR,

Details I tried to run rpc-server between DGX Spark -> Linux PC (RTX Pro 6000) but i get an exception running llama-bench
kieran@spark-95d6:~/git/llama.cpp/build$ bin/llama-bench --rpc 10.100.50.1:50052 \
  -m ~/.cache/huggingface/hub/models--unsloth--Qwen3.5-122B-A10B-GGUF/snapshots/51eab4d59d53f573fb9206cb3ce613f1d0aa392b/Q4_K_M/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf -p 2048 -n 256
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB
RDMA probed: dev=rocep1s0f1 gid=2 qpn=552 inline=316
RDMA activated: qpn=552->260 mtu=4096 rx_depth=24
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
/home/kieran/git/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:1125: Remote RPC server crashed or returned malformed response
[New LWP 2814330]
[New LWP 2814326]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libhns-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmthca-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libhfi1verbs-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libirdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmlx5-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libefa-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libocrdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libcxgb4-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libvmw_pvrdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libqedr-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/liberdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmlx4-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libipathverbs-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libsiw-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmana-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/librxe-rdmav34.so
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000f68df09c7b74 in __GI___wait4 (pid=2815007, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x0000f68df09c7b74 in __GI___wait4 (pid=2815007, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000f68df13a5a1c in ggml_print_backtrace () from /home/kieran/git/llama.cpp/build/bin/libggml-base.so.0
#2  0x0000f68df13a5bc0 in ggml_abort () from /home/kieran/git/llama.cpp/build/bin/libggml-base.so.0
#3  0x0000f68dee6aa8a0 in ggml_backend_rpc_buffer_set_tensor(ggml_backend_buffer*, ggml_tensor*, void const*, unsigned long, unsigned long) () from /home/kieran/git/llama.cpp/build/bin/libggml-rpc.so.0
#4  0x0000f68df158b790 in llama_model_loader::load_all_data(ggml_context*, std::unordered_map<unsigned int, ggml_backend_buffer*, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, ggml_backend_buffer*> > >&, std::vector<std::unique_ptr<llama_mlock, std::default_delete<llama_mlock> >, std::allocator<std::unique_ptr<llama_mlock, std::default_delete<llama_mlock> > > >*, bool (*)(float, void*), void*) () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#5  0x0000f68df15a8d7c in llama_model::load_tensors(llama_model_loader&) () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#6  0x0000f68df14f3098 in llama_model_load_from_file_impl(gguf_context*, void (*)(ggml_tensor*, void*), void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, llama_model_params) () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#7  0x0000f68df14f43a0 in llama_model_load_from_file () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#8  0x0000acf83a58d9fc in main ()
[Inferior 1 (process 2814325) detached]
Aborted (core dumped)

PC side:

kieran@kieran-x ~/g/l/build (feat-rdma-9493)> bin/rpc-server -H 0.0.0.0 -p 50052 -c
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97204 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97204 MiB

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Starting RPC server v3.7.1
  endpoint       : 0.0.0.0:50052
  local cache    : /home/kieran/.cache/llama.cpp/rpc/
Devices:
  CUDA0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (97204 MiB, 95458 MiB free)
  transport      : TCP (RDMA auto-negotiate enabled)
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=260 inline=316
RDMA activated: qpn=260->552 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=262 inline=316
RDMA activated: qpn=262->553 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=263 inline=316
RDMA activated: qpn=263->554 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=264 inline=316
RDMA activated: qpn=264->555 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=265 inline=316
RDMA activated: qpn=265->556 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=266 inline=316
RDMA activated: qpn=266->557 mtu=1024 rx_depth=24
Client connection closed

@rgerganov
Copy link
Copy Markdown
Member

@dvv47 thank you for this PoC, I finally got some time to play with this. For the record I am using the same testbed as @ggerganov -- two DGX Sparks connected with a single QSFP cable.

I benchmarked token generation (tg128) with gpt-oss-20b using a single RPC server with different backends and transports.

Transport CPU backend CUDA backend
TCP (remote) 32.58 ± 0.17 67.25 ± 0.37
TCP (local) 33.47 ± 0.17 72.61 ± 0.49
RDMA 35.12 ± 0.14 75.04 ± 0.21

It's really impressive how using a remote server over RDMA is faster compared to one running on localhost with TCP/IP. And here I am using the TCP/IP stack provided by the ConnectX-7 NIC, I suspect the performance of the standard ethernet NIC will be lower (will test this soon).

As a next step I think we should try to come up with a better way to abstract the underlying transport, so we can both reuse code and implement transport specific optimizations. For example the current implementation of send_rpc_cmd is:

// RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) |
// No response
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
uint8_t cmd_byte = cmd;
if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
return false;
}
if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
return false;
}
if (!send_data(sock->fd, input, input_size)) {
return false;
}
return true;
}

Doing multiple send_data in a row with TCP/IP is fine as we don't have to wait for reply as I explained here, but this is no longer the case with RDMA. I think we need to abstract the transport on a higher level than plain send/recv functions.

Another possible way to leverage RDMA is to expose host buffers and load tensors directly into the RDMA buffers. This should speed up model loading.

@dvv101111
Copy link
Copy Markdown
Contributor Author

@rgerganov

Doing multiple send_data in a row with TCP/IP is fine as we don't have to wait for reply as I explained #16892 (comment), but this is no longer the case with RDMA. I think we need to abstract the transport on a higher level than plain send/recv functions.

Another possible way to leverage RDMA is to expose host buffers and load tensors directly into the RDMA buffers. This should speed up model loading.

Yes, I was actually thinking about the same idea - raising RDMA to a higher level to get the ability to utilize RDMA WRITE functions, which would require less synchronization. However, when I made a test implementation for that, I got exactly the same inference performance.

So I agree with you that this would improve model loading time, but at the cost of more code complexity, because the RPC transports would then have two different high-level interfaces tcp vs rdma.

If you'd like, I can bring this idea back to life and publish it as a separate branch/pull request.

@rgerganov
Copy link
Copy Markdown
Member

By raising the level of abstraction I actually mean something very simple -- instead of abstracting primitive send/recv function, let's try to abstract send_rpc_cmd. The TCP/IP implementation stays as-is but the RDMA implementation concatenates rpc_cmd, request_size and request_data and sends them at once. This way we make one round-trip to the server instead of three and should give us some measurable improvement IMO.

@dvv101111
Copy link
Copy Markdown
Contributor Author

dvv101111 commented Apr 1, 2026

@rgerganov
My conclusion is that merging multiple sends into one does not affect performance, but requires significant code changes. Unlike TCP, where sent data is appended to the driver buffer on the receiver and can be partially read, RDMA is message-oriented - sending cmd+size+payload as one message requires reading exactly the same data via a single rdma_recv(). This leads to refactoring rpc_serve_client() to use a different receive/dispatch pattern for the RDMA path.

At these RDMA speeds, the aggregation latency savings are not visible over inference or even model loading time. The transfer of tensor payloads takes significantly more time than the per-message negotiation overhead.

For retesting, I merged the current master (b8611), added metrics for model loading time, and implemented this send aggregation to reduce small transfers.
Here it is:
diff PR : https://github.com/dvv47/llama.cpp/pull/1/changes
diff patch: rdma_v1.1.patch
i am decided not commit it here and just keep as patch

The results were within margin of error:

With my homelab cluster two ryzen 395+ gfx1151 connected with mellanox lx 4 and mellanox lx 6:
Model Qwen3-Coder-Next-UD-Q8_K_XL

Vulkan backend:
image

ROCm backend:
image

@pfn
Copy link
Copy Markdown

pfn commented Apr 2, 2026

I just gave it a shot tonight, it's very neat

First with plain RPC

pfnguyen@neuron:~/llama.cpp$ docker compose exec -it llama-server bash
ubuntu@4a7893229529:/app$ /app/llama-bench -m /models/Qwen3.5-122B-A10B-GGUF-Q8_0/Qwen3.5-122B-A10B-Q8_0-00001-of-00004.gguf --rpc 192.168.177.11:50052,192
.168.177.12:50052 -p 2048 -n 256
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124546 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124546 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q8_0       | 120.94 GiB |   122.11 B | CUDA,RPC   |  99 |          pp2048 |       493.04 ± 15.96 |
| qwen35moe 122B.A10B Q8_0       | 120.94 GiB |   122.11 B | CUDA,RPC   |  99 |           tg256 |         17.35 ± 0.09 |

build: 72a13c73b (8633)

A second time with the container running as root, /dev/infiniband, shared mem set, additional cap IPC_LOCK, etc.

root@neuron:/app# /app/llama-bench -m /models/Qwen3.5-122B-A10B-GGUF-Q8_0/Qwen3.5-122B-A10B-Q8_0-00001-of-00004.gguf --rpc 192.168.177.11:50052,192.168.177
.12:50052 -p 2048 -n 256
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124546 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124546 MiB
RDMA probed: dev=rocep1s0f1 gid=2 qpn=648 inline=316
RDMA activated: qpn=648->649 mtu=4096 rx_depth=24
RDMA probed: dev=rocep1s0f1 gid=2 qpn=650 inline=316
RDMA activated: qpn=650->744 mtu=4096 rx_depth=24
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |

| qwen35moe 122B.A10B Q8_0       | 120.94 GiB |   122.11 B | CUDA,RPC   |  99 |          pp2048 |        537.62 ± 1.64 |
| qwen35moe 122B.A10B Q8_0       | 120.94 GiB |   122.11 B | CUDA,RPC   |  99 |           tg256 |         18.42 ± 0.03 |

build: 72a13c73b (8633)

I'm curious about mtu, though, I thought it should be 9000 per the configuration guide by nvidia for connecting 2 devices?

Also tried running it as llama-server and using eugr/llama-benchy to compare:

root@neuron:/app# /app/llama-server -m /models/Qwen3.5-122B-A10B-GGUF-Q8_0/Qwen3.5-122B-A10B-Q8_0-00001-of-00004.gguf --rpc 192.168.177.12:50052 --jinja --
mmproj /models/Qwen3.5-122B-A10B-GGUF-Q8_0/mmproj-BF16.gguf -c 262144 --port 8000 --host 0.0.0.0 -ctv q8_0 -ctk q8_0
...
sandbox@c37ff8f29d08:~$ uvx llama-benchy --base-url http://neuron:8000/v1 --model Qwen/Qwen3.5-122B-A10B-FP8 --tg 256
...

| model                      |   test |            t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------|-------:|---------------:|-------------:|----------------:|----------------:|----------------:|
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 | 584.96 ± 10.76 |              | 3506.71 ± 64.71 | 3504.55 ± 64.71 | 3506.77 ± 64.70 |
| Qwen/Qwen3.5-122B-A10B-FP8 |  tg256 |   18.44 ± 0.07 | 19.00 ± 0.00 |                 |                 |                 |

(I normally run -FP8 in vllm on my sparks)

@dvv101111 dvv101111 marked this pull request as ready for review April 2, 2026 06:28
@dvv101111
Copy link
Copy Markdown
Contributor Author

@pfn

I'm curious about mtu, though, I thought it should be 9000 per the configuration guide by nvidia for connecting 2 devices?

Raising TCP MTU from 1500 to 9000 and RDMA MTU from 1024 to 4096 helps reduce CPU load and in the case of large data transfers, increase bandwidth. This is beneficial for training and model loading, but requires network isolation - all devices on the segment must share the same MTU settings.

On the other hand, keeping the default MTU preserves the ability to use these devices on a regular network with internet access and other devices. In my case, I have a flat home network where the cluster nodes are connected to the same router as a laptop, mobile phones, and some IoT devices. RoCEv2 works in this case also and provide very low stable latency.

In your case, it seems your TCP stack is already highly optimized, which is why the benefit from RDMA is not as large. but it's still there.

@rgerganov
Copy link
Copy Markdown
Member

My conclusion is that merging multiple sends into one does not affect performance, but requires significant code changes. Unlike TCP, where sent data is appended to the driver buffer on the receiver and can be partially read, RDMA is message-oriented - sending cmd+size+payload as one message requires reading exactly the same data via a single rdma_recv(). This leads to refactoring rpc_serve_client() to use a different receive/dispatch pattern for the RDMA path.

Thanks, this also confirms my experiments so far. I think giving up on partial reads is not a big deal and your current implementation with swapping send/recv function is quite straightforward. We can still optimize the protocol further by sending cmd+request_size (9 bytes) in one message followed by request_body (request_size bytes) but I prefer to do this in a follow up PR

Copy link
Copy Markdown
Member

@rgerganov rgerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash the commits and rebase on current master

Comment thread ggml/include/ggml-rpc.h Outdated
Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated
Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated
Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp
Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated
@slavonnet
Copy link
Copy Markdown

slavonnet commented Apr 11, 2026

Thank you for doing this!

I can check the difference on a pure Infiniband 56g (though on the CPU, there are no GPUs there). In theory, the Infiniband protocol has a 2-8 times lower delay per port, and this should greatly reduce delays and increase performance.

Ethernet (RoCE v2): Latency ~10–50 µs (after tuning). Jitter - Higher, depends on tuning
InfiniBand: Latency - ~1–2 µs (native). Jitter - Very low, deterministic

I have Mellonox CX-3 VPI cards and a Mellonox Infiniband switch. Its very chip install if buy it on Ebay :)

By the way, Digital Spark is supposed to support Infiniband port mode. If there is one, then you need to run the OpenSM service on one of the nodes, without it the Infiniband layer will not rise and there will be no link. I also have two Nvidia Spark coming before the end of the year, so I'm really following your progress :)

If you need to build something and run the test, then send the commands to run it (taking into account that I have CPU versions of the servers there)

@Mithras
Copy link
Copy Markdown

Mithras commented Apr 12, 2026

Ethernet (RoCE v2): Latency ~10–50 µs

That doesn't sound right. I have mellanox x4 and in eth mode with roce v2 have ~1us latency

@dvv101111 dvv101111 force-pushed the feat-rdma-9493 branch 2 times, most recently from 5433c79 to b1a5fde Compare April 12, 2026 20:54
Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated
Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated
Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated
Comment thread docs/backend/RPC-RDMA.md Outdated
Comment thread tools/rpc/README.md Outdated
@rgerganov
Copy link
Copy Markdown
Member

On my testbed (two DGX Sparks connected with QSFP) I get almost identical performance from CUDA-over-RPC and local CUDA backend:

model size params backend ngl test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 pp512 3113.46 ± 26.16
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 tg128 77.19 ± 0.60
gemma4 ?B Q4_K - Medium 17.39 GiB 30.70 B CUDA 99 pp512 687.29 ± 8.00
gemma4 ?B Q4_K - Medium 17.39 GiB 30.70 B CUDA 99 tg128 10.07 ± 0.01

build: b1a5fde (8769)

model size params backend ngl test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B RPC 99 pp512 3170.49 ± 51.74
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B RPC 99 tg128 76.71 ± 0.18
gemma4 ?B Q4_K - Medium 17.39 GiB 30.70 B RPC 99 pp512 638.88 ± 4.14
gemma4 ?B Q4_K - Medium 17.39 GiB 30.70 B RPC 99 tg128 10.35 ± 0.01

build: b1a5fde (8769)

Unfortunately, loading the model with -sm tensor over RPC is super slow because we make thousands of set_tensor calls with very small chunks of data:

...
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38064105, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38064870, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38065635, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38066400, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38067165, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38067930, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38068695, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38069460, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38070225, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38070990, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38071755, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38072520, size: 765
...

We need to figure out how to improve this.

Copy link
Copy Markdown
Member

@rgerganov rgerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good to go as an initial version for RDMA support. I will prepare a follow up patch which moves all transport related stuff in a separate file (e.g. transport.cpp) and clean up the socket_t interface.

To summarize the design decisions made here:

  • socket_t encapsulates how client-server communication is done
  • partial reads are no longer supported due to RDMA
  • transport capabilities are negotiated as part of RPC_CMD_HELLO
  • rdma_poll() is doing a busy loop causing 100% CPU usage on one core (not sure how we can avoid this)

@dvv101111
Copy link
Copy Markdown
Contributor Author

I also want to make a cross-platform Windows/Linux RPC RDMA in the next version. Probably not a very popular setup, but OK for a home lab.
I already have some proof of concept

@rgerganov rgerganov merged commit adb541a into ggml-org:master Apr 15, 2026
46 of 47 checks passed
@rgerganov
Copy link
Copy Markdown
Member

This is the follow-up refactoring: #21998
It's still WIP but feedback is welcome

mengqin pushed a commit to mengqin/llama.cpp that referenced this pull request Apr 20, 2026
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants