ggml-rpc: add native RDMA transport for RPC backend (RoCEv2) by dvv101111 · Pull Request #20590 · ggml-org/llama.cpp

dvv101111 · 2026-03-15T12:25:58Z

Adds optional RDMA (RoCEv2) transport to the RPC backend

Two-node cluster: AMD Radeon 8060S (gfx1151) iGPUs, ConnectX-4 Lx / ConnectX-6 Lx 25GbE, RoCEv2. Model: Qwen3-Coder-Next 80B Q8_K_XL, layer split (-sm layer -ts 1/1) across both nodes, ROCm backend.

Metric	TCP	RDMA	Improvement
Prompt processing (pp2048)	651.48	678.42	+4.1%
Token generation (tg256)	30.19	32.16	+6.5%

Optional: libibverbs-dev / rdma-core (Linux only). Without the cmake flag, nothing changes -- no new headers, no new linkage.

ggerganov · 2026-03-15T17:02:15Z

@dvv47 Do you know if I connect 2 DGX Sparks with a 200Gb/s QSFP cable, is this implementation going to be compatible? I am planning to buy the cable and test it.

Mithras · 2026-03-15T17:54:16Z

I might be doing something wrong but this is as far as I was able to get:

I have dual strix halo with x-5 and x-4 in ETH mode. No matter what GGML_RDMA_GID I set, I'm just getting infinite

llama-1  | Accepted client connection
llama-1  | RDMA probed: dev=rocep101s0 gid=5 qpn=322 inline=316
llama-1  | RDMA activated: qpn=322->463 mtu=1024 rx_depth=24
llama-1  | [get_device_memory] device: 0, free_mem: 134044098560, total_mem: 134217728000
llama-1  | Client connection closed
llama-1  | RDMA CQ poll timeout

loop. Models never finish loading. It might be something specific to containers... I've tried host network, priviledged, mapping /dev/infiniband, etc. Wasn't able to make this work. Is there something obvious I'm missing?

dvv101111 · 2026-03-15T18:12:13Z

@ggerganov Yes, it will work with dgx spark, as it is, they use Mellanox ConnectX-7 and Libibverbs api exactly same. Low-latency benefits will be there, but full bandwidth utilization for faster model transfer weight could be optimized even further

dvv101111 · 2026-03-15T18:18:33Z

@Mithras it's possible that this is a docker limitation. Have you tried running it natively? I will also test it with Docker and will let you know the results. Maybe using --ipc=host will help.

dvv101111 · 2026-03-15T22:10:33Z

Fixed the issue, it now reconnects better, and I have retested with the ROCm 7.2 Docker container and
My previous test was with nightly version

Also, according to my tests the Vulkan backend has a lower RDMA profit, about 3%, in both PP and TG. I am also interested in testing how this affects CUDA devices. I will probably try it this week.

@Mithras for this strix halo:

docker run \
    --network host \
    --ipc host \
    --privileged \
    --ulimit memlock=-1:-1 \
    --security-opt seccomp=unconfined \
    --device /dev/dri \
    --device /dev/kfd \
    -v /dev/infiniband:/dev/infiniband \
    -e HSA_OVERRIDE_GFX_VERSION=11.5.1

and if you build it yourself:

RUN cmake -B build \
    -DGGML_HIP=ON \
    -DGGML_HIP_NO_VMM=ON \
    -DLLAMA_HIP_UMA=ON \
    -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=ON \
    -DCMAKE_HIP_FLAGS="--rocm-path=/opt/rocm -mllvm --amdgpu-unroll-threshold-local=600" \
    -DAMDGPU_TARGETS=gfx1151 \
    -DGGML_RPC=ON \
    -DGGML_RPC_RDMA=ON

Mithras · 2026-03-16T04:39:40Z

@dvv47 it mostly works now, thank you so much! A weird issue that still present is that it works until you let it idle for a few mins. Then it throws the same timeout error:

But that only happens after you let it idle for a couple mins. I was able to run pretty big prompts from perplexica prior to that just fine

Also, it seems the only thing that's needed to make RDMA work in docker is passing /dev/infiniband device

dvv101111 · 2026-03-16T10:49:24Z

@Mithras fixed that and tested on some agentic tasks

Mithras · 2026-03-16T16:20:11Z

@dvv47 this works great, thank you so much! Real perf boost for free!
Another feature/question: Would it be possible to set GGML_RDMA_DEV and GGML_RDMA_GID per RPC? I have two strix halos and one of them has two port x-5 NIC with one port connected to my desktop PC. It'll probably just work if I use network: host but it would be nice to be able to use multi-port setup in isolated network where automatic dev/gid detection doesn't work.

krampenschiesser · 2026-03-17T02:17:22Z

Works with virtual roce v2, too. but of course slightly slower than the ethernet stack below it.
(used rdma link add rxe0 type rxe netdev eth0)

ggml_cuda_init: found 5 CUDA devices (Total VRAM: 79251 MiB):
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15850 MiB
  Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15850 MiB
  Device 2: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15850 MiB
  Device 3: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15850 MiB
  Device 4: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 15850 MiB
RDMA probed: dev=rxe0 gid=2 qpn=20 inline=256
RDMA activated: qpn=20->20 mtu=4096 rx_depth=24
| model                          |       size |     params | backend    | ngl | fa | mmap | dio |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --: | --------------: | -------------------: |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB |   228.69 B | CUDA,RPC   |  99 |  1 |    0 |   1 |           pp512 |        577.62 ± 6.70 |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB |   228.69 B | CUDA,RPC   |  99 |  1 |    0 |   1 |          pp1024 |       412.20 ± 12.36 |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB |   228.69 B | CUDA,RPC   |  99 |  1 |    0 |   1 |          pp2048 |        413.64 ± 5.54 |
| minimax-m2 230B.A10B IQ4_XS - 4.25 bpw | 113.52 GiB |   228.69 B | CUDA,RPC   |  99 |  1 |    0 |   1 |         pp10000 |        370.39 ± 4.64 |

skipped the tg/s since there is barely any traffic (4mb here and there)

…Ev2) Base PR from dvv47 adding optional RDMA transport with auto-negotiation during RPC HELLO handshake. This commit is the unmodified PR applied to current master; macOS Thunderbolt adaptations follow in subsequent commits. Ref: ggml-org#20590 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

ggerganov · 2026-03-20T11:31:11Z

@dvv47 Do you know if I connect 2 DGX Sparks with a 200Gb/s QSFP cable, is this > implementation going to be compatible? I am planning to buy the cable and test it.

@ggerganov Yes, it will work with dgx spark, as it is, they use Mellanox ConnectX-7 and Libibverbs api exactly same. Low-latency benefits will be there, but full bandwidth utilization for faster model transfer weight could be optimized even further

I got the QSFP cable today. Will report results when I get some time to try it out.

ggerganov · 2026-03-21T11:29:31Z

Got the two DGX Sparks connected. Not sure I am getting the full 200Gb/s speed, but it should be at least 100Gb/s if I am reading the numbers correctly:

Details

Cable: https://www.naddod.com/products/102069.html

$ ibdev2netdev 
rocep1s0f0 port 1 ==> enp1s0f0np0 (Up)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Down)
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Up)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Down)

$ ib_write_bw -d rocep1s0f0 --report_gbits -F
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF		Device         : rocep1s0f0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs	: OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x0178 PSN 0xe286ee RKey 0x18431d VAddr 0x00f2e9affd8000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:169:254:78:122
 remote address: LID 0000 QPN 0x017a PSN 0x476605 RKey 0x184300 VAddr 0x00e9841df15000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:169:254:48:241
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      5000             108.30             108.29 		  0.206553
---------------------------------------------------------------------------------------

# nccl-tests version 2.18.2 nccl-headers=22809 nccl-library=22809
# Collective test starting: all_gather_perf
# nThread 1 nGpus 1 minBytes 17179869184 maxBytes 17179869184 step: 2(factor) warmup iters: 1 iters: 20 agg iters: 1 validation: 1 graph: 0 unalign: 0
#
# Using devices
#  Rank  0 Group  0 Pid  31967 on spark-17ed device  0 [000f:01:00] NVIDIA GB10
#  Rank  1 Group  0 Pid  32456 on spark-a163 device  0 [000f:01:00] NVIDIA GB10
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw  #wrong     time   algbw   busbw  #wrong 
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)             (us)  (GB/s)  (GB/s)         
 17179869184    2147483648     float    none      -1   495564   34.67   17.33       0   396500   43.33   21.66       0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 19.499 
#
# Collective test concluded: all_gather_perf
#

$ ib_write_lat -d rocep1s0f0 --report_gbits
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write Latency Test
 Dual-port       : OFF		Device         : rocep1s0f0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: OFF
 ibv_wr* API     : ON
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 220[B]
 rdma_cm QPs	: OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x017a PSN 0x7b2cdb RKey 0x1707c4 VAddr 0x00c31234c02000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:169:254:78:122
 remote address: LID 0000 QPN 0x017c PSN 0x90bf28 RKey 0x169b59 VAddr 0x00b51d8310f000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:169:254:48:241
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec] 
 2       1000          1.65           2.81         1.74     	      1.74        	0.03   		1.86    		2.81   
---------------------------------------------------------------------------------------

dvv101111 · 2026-03-21T11:40:33Z

I heard that the network of DGX Spark is a bit complicated. In reality, each port is split into two links with a speed of 100 Gbps. So, in theory, you will be able to set up link bonding and achieve 200 Gpbs over one cable. Or, without bonding, you can get 100 gbps over one port, and at the same time, 100 gbps to a second device through a second port. In the case of this RPC RDMA feature, most profit will come from a stable, low-latency connection. So I expect no difference between 25 Gbps and even 200 Gbps links

ggerganov · 2026-03-21T12:02:53Z

Yup, that's what I figured as well. Added the latency numbers to the previous comment for completeness.

v0l · 2026-03-26T14:05:57Z

EDIT 3: Got it working between RTX Pro 6000 (x86_64 host) and DGX Spark (arm64) loading Qwen3.5 122B Q8 (200GB+ model)

EDIT 2: I rebased on master which has a bug in rpc-server: fixed in #21030

EDIT: MTU issue was first problem, fixed that but having another error now... will update later

Hey just checking out this PR,

Details

I tried to run rpc-server between DGX Spark -> Linux PC (RTX Pro 6000) but i get an exception running llama-bench

kieran@spark-95d6:~/git/llama.cpp/build$ bin/llama-bench --rpc 10.100.50.1:50052 \
  -m ~/.cache/huggingface/hub/models--unsloth--Qwen3.5-122B-A10B-GGUF/snapshots/51eab4d59d53f573fb9206cb3ce613f1d0aa392b/Q4_K_M/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf -p 2048 -n 256
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB
RDMA probed: dev=rocep1s0f1 gid=2 qpn=552 inline=316
RDMA activated: qpn=552->260 mtu=4096 rx_depth=24
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
/home/kieran/git/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:1125: Remote RPC server crashed or returned malformed response
[New LWP 2814330]
[New LWP 2814326]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libhns-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmthca-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libhfi1verbs-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libirdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmlx5-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libefa-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libocrdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libbnxt_re-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libcxgb4-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libvmw_pvrdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libqedr-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/liberdma-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmlx4-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libipathverbs-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libsiw-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/libmana-rdmav34.so
warning: could not find '.gnu_debugaltlink' file for /usr/lib/aarch64-linux-gnu/libibverbs/librxe-rdmav34.so
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000f68df09c7b74 in __GI___wait4 (pid=2815007, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x0000f68df09c7b74 in __GI___wait4 (pid=2815007, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x0000f68df13a5a1c in ggml_print_backtrace () from /home/kieran/git/llama.cpp/build/bin/libggml-base.so.0
#2  0x0000f68df13a5bc0 in ggml_abort () from /home/kieran/git/llama.cpp/build/bin/libggml-base.so.0
#3  0x0000f68dee6aa8a0 in ggml_backend_rpc_buffer_set_tensor(ggml_backend_buffer*, ggml_tensor*, void const*, unsigned long, unsigned long) () from /home/kieran/git/llama.cpp/build/bin/libggml-rpc.so.0
#4  0x0000f68df158b790 in llama_model_loader::load_all_data(ggml_context*, std::unordered_map<unsigned int, ggml_backend_buffer*, std::hash<unsigned int>, std::equal_to<unsigned int>, std::allocator<std::pair<unsigned int const, ggml_backend_buffer*> > >&, std::vector<std::unique_ptr<llama_mlock, std::default_delete<llama_mlock> >, std::allocator<std::unique_ptr<llama_mlock, std::default_delete<llama_mlock> > > >*, bool (*)(float, void*), void*) () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#5  0x0000f68df15a8d7c in llama_model::load_tensors(llama_model_loader&) () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#6  0x0000f68df14f3098 in llama_model_load_from_file_impl(gguf_context*, void (*)(ggml_tensor*, void*), void*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, llama_model_params) () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#7  0x0000f68df14f43a0 in llama_model_load_from_file () from /home/kieran/git/llama.cpp/build/bin/libllama.so.0
#8  0x0000acf83a58d9fc in main ()
[Inferior 1 (process 2814325) detached]
Aborted (core dumped)

PC side:

kieran@kieran-x ~/g/l/build (feat-rdma-9493)> bin/rpc-server -H 0.0.0.0 -p 50052 -c
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97204 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97204 MiB

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING: Host ('0.0.0.0') is != '127.0.0.1'
         Never expose the RPC server to an open network!
         This is an experimental feature and is not secure!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Starting RPC server v3.7.1
  endpoint       : 0.0.0.0:50052
  local cache    : /home/kieran/.cache/llama.cpp/rpc/
Devices:
  CUDA0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (97204 MiB, 95458 MiB free)
  transport      : TCP (RDMA auto-negotiate enabled)
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=260 inline=316
RDMA activated: qpn=260->552 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=262 inline=316
RDMA activated: qpn=262->553 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=263 inline=316
RDMA activated: qpn=263->554 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=264 inline=316
RDMA activated: qpn=264->555 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=265 inline=316
RDMA activated: qpn=265->556 mtu=1024 rx_depth=24
Client connection closed
Accepted client connection
RDMA probed: dev=mlx5_0 gid=2 qpn=266 inline=316
RDMA activated: qpn=266->557 mtu=1024 rx_depth=24
Client connection closed

rgerganov · 2026-04-01T12:06:19Z

@dvv47 thank you for this PoC, I finally got some time to play with this. For the record I am using the same testbed as @ggerganov -- two DGX Sparks connected with a single QSFP cable.

I benchmarked token generation (tg128) with gpt-oss-20b using a single RPC server with different backends and transports.

Transport	CPU backend	CUDA backend
TCP (remote)	32.58 ± 0.17	67.25 ± 0.37
TCP (local)	33.47 ± 0.17	72.61 ± 0.49
RDMA	35.12 ± 0.14	75.04 ± 0.21

It's really impressive how using a remote server over RDMA is faster compared to one running on localhost with TCP/IP. And here I am using the TCP/IP stack provided by the ConnectX-7 NIC, I suspect the performance of the standard ethernet NIC will be lower (will test this soon).

As a next step I think we should try to come up with a better way to abstract the underlying transport, so we can both reuse code and implement transport specific optimizations. For example the current implementation of send_rpc_cmd is:

llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp

Lines 459 to 473 in 6b949d1

    
           // RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) | 
        
           // No response 
        
           static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) { 
        
               uint8_t cmd_byte = cmd; 
        
               if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) { 
        
                   return false; 
        
               } 
        
               if (!send_data(sock->fd, &input_size, sizeof(input_size))) { 
        
                   return false; 
        
               } 
        
               if (!send_data(sock->fd, input, input_size)) { 
        
                   return false; 
        
               } 
        
               return true; 
        
           }

Doing multiple send_data in a row with TCP/IP is fine as we don't have to wait for reply as I explained here, but this is no longer the case with RDMA. I think we need to abstract the transport on a higher level than plain send/recv functions.

Another possible way to leverage RDMA is to expose host buffers and load tensors directly into the RDMA buffers. This should speed up model loading.

dvv101111 · 2026-04-01T12:20:44Z

@rgerganov

Doing multiple send_data in a row with TCP/IP is fine as we don't have to wait for reply as I explained #16892 (comment), but this is no longer the case with RDMA. I think we need to abstract the transport on a higher level than plain send/recv functions.

Another possible way to leverage RDMA is to expose host buffers and load tensors directly into the RDMA buffers. This should speed up model loading.

Yes, I was actually thinking about the same idea - raising RDMA to a higher level to get the ability to utilize RDMA WRITE functions, which would require less synchronization. However, when I made a test implementation for that, I got exactly the same inference performance.

So I agree with you that this would improve model loading time, but at the cost of more code complexity, because the RPC transports would then have two different high-level interfaces tcp vs rdma.

If you'd like, I can bring this idea back to life and publish it as a separate branch/pull request.

rgerganov · 2026-04-01T13:21:02Z

By raising the level of abstraction I actually mean something very simple -- instead of abstracting primitive send/recv function, let's try to abstract send_rpc_cmd. The TCP/IP implementation stays as-is but the RDMA implementation concatenates rpc_cmd, request_size and request_data and sends them at once. This way we make one round-trip to the server instead of three and should give us some measurable improvement IMO.

dvv101111 · 2026-04-01T16:09:08Z

@rgerganov
My conclusion is that merging multiple sends into one does not affect performance, but requires significant code changes. Unlike TCP, where sent data is appended to the driver buffer on the receiver and can be partially read, RDMA is message-oriented - sending cmd+size+payload as one message requires reading exactly the same data via a single rdma_recv(). This leads to refactoring rpc_serve_client() to use a different receive/dispatch pattern for the RDMA path.

At these RDMA speeds, the aggregation latency savings are not visible over inference or even model loading time. The transfer of tensor payloads takes significantly more time than the per-message negotiation overhead.

For retesting, I merged the current master (b8611), added metrics for model loading time, and implemented this send aggregation to reduce small transfers.
Here it is:
diff PR : https://github.com/dvv47/llama.cpp/pull/1/changes
diff patch: rdma_v1.1.patch
i am decided not commit it here and just keep as patch

The results were within margin of error:

With my homelab cluster two ryzen 395+ gfx1151 connected with mellanox lx 4 and mellanox lx 6:
Model Qwen3-Coder-Next-UD-Q8_K_XL

Vulkan backend:

ROCm backend:

pfn · 2026-04-02T03:09:46Z

I just gave it a shot tonight, it's very neat

First with plain RPC

pfnguyen@neuron:~/llama.cpp$ docker compose exec -it llama-server bash
ubuntu@4a7893229529:/app$ /app/llama-bench -m /models/Qwen3.5-122B-A10B-GGUF-Q8_0/Qwen3.5-122B-A10B-Q8_0-00001-of-00004.gguf --rpc 192.168.177.11:50052,192
.168.177.12:50052 -p 2048 -n 256
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124546 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124546 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35moe 122B.A10B Q8_0       | 120.94 GiB |   122.11 B | CUDA,RPC   |  99 |          pp2048 |       493.04 ± 15.96 |
| qwen35moe 122B.A10B Q8_0       | 120.94 GiB |   122.11 B | CUDA,RPC   |  99 |           tg256 |         17.35 ± 0.09 |

build: 72a13c73b (8633)

A second time with the container running as root, /dev/infiniband, shared mem set, additional cap IPC_LOCK, etc.

root@neuron:/app# /app/llama-bench -m /models/Qwen3.5-122B-A10B-GGUF-Q8_0/Qwen3.5-122B-A10B-Q8_0-00001-of-00004.gguf --rpc 192.168.177.11:50052,192.168.177
.12:50052 -p 2048 -n 256
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124546 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124546 MiB
RDMA probed: dev=rocep1s0f1 gid=2 qpn=648 inline=316
RDMA activated: qpn=648->649 mtu=4096 rx_depth=24
RDMA probed: dev=rocep1s0f1 gid=2 qpn=650 inline=316
RDMA activated: qpn=650->744 mtu=4096 rx_depth=24
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |

| qwen35moe 122B.A10B Q8_0       | 120.94 GiB |   122.11 B | CUDA,RPC   |  99 |          pp2048 |        537.62 ± 1.64 |
| qwen35moe 122B.A10B Q8_0       | 120.94 GiB |   122.11 B | CUDA,RPC   |  99 |           tg256 |         18.42 ± 0.03 |

build: 72a13c73b (8633)

I'm curious about mtu, though, I thought it should be 9000 per the configuration guide by nvidia for connecting 2 devices?

Also tried running it as llama-server and using eugr/llama-benchy to compare:

root@neuron:/app# /app/llama-server -m /models/Qwen3.5-122B-A10B-GGUF-Q8_0/Qwen3.5-122B-A10B-Q8_0-00001-of-00004.gguf --rpc 192.168.177.12:50052 --jinja --
mmproj /models/Qwen3.5-122B-A10B-GGUF-Q8_0/mmproj-BF16.gguf -c 262144 --port 8000 --host 0.0.0.0 -ctv q8_0 -ctk q8_0
...
sandbox@c37ff8f29d08:~$ uvx llama-benchy --base-url http://neuron:8000/v1 --model Qwen/Qwen3.5-122B-A10B-FP8 --tg 256
...

| model                      |   test |            t/s |     peak t/s |       ttfr (ms) |    est_ppt (ms) |   e2e_ttft (ms) |
|:---------------------------|-------:|---------------:|-------------:|----------------:|----------------:|----------------:|
| Qwen/Qwen3.5-122B-A10B-FP8 | pp2048 | 584.96 ± 10.76 |              | 3506.71 ± 64.71 | 3504.55 ± 64.71 | 3506.77 ± 64.70 |
| Qwen/Qwen3.5-122B-A10B-FP8 |  tg256 |   18.44 ± 0.07 | 19.00 ± 0.00 |                 |                 |                 |

(I normally run -FP8 in vllm on my sparks)

dvv101111 · 2026-04-02T07:01:01Z

@pfn

I'm curious about mtu, though, I thought it should be 9000 per the configuration guide by nvidia for connecting 2 devices?

Raising TCP MTU from 1500 to 9000 and RDMA MTU from 1024 to 4096 helps reduce CPU load and in the case of large data transfers, increase bandwidth. This is beneficial for training and model loading, but requires network isolation - all devices on the segment must share the same MTU settings.

On the other hand, keeping the default MTU preserves the ability to use these devices on a regular network with internet access and other devices. In my case, I have a flat home network where the cluster nodes are connected to the same router as a laptop, mobile phones, and some IoT devices. RoCEv2 works in this case also and provide very low stable latency.

In your case, it seems your TCP stack is already highly optimized, which is why the benefit from RDMA is not as large. but it's still there.

rgerganov · 2026-04-02T09:05:11Z

My conclusion is that merging multiple sends into one does not affect performance, but requires significant code changes. Unlike TCP, where sent data is appended to the driver buffer on the receiver and can be partially read, RDMA is message-oriented - sending cmd+size+payload as one message requires reading exactly the same data via a single rdma_recv(). This leads to refactoring rpc_serve_client() to use a different receive/dispatch pattern for the RDMA path.

Thanks, this also confirms my experiments so far. I think giving up on partial reads is not a big deal and your current implementation with swapping send/recv function is quite straightforward. We can still optimize the protocol further by sending cmd+request_size (9 bytes) in one message followed by request_body (request_size bytes) but I prefer to do this in a follow up PR

rgerganov

Please squash the commits and rebase on current master

slavonnet · 2026-04-11T15:25:02Z

Thank you for doing this!

I can check the difference on a pure Infiniband 56g (though on the CPU, there are no GPUs there). In theory, the Infiniband protocol has a 2-8 times lower delay per port, and this should greatly reduce delays and increase performance.

Ethernet (RoCE v2): Latency ~10–50 µs (after tuning). Jitter - Higher, depends on tuning
InfiniBand: Latency - ~1–2 µs (native). Jitter - Very low, deterministic

I have Mellonox CX-3 VPI cards and a Mellonox Infiniband switch. Its very chip install if buy it on Ebay :)

By the way, Digital Spark is supposed to support Infiniband port mode. If there is one, then you need to run the OpenSM service on one of the nodes, without it the Infiniband layer will not rise and there will be no link. I also have two Nvidia Spark coming before the end of the year, so I'm really following your progress :)

If you need to build something and run the test, then send the commands to run it (taking into account that I have CPU versions of the servers there)

Mithras · 2026-04-12T03:44:11Z

Ethernet (RoCE v2): Latency ~10–50 µs

That doesn't sound right. I have mellanox x4 and in eth mode with roce v2 have ~1us latency

rgerganov · 2026-04-14T12:20:54Z

On my testbed (two DGX Sparks connected with QSFP) I get almost identical performance from CUDA-over-RPC and local CUDA backend:

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	pp512	3113.46 ± 26.16
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	tg128	77.19 ± 0.60
gemma4 ?B Q4_K - Medium	17.39 GiB	30.70 B	CUDA	99	pp512	687.29 ± 8.00
gemma4 ?B Q4_K - Medium	17.39 GiB	30.70 B	CUDA	99	tg128	10.07 ± 0.01

build: b1a5fde (8769)

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	RPC	99	pp512	3170.49 ± 51.74
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	RPC	99	tg128	76.71 ± 0.18
gemma4 ?B Q4_K - Medium	17.39 GiB	30.70 B	RPC	99	pp512	638.88 ± 4.14
gemma4 ?B Q4_K - Medium	17.39 GiB	30.70 B	RPC	99	tg128	10.35 ± 0.01

build: b1a5fde (8769)

Unfortunately, loading the model with -sm tensor over RPC is super slow because we make thousands of set_tensor calls with very small chunks of data:

...
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38064105, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38064870, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38065635, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38066400, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38067165, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38067930, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38068695, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38069460, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38070225, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38070990, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38071755, size: 765
[set_tensor] buffer: 0xafc800794370, data: 0xe2e99767b300, offset: 38072520, size: 765
...

We need to figure out how to improve this.

rgerganov

I think this is good to go as an initial version for RDMA support. I will prepare a follow up patch which moves all transport related stuff in a separate file (e.g. transport.cpp) and clean up the socket_t interface.

To summarize the design decisions made here:

socket_t encapsulates how client-server communication is done
partial reads are no longer supported due to RDMA
transport capabilities are negotiated as part of RPC_CMD_HELLO
rdma_poll() is doing a busy loop causing 100% CPU usage on one core (not sure how we can avoid this)

dvv101111 · 2026-04-15T10:20:26Z

I also want to make a cross-platform Windows/Linux RPC RDMA in the next version. Probably not a very popular setup, but OK for a home lab.
I already have some proof of concept

rgerganov · 2026-04-16T13:40:27Z

This is the follow-up refactoring: #21998
It's still WIP but feedback is welcome

)

dvv101111 requested review from ggerganov and rgerganov as code owners March 15, 2026 12:25

dvv101111 mentioned this pull request Mar 15, 2026

Feature Request: RDMA support for rpc back ends #9493

Open

4 tasks

github-actions Bot added documentation Improvements or additions to documentation examples ggml changes relating to the ggml tensor library for machine learning labels Mar 15, 2026

dvv101111 requested a review from a team as a code owner March 15, 2026 21:56

dvv101111 marked this pull request as draft March 16, 2026 10:45

dvv101111 marked this pull request as ready for review April 2, 2026 06:28

rgerganov reviewed Apr 8, 2026

View reviewed changes

Comment thread ggml/include/ggml-rpc.h Outdated

Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated

Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated

Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp

Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated

dvv101111 force-pushed the feat-rdma-9493 branch 2 times, most recently from 5433c79 to b1a5fde Compare April 12, 2026 20:54

rgerganov reviewed Apr 14, 2026

View reviewed changes

Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated

Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated

Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated

Comment thread docs/backend/RPC-RDMA.md Outdated

Comment thread tools/rpc/README.md Outdated

feat(ggml-rpc): add native RDMA transport for RPC backend (RoCEv2)

409857f

dvv101111 force-pushed the feat-rdma-9493 branch from b1a5fde to 409857f Compare April 14, 2026 12:03

ggerganov mentioned this pull request Apr 14, 2026

ci : re-enable mac workflows #21894

Merged

rgerganov approved these changes Apr 15, 2026

View reviewed changes

ggerganov approved these changes Apr 15, 2026

View reviewed changes

rgerganov merged commit adb541a into ggml-org:master Apr 15, 2026
46 of 47 checks passed

Mithras mentioned this pull request Apr 17, 2026

Initial Lemonade RPC Support lemonade-sdk/lemonade#1563

Open

mengqin pushed a commit to mengqin/llama.cpp that referenced this pull request Apr 20, 2026

rpc : add native RDMA transport for RPC backend (RoCEv2) (ggml-org#20590

c6f8378

)

ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Apr 21, 2026

rpc : add native RDMA transport for RPC backend (RoCEv2) (ggml-org#20590

38a6ffb

)

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Apr 23, 2026

rpc : add native RDMA transport for RPC backend (RoCEv2) (ggml-org#20590

68c2a6d

)

rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026

rpc : add native RDMA transport for RPC backend (RoCEv2) (ggml-org#20590

aae1275

)

jimbothigpen pushed a commit to jimbothigpen/frankenturbo2 that referenced this pull request May 2, 2026

rpc : add native RDMA transport for RPC backend (RoCEv2) (ggml-org#20590

628c64f

)

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026

rpc : add native RDMA transport for RPC backend (RoCEv2) (ggml-org#20590

e1f7e94

)

Conversation

dvv101111 commented Mar 15, 2026

Uh oh!

ggerganov commented Mar 15, 2026

Uh oh!

Mithras commented Mar 15, 2026

Uh oh!

dvv101111 commented Mar 15, 2026

Uh oh!

dvv101111 commented Mar 15, 2026

Uh oh!

dvv101111 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mithras commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dvv101111 commented Mar 16, 2026

Uh oh!

Mithras commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krampenschiesser commented Mar 17, 2026

Uh oh!

ggerganov commented Mar 20, 2026

Uh oh!

ggerganov commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dvv101111 commented Mar 21, 2026

Uh oh!

ggerganov commented Mar 21, 2026

Uh oh!

v0l commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgerganov commented Apr 1, 2026

Uh oh!

dvv101111 commented Apr 1, 2026

Uh oh!

rgerganov commented Apr 1, 2026

Uh oh!

dvv101111 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pfn commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dvv101111 commented Apr 2, 2026

Uh oh!

rgerganov commented Apr 2, 2026

Uh oh!

rgerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slavonnet commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mithras commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rgerganov commented Apr 14, 2026

Uh oh!

rgerganov left a comment

Choose a reason for hiding this comment

Uh oh!

dvv101111 commented Apr 15, 2026

Uh oh!

Uh oh!

rgerganov commented Apr 16, 2026

dvv101111 commented Mar 15, 2026 •

edited

Loading

Mithras commented Mar 16, 2026 •

edited

Loading

Mithras commented Mar 16, 2026 •

edited

Loading

ggerganov commented Mar 21, 2026 •

edited

Loading

v0l commented Mar 26, 2026 •

edited

Loading

dvv101111 commented Apr 1, 2026 •

edited

Loading

pfn commented Apr 2, 2026 •

edited

Loading

slavonnet commented Apr 11, 2026 •

edited

Loading

Mithras commented Apr 12, 2026 •

edited

Loading