rpc : reuse compute graphs by rgerganov · Pull Request #15405 · ggml-org/llama.cpp

rgerganov · 2025-08-18T19:52:51Z

Store compute graphs on the server side and reuse them when possible. ~~Compute graphs are kept in a ring buffer with fixed size, so we can avoid serializing and deserializing the same graph every time. Add two new commands:~~
* RPC_CMD_GRAPH_COMPUTE_AND_STORE -- compute and store the graph
* RPC_CMD_GRAPH_RECOMPUTE -- recompute the graph with the given ID

~~Currently there is no good way to associate an ID with ggml_cgraph, so we abuse tensor->extra of the first node for this purpose.~~

rgerganov · 2025-08-18T19:59:56Z

The time spent on serializing/deserializing the graph is quite small compared to the graph compute time. I have measured ~1% tg improvement on 1Gpbs connection:

master

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	pp512	116.93 ± 0.34
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg128	13.62 ± 0.07
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg256	13.53 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg512	13.27 ± 0.01

PR

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	pp512	116.79 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg128	13.76 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg256	13.66 ± 0.03
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA,RPC	99	tg512	13.37 ± 0.02

Not sure if its worth the complexity we are adding here ...

jukofyork · 2025-10-24T01:03:39Z

When I'm running over RPC, I'm noticing the amount of data sent from the main host to the RPC servers is very asymmetric during token generation, eg:

Something like 200MB/s getting sent to each RPC server.
Something like 5MB/s getting sent from each RPC server.

Is this the compute graph getting sent (which I can only think must have all the sin/cos values for the positional encodings or something to have this large disparity?) and the hidden state(s) getting sent back?

rgerganov · 2025-10-24T11:49:37Z

You can set GGML_RPC_DEBUG=1 and then rpc-server will print the commands which are being executed. From the debug logs you can see how much data is being sent and received. For example when using gemma-3-1b-it-q4_0.gguf with a single rpc-server, these are the commands which are used for generating one token:

...
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf17ea040, offset: 0, size: 4608
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 4
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2b840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2c840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2c840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2d840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2e840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf252e840, offset: 0, size: 4
[graph_compute] device: 0, n_nodes: 1048, n_tensors: 1449
[get_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 1048576
...

For set_tensor you get how much bytes are received by the server.
For graph_compute, the received data in bytes is n_nodes*8 + n_tensors*296
For get_tensor you get how much bytes are sent by the server.

jukofyork · 2025-10-24T12:43:00Z

You can set GGML_RPC_DEBUG=1 and then rpc-server will print the commands which are being executed. From the debug logs you can see how much data is being sent and received. For example when using gemma-3-1b-it-q4_0.gguf with a single rpc-server, these are the commands which are used for generating one token:
...
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf17ea040, offset: 0, size: 4608
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 4
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2b840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2c840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2c840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2d840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2e840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf252e840, offset: 0, size: 4
[graph_compute] device: 0, n_nodes: 1048, n_tensors: 1449
[get_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 1048576
...
For set_tensor you get how much bytes are received by the server. For graph_compute, the received data in bytes is n_nodes*8 + n_tensors*296 For get_tensor you get how much bytes are sent by the server.

Thanks! I'll give this a try and see if I can see what is getting sent.

slaren · 2025-10-24T12:50:15Z

Running the client with GGML_SCHED_DEBUG=1 could also give you an idea of what tensors are being transferred to each server when evaluating a graph (these are the "inputs" of each split). Unfortunately, many tensors don't have very informative names at the moment, but the big one is likely to be the attention mask.

jukofyork · 2025-10-24T13:02:07Z

This is what I'm getting for the 2 RPC servers (each with 2 GPUs):

[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 122880
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e000, offset: 0, size: 24
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e080, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e100, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e180, offset: 0, size: 32768
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e000, offset: 0, size: 24
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e080, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e100, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e180, offset: 0, size: 32768
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c603c000, offset: 0, size: 122880

[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 122880
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e000, offset: 0, size: 24
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e080, offset: 0, size: 48
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e100, offset: 0, size: 48
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e180, offset: 0, size: 32768
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e000, offset: 0, size: 24
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e080, offset: 0, size: 48
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e100, offset: 0, size: 48
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e180, offset: 0, size: 32768
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e03c000, offset: 0, size: 122880

1080*8 + 1400*296 = 423040

So nothing huge here for 1 token.

But this is what I'm seeing get moved over network whilst generating:

and here are the last couple of tokens generated:

[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 61440
[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 20480
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f000, offset: 0, size: 4
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f080, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f180, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f000, offset: 0, size: 4
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f080, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f180, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 20480
[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 143360
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f000, offset: 0, size: 28
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f080, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f180, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f000, offset: 0, size: 28
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f080, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f180, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 143360

[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 61440
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 20480
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f000, offset: 0, size: 4
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f080, offset: 0, size: 8
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f180, offset: 0, size: 8
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f000, offset: 0, size: 4
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f080, offset: 0, size: 8
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f180, offset: 0, size: 8
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 20480
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 143360
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f000, offset: 0, size: 28
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f080, offset: 0, size: 56
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f180, offset: 0, size: 56
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f000, offset: 0, size: 28
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f080, offset: 0, size: 56
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f180, offset: 0, size: 56
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 143360

Is there anything here that can explain the huge discrepancy of what the main host is sending to the 2 clients, compared to what they are sending back?

The amount of data they are sending back looks to be in line with what you would expect for the hidden states.

jukofyork · 2025-10-24T19:51:07Z

I think I've deciphered this now (it was confusing because I was also running speculative decoding!).

For graph_compute, the received data in bytes is n_nodes8 + n_tensors296

[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400

So this is 1080*8 + 1400*296 = 423040.

[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 20480

this is the hidden state stored as float32 (ie: 20480/4 = 5120).

[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072

This appears to be 2x the context length, so I assume it is something holding position encodings.

So overall we are sending 423040+131072+20480 and receiving 20480 back here:

[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 20480

which is a 28:1 ratio and in line with the measurements.

rgerganov · 2025-10-25T08:42:27Z

The size of the compute graph is substantial in your case (423040 bytes), so reusing the graph (what this PR is about) could bring noticeable improvement for you. I will rebase this PR, so you can test in your environment.

jukofyork · 2025-10-25T11:09:58Z

The size of the compute graph is substantial in your case (423040 bytes), so reusing the graph (what this PR is about) could bring noticeable improvement for you. I will rebase this PR, so you can test in your environment.

Thanks!

rgerganov · 2025-10-27T13:03:25Z

@jukofyork I have rebased the PR and I will appreciate if you test it in your environment. With this patch the compute graph is stored on the server side and there is no need to send it for every token.

Note that you need to rebuild both client and server parts.

jukofyork · 2025-10-27T13:06:45Z

@jukofyork I have rebased the PR and I will appreciate if you test it in your environment. With this patch the compute graph is stored on the server side and there is no need to send it for every token.

Note that you need to rebuild both client and server parts.

Thanks! I'm away for a couple of days so will be Thursday/Friday before I can boot up the machines to test it.

jukofyork · 2025-10-28T14:21:50Z

This makes quite a big difference for me:

master

model	size	params	backend	ngl	fa	dev	ts	test	t/s
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	pp512	318.55 ± 0.86
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg128	15.64 ± 0.00
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg256	15.64 ± 0.01
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg512	15.54 ± 0.01

PR

model	size	params	backend	ngl	fa	dev	ts	test	t/s
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	pp512	320.18 ± 0.99
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg128	19.47 ± 0.02
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg256	19.48 ± 0.01
glm4moe 355B.A32B Q6_K	231.00 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg512	19.30 ± 0.01

(19.47 + 19.48 + 19.30) / (15.64 + 15.64 + 15.54) = +24.5% tg improvement.

jukofyork · 2025-10-28T14:59:57Z

This also changes the relative costs as now batch_size = 1 and batch_size = 2 both generate ~19.5 tokens/s (not really relevant for this PR, but will make more sense when I get time to explain my new "profile guided speculative decoding" code that uses these values...).

rgerganov · 2025-10-28T17:24:32Z

@jukofyork this is great news, I didn't have a chance to test this patch with larger graphs, thanks for confirming it makes a significant improvement

@slaren Is there a better way to associate an ID with ggml_cgraph?

jukofyork · 2025-10-30T09:59:34Z

@rgerganov @slaren

I've been looking at the RPC code and wonder if there might be a way to simplify/unify all this and remove the RPC_CMD_SET_TENSOR_HASH command, etc.

It's pretty cheap to hash data using #16753 (and with a few hours work I can probably decipher the 128-bit version and get an extra 1.5x performance for SSE2...), so what about if we removed the existing tensor-specific hashing and instead did it in the send_rpc_cmd code:

// RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) |
// No response
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    uint8_t cmd_byte = cmd;
    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }
    return true;
}

// RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) |
// RPC response: | response_size (8 bytes) | response_data (response_size bytes) |
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size, void * output, size_t output_size) {
    if (!send_rpc_cmd(sock, cmd, input, input_size)) {
        return false;
    }
    // TODO: currently the output_size is always known, do we need support for commands with variable output size?
    // even if we do, we can skip sending output_size from the server for commands with known output size
    uint64_t out_size;
    if (!recv_data(sock->fd, &out_size, sizeof(out_size))) {
        return false;
    }
    if (out_size != output_size) {
        return false;
    }
    if (!recv_data(sock->fd, output, output_size)) {
        return false;
    }
    return true;
}

So:

The RPC servers would still have the (optional) non-volatile cache for large transfers and the code above would need then to negotiate these transfers, but since these are the large tensors the added latency would be insignificant and this is essentially what the RPC_CMD_SET_TENSOR_HASH command is doing anyway; we just move the logic to here. As with the current implementation, there will be some host-side-only heuristic(s) applied (minimum size, etc).

But we would also introduce a volatile LRU cache:

It could either cache all data sent/received, or use similar heuristic(s) as the non-volatile cache, but the heuristic(s) would need to be performed both host-side and server-side.
The host wouldn't actually need to store the data itself and its LRU cache would just need the hashes storing.
As the host knows the LRU cache size and ejection policy the RPC servers will be using, it should be able to keep track of exactly what they have stored in their LRU caches and there would be no additional latency nor any negotiation needed (ie: it would just be the same as above but with the hash in place of the actual data).
The LRU cache would need to be large enough so that important data like the compute graph and position embeddings (if that is what that other tensor is that I see getting sent?), don't get ejected by the batches of hidden state data during PP (or some additional heuristic applied to avoid storing hidden state data at all...).

This seems to have lots of benefits:

Removes a lot of the complexity this PR adds.
Removes the need for the RPC_CMD_SET_TENSOR_HASH and the complexity it adds.
Moves the logic of the heuristics to a single point and this will make it easier to apply additional heuristics if needed (ie: not just sized-based).
If we ever want to experiment with using compression, then it will neatly fit in here too. The quantised tensors are high entropy and will not compress at all, but the float32 tensors being used for the hidden states likely contain way less entropy and can likely be compressed to 1/2 to 1/3 of their size losslessly (eg: https://arxiv.org/abs/2504.11651 and https://clemenswinter.com/2024/04/07/the-simple-beauty-of-xor-floating-point-compression/) or even further if we accept some lossy compression to a lower precision data type (eg: deepspeed's communication_data_type https://www.deepspeed.ai/docs/config-json/#communication-options).

The only thing I'm not so sure about is if we are ever sending any pointers (or non-static node indices, etc) over the RPC channel? If so then that will make things much harder and probably a lot more work (but not impossible, eg: using the ideas from https://theboostcpplibraries.com/boost.serialization-pointers-and-references).

jukofyork · 2025-10-30T10:12:55Z

I made a lot of ninja edits just now so if reading from the email notification, please see the comment instead!

rgerganov · 2025-10-30T11:48:13Z

I am not sure I understand your proposal. As of now, there are only two commands that benefit from caching -- SET_TENSOR and GRAPH_COMPUTE. Why do we need to modify send_rpc_cmd() which is used for all RPC commands and how does this makes anything simpler?

Also note that my implementation is reusing compute graphs without serializing them first, so we don't waste CPU time for serialization and hashing.

slaren · 2025-10-30T12:15:55Z

@slaren Is there a better way to associate an ID with ggml_cgraph?

I don't think there is a good way to do this at the moment. This implementation is not reliable either. To do this in a reliable way, you would need to check every node of the graph to determine if it is changed, similar to what the CUDA backend does to implement CUDA graphs. In the future with the graph plan API this may be unnecessary.

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }

With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

jukofyork · 2025-10-30T12:27:42Z

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }
With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

jukofyork · 2025-10-30T12:33:21Z

I am not sure I understand your proposal. As of now, there are only two commands that benefit from caching -- SET_TENSOR and GRAPH_COMPUTE. Why do we need to modify send_rpc_cmd() which is used for all RPC commands and how does this makes anything simpler?

Also note that my implementation is reusing compute graphs without serializing them first, so we don't waste CPU time for serialization and hashing.

I've got to go out, but will try and give a more detailed example of what I mean when I get back in. The basic idea is just to extract all the "communication channel" stuff (requiring tunable heuristic, etc) away from the clean RPC interface. The send_rpc_cmd is just the point where the heuristics will have the most helpful data to make their decisions, but it could equivalently be done in the lower level functions as well (although I think it would be not as easy to design the heuristics without the actual command name, etc).

eg: The simple >10MB hashing heuristic is already failing for me where I am sending huge batches and I've tried everything up to 100MB but still getting some batches hashes and then saved.

jukofyork · 2025-10-30T20:16:29Z

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }
With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.
I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

Here's what I tried for both send_msg and send_rpc_cmd, but it didn't seem to help me at all:

5ce87d1

static bool send_msg(sockfd_t sockfd, const void * msg, size_t msg_size) {
    const size_t header_size = sizeof(msg_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + msg_size);

    // header
    memcpy(buf.data(), &msg_size, sizeof(msg_size));

    // payload
    if (msg_size > 0) {
        memcpy(buf.data() + header_size, msg, msg_size);
    }

    // single send
    return send_data(sockfd, buf.data(), buf.size());
}

static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    const size_t header_size = 1 + sizeof(input_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + input_size);

    // header
    buf[0] = static_cast<uint8_t>(cmd);
    memcpy(buf.data() + 1, &input_size, sizeof(input_size));

    // payload
    if (input_size > 0) {
        memcpy(buf.data() + header_size, input, input_size);
    }

    // single send (send_data may still chunk very large buffers, which is fine)
    return send_data(sock->fd, buf.data(), buf.size());
}

jukofyork · 2025-10-31T05:59:39Z

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }
With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.
I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

Here's what I tried for both send_msg and send_rpc_cmd, but it didn't seem to help me at all:

5ce87d1

static bool send_msg(sockfd_t sockfd, const void * msg, size_t msg_size) {
    const size_t header_size = sizeof(msg_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + msg_size);

    // header
    memcpy(buf.data(), &msg_size, sizeof(msg_size));

    // payload
    if (msg_size > 0) {
        memcpy(buf.data() + header_size, msg, msg_size);
    }

    // single send
    return send_data(sockfd, buf.data(), buf.size());
}

static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    const size_t header_size = 1 + sizeof(input_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + input_size);

    // header
    buf[0] = static_cast<uint8_t>(cmd);
    memcpy(buf.data() + 1, &input_size, sizeof(input_size));

    // payload
    if (input_size > 0) {
        memcpy(buf.data() + header_size, input, input_size);
    }

    // single send (send_data may still chunk very large buffers, which is fine)
    return send_data(sock->fd, buf.data(), buf.size());
}

Actually this now seems to be giving me quite a boost! With this PR and that branch merged, I'm now getting this:

model	size	params	backend	ngl	fa	dev	ts	test	t/s
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	pp512	354.63 ± 1.07
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg128	22.70 ± 0.01
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg256	22.41 ± 0.01
glm4moe 355B.A32B Q6_K	230.78 GiB	356.79 B	CUDA,RPC	99	1	CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1	16.00/15.00/15.00/15.00/15.00/17.00	tg512	22.07 ± 0.05

compared to this the other day:

This makes quite a big difference for me:

master

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 318.55 ± 0.86
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 15.64 ± 0.00
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 15.64 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 15.54 ± 0.01

PR

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 320.18 ± 0.99
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 19.47 ± 0.02
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 19.48 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 19.30 ± 0.01

(19.47 + 19.48 + 19.30) / (15.64 + 15.64 + 15.54) = +24.5% tg improvement.

It may be something else that has changed, so will check next week to be sure, but it seems odd I got around 15 tokens/s when I tested this branch a couple of days ago.

jukofyork · 2025-10-31T11:35:58Z

I made a draft PR #16892 to see if this makes any difference to others, or if it only helps with this specific PR, etc.

jukofyork · 2025-10-31T12:00:36Z

I also had a go at writing a (very!) hacky proof of concept for the volatile hash last night:

llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp

Line 431 in 3561f8c

// Try use the volatile cache when data size is larger than this threshold

// Try use the volatile cache when data size is larger than this threshold
const size_t MIN_CACHE_THRESHOLD = 20 * 1024;
const size_t MAX_CACHE_THRESHOLD = 1024 * 1024;

static bool send_data(sockfd_t sockfd, const void * data, size_t size) {
    static std::unordered_set<uint64_t> sent_hashes;

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        uint64_t hash = generate_hash((const uint8_t*)data, size);
        bool is_new = sent_hashes.find(hash) == sent_hashes.end();

        uint8_t flag = is_new ? 1 : 0;
        if (send(sockfd, (const char*)&flag, sizeof(flag), 0) != sizeof(flag)) {
            return false;
        }
        if (send(sockfd, (const char*)&hash, sizeof(hash), 0) != sizeof(hash)) {
            return false;
        }

        if (!is_new) {
            return true;
        }
        sent_hashes.insert(hash);
    }

    size_t bytes_sent = 0;
    while (bytes_sent < size) {
        size_t size_to_send = std::min(size - bytes_sent, MAX_CHUNK_SIZE);
        ssize_t n = send(sockfd, (const char *)data + bytes_sent, size_to_send, 0);
        if (n < 0) {
            GGML_LOG_ERROR("send failed (bytes_sent=%zu, size_to_send=%zu)\n",
                           bytes_sent, size_to_send);
            return false;
        }
        bytes_sent += (size_t)n;
    }
    return true;
}

static bool recv_data(sockfd_t sockfd, void * data, size_t size) {
    static std::unordered_map<uint64_t, std::vector<uint8_t>> recv_cache;

    uint64_t hash = 0;

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        uint8_t flag;
        if (recv(sockfd, (char*)&flag, sizeof(flag), 0) != sizeof(flag)) {
            return false;
        }
        if (recv(sockfd, (char*)&hash, sizeof(hash), 0) != sizeof(hash)) {
            return false;
        }

        if (flag == 0) {
            auto it = recv_cache.find(hash);
            if (it != recv_cache.end()) {
                memcpy(data, it->second.data(), size);
                return true;
            }
            return false;
        }
    }

    size_t bytes_recv = 0;
    while (bytes_recv < size) {
        size_t size_to_recv = std::min(size - bytes_recv, MAX_CHUNK_SIZE);
        ssize_t n = recv(sockfd, (char *)data + bytes_recv, size_to_recv, 0);
        if (n < 0) {
            GGML_LOG_ERROR("recv failed (bytes_recv=%zu, size_to_recv=%zu)\n",
                           bytes_recv, size_to_recv);
            return false;
        }
        if (n == 0) {
            LOG_DBG("recv returned 0 (peer closed?)\n");
            return false;
        }
        bytes_recv += (size_t)n;
    }

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        recv_cache[hash] = std::vector<uint8_t>((uint8_t*)data, (uint8_t*)data + size);
    }

    return true;
}

It did give a small improvement (from 15.5 tokens/s to 17.5 tokens/s), but either the hashing still has a lot of overhead or the repeated serialisation of the compute graph as mentioned by @rgerganov is very costly, as with this PR I get 19.5 tokens/s in comparison! ☹️

I then destroyed the branch trying to hack in the code to join the packets after the push linked above, so will have to take another look at this next week and likely start fresh.

I did find that the other tensors that were exactly 2x the context length (and suspected was something to do with the position embeddedings) is the same each time and could also be cached (this finally produced a balanced flow over the network of about 4-5MB/s which is purely the hidden state data).

jukofyork · 2025-10-31T13:08:55Z

@slaren Is there a better way to associate an ID with ggml_cgraph?

I don't think there is a good way to do this at the moment. This implementation is not reliable either. To do this in a reliable way, you would need to check every node of the graph to determine if it is changed, similar to what the CUDA backend does to implement CUDA graphs. In the future with the graph plan API this may be unnecessary.

It's a pity, as this PR does seem to help a lot (maybe mainly large MOE models only though?).

From my hacky experiments last night, it definitely looks like @rgerganov is correct that the serialisation+hashing (even using xxhash64) is going to be worse than this PR (my only hope was it was the multiple packets, but after reading @rgerganov post on the other PR; I'm not hopeful).

By "visiting every node" does that mean the data of every node needs comparing (as in the full ~450kb of data for my example), or is it just a set of "node IDs" or similar that need comparing?

kyuz0 · 2025-11-10T20:05:42Z

On my setup (two AMD Strix Halo hosts), this PR has improved (~5.5%) prompt processing:

  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           pp512 |        175.56 ± 1.40 |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           tg128 |         17.50 ± 0.08 |

Right now I only have a screenshot of the performance without this PR:

Happy to do more testing. My setup currenlt uses USB-to-USB, 10Gbps and I'm trying to see if I can get to 20-40Gbps.

kyuz0 · 2025-11-10T20:52:13Z

Improvement was there but minimal got GLM 6.4:

llama-bench -mmp 0 -fa 1 -m GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf --rpc "192.168.2.2:50052"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| glm4moe 355B.A32B Q4_K - Medium | 189.69 GiB |   356.79 B | ROCm,RPC   |  99 |  1 |    0 |           pp512 |         34.18 ± 0.05 |
| glm4moe 355B.A32B Q4_K - Medium | 189.69 GiB |   356.79 B | ROCm,RPC   |  99 |  1 |    0 |           tg128 |          8.81 ± 0.01 |

rgerganov · 2025-11-11T07:57:06Z

@kyuz0 thanks for taking the time to test this on your setup

I will try to rework this with the approach suggested by @slaren by using the CUDA backend as reference

ggerganov · 2025-11-11T08:04:23Z

I wonder if you can simply reuse this approach that I prototyped some time ago for the Metal backend to determine when a graph is the same as the previous one:

https://github.com/ggml-org/llama.cpp/pull/14570/files#diff-4045ccc2edf36af53c4a28fca4a6a24f628b3c9bee7ccb43b06e55dc5e5e199eR5482-R5514

At the time, I thought it was hacky, but I don't actually see a failure case - it simply does a memcmp of the entire graph data at the byte level.

It didn't end up being used because it didn't lead to any improvements for Metal (#14570). But here it could be a reasonable stopgap until the graph plan API lands.

rgerganov · 2025-11-24T13:40:12Z

I have reimplemented the graph caching using the approach suggested by @ggerganov. Both client and server maintain a ring buffer of cached compute graphs and reuse them when possible. The newly added RPC commands for computing graphs do not return result -- if graph_compute() returns something different from GGML_STATUS_SUCCESS, the server simply closes the connection. This will save a network round-trip and should bring noticeable TG speed improvement.

@kyuz0 @jukofyork Could you please test this PR on your testbeds and report the results here?

kyuz0 · 2025-11-25T12:50:36Z

Thanks @rgerganov . I did some initial testing, and now I can't see a difference in performance on MiniMax with or without the PR. I will run a proper test bench in a bit to confirm, there might be an issue with the testing I did.

rgerganov · 2025-11-25T13:50:57Z

I am seeing 13% TG improvement with gpt-oss-20b when running on localhost:

master

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	RPC	99	pp512	9332.25 ± 120.10
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	RPC	99	tg128	251.03 ± 1.38

PR

model	size	params	backend	ngl	test	t/s
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	RPC	99	pp512	9473.22 ± 108.71
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	RPC	99	tg128	283.68 ± 0.83

eugr · 2025-11-25T19:09:21Z

I'm seeing a bit of improvement with dual DGX Sparks connected via their 200G port on Qwen3-VL-235B (and even GPT-OSS-120B), but for some reason not on MiniMax M2.

Qwen3-235B

Before

model	size	params	backend	test	t/s
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	pp2048	528.52 ± 2.88
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	tg32	12.98 ± 0.05
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	pp2048 @ d4096	469.70 ± 5.47
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	tg32 @ d4096	11.62 ± 0.08
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	pp2048 @ d8192	420.87 ± 8.01
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	tg32 @ d8192	11.15 ± 0.08
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	pp2048 @ d16384	340.40 ± 8.40
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	tg32 @ d16384	9.90 ± 0.02
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	pp2048 @ d32768	226.70 ± 35.69
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	tg32 @ d32768	8.03 ± 0.04

After

model	size	params	backend	test	t/s
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	pp2048	551.75 ± 1.11
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	tg32	13.78 ± 0.02
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	pp2048 @ d4096	491.34 ± 22.40
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	tg32 @ d4096	12.47 ± 0.09
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	pp2048 @ d8192	454.67 ± 1.51
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	tg32 @ d8192	12.15 ± 0.02
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	pp2048 @ d16384	377.73 ± 0.39
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	tg32 @ d16384	10.91 ± 0.02
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	pp2048 @ d32768	279.02 ± 0.82
qwen3vlmoe 235B.A22B Q4_K - Medium	124.91 GiB	235.09 B	CUDA,RPC	tg32 @ d32768	8.99 ± 0.07

MiniMax M2

Before

model	size	params	backend	test	t/s
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	pp2048	906.42 ± 1.27
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	tg32	25.32 ± 0.28
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	pp2048 @ d4096	822.09 ± 4.14
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	tg32 @ d4096	21.47 ± 0.16
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	pp2048 @ d8192	736.49 ± 6.00
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	tg32 @ d8192	19.03 ± 0.12
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	pp2048 @ d16384	615.61 ± 5.00
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	tg32 @ d16384	15.49 ± 0.22
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	pp2048 @ d32768	460.02 ± 5.47
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	tg32 @ d32768	11.14 ± 0.07

After

model	size	params	backend	test	t/s
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	pp2048	901.43 ± 2.48
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	tg32	25.90 ± 0.08
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	pp2048 @ d4096	810.92 ± 1.31
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	tg32 @ d4096	23.10 ± 0.27
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	pp2048 @ d8192	731.97 ± 3.02
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	tg32 @ d8192	19.81 ± 0.14
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	pp2048 @ d16384	606.83 ± 1.97
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	tg32 @ d16384	16.08 ± 0.08
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	pp2048 @ d32768	456.50 ± 1.00
minimax-m2 230B.A10B Q4_K - Medium	122.58 GiB	228.69 B	CUDA,RPC	tg32 @ d32768	11.71 ± 0.07

kyuz0 · 2025-11-26T13:00:09Z

SO, I think my current issue seems to be that the baseline, without the PR, has improved to match the performance of the PR:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           pp512 |        173.79 ± 0.42 |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           tg128 |         17.10 ± 0.03 |

I will look at the pipelines that I have to build my containers, but I am building the container WITHOUT the PR using this Dockerfile and setting the env APPLY_LLAMA_PR_15405 to 0:

https://github.com/kyuz0/amd-strix-halo-toolboxes/blob/main/toolboxes/Dockerfile.rocm-6.4.4-rocwmma

So I cannot see a way that my container inadvertantly includes the PR when APPLY_LLAMA_PR_15405 is set to 0.

Djip007 · 2025-11-26T16:10:03Z

https://github.com/kyuz0/amd-strix-halo-toolboxes/blob/main/toolboxes/Dockerfile.rocm-6.4.4-rocwmma

on fedora:43 rocm 6.4.4 is now upstream, no need to use the "el9" .

in my case I create a toolbox (I have silverblue ;)):

toolbox create -d fedora -r 43 rocm-devel-fc43
toolbox enter rocm-devel-fc43

# add rocm+dep:
 sudo dnf upgrade --refresh
 sudo dnf install hipblas-devel rocm-hip-devel rocblas-devel
 sudo dnf install cmake gcc-c++
 sudo dnf install rocm-smi rocminfo amd-smi
 sudo dnf install libcurl-devel

# build with
 cmake -S . -B build/rocm6.4.4 -DGGML_HIP=ON -DGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release
 cmake --build build/rocm6.4.4 --config Release -- -j 16

# ...

kyuz0 · 2025-11-26T17:02:55Z

https://github.com/kyuz0/amd-strix-halo-toolboxes/blob/main/toolboxes/Dockerfile.rocm-6.4.4-rocwmma

on fedora:43 rocm 6.4.4 is now upstream, no need to use the "el9" .

in my case I create a toolbox (I have silverblue ;)):
toolbox create -d fedora -r 43 rocm-devel-fc43
toolbox enter rocm-devel-fc43

# add rocm+dep:
 sudo dnf upgrade --refresh
 sudo dnf install hipblas-devel rocm-hip-devel rocblas-devel
 sudo dnf install cmake gcc-c++
 sudo dnf install rocm-smi rocminfo amd-smi
 sudo dnf install libcurl-devel

# build with
 cmake -S . -B build/rocm6.4.4 -DGGML_HIP=ON -DGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release
 cmake --build build/rocm6.4.4 --config Release -- -j 16

# ...

Thank you, I still think regardless of where ROCm comes from, I am not sure why I am getting the same performance with and without the PR, and again, it seems even without the PR now performance is better.

I suspect user error, so I might just create a toolbox manually instead of my pipelines.

Djip007 · 2025-11-26T17:16:18Z

I is sometime hard to bench, with the Framework Desktop on Perfo mode it is:

160W for some second
140W for 10min
120W for next...
It is sometime hard to know if it is at 140 or reduce to 120W.... wait 10 min IDLE is "long" ;)
I can see CPU speed for llama.cpp CPU bench, but not sure how to have the GPU speed for now. (edit: OK with radeontop ;) )

Note: in most case I test the fa-wmma slowdown with hight pp (>2000...)

Note: did you connect the 2 MAX with RJ45 network, or USB-4?

rgerganov · 2025-11-27T11:21:29Z

@ggerganov I changed the implementation to cache only the last computed graph for each device and reuse it when possible. It is much simpler and clean this way. Also removed the response from RPC_CMD_GRAPH_COMPUTE to save a network round trip. If computing the graph is not successful, the server closes the connection.

ggerganov · 2025-11-27T12:26:53Z

The CI needs a fix before merging: https://github.com/ggml-org/llama.cpp/actions/runs/19734283728/job/56542411768?pr=15405#step:4:1857

Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server.

* RPC support multiple devices * rpc : update documentation (ggml-org#16441) Update the README file to match the newly added functionality of exposing multiple devices from a single server. Co-authored-by: Diego Devesa <slarengh@gmail.com> # Conflicts: # examples/rpc/README.md * Remove memory settings * rpc : cache and reuse compute graphs (ggml-org#15405) Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server. * Add -cpu to include cpu backend --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>

Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server.

github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 18, 2025

rgerganov force-pushed the rpc-reuse-graphs branch from 537b237 to 2272e04 Compare October 27, 2025 12:59

rgerganov marked this pull request as ready for review October 28, 2025 17:19

rgerganov requested review from ggerganov and slaren as code owners October 28, 2025 17:19

jukofyork mentioned this pull request Oct 31, 2025

rpc: join small packets in send_msg and recv_msg #16892

Closed

rgerganov force-pushed the rpc-reuse-graphs branch from 2272e04 to c16d931 Compare November 24, 2025 13:31

ggerganov reviewed Nov 26, 2025

View reviewed changes

Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated

Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated

rgerganov force-pushed the rpc-reuse-graphs branch from c16d931 to 91757dd Compare November 27, 2025 11:10

ggerganov approved these changes Nov 27, 2025

View reviewed changes

rgerganov force-pushed the rpc-reuse-graphs branch from 91757dd to 5110d3d Compare November 27, 2025 12:32

loci-dev mentioned this pull request Nov 27, 2025

UPSTREAM PR #15405: rpc : reuse compute graphs auroralabs-loci/llama.cpp#346

Open

rgerganov merged commit 15d2b46 into ggml-org:master Nov 28, 2025
72 of 74 checks passed

firecoperana mentioned this pull request Nov 29, 2025

RPC: support multiple devices including cpu ikawrakow/ik_llama.cpp#1024

Merged

rgerganov mentioned this pull request Dec 2, 2025

Eval bug: rpc-server crashes with a gpf #17647

Open

Conversation

rgerganov commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgerganov commented Aug 18, 2025

master

PR

Uh oh!

jukofyork commented Oct 24, 2025

Uh oh!

rgerganov commented Oct 24, 2025

Uh oh!

jukofyork commented Oct 24, 2025

Uh oh!

slaren commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgerganov commented Oct 25, 2025

Uh oh!

jukofyork commented Oct 25, 2025

Uh oh!

rgerganov commented Oct 27, 2025

Uh oh!

jukofyork commented Oct 27, 2025

Uh oh!

jukofyork commented Oct 28, 2025

master

PR

Uh oh!

jukofyork commented Oct 28, 2025

Uh oh!

rgerganov commented Oct 28, 2025

Uh oh!

jukofyork commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 30, 2025

Uh oh!

rgerganov commented Oct 30, 2025

Uh oh!

slaren commented Oct 30, 2025

Uh oh!

jukofyork commented Oct 30, 2025

Uh oh!

jukofyork commented Oct 30, 2025

Uh oh!

jukofyork commented Oct 30, 2025

Uh oh!

jukofyork commented Oct 31, 2025

master

PR

Uh oh!

jukofyork commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kyuz0 commented Nov 10, 2025

Uh oh!

kyuz0 commented Nov 10, 2025

Uh oh!

rgerganov commented Nov 11, 2025

Uh oh!

ggerganov commented Nov 11, 2025

Uh oh!

rgerganov commented Nov 24, 2025

Uh oh!

kyuz0 commented Nov 25, 2025

Uh oh!

rgerganov commented Nov 25, 2025

rgerganov commented Aug 18, 2025 •

edited

Loading

slaren commented Oct 24, 2025 •

edited

Loading

jukofyork commented Oct 24, 2025 •

edited

Loading

jukofyork commented Oct 24, 2025 •

edited

Loading

jukofyork commented Oct 30, 2025 •

edited

Loading

jukofyork commented Oct 31, 2025 •

edited

Loading

jukofyork commented Oct 31, 2025 •

edited

Loading

jukofyork commented Oct 31, 2025 •

edited

Loading

Djip007 commented Nov 26, 2025 •

edited

Loading