Skip to content

rpc : reuse compute graphs#15405

Merged
rgerganov merged 1 commit into
ggml-org:masterfrom
rgerganov:rpc-reuse-graphs
Nov 28, 2025
Merged

rpc : reuse compute graphs#15405
rgerganov merged 1 commit into
ggml-org:masterfrom
rgerganov:rpc-reuse-graphs

Conversation

@rgerganov
Copy link
Copy Markdown
Member

@rgerganov rgerganov commented Aug 18, 2025

Store compute graphs on the server side and reuse them when possible. Compute graphs are kept in a ring buffer with fixed size, so we can avoid serializing and deserializing the same graph every time. Add two new commands:
* RPC_CMD_GRAPH_COMPUTE_AND_STORE -- compute and store the graph
* RPC_CMD_GRAPH_RECOMPUTE -- recompute the graph with the given ID

Currently there is no good way to associate an ID with ggml_cgraph, so we abuse tensor->extra of the first node for this purpose.

@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 18, 2025
@rgerganov
Copy link
Copy Markdown
Member Author

The time spent on serializing/deserializing the graph is quite small compared to the graph compute time. I have measured ~1% tg improvement on 1Gpbs connection:

master

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 pp512 116.93 ± 0.34
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg128 13.62 ± 0.07
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg256 13.53 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg512 13.27 ± 0.01

PR

model size params backend ngl test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 pp512 116.79 ± 0.28
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg128 13.76 ± 0.06
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg256 13.66 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B CUDA,RPC 99 tg512 13.37 ± 0.02

Not sure if its worth the complexity we are adding here ...

@jukofyork
Copy link
Copy Markdown
Collaborator

When I'm running over RPC, I'm noticing the amount of data sent from the main host to the RPC servers is very asymmetric during token generation, eg:

  • Something like 200MB/s getting sent to each RPC server.
  • Something like 5MB/s getting sent from each RPC server.

Is this the compute graph getting sent (which I can only think must have all the sin/cos values for the positional encodings or something to have this large disparity?) and the hidden state(s) getting sent back?

@rgerganov
Copy link
Copy Markdown
Member Author

You can set GGML_RPC_DEBUG=1 and then rpc-server will print the commands which are being executed. From the debug logs you can see how much data is being sent and received. For example when using gemma-3-1b-it-q4_0.gguf with a single rpc-server, these are the commands which are used for generating one token:

...
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf17ea040, offset: 0, size: 4608
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 4
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2b840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2c840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2c840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2d840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2e840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf252e840, offset: 0, size: 4
[graph_compute] device: 0, n_nodes: 1048, n_tensors: 1449
[get_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 1048576
...

For set_tensor you get how much bytes are received by the server.
For graph_compute, the received data in bytes is n_nodes*8 + n_tensors*296
For get_tensor you get how much bytes are sent by the server.

@jukofyork
Copy link
Copy Markdown
Collaborator

You can set GGML_RPC_DEBUG=1 and then rpc-server will print the commands which are being executed. From the debug logs you can see how much data is being sent and received. For example when using gemma-3-1b-it-q4_0.gguf with a single rpc-server, these are the commands which are used for generating one token:

...
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf17ea040, offset: 0, size: 4608
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 4
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2b840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2c840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2c840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2d840, offset: 0, size: 8
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1d2e840, offset: 0, size: 65536
[set_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf252e840, offset: 0, size: 4
[graph_compute] device: 0, n_nodes: 1048, n_tensors: 1449
[get_tensor] buffer: 0x5b6613a44f00, data: 0x7bebf1a2a040, offset: 0, size: 1048576
...

For set_tensor you get how much bytes are received by the server. For graph_compute, the received data in bytes is n_nodes*8 + n_tensors*296 For get_tensor you get how much bytes are sent by the server.

Thanks! I'll give this a try and see if I can see what is getting sent.

@slaren
Copy link
Copy Markdown
Member

slaren commented Oct 24, 2025

Running the client with GGML_SCHED_DEBUG=1 could also give you an idea of what tensors are being transferred to each server when evaluating a graph (these are the "inputs" of each split). Unfortunately, many tensors don't have very informative names at the moment, but the big one is likely to be the attention mask.

@jukofyork
Copy link
Copy Markdown
Collaborator

jukofyork commented Oct 24, 2025

This is what I'm getting for the 2 RPC servers (each with 2 GPUs):

[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 122880
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e000, offset: 0, size: 24
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e080, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e100, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430201e180, offset: 0, size: 32768
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e000, offset: 0, size: 24
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e080, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e100, offset: 0, size: 48
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c601e180, offset: 0, size: 32768
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c603c000, offset: 0, size: 122880
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 122880
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e000, offset: 0, size: 24
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e080, offset: 0, size: 48
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e100, offset: 0, size: 48
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a01e180, offset: 0, size: 32768
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e000, offset: 0, size: 24
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e080, offset: 0, size: 48
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e100, offset: 0, size: 48
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e01e180, offset: 0, size: 32768
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e03c000, offset: 0, size: 122880

1080*8 + 1400*296 = 423040

So nothing huge here for 1 token.


But this is what I'm seeing get moved over network whilst generating:

image

and here are the last couple of tokens generated:

[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 61440
[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 20480
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f000, offset: 0, size: 4
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f080, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f180, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f000, offset: 0, size: 4
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f080, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f180, offset: 0, size: 8
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 20480
[set_tensor] buffer: 0x562f5d724340, data: 0x7f4302000000, offset: 0, size: 143360
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f000, offset: 0, size: 28
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f080, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f180, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d724340, data: 0x7f430205f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x562f5d724340, dst->buffer: 0x562f5d162200
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f000, offset: 0, size: 28
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f080, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f180, offset: 0, size: 56
[set_tensor] buffer: 0x562f5d162200, data: 0x7f42c605f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x562f5d162200, data: 0x7f42c60be000, offset: 0, size: 143360
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 61440
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 20480
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f000, offset: 0, size: 4
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f080, offset: 0, size: 8
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f180, offset: 0, size: 8
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f000, offset: 0, size: 4
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f080, offset: 0, size: 8
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f180, offset: 0, size: 8
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 20480
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 143360
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f000, offset: 0, size: 28
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f080, offset: 0, size: 56
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f180, offset: 0, size: 56
[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072
[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400
[copy_tensor] src->buffer: 0x561d64889540, dst->buffer: 0x561d642c72c0
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f000, offset: 0, size: 28
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f080, offset: 0, size: 56
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f180, offset: 0, size: 56
[set_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e05f280, offset: 0, size: 131072
[graph_compute] device: 1, n_nodes: 1080, n_tensors: 1400
[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 143360

Is there anything here that can explain the huge discrepancy of what the main host is sending to the 2 clients, compared to what they are sending back?

The amount of data they are sending back looks to be in line with what you would expect for the hidden states.

@jukofyork
Copy link
Copy Markdown
Collaborator

jukofyork commented Oct 24, 2025

I think I've deciphered this now (it was confusing because I was also running speculative decoding!).

For graph_compute, the received data in bytes is n_nodes8 + n_tensors296

[graph_compute] device: 0, n_nodes: 1080, n_tensors: 1400

So this is 1080*8 + 1400*296 = 423040.


[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a000000, offset: 0, size: 20480

this is the hidden state stored as float32 (ie: 20480/4 = 5120).


[set_tensor] buffer: 0x561d64889540, data: 0x7f7c8a05f280, offset: 0, size: 131072

This appears to be 2x the context length, so I assume it is something holding position encodings.


So overall we are sending 423040+131072+20480 and receiving 20480 back here:

[get_tensor] buffer: 0x561d642c72c0, data: 0x7f7c4e0be000, offset: 0, size: 20480

which is a 28:1 ratio and in line with the measurements.

@rgerganov
Copy link
Copy Markdown
Member Author

The size of the compute graph is substantial in your case (423040 bytes), so reusing the graph (what this PR is about) could bring noticeable improvement for you. I will rebase this PR, so you can test in your environment.

@jukofyork
Copy link
Copy Markdown
Collaborator

The size of the compute graph is substantial in your case (423040 bytes), so reusing the graph (what this PR is about) could bring noticeable improvement for you. I will rebase this PR, so you can test in your environment.

Thanks!

@rgerganov
Copy link
Copy Markdown
Member Author

@jukofyork I have rebased the PR and I will appreciate if you test it in your environment. With this patch the compute graph is stored on the server side and there is no need to send it for every token.

Note that you need to rebuild both client and server parts.

@jukofyork
Copy link
Copy Markdown
Collaborator

@jukofyork I have rebased the PR and I will appreciate if you test it in your environment. With this patch the compute graph is stored on the server side and there is no need to send it for every token.

Note that you need to rebuild both client and server parts.

Thanks! I'm away for a couple of days so will be Thursday/Friday before I can boot up the machines to test it.

@jukofyork
Copy link
Copy Markdown
Collaborator

This makes quite a big difference for me:

master

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 318.55 ± 0.86
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 15.64 ± 0.00
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 15.64 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 15.54 ± 0.01

PR

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 320.18 ± 0.99
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 19.47 ± 0.02
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 19.48 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 19.30 ± 0.01

(19.47 + 19.48 + 19.30) / (15.64 + 15.64 + 15.54) = +24.5% tg improvement.

@jukofyork
Copy link
Copy Markdown
Collaborator

image

This also changes the relative costs as now batch_size = 1 and batch_size = 2 both generate ~19.5 tokens/s (not really relevant for this PR, but will make more sense when I get time to explain my new "profile guided speculative decoding" code that uses these values...).

@rgerganov rgerganov marked this pull request as ready for review October 28, 2025 17:19
@rgerganov
Copy link
Copy Markdown
Member Author

@jukofyork this is great news, I didn't have a chance to test this patch with larger graphs, thanks for confirming it makes a significant improvement

@slaren Is there a better way to associate an ID with ggml_cgraph?

@jukofyork
Copy link
Copy Markdown
Collaborator

jukofyork commented Oct 30, 2025

@rgerganov @slaren

I've been looking at the RPC code and wonder if there might be a way to simplify/unify all this and remove the RPC_CMD_SET_TENSOR_HASH command, etc.

It's pretty cheap to hash data using #16753 (and with a few hours work I can probably decipher the 128-bit version and get an extra 1.5x performance for SSE2...), so what about if we removed the existing tensor-specific hashing and instead did it in the send_rpc_cmd code:

// RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) |
// No response
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    uint8_t cmd_byte = cmd;
    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }
    return true;
}

// RPC request : | rpc_cmd (1 byte) | request_size (8 bytes) | request_data (request_size bytes) |
// RPC response: | response_size (8 bytes) | response_data (response_size bytes) |
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size, void * output, size_t output_size) {
    if (!send_rpc_cmd(sock, cmd, input, input_size)) {
        return false;
    }
    // TODO: currently the output_size is always known, do we need support for commands with variable output size?
    // even if we do, we can skip sending output_size from the server for commands with known output size
    uint64_t out_size;
    if (!recv_data(sock->fd, &out_size, sizeof(out_size))) {
        return false;
    }
    if (out_size != output_size) {
        return false;
    }
    if (!recv_data(sock->fd, output, output_size)) {
        return false;
    }
    return true;
}

So:

The RPC servers would still have the (optional) non-volatile cache for large transfers and the code above would need then to negotiate these transfers, but since these are the large tensors the added latency would be insignificant and this is essentially what the RPC_CMD_SET_TENSOR_HASH command is doing anyway; we just move the logic to here. As with the current implementation, there will be some host-side-only heuristic(s) applied (minimum size, etc).

But we would also introduce a volatile LRU cache:

  • It could either cache all data sent/received, or use similar heuristic(s) as the non-volatile cache, but the heuristic(s) would need to be performed both host-side and server-side.
  • The host wouldn't actually need to store the data itself and its LRU cache would just need the hashes storing.
  • As the host knows the LRU cache size and ejection policy the RPC servers will be using, it should be able to keep track of exactly what they have stored in their LRU caches and there would be no additional latency nor any negotiation needed (ie: it would just be the same as above but with the hash in place of the actual data).
  • The LRU cache would need to be large enough so that important data like the compute graph and position embeddings (if that is what that other tensor is that I see getting sent?), don't get ejected by the batches of hidden state data during PP (or some additional heuristic applied to avoid storing hidden state data at all...).

This seems to have lots of benefits:

The only thing I'm not so sure about is if we are ever sending any pointers (or non-static node indices, etc) over the RPC channel? If so then that will make things much harder and probably a lot more work (but not impossible, eg: using the ideas from https://theboostcpplibraries.com/boost.serialization-pointers-and-references).

@jukofyork
Copy link
Copy Markdown
Collaborator

I made a lot of ninja edits just now so if reading from the email notification, please see the comment instead!

@rgerganov
Copy link
Copy Markdown
Member Author

I am not sure I understand your proposal. As of now, there are only two commands that benefit from caching -- SET_TENSOR and GRAPH_COMPUTE. Why do we need to modify send_rpc_cmd() which is used for all RPC commands and how does this makes anything simpler?

Also note that my implementation is reusing compute graphs without serializing them first, so we don't waste CPU time for serialization and hashing.

@slaren
Copy link
Copy Markdown
Member

slaren commented Oct 30, 2025

@slaren Is there a better way to associate an ID with ggml_cgraph?

I don't think there is a good way to do this at the moment. This implementation is not reliable either. To do this in a reliable way, you would need to check every node of the graph to determine if it is changed, similar to what the CUDA backend does to implement CUDA graphs. In the future with the graph plan API this may be unnecessary.

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }

With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

@jukofyork
Copy link
Copy Markdown
Collaborator

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }

With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

@jukofyork
Copy link
Copy Markdown
Collaborator

I am not sure I understand your proposal. As of now, there are only two commands that benefit from caching -- SET_TENSOR and GRAPH_COMPUTE. Why do we need to modify send_rpc_cmd() which is used for all RPC commands and how does this makes anything simpler?

Also note that my implementation is reusing compute graphs without serializing them first, so we don't waste CPU time for serialization and hashing.

I've got to go out, but will try and give a more detailed example of what I mean when I get back in. The basic idea is just to extract all the "communication channel" stuff (requiring tunable heuristic, etc) away from the clean RPC interface. The send_rpc_cmd is just the point where the heuristics will have the most helpful data to make their decisions, but it could equivalently be done in the lower level functions as well (although I think it would be not as easy to design the heuristics without the actual command name, etc).

eg: The simple >10MB hashing heuristic is already failing for me where I am sending huge batches and I've tried everything up to 100MB but still getting some batches hashes and then saved.

@jukofyork
Copy link
Copy Markdown
Collaborator

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }

With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

Here's what I tried for both send_msg and send_rpc_cmd, but it didn't seem to help me at all:

5ce87d1

static bool send_msg(sockfd_t sockfd, const void * msg, size_t msg_size) {
    const size_t header_size = sizeof(msg_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + msg_size);

    // header
    memcpy(buf.data(), &msg_size, sizeof(msg_size));

    // payload
    if (msg_size > 0) {
        memcpy(buf.data() + header_size, msg, msg_size);
    }

    // single send
    return send_data(sockfd, buf.data(), buf.size());
}
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    const size_t header_size = 1 + sizeof(input_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + input_size);

    // header
    buf[0] = static_cast<uint8_t>(cmd);
    memcpy(buf.data() + 1, &input_size, sizeof(input_size));

    // payload
    if (input_size > 0) {
        memcpy(buf.data() + header_size, input, input_size);
    }

    // single send (send_data may still chunk very large buffers, which is fine)
    return send_data(sock->fd, buf.data(), buf.size());
}

@jukofyork
Copy link
Copy Markdown
Collaborator

    if (!send_data(sock->fd, &cmd_byte, sizeof(cmd_byte))) {
        return false;
    }
    if (!send_data(sock->fd, &input_size, sizeof(input_size))) {
        return false;
    }
    if (!send_data(sock->fd, input, input_size)) {
        return false;
    }

With TCP_NODELAY, each of these calls may be generating a new packet, which may add to the network latency. It may be worth building the entire packet first in a buffer and sending it with a single call to send.

I tried this a few days ago but it didn't make any difference for me. Not sure if it's the extra memcpy required to place the 9 bytes before the data or if it's just my network stack is still buffering small packets behind the scenes with TCP_NODELAY turned off.

Here's what I tried for both send_msg and send_rpc_cmd, but it didn't seem to help me at all:

5ce87d1

static bool send_msg(sockfd_t sockfd, const void * msg, size_t msg_size) {
    const size_t header_size = sizeof(msg_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + msg_size);

    // header
    memcpy(buf.data(), &msg_size, sizeof(msg_size));

    // payload
    if (msg_size > 0) {
        memcpy(buf.data() + header_size, msg, msg_size);
    }

    // single send
    return send_data(sockfd, buf.data(), buf.size());
}
static bool send_rpc_cmd(const std::shared_ptr<socket_t> & sock, enum rpc_cmd cmd, const void * input, size_t input_size) {
    const size_t header_size = 1 + sizeof(input_size);
    std::vector<uint8_t> buf;
    buf.resize(header_size + input_size);

    // header
    buf[0] = static_cast<uint8_t>(cmd);
    memcpy(buf.data() + 1, &input_size, sizeof(input_size));

    // payload
    if (input_size > 0) {
        memcpy(buf.data() + header_size, input, input_size);
    }

    // single send (send_data may still chunk very large buffers, which is fine)
    return send_data(sock->fd, buf.data(), buf.size());
}

Actually this now seems to be giving me quite a boost! With this PR and that branch merged, I'm now getting this:

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 354.63 ± 1.07
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 22.70 ± 0.01
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 22.41 ± 0.01
glm4moe 355B.A32B Q6_K 230.78 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 22.07 ± 0.05

compared to this the other day:

This makes quite a big difference for me:

master

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 318.55 ± 0.86
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 15.64 ± 0.00
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 15.64 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 15.54 ± 0.01

PR

model size params backend ngl fa dev ts test t/s
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 pp512 320.18 ± 0.99
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg128 19.47 ± 0.02
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg256 19.48 ± 0.01
glm4moe 355B.A32B Q6_K 231.00 GiB 356.79 B CUDA,RPC 99 1 CUDA0/RPC0/RPC1/RPC2/RPC3/CUDA1 16.00/15.00/15.00/15.00/15.00/17.00 tg512 19.30 ± 0.01

(19.47 + 19.48 + 19.30) / (15.64 + 15.64 + 15.54) = +24.5% tg improvement.

It may be something else that has changed, so will check next week to be sure, but it seems odd I got around 15 tokens/s when I tested this branch a couple of days ago.

@jukofyork
Copy link
Copy Markdown
Collaborator

jukofyork commented Oct 31, 2025

I made a draft PR #16892 to see if this makes any difference to others, or if it only helps with this specific PR, etc.

@jukofyork
Copy link
Copy Markdown
Collaborator

jukofyork commented Oct 31, 2025

I also had a go at writing a (very!) hacky proof of concept for the volatile hash last night:

// Try use the volatile cache when data size is larger than this threshold

// Try use the volatile cache when data size is larger than this threshold
const size_t MIN_CACHE_THRESHOLD = 20 * 1024;
const size_t MAX_CACHE_THRESHOLD = 1024 * 1024;

static bool send_data(sockfd_t sockfd, const void * data, size_t size) {
    static std::unordered_set<uint64_t> sent_hashes;

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        uint64_t hash = generate_hash((const uint8_t*)data, size);
        bool is_new = sent_hashes.find(hash) == sent_hashes.end();

        uint8_t flag = is_new ? 1 : 0;
        if (send(sockfd, (const char*)&flag, sizeof(flag), 0) != sizeof(flag)) {
            return false;
        }
        if (send(sockfd, (const char*)&hash, sizeof(hash), 0) != sizeof(hash)) {
            return false;
        }

        if (!is_new) {
            return true;
        }
        sent_hashes.insert(hash);
    }

    size_t bytes_sent = 0;
    while (bytes_sent < size) {
        size_t size_to_send = std::min(size - bytes_sent, MAX_CHUNK_SIZE);
        ssize_t n = send(sockfd, (const char *)data + bytes_sent, size_to_send, 0);
        if (n < 0) {
            GGML_LOG_ERROR("send failed (bytes_sent=%zu, size_to_send=%zu)\n",
                           bytes_sent, size_to_send);
            return false;
        }
        bytes_sent += (size_t)n;
    }
    return true;
}

static bool recv_data(sockfd_t sockfd, void * data, size_t size) {
    static std::unordered_map<uint64_t, std::vector<uint8_t>> recv_cache;

    uint64_t hash = 0;

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        uint8_t flag;
        if (recv(sockfd, (char*)&flag, sizeof(flag), 0) != sizeof(flag)) {
            return false;
        }
        if (recv(sockfd, (char*)&hash, sizeof(hash), 0) != sizeof(hash)) {
            return false;
        }

        if (flag == 0) {
            auto it = recv_cache.find(hash);
            if (it != recv_cache.end()) {
                memcpy(data, it->second.data(), size);
                return true;
            }
            return false;
        }
    }

    size_t bytes_recv = 0;
    while (bytes_recv < size) {
        size_t size_to_recv = std::min(size - bytes_recv, MAX_CHUNK_SIZE);
        ssize_t n = recv(sockfd, (char *)data + bytes_recv, size_to_recv, 0);
        if (n < 0) {
            GGML_LOG_ERROR("recv failed (bytes_recv=%zu, size_to_recv=%zu)\n",
                           bytes_recv, size_to_recv);
            return false;
        }
        if (n == 0) {
            LOG_DBG("recv returned 0 (peer closed?)\n");
            return false;
        }
        bytes_recv += (size_t)n;
    }

    if (size > MIN_CACHE_THRESHOLD && size < MAX_CACHE_THRESHOLD) {
        recv_cache[hash] = std::vector<uint8_t>((uint8_t*)data, (uint8_t*)data + size);
    }

    return true;
}

It did give a small improvement (from 15.5 tokens/s to 17.5 tokens/s), but either the hashing still has a lot of overhead or the repeated serialisation of the compute graph as mentioned by @rgerganov is very costly, as with this PR I get 19.5 tokens/s in comparison! ☹️

I then destroyed the branch trying to hack in the code to join the packets after the push linked above, so will have to take another look at this next week and likely start fresh.

I did find that the other tensors that were exactly 2x the context length (and suspected was something to do with the position embeddedings) is the same each time and could also be cached (this finally produced a balanced flow over the network of about 4-5MB/s which is purely the hidden state data).

@jukofyork
Copy link
Copy Markdown
Collaborator

jukofyork commented Oct 31, 2025

@slaren Is there a better way to associate an ID with ggml_cgraph?

I don't think there is a good way to do this at the moment. This implementation is not reliable either. To do this in a reliable way, you would need to check every node of the graph to determine if it is changed, similar to what the CUDA backend does to implement CUDA graphs. In the future with the graph plan API this may be unnecessary.

It's a pity, as this PR does seem to help a lot (maybe mainly large MOE models only though?).

From my hacky experiments last night, it definitely looks like @rgerganov is correct that the serialisation+hashing (even using xxhash64) is going to be worse than this PR (my only hope was it was the multiple packets, but after reading @rgerganov post on the other PR; I'm not hopeful).

By "visiting every node" does that mean the data of every node needs comparing (as in the full ~450kb of data for my example), or is it just a set of "node IDs" or similar that need comparing?

@kyuz0
Copy link
Copy Markdown

kyuz0 commented Nov 10, 2025

On my setup (two AMD Strix Halo hosts), this PR has improved (~5.5%) prompt processing:

  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           pp512 |        175.56 ± 1.40 |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           tg128 |         17.50 ± 0.08 |

Right now I only have a screenshot of the performance without this PR:

image

Happy to do more testing. My setup currenlt uses USB-to-USB, 10Gbps and I'm trying to see if I can get to 20-40Gbps.

@kyuz0
Copy link
Copy Markdown

kyuz0 commented Nov 10, 2025

Improvement was there but minimal got GLM 6.4:

llama-bench -mmp 0 -fa 1 -m GLM-4.6-UD-Q4_K_XL-00001-of-00005.gguf --rpc "192.168.2.2:50052"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| glm4moe 355B.A32B Q4_K - Medium | 189.69 GiB |   356.79 B | ROCm,RPC   |  99 |  1 |    0 |           pp512 |         34.18 ± 0.05 |
| glm4moe 355B.A32B Q4_K - Medium | 189.69 GiB |   356.79 B | ROCm,RPC   |  99 |  1 |    0 |           tg128 |          8.81 ± 0.01 |
image

@rgerganov
Copy link
Copy Markdown
Member Author

@kyuz0 thanks for taking the time to test this on your setup

I will try to rework this with the approach suggested by @slaren by using the CUDA backend as reference

@ggerganov
Copy link
Copy Markdown
Member

I wonder if you can simply reuse this approach that I prototyped some time ago for the Metal backend to determine when a graph is the same as the previous one:

https://github.com/ggml-org/llama.cpp/pull/14570/files#diff-4045ccc2edf36af53c4a28fca4a6a24f628b3c9bee7ccb43b06e55dc5e5e199eR5482-R5514

At the time, I thought it was hacky, but I don't actually see a failure case - it simply does a memcmp of the entire graph data at the byte level.

It didn't end up being used because it didn't lead to any improvements for Metal (#14570). But here it could be a reasonable stopgap until the graph plan API lands.

@rgerganov
Copy link
Copy Markdown
Member Author

I have reimplemented the graph caching using the approach suggested by @ggerganov. Both client and server maintain a ring buffer of cached compute graphs and reuse them when possible. The newly added RPC commands for computing graphs do not return result -- if graph_compute() returns something different from GGML_STATUS_SUCCESS, the server simply closes the connection. This will save a network round-trip and should bring noticeable TG speed improvement.

@kyuz0 @jukofyork Could you please test this PR on your testbeds and report the results here?

@kyuz0
Copy link
Copy Markdown

kyuz0 commented Nov 25, 2025

Thanks @rgerganov . I did some initial testing, and now I can't see a difference in performance on MiniMax with or without the PR. I will run a proper test bench in a bit to confirm, there might be an issue with the testing I did.

@rgerganov
Copy link
Copy Markdown
Member Author

I am seeing 13% TG improvement with gpt-oss-20b when running on localhost:

master

model size params backend ngl test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B RPC 99 pp512 9332.25 ± 120.10
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B RPC 99 tg128 251.03 ± 1.38

PR

model size params backend ngl test t/s
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B RPC 99 pp512 9473.22 ± 108.71
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B RPC 99 tg128 283.68 ± 0.83

@eugr
Copy link
Copy Markdown

eugr commented Nov 25, 2025

I'm seeing a bit of improvement with dual DGX Sparks connected via their 200G port on Qwen3-VL-235B (and even GPT-OSS-120B), but for some reason not on MiniMax M2.

Qwen3-235B

Before

model size params backend test t/s
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 528.52 ± 2.88
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 12.98 ± 0.05
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d4096 469.70 ± 5.47
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d4096 11.62 ± 0.08
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d8192 420.87 ± 8.01
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d8192 11.15 ± 0.08
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d16384 340.40 ± 8.40
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d16384 9.90 ± 0.02
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d32768 226.70 ± 35.69
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d32768 8.03 ± 0.04

After

model size params backend test t/s
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 551.75 ± 1.11
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 13.78 ± 0.02
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d4096 491.34 ± 22.40
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d4096 12.47 ± 0.09
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d8192 454.67 ± 1.51
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d8192 12.15 ± 0.02
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d16384 377.73 ± 0.39
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d16384 10.91 ± 0.02
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC pp2048 @ d32768 279.02 ± 0.82
qwen3vlmoe 235B.A22B Q4_K - Medium 124.91 GiB 235.09 B CUDA,RPC tg32 @ d32768 8.99 ± 0.07

MiniMax M2

Before

model size params backend test t/s
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 906.42 ± 1.27
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 25.32 ± 0.28
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d4096 822.09 ± 4.14
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d4096 21.47 ± 0.16
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d8192 736.49 ± 6.00
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d8192 19.03 ± 0.12
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d16384 615.61 ± 5.00
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d16384 15.49 ± 0.22
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d32768 460.02 ± 5.47
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d32768 11.14 ± 0.07

After

model size params backend test t/s
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 901.43 ± 2.48
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 25.90 ± 0.08
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d4096 810.92 ± 1.31
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d4096 23.10 ± 0.27
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d8192 731.97 ± 3.02
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d8192 19.81 ± 0.14
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d16384 606.83 ± 1.97
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d16384 16.08 ± 0.08
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC pp2048 @ d32768 456.50 ± 1.00
minimax-m2 230B.A10B Q4_K - Medium 122.58 GiB 228.69 B CUDA,RPC tg32 @ d32768 11.71 ± 0.07

Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated
Comment thread ggml/src/ggml-rpc/ggml-rpc.cpp Outdated
@kyuz0
Copy link
Copy Markdown

kyuz0 commented Nov 26, 2025

SO, I think my current issue seems to be that the baseline, without the PR, has improved to match the performance of the PR:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ---: | --------------: | -------------------: |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           pp512 |        173.79 ± 0.42 |
| minimax-m2 230B.A10B Q6_K      | 180.94 GiB |   228.69 B | ROCm,RPC   |  99 |  1 |    0 |           tg128 |         17.10 ± 0.03 |

I will look at the pipelines that I have to build my containers, but I am building the container WITHOUT the PR using this Dockerfile and setting the env APPLY_LLAMA_PR_15405 to 0:

https://github.com/kyuz0/amd-strix-halo-toolboxes/blob/main/toolboxes/Dockerfile.rocm-6.4.4-rocwmma

So I cannot see a way that my container inadvertantly includes the PR when APPLY_LLAMA_PR_15405 is set to 0.

@Djip007
Copy link
Copy Markdown
Contributor

Djip007 commented Nov 26, 2025

https://github.com/kyuz0/amd-strix-halo-toolboxes/blob/main/toolboxes/Dockerfile.rocm-6.4.4-rocwmma

on fedora:43 rocm 6.4.4 is now upstream, no need to use the "el9" .

in my case I create a toolbox (I have silverblue ;)):

toolbox create -d fedora -r 43 rocm-devel-fc43
toolbox enter rocm-devel-fc43

# add rocm+dep:
 sudo dnf upgrade --refresh
 sudo dnf install hipblas-devel rocm-hip-devel rocblas-devel
 sudo dnf install cmake gcc-c++
 sudo dnf install rocm-smi rocminfo amd-smi
 sudo dnf install libcurl-devel

# build with
 cmake -S . -B build/rocm6.4.4 -DGGML_HIP=ON -DGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release
 cmake --build build/rocm6.4.4 --config Release -- -j 16

# ...

@kyuz0
Copy link
Copy Markdown

kyuz0 commented Nov 26, 2025

https://github.com/kyuz0/amd-strix-halo-toolboxes/blob/main/toolboxes/Dockerfile.rocm-6.4.4-rocwmma

on fedora:43 rocm 6.4.4 is now upstream, no need to use the "el9" .

in my case I create a toolbox (I have silverblue ;)):

toolbox create -d fedora -r 43 rocm-devel-fc43
toolbox enter rocm-devel-fc43

# add rocm+dep:
 sudo dnf upgrade --refresh
 sudo dnf install hipblas-devel rocm-hip-devel rocblas-devel
 sudo dnf install cmake gcc-c++
 sudo dnf install rocm-smi rocminfo amd-smi
 sudo dnf install libcurl-devel

# build with
 cmake -S . -B build/rocm6.4.4 -DGGML_HIP=ON -DGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release
 cmake --build build/rocm6.4.4 --config Release -- -j 16

# ...

Thank you, I still think regardless of where ROCm comes from, I am not sure why I am getting the same performance with and without the PR, and again, it seems even without the PR now performance is better.

I suspect user error, so I might just create a toolbox manually instead of my pipelines.

@Djip007
Copy link
Copy Markdown
Contributor

Djip007 commented Nov 26, 2025

I is sometime hard to bench, with the Framework Desktop on Perfo mode it is:

  • 160W for some second
  • 140W for 10min
  • 120W for next...
    It is sometime hard to know if it is at 140 or reduce to 120W.... wait 10 min IDLE is "long" ;)
    I can see CPU speed for llama.cpp CPU bench, but not sure how to have the GPU speed for now. (edit: OK with radeontop ;) )

Note: in most case I test the fa-wmma slowdown with hight pp (>2000...)

Note: did you connect the 2 MAX with RJ45 network, or USB-4?

@rgerganov
Copy link
Copy Markdown
Member Author

@ggerganov I changed the implementation to cache only the last computed graph for each device and reuse it when possible. It is much simpler and clean this way. Also removed the response from RPC_CMD_GRAPH_COMPUTE to save a network round trip. If computing the graph is not successful, the server closes the connection.

@ggerganov
Copy link
Copy Markdown
Member

The CI needs a fix before merging: https://github.com/ggml-org/llama.cpp/actions/runs/19734283728/job/56542411768?pr=15405#step:4:1857

Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
@rgerganov rgerganov merged commit 15d2b46 into ggml-org:master Nov 28, 2025
72 of 74 checks passed
Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request Apr 28, 2026
* RPC support multiple devices

* rpc : update documentation (ggml-org#16441)

Update the README file to match the newly added functionality of
exposing multiple devices from a single server.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

# Conflicts:
#	examples/rpc/README.md

* Remove memory settings

* rpc : cache and reuse compute graphs (ggml-org#15405)

Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.

* Add -cpu to include cpu backend

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants