b2950 broke RPC mode #7427

steampunque · 2024-05-21T03:14:54Z

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

After b2950 patch RPC functionality is broken. Offloading to 3 machines first server crashes with below message. Reverting back to b2949 fixes the problem.

ll_startrpc
create_backend: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
Starting RPC server on 0.0.0.0:50052, backend memory: 8022 MB
Accepted client connection, free_mem=8412266496, total_mem=8500477952
GGML_ASSERT: /usr/local/src/ai/llamacpp/llama.cpp/ggml-backend.c:226: offset + size <= ggml_nbytes(tensor) && "tensor write out of bounds"
[New LWP 11678]
[New LWP 11684]
[New LWP 11685]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f66930dc3c7 in wait4 () from /lib64/libc.so.6
#0 0x00007f66930dc3c7 in wait4 () from /lib64/libc.so.6
#1 0x0000000000411f4b in ggml_print_backtrace ()
#2 0x000000000046639a in ggml_backend_tensor_set ()
#3 0x0000000000541d20 in start_rpc_server ()
#4 0x0000000000406ebc in main ()
[Inferior 1 (process 11677) detached]
/usr/local/bin/ll_startrpc: line 14: 11677 Aborted rpc-server -H 0.0.0.0 -p 50052

rgerganov · 2024-05-21T07:04:02Z

You are most probably running an old rpc-server with a new build of llama.cpp. We have added #pragma pack(push, 1) to rpc_tensor and now it is serialized into 292 bytes instead of 296 bytes. Make sure that you are building rpc-server from the same source tree as the rest of the binaries.

I may add a new HELLO command which advertises the version of the rpc-server when a new client connects. This may prevent problems like this in the long term.

steampunque · 2024-05-21T12:52:36Z

I am pretty sure client and server were both built from the same source tree, b2950. I upgraded all 3 machines to b2950 at the same time.

There is a small chance the source tree didn't sync right to the 1070 machine, I will double
check and rebuild it tonight.

steampunque · 2024-05-21T21:33:22Z

That was the problem, my source tree got out of sync on the 1070 machine somehow. Thanks.

steampunque added the bug-unconfirmed label May 21, 2024

steampunque closed this as completed May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

b2950 broke RPC mode #7427

b2950 broke RPC mode #7427

steampunque commented May 21, 2024

rgerganov commented May 21, 2024

steampunque commented May 21, 2024 •

edited

Loading

steampunque commented May 21, 2024

b2950 broke RPC mode #7427

b2950 broke RPC mode #7427

Comments

steampunque commented May 21, 2024

rgerganov commented May 21, 2024

steampunque commented May 21, 2024 • edited Loading

steampunque commented May 21, 2024

steampunque commented May 21, 2024 •

edited

Loading