UPSTREAM PR #17116: rpc : fix alloc size logic #262

DajanaV · 2025-11-18T13:41:50Z

fix #16657
ref ggml-org/llama.cpp#16276 (review)

This fixes the RPC inference when Metal backend is involved.

Testing:

# server
make -j && ./bin/rpc-server

# cli
make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf --rpc localhost:50052 -ngl 99 --no-mmap -no-cnv -p "Hello" --top-k 1 -n 32 -fa on

TODO:

Check performance imapct
Cache the responses to avoid extra RPC calls?

loci-agentic-ai · 2025-11-18T14:24:44Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: RPC Allocation Size Logic Fix

Overview

PR #262 implements a fix for RPC inference when Metal backend is involved, addressing allocation size calculation logic in the RPC system. The changes are contained within the GGML RPC subsystem (ggml-rpc.h and ggml-rpc.cpp) and do not modify core inference functions.

Analysis Results

Performance Metrics: No performance data was available for the specified version comparison, indicating either incomplete analysis pipeline execution or that the changes are too localized to generate measurable performance differences in the core inference path.

Code Changes Scope: The modifications are limited to:

RPC protocol version bump (breaking change requiring client/server sync)
Enhanced allocation size request structure to include source tensors
Null pointer safety improvements in tensor serialization
Expanded allocation logic for specific operations (GGML_OP_FLASH_ATTN_EXT, GGML_OP_MUL_MAT_ID)

Core Function Impact: The changes do not affect primary inference functions (llama_decode, llama_encode, llama_tokenize) or other performance-critical components identified in the project structure. The modifications are isolated to RPC backend allocation logic.

Network and Memory Impact: The fix introduces additional RPC message overhead by serializing source tensors (GGML_MAX_SRC * sizeof(rpc_tensor) per allocation request) and increases server-side memory allocation. However, this overhead only affects distributed inference scenarios using RPC backends.

Correctness Benefits: The implementation addresses a fundamental issue where allocation size calculations were insufficient for certain tensor operations, particularly affecting Metal backend compatibility. The fix prevents potential allocation failures that could cause crashes or incorrect results in distributed inference setups.

Binary Impact: Changes affect RPC-enabled binaries (llama-cli, rpc-server) when used with distributed inference configurations. Standard local inference remains unaffected.

The changes represent a targeted correctness fix with minimal performance impact on typical usage patterns. The modifications improve system reliability for distributed inference scenarios while maintaining compatibility with existing local inference workflows.

DajanaV temporarily deployed to PROD__AL_DEMO November 18, 2025 13:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 28 times, most recently from ab559ce to e612b7c Compare November 24, 2025 22:10

loci-dev force-pushed the main branch 30 times, most recently from a3dcd73 to 2517152 Compare January 8, 2026 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17116: rpc : fix alloc size logic #262

UPSTREAM PR #17116: rpc : fix alloc size logic #262

Uh oh!

DajanaV commented Nov 18, 2025

Uh oh!

loci-agentic-ai bot commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

UPSTREAM PR #17116: rpc : fix alloc size logic #262

Are you sure you want to change the base?

UPSTREAM PR #17116: rpc : fix alloc size logic #262

Uh oh!

Conversation

DajanaV commented Nov 18, 2025

Uh oh!

loci-agentic-ai bot commented Nov 18, 2025

Performance Analysis Summary: RPC Allocation Size Logic Fix

Overview

Analysis Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants