Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 18, 2025

Mirrored from ggml-org/llama.cpp#17116

fix #16657
ref ggml-org/llama.cpp#16276 (review)

This fixes the RPC inference when Metal backend is involved.

Testing:

# server
make -j && ./bin/rpc-server

# cli
make -j && ./bin/llama-cli -m ../models/gemma-3-4b-it/ggml-model-f16.gguf --rpc localhost:50052 -ngl 99 --no-mmap -no-cnv -p "Hello" --top-k 1 -n 32 -fa on

TODO:

  • Check performance imapct
  • Cache the responses to avoid extra RPC calls?

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: RPC Allocation Size Logic Fix

Overview

PR #262 implements a fix for RPC inference when Metal backend is involved, addressing allocation size calculation logic in the RPC system. The changes are contained within the GGML RPC subsystem (ggml-rpc.h and ggml-rpc.cpp) and do not modify core inference functions.

Analysis Results

Performance Metrics: No performance data was available for the specified version comparison, indicating either incomplete analysis pipeline execution or that the changes are too localized to generate measurable performance differences in the core inference path.

Code Changes Scope: The modifications are limited to:

  • RPC protocol version bump (breaking change requiring client/server sync)
  • Enhanced allocation size request structure to include source tensors
  • Null pointer safety improvements in tensor serialization
  • Expanded allocation logic for specific operations (GGML_OP_FLASH_ATTN_EXT, GGML_OP_MUL_MAT_ID)

Core Function Impact: The changes do not affect primary inference functions (llama_decode, llama_encode, llama_tokenize) or other performance-critical components identified in the project structure. The modifications are isolated to RPC backend allocation logic.

Network and Memory Impact: The fix introduces additional RPC message overhead by serializing source tensors (GGML_MAX_SRC * sizeof(rpc_tensor) per allocation request) and increases server-side memory allocation. However, this overhead only affects distributed inference scenarios using RPC backends.

Correctness Benefits: The implementation addresses a fundamental issue where allocation size calculations were insufficient for certain tensor operations, particularly affecting Metal backend compatibility. The fix prevents potential allocation failures that could cause crashes or incorrect results in distributed inference setups.

Binary Impact: Changes affect RPC-enabled binaries (llama-cli, rpc-server) when used with distributed inference configurations. Standard local inference remains unaffected.

The changes represent a targeted correctness fix with minimal performance impact on typical usage patterns. The modifications improve system reliability for distributed inference scenarios while maintaining compatibility with existing local inference workflows.

@loci-dev loci-dev force-pushed the main branch 28 times, most recently from ab559ce to e612b7c Compare November 24, 2025 22:10
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from a3dcd73 to 2517152 Compare January 8, 2026 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants