UPSTREAM PR #17937: common : refactor common_sampler + grammar logic changes by loci-dev · Pull Request #523 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-11T13:45:06Z

from #17004

Extracting some refactoring portions from #17004 to make the review easier:

Simplify and make safer the management of llama objects (samplers, contexts, model)
The common_init_result now also owns the sampler chains constructed during common_init_from_params()
The sampler chains of common_init_result are constructed before the model and the context - we will need this for #17004 in order to optionally pass the samplers during the construction of the context

Another change related to the grammar logic (the explanation is in the referenced comment):

No longer maintain a separate sampler chain for the grammar
Merge the grammar into the main common_sampler chain
The grammar is now always applied first to the raw logits, before the rest of the samplers

The main reason for this change is to make the integration of #17004 compatible with grammar usage and to simplify the logic for handling the grammar when it is present. The main concern is that this will likely hurt the performance when grammar sampling is involved, since we no longer do the "rejection sampling" trick. I think it's better to put effort to optimize the performance of the grammar in general so we don't need to do the trick at all.

loci-review · 2025-12-11T14:45:19Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Project: llama.cpp
Versions: 5ee9b3ed (target) vs 835b8c3a (baseline)
Scope: 26 files modified (+367/-290 lines)

Overview

This PR refactors the common initialization and sampling subsystems, introducing PIMPL-based resource management and merging grammar samplers into the main chain. The changes affect utility functions and initialization paths while leaving core inference operations unchanged.

Key Findings

Performance-Critical Functions

common_sampler_accept (common/sampling.cpp:324-347)

Throughput increased by 154 ns (102 ns → 256 ns)
Response time increased by 169 ns (367 ns → 536 ns)
Changed from two direct function calls to iterating over all samplers in the chain (typically 8-12 samplers)
The function now calls llama_sampler_chain_get() and llama_sampler_accept() for each sampler instead of accepting the entire chain at once
This function is invoked for every generated token during inference

~common_init_result (common/common.h:673)

Response time increased by 2,987 ns (1,035 ns → 4,023 ns)
The destructor now cleans up a vector of samplers (one per sequence, sized by n_seq_max)
Each sampler destruction involves freeing the llama_sampler chain and internal state
With typical n_seq_max=1, overhead is approximately 120 ns per additional sampler
This occurs once per initialization lifecycle, not in the inference hot path

STL Container Operations

std::vector<char>::begin: throughput increased by 89 ns (62 ns → 151 ns)
std::deque<long>::back: throughput increased by 103 ns (80 ns → 183 ns)
std::thread::joinable: throughput increased by 112 ns (83 ns → 195 ns)
These changes suggest debug mode or reduced optimization flags in the build configuration rather than code modifications

Inference Impact

Core inference functions remain unchanged:

llama_decode: no modifications detected
llama_encode: no modifications detected
llama_tokenize: no modifications detected
ggml_mul_mat and attention mechanisms: no modifications detected

Tokens per second impact: The sampling refactoring adds 154 ns per token to the generation pipeline through common_sampler_accept. For a model generating 100 tokens per second, this represents 15 microseconds of additional overhead per second, which is negligible compared to typical inference times. The core tensor operations and decode functions that dominate inference time (typically milliseconds per token) are unaffected. Based on the reference that 2 ms slower llama_decode reduces tokens per second by 7%, the 154 ns overhead in sampling represents approximately 0.0054% impact on generation throughput.

Power Consumption

Utility binaries show modest increases:

llama-gguf-split: +755 nJ (+2.43%)
llama-tokenize: +723 nJ (+2.42%)
llama-quantize: +796 nJ (+2.35%)
llama-bench: +773 nJ (+1.64%)

Inference binaries show minimal impact:

llama-run: +870 nJ (+0.40%)
llama-cvector-generator: +631 nJ (+0.25%)
llama-tts: +501 nJ (+0.20%)

Core libraries unchanged:

libggml-base.so, libggml-cpu.so, libggml.so, libllama.so: 0% change

The power consumption increases correlate with the additional sampler management overhead in utility operations. The core inference libraries show no power consumption change, confirming that tensor operations and model execution remain unaffected.

Code Changes

The refactoring implements three main changes:

PIMPL pattern for common_init_result: Encapsulates model, context, and samplers behind an implementation pointer. Samplers are now allocated upfront for all sequences (sized by n_seq_max) during initialization rather than on-demand.
Grammar sampler integration: Previously maintained as a separate sampler (grmr), grammar is now the first element in the unified sampler chain. This eliminates the rejection sampling mechanism that would resample with grammar-first if the initial token was invalid.
Sampler acceptance loop: Changed from accepting the grammar and chain separately to iterating through all samplers in the chain, with special handling for the grammar sampler at index 0.

The absolute time increases are measured in nanoseconds and microseconds, representing minimal impact on overall inference performance which operates in millisecond timescales.

common : refactor common_sampler + grammar logic changes

7ee3c35

loci-dev temporarily deployed to PROD__AL_DEMO December 11, 2025 13:45 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 27 times, most recently from 45e0e28 to e9472cd Compare December 15, 2025 02:47

loci-dev force-pushed the main branch 30 times, most recently from e02e9be to 9f1f66d Compare December 19, 2025 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17937: common : refactor common_sampler + grammar logic changes#523

UPSTREAM PR #17937: common : refactor common_sampler + grammar logic changes#523
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17937-branch_ggml-org-gg/common-refactor

loci-dev commented Dec 11, 2025

Uh oh!

loci-review bot commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 11, 2025

Uh oh!

loci-review bot commented Dec 11, 2025

Performance Analysis Summary

Overview

Key Findings

Performance-Critical Functions

Inference Impact

Power Consumption

Code Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants