Skip to content

UPSTREAM PR #17937: common : refactor common_sampler + grammar logic changes#523

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17937-branch_ggml-org-gg/common-refactor
Open

UPSTREAM PR #17937: common : refactor common_sampler + grammar logic changes#523
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17937-branch_ggml-org-gg/common-refactor

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17937

from #17004

Extracting some refactoring portions from #17004 to make the review easier:

  • Simplify and make safer the management of llama objects (samplers, contexts, model)
  • The common_init_result now also owns the sampler chains constructed during common_init_from_params()
  • The sampler chains of common_init_result are constructed before the model and the context - we will need this for #17004 in order to optionally pass the samplers during the construction of the context

ref ggml-org/llama.cpp#17750 (comment)

Another change related to the grammar logic (the explanation is in the referenced comment):

  • No longer maintain a separate sampler chain for the grammar
  • Merge the grammar into the main common_sampler chain
  • The grammar is now always applied first to the raw logits, before the rest of the samplers

The main reason for this change is to make the integration of #17004 compatible with grammar usage and to simplify the logic for handling the grammar when it is present. The main concern is that this will likely hurt the performance when grammar sampling is involved, since we no longer do the "rejection sampling" trick. I think it's better to put effort to optimize the performance of the grammar in general so we don't need to do the trick at all.

@loci-review
Copy link

loci-review bot commented Dec 11, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary

Project: llama.cpp
Versions: 5ee9b3ed (target) vs 835b8c3a (baseline)
Scope: 26 files modified (+367/-290 lines)


Overview

This PR refactors the common initialization and sampling subsystems, introducing PIMPL-based resource management and merging grammar samplers into the main chain. The changes affect utility functions and initialization paths while leaving core inference operations unchanged.


Key Findings

Performance-Critical Functions

common_sampler_accept (common/sampling.cpp:324-347)

  • Throughput increased by 154 ns (102 ns → 256 ns)
  • Response time increased by 169 ns (367 ns → 536 ns)
  • Changed from two direct function calls to iterating over all samplers in the chain (typically 8-12 samplers)
  • The function now calls llama_sampler_chain_get() and llama_sampler_accept() for each sampler instead of accepting the entire chain at once
  • This function is invoked for every generated token during inference

~common_init_result (common/common.h:673)

  • Response time increased by 2,987 ns (1,035 ns → 4,023 ns)
  • The destructor now cleans up a vector of samplers (one per sequence, sized by n_seq_max)
  • Each sampler destruction involves freeing the llama_sampler chain and internal state
  • With typical n_seq_max=1, overhead is approximately 120 ns per additional sampler
  • This occurs once per initialization lifecycle, not in the inference hot path

STL Container Operations

  • std::vector<char>::begin: throughput increased by 89 ns (62 ns → 151 ns)
  • std::deque<long>::back: throughput increased by 103 ns (80 ns → 183 ns)
  • std::thread::joinable: throughput increased by 112 ns (83 ns → 195 ns)
  • These changes suggest debug mode or reduced optimization flags in the build configuration rather than code modifications

Inference Impact

Core inference functions remain unchanged:

  • llama_decode: no modifications detected
  • llama_encode: no modifications detected
  • llama_tokenize: no modifications detected
  • ggml_mul_mat and attention mechanisms: no modifications detected

Tokens per second impact: The sampling refactoring adds 154 ns per token to the generation pipeline through common_sampler_accept. For a model generating 100 tokens per second, this represents 15 microseconds of additional overhead per second, which is negligible compared to typical inference times. The core tensor operations and decode functions that dominate inference time (typically milliseconds per token) are unaffected. Based on the reference that 2 ms slower llama_decode reduces tokens per second by 7%, the 154 ns overhead in sampling represents approximately 0.0054% impact on generation throughput.

Power Consumption

Utility binaries show modest increases:

  • llama-gguf-split: +755 nJ (+2.43%)
  • llama-tokenize: +723 nJ (+2.42%)
  • llama-quantize: +796 nJ (+2.35%)
  • llama-bench: +773 nJ (+1.64%)

Inference binaries show minimal impact:

  • llama-run: +870 nJ (+0.40%)
  • llama-cvector-generator: +631 nJ (+0.25%)
  • llama-tts: +501 nJ (+0.20%)

Core libraries unchanged:

  • libggml-base.so, libggml-cpu.so, libggml.so, libllama.so: 0% change

The power consumption increases correlate with the additional sampler management overhead in utility operations. The core inference libraries show no power consumption change, confirming that tensor operations and model execution remain unaffected.

Code Changes

The refactoring implements three main changes:

  1. PIMPL pattern for common_init_result: Encapsulates model, context, and samplers behind an implementation pointer. Samplers are now allocated upfront for all sequences (sized by n_seq_max) during initialization rather than on-demand.

  2. Grammar sampler integration: Previously maintained as a separate sampler (grmr), grammar is now the first element in the unified sampler chain. This eliminates the rejection sampling mechanism that would resample with grammar-first if the initial token was invalid.

  3. Sampler acceptance loop: Changed from accepting the grammar and chain separately to iterating through all samplers in the chain, with special handling for the grammar sampler at index 0.

The absolute time increases are measured in nanoseconds and microseconds, representing minimal impact on overall inference performance which operates in millisecond timescales.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 45e0e28 to e9472cd Compare December 15, 2025 02:47
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from e02e9be to 9f1f66d Compare December 19, 2025 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants