UPSTREAM PR #17937: common : refactor common_sampler + grammar logic changes#523
UPSTREAM PR #17937: common : refactor common_sampler + grammar logic changes#523
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis SummaryProject: llama.cpp OverviewThis PR refactors the common initialization and sampling subsystems, introducing PIMPL-based resource management and merging grammar samplers into the main chain. The changes affect utility functions and initialization paths while leaving core inference operations unchanged. Key FindingsPerformance-Critical Functionscommon_sampler_accept (common/sampling.cpp:324-347)
~common_init_result (common/common.h:673)
STL Container Operations
Inference ImpactCore inference functions remain unchanged:
Tokens per second impact: The sampling refactoring adds 154 ns per token to the generation pipeline through Power ConsumptionUtility binaries show modest increases:
Inference binaries show minimal impact:
Core libraries unchanged:
The power consumption increases correlate with the additional sampler management overhead in utility operations. The core inference libraries show no power consumption change, confirming that tensor operations and model execution remain unaffected. Code ChangesThe refactoring implements three main changes:
The absolute time increases are measured in nanoseconds and microseconds, representing minimal impact on overall inference performance which operates in millisecond timescales. |
45e0e28 to
e9472cd
Compare
e02e9be to
9f1f66d
Compare
Mirrored from ggml-org/llama.cpp#17937
from #17004
Extracting some refactoring portions from #17004 to make the review easier:
common_init_resultnow also owns the sampler chains constructed duringcommon_init_from_params()common_init_resultare constructed before the model and the context - we will need this for #17004 in order to optionally pass the samplers during the construction of the contextref ggml-org/llama.cpp#17750 (comment)
Another change related to the grammar logic (the explanation is in the referenced comment):
common_samplerchainThe main reason for this change is to make the integration of #17004 compatible with grammar usage and to simplify the logic for handling the grammar when it is present. The main concern is that this will likely hurt the performance when grammar sampling is involved, since we no longer do the "rejection sampling" trick. I think it's better to put effort to optimize the performance of the grammar in general so we don't need to do the trick at all.