UPSTREAM PR #19261: spec : fix the check-rate logic of ngram-simple#1133
UPSTREAM PR #19261: spec : fix the check-rate logic of ngram-simple#1133
Conversation
OverviewAnalysis of commit Power Consumption by Binary:
Function Analysis
The function refactored check-rate logic from position-based ( 🔎 Full breakdown: Loci Inspector. |
d9cffb7 to
1e94f5e
Compare
01000b6 to
4c1b7f6
Compare
OverviewAnalysis of 115,469 functions across 15 binaries reveals minimal performance impact from speculative decoding refactoring. Modified: 80 functions (0.07%), new: 44, removed: 55, unchanged: 115,290. Power consumption changes:
Function Analysiscommon_speculative_state_ngram_simple::draft() (both binaries): Intentional optimization adding check-rate throttling. Throughput time increased +79% (+61ns) due to added counter logic, but response time improved -1% (-284ns to -294ns) by reducing frequency of expensive O(n²) pattern matching. Net positive optimization. std::match_results::_M_establish_failed_match (llama-tts): Throughput time improved -53% (-81ns), response time -4% (-81ns). Benefit from std::regex::optimize flag adoption in codebase, improving regex automaton efficiency. Standard library functions (various): Mixed compiler optimization effects. Improvements include std::_Rb_tree::end() (-75% throughput), json_sax_dom_callback_parser::null() (-75% throughput), std::_Function_handler::_M_invoke (-46% throughput). Regressions include std::vector::end() (+307% throughput, +183ns), nlohmann::json::get() (+307% throughput, +183ns), std::make_shared (+207% throughput, +130ns). All changes are in non-critical paths (initialization, text preprocessing, configuration loading) with absolute impacts under 200ns. Additional FindingsZero impact on performance-critical inference paths: matrix operations (70-90% of inference time), attention mechanisms, KV cache, quantization kernels, and all GPU backends remain unchanged. Changes isolated to CPU-side speculative decoding utilities in common library. No GPU/ML operations affected. 🔎 Full breakdown: Loci Inspector. |
cd152fa to
ab12294
Compare
048ad94 to
6c1fde6
Compare
073bd79 to
823244c
Compare
Note
Source pull request: ggml-org/llama.cpp#19261
fix #19231
For the
spec-simplemethod, we don't need to keep track of the last length to rate-limit the generations. We can simply use an incremental counter. This makes the speculator work with "Regenerate" of last message or branching the conversation from previous messages.