UPSTREAM PR #17713: common : add parser for ministral/mistral 3#409
UPSTREAM PR #17713: common : add parser for ministral/mistral 3#409
Conversation
|
Explore the complete analysis inside the Version Insights Performance Review Summary - PR #409OverviewThis PR introduces a PEG parser framework for Ministral/Mistral 3 model output parsing, adding 5362 lines across 27 files. The implementation provides streaming-aware parsing with AST generation, Unicode handling, and GBNF grammar generation capabilities. Key FindingsPerformance-Critical Areas ImpactChat Parsing Infrastructure:
STL Iterator Functions:
Inference Performance ImpactTokens Per Second: Power Consumption AnalysisBinary-Level Impact:
The power consumption increases correlate with extended execution time in chat parsing components. The overhead stems from PEG parser construction (1500-3000 ns), runtime parsing with AST generation (1800-3600 ns), and mapper traversal (500-1000 ns). These binaries utilize chat parsing functionality, accumulating overhead across multiple parse operations during execution. |
dba8180 to
8654e36
Compare
53cc80f to
765aff0
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #409Analysis OverviewPR #409 introduces a new chat template parser for Ministral 3 models, adding 119 lines to Key FindingsPerformance-Critical Functions Impact: The analysis identified performance changes in STL container accessor functions rather than core inference functions. The most significant changes occurred in:
These functions are STL template instantiations used for container iteration in chat processing, not in the core inference path. The changes appear to stem from compiler optimization differences affecting template code generation rather than algorithmic modifications in PR #409. Inference Performance Impact: No changes were detected in core inference functions ( Power Consumption Analysis: Three binaries show measurable increases:
These increases correlate with the STL accessor regressions rather than the PR's new functionality. The new Ministral 3 parser code path is not yet active in the analyzed binaries since the PR remains unmerged. Code Changes Analysis: PR #409 adds
The implementation introduces JSON object manipulation and string operations during chat initialization. The code follows existing patterns in the codebase for other model-specific chat handlers (llama_3_x, lfm2, magistral). Systemic Pattern: The STL iterator performance changes affect 8 of the top 10 functions by response time change, suggesting a build configuration or compiler version difference between the analyzed versions rather than code-level changes. The pattern is consistent across vector, tree, and JSON container types, indicating a toolchain-level issue independent of PR #409's functional changes. |
4ba0a8d to
4587bfa
Compare
765aff0 to
86cb434
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #409Project: llama.cpp | PR: #409 - Ministral 3 Parser Implementation SummaryThis PR adds PEG parser support for Ministral 3 Reasoning/Instruct models through a new 109-line function Key FindingsMost-Impacted FunctionsChat Template Processing (Non-Inference Path):
Note: All impacted functions are in chat template processing, message transformation, or test infrastructure. None are in the core inference pipeline. Inference Performance ImpactTokens Per Second: No Impact The core inference functions show no changes:
The observed regressions occur in chat template initialization code that executes once per chat session, not per token. The initialization overhead of 25,000-35,000 ns is amortized across the entire chat session. For a typical chat generating 100 tokens, this adds 250-350 ns per token, which is negligible compared to the typical per-token inference time of 10-50 ms on CPU. Tokens per second remains unchanged because the per-token inference path is unaffected by these changes. Power Consumption AnalysisBinary-Level Impact: Three binaries show measurable power consumption increases:
Core Libraries: No Impact
The power consumption increases are confined to application binaries that use chat template processing. The core inference libraries show no power consumption change, confirming that the inference pipeline is unaffected. |
a9fcc24 to
ea62cd5
Compare
Mirrored from ggml-org/llama.cpp#17713
Note
This is intended to be an functional demonstration of the PEG parser implemented in #17136, and as such depends on that PR. I have squashed the commits down to make it easy to view the parser implementation for Ministral 3 Reasoning/Instruct and Mistral Large 3 Instruct.
Parser implementation for Ministral 3 Reasoning/Instruct and Mistral Large 3 Instruct. It deviates from the previous Mistral outputs by generating tool calls in the form:
[TOOL_CALLS]tool_name[ARGS]{"arg1": ... }...Features
reasoning_content1systemandassistantmessages containingreasoning_contentinto{"type": "thinking", "thinking": "..."}content blocks the chat template expects. #17700tool_choice = autoandtool_choice = required(with thinking).response_formatwith thinking.Keeping this as a draft until #17136 gets the stamp of approval.
Footnotes
Currently only
reasoning_format = auto/deepseekis supported. I was unaware the reasoning format is not exposed during chat param init, but this is easy to address. ↩