Skip to content

UPSTREAM PR #17713: common : add parser for ministral/mistral 3#409

Open
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17713-branch_aldehir-ministral-3
Open

UPSTREAM PR #17713: common : add parser for ministral/mistral 3#409
loci-dev wants to merge 5 commits intomainfrom
upstream-PR17713-branch_aldehir-ministral-3

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 3, 2025

Mirrored from ggml-org/llama.cpp#17713

Note

This is intended to be an functional demonstration of the PEG parser implemented in #17136, and as such depends on that PR. I have squashed the commits down to make it easy to view the parser implementation for Ministral 3 Reasoning/Instruct and Mistral Large 3 Instruct.

Parser implementation for Ministral 3 Reasoning/Instruct and Mistral Large 3 Instruct. It deviates from the previous Mistral outputs by generating tool calls in the form: [TOOL_CALLS]tool_name[ARGS]{"arg1": ... }...

Features

  • Extracts reasoning to reasoning_content1
  • Formats system and assistant messages containing reasoning_content into {"type": "thinking", "thinking": "..."} content blocks the chat template expects. #17700
  • Supports tool calling for both tool_choice = auto and tool_choice = required (with thinking).
  • Supports parallel tool calls
  • Supports response_format with thinking.

Keeping this as a draft until #17136 gets the stamp of approval.

Footnotes

  1. Currently only reasoning_format = auto/deepseek is supported. I was unaware the reasoning format is not exposed during chat param init, but this is easy to address.

@loci-review
Copy link

loci-review bot commented Dec 3, 2025

Explore the complete analysis inside the Version Insights

Performance Review Summary - PR #409

Overview

This PR introduces a PEG parser framework for Ministral/Mistral 3 model output parsing, adding 5362 lines across 27 files. The implementation provides streaming-aware parsing with AST generation, Unicode handling, and GBNF grammar generation capabilities.

Key Findings

Performance-Critical Areas Impact

Chat Parsing Infrastructure:

  • common_chat_msg_parser constructor shows +3923 ns response time increase, attributed to PEG parser initialization overhead including builder instantiation, rule creation for tool definitions, and parser serialization
  • Destructor exhibits +1395 ns increase due to cleanup of PEG arena structures containing vectors, unordered maps, and shared pointers for JSON schemas
  • The parsing path now includes format dispatch, parser deserialization, AST construction, and mapper traversal adding 1800-3600 ns per message parse operation

STL Iterator Functions:

  • Multiple iterator operations across llama-run, llama-tts, and llama-cvector-generator show +133-135 ns increases
  • Functions affected include std::_Rb_tree_iterator, std::__detail::_Node_iterator_base, and vector iterators
  • Regression appears related to compilation unit expansion from new template-heavy includes affecting inlining decisions

Inference Performance Impact

Tokens Per Second:
No direct impact on inference throughput. The affected functions (common_chat_msg_parser, chat parsing utilities) operate outside the core inference path. Functions responsible for tokenization and inference (llama_decode, llama_encode, llama_tokenize) show no modifications or performance changes. The parsing overhead affects only chat message formatting and tool call extraction, which occurs before or after inference execution.

Power Consumption Analysis

Binary-Level Impact:

  • llama-run: +28,277 nJ increase (+14.7%)
  • llama-cvector-generator: +30,591 nJ increase (+13.9%)
  • llama-tts: +30,480 nJ increase (+13.6%)

The power consumption increases correlate with extended execution time in chat parsing components. The overhead stems from PEG parser construction (1500-3000 ns), runtime parsing with AST generation (1800-3600 ns), and mapper traversal (500-1000 ns). These binaries utilize chat parsing functionality, accumulating overhead across multiple parse operations during execution.

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from dba8180 to 8654e36 Compare December 3, 2025 21:08
@loci-dev loci-dev force-pushed the upstream-PR17713-branch_aldehir-ministral-3 branch from 53cc80f to 765aff0 Compare December 4, 2025 02:14
@loci-review
Copy link

loci-review bot commented Dec 4, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #409

Analysis Overview

PR #409 introduces a new chat template parser for Ministral 3 models, adding 119 lines to common/chat.cpp. The performance analysis compared version 091801bf (target) against f050349e (baseline) across the llama.cpp codebase.

Key Findings

Performance-Critical Functions Impact:

The analysis identified performance changes in STL container accessor functions rather than core inference functions. The most significant changes occurred in:

  • std::vector::begin variants: +132 ns throughput increase across multiple binaries (llama-tts, llama-run, llama-cvector-generator)
  • std::vector::end and std::_Rb_tree::end: +24 ns throughput increase
  • nlohmann::basic_json::back: +129 ns throughput increase in llama-run

These functions are STL template instantiations used for container iteration in chat processing, not in the core inference path. The changes appear to stem from compiler optimization differences affecting template code generation rather than algorithmic modifications in PR #409.

Inference Performance Impact:

No changes were detected in core inference functions (llama_decode, llama_encode, llama_tokenize). The PR modifies only chat template processing logic, which executes during initialization before token generation begins. Therefore, tokens per second during inference remains unaffected. The added message transformation and parser construction (310 microseconds typical cost) occurs once per chat session, not per token.

Power Consumption Analysis:

Three binaries show measurable increases:

  • llama-run: +1466 nanojoules (+0.67%)
  • llama-tts: +1628 nanojoules (+0.64%)
  • llama-cvector-generator: +1511 nanojoules (+0.61%)

These increases correlate with the STL accessor regressions rather than the PR's new functionality. The new Ministral 3 parser code path is not yet active in the analyzed binaries since the PR remains unmerged.

Code Changes Analysis:

PR #409 adds common_chat_params_init_mistral_3 function implementing:

  • Message structure transformation (reasoning content extraction)
  • PEG parser construction for tool calls and response formats
  • Grammar building for constrained generation

The implementation introduces JSON object manipulation and string operations during chat initialization. The code follows existing patterns in the codebase for other model-specific chat handlers (llama_3_x, lfm2, magistral).

Systemic Pattern:

The STL iterator performance changes affect 8 of the top 10 functions by response time change, suggesting a build configuration or compiler version difference between the analyzed versions rather than code-level changes. The pattern is consistent across vector, tree, and JSON container types, indicating a toolchain-level issue independent of PR #409's functional changes.

@loci-dev loci-dev force-pushed the main branch 15 times, most recently from 4ba0a8d to 4587bfa Compare December 5, 2025 11:09
@loci-dev loci-dev force-pushed the upstream-PR17713-branch_aldehir-ministral-3 branch from 765aff0 to 86cb434 Compare December 7, 2025 00:49
@loci-review
Copy link

loci-review bot commented Dec 7, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #409

Project: llama.cpp | PR: #409 - Ministral 3 Parser Implementation
Comparison: Version 4a173264 vs 038b543f


Summary

This PR adds PEG parser support for Ministral 3 Reasoning/Instruct models through a new 109-line function common_chat_params_init_ministral_3 in common/chat.cpp. The implementation introduces message transformation logic, nested lambda-based parser construction, and grammar building for tool calling and reasoning extraction. The changes are isolated to chat template processing and do not affect core inference functions. Performance regressions are observed in STL container operations and lambda invocations, but these occur in initialization code paths rather than per-token inference loops.


Key Findings

Most-Impacted Functions

Chat Template Processing (Non-Inference Path):

  1. __invoke_impl (llama-cvector-generator) - Lambda invocation in PEG parser construction

    • Response time increase: 21,635 ns (49 ns → 21,684 ns)
    • Located in Minja template engine context builtin processing
    • Caused by nested lambda complexity in build_chat_peg_native_parser with multiple builder method calls
    • Executes during chat initialization, not per-token inference
  2. end (llama-run, std::map) - Iterator accessor for std::map<std::string, httplib::FormData>

    • Throughput increase: 135 ns (60 ns → 195 ns)
    • Response time increase: 115 ns (80 ns → 195 ns)
    • Leaf function with no callees, regression entirely in instruction execution
    • Used in HTTP form data processing during server request handling
  3. end (llama-cvector-generator, std::vector) - Iterator accessor for std::vector<nlohmann::json*>

    • Throughput increase: 135 ns (60 ns → 195 ns)
    • Response time increase: 114 ns (81 ns → 195 ns)
    • Identical regression pattern to std::map::end, suggesting systematic compiler or build configuration change
    • Used in JSON array iteration during message transformation
  4. begin (llama-tts, std::vector) - Iterator accessor for vector of shared pointer pairs

    • Throughput increase: 24 ns (38 ns → 62 ns)
    • Response time increase: 31 ns (52 ns → 83 ns)
    • Used in Minja expression processing during template rendering
  5. operator== (llama-cvector-generator, nlohmann::json) - JSON iterator comparison

    • Throughput increase: 117 ns (189 ns → 306 ns)
    • Response time increase: 139 ns (4,378 ns → 4,517 ns)
    • Used in JSON object lookup operations during message transformation

Note: All impacted functions are in chat template processing, message transformation, or test infrastructure. None are in the core inference pipeline.

Inference Performance Impact

Tokens Per Second: No Impact

The core inference functions show no changes:

  • llama_decode - Not modified, no performance change
  • llama_encode - Not modified, no performance change
  • llama_tokenize - Not modified, no performance change
  • llama_batch_init - Not modified, no performance change
  • ggml_backend_graph_compute - Not modified, no performance change

The observed regressions occur in chat template initialization code that executes once per chat session, not per token. The initialization overhead of 25,000-35,000 ns is amortized across the entire chat session. For a typical chat generating 100 tokens, this adds 250-350 ns per token, which is negligible compared to the typical per-token inference time of 10-50 ms on CPU.

Tokens per second remains unchanged because the per-token inference path is unaffected by these changes.

Power Consumption Analysis

Binary-Level Impact:

Three binaries show measurable power consumption increases:

  1. llama-tts: +1,810 nJ (+0.714%)

    • Base: 253,455 nJ → Target: 255,265 nJ
    • Driven by test code execution invoking new template processing logic
    • STL iterator operations contribute cumulative overhead
  2. llama-run: +1,546 nJ (+0.706%)

    • Base: 218,840 nJ → Target: 220,386 nJ
    • Driven by std::map::end regression (135 ns throughput increase)
    • HTTP form data processing overhead in server request handling
  3. llama-cvector-generator: +1,473 nJ (+0.591%)

    • Base: 249,347 nJ → Target: 250,820 nJ
    • Driven by std::vector::end regression (135 ns throughput increase) and JSON iterator operations (117 ns throughput increase)
    • Test infrastructure executing new PEG parser test cases

Core Libraries: No Impact

  • libllama.so: -0.000% (no change)
  • libggml.so: 0.000% (no change)
  • libggml-cpu.so: 0.000% (no change)

The power consumption increases are confined to application binaries that use chat template processing. The core inference libraries show no power consumption change, confirming that the inference pipeline is unaffected.

@loci-dev loci-dev force-pushed the main branch 24 times, most recently from a9fcc24 to ea62cd5 Compare December 10, 2025 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants