Enable Reasoning via Guided Enforcement#262
Conversation
This implements the full llguidance integration enabling grammar-constrained inference for structured outputs, tool calling, and custom constraints. Architecture: - TopLevelGrammar serialized via rmp_serde across RPC boundaries - Grammar flows: Server → params.grammar → Runner → GuidanceState → Matcher - Inline correction via logits masking during sampling - Post-process correction via rollback on validation failure Key components: - params.grammar field in SamplingParams for RPC serialization GuidanceState - GuidanceState::new() with Matcher state management - GuidanceState::reset() for proper state cleanup - Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite loops - guidance_failed/guidance_mismatch sets cleared on rollback - Vocab size validation in build_llg_factory() - Lark grammar generation from tools via build_tool_call_lark_grammar() CLI flags: - --enable-tool-grammar: Auto-build LLG grammar from MCP tools - --allow-constraint-api: Accept client-provided structured_outputs/response_format
Expand SpecialTokens usage to cover EOS uses across the codebase to include the chat template. This gates access to the EOS tokens through a single common API providing an interdiction point to add or remove them as needed per-model or family as requried.
- Replace manual EOS token extraction logic with centralized SpecialTokens::new() and idiomatic accessors - Eliminate EosTokenId enum and related complex serialization logic in favor of direct Vec<u32> - Update all callers to use SpecialTokens for tool start/end token IDs - Remove stop_token_ids from SamplingParams and related logic (now handled via SpecialTokens) - Simplify tokenizer config by replacing EosTokenEntry with Option<String> - Add comprehensive SpecialTokens API with category-based accessors, ID/string sets, and search methods
Improve the binary example to be a handy extractor for models which developers can use to update special_tokens.rs quickly. Add tags extracted from Qwen3.5 0.8B
Narrow the Common category search specifically to find string dups of actually special tokens (handle "aftermarket" models/merges). Add and test Llama4 and Qwen3.5 MoE
This PR introduces a new `reasoning_effort` parameter to control reasoning block generation in the chat completion API, matching OpenAI's reasoning API behavior. - **API Extension**: Added `reasoning_effort` field to `ChatCompletionRequest` accepting "none", "low", "medium", or "high" values (case-insensitive) - **New Module**: Created `src/utils/reasoning.rs` with: - `ReasoningEffort` enum with `from_str` deserialization - `ThinkingGrammarBuilder` for reasoning block grammar construction - `thinking_grammar_with_reasoning_block()` generating Lark grammar patterns - `build_reasoning_grammar()` for composing reasoning blocks with base grammars - **Integration**: Updated `compose_grammars()` in `src/utils/guidance.rs` to accept and apply reasoning effort levels - **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()` in `src/tools/schema.rs` to strip null from required field types, ensuring grammars enforce field presence for tool parameters - **Special Token Helpers**: Added `reasoning_start_ids()`, `reasoning_end_ids()`, and `reasoning_tokens()` methods to `SpecialTokens` for robust token detection - **Comprehensive Tests**: Added 11 new tests covering: - Reasoning effort parsing and validation - Thinking grammar builder functionality - Schema null-stripping for required/optional fields - Grammar composition permutations with reasoning
This change updates the ToolGrammarBuilder to correctly use pad token IDs for XML tool call termination when building Lark grammars for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5). The XML format requires closing markers for </function> and </parameter> tags. When the tokenizer lacks special closing tags the model can run-on generating forever as XML is not a finite stateless grammar; see guidance-ai/llguidance/issues/306. Use pad tokens as "magic" terminating markers embedded into the grammar and recognizable by the tokenizer/llg mask as not output a model can normally emit in its textual output (masked to 0.0 logprob normaly). Anchor the XML function/parameter generation like we bound tool-call and text. This is easier on the model than forcing JSON parsing (qwen3) especially in conjunction with forcing it to `<think>` if its not trained to do so. Mechanically, we modify the chat template to special pad tags after the closing tags for function and param and inject those into the grammar template submitted to the model as tool-choice. Call path: src/tools/schema.rs:361 build_xml_with_anchors(pad_ids) ├─ Uses pad_ids[0] as </function> anchor └─ Uses pad_ids[1] as </parameter> anchor Grammar structure: - start: ( text | tool_call )+ eos? - tool_call: <[tool_start_id]> tool_content <[tool_end_id]> - tool_0: "<function=fetch_url_via_curl>" param_0_0 ... "</function>" <[pad_id_0]> - param_0_0: "<parameter=url>" value_0_0 ... "</parameter>" <[pad_id_1]> - ... The pad tokens serve as finite termination points for the XML parser, allowing the Lark grammar to generate valid, parseable tool calls without requiring explicit special closing tags in the tokenizer vocabulary.
|
Here are some review comments: I'm refactoring and resolve these issues. |
|
Enable tool grammer is not working, this is log before refactoring: After refactor: It can create guidance state but failed to generate tool call, client receive plain text. Model: Qwen3-30B-A3B-Instruct-2507 So, we've better remove the guided decoding for tool call grammar. |
|
@sempervictus I have removed the tool grammar since it is not working. It passed all constraint cases except this one "Emit a search tool call". Are you able to make a last test? In terms of guided decoding, I think it works and we've ready to merge it. |
|
I think once this one is merged, you can submit another PR to address the tool grammar issue. |
2b02393 to
e95183b
Compare
|
i removed the XML binary anchors - trying \n stubs but worst case the jinja allows the actual arguments to parameters to be JSON. Testing presently w/ the updated precision PR. Sorry, probably havent read the giant doc but We also need to excise the stupid instruction from the jinja telling it to use the XML format before giving it the template |
omg, you made your old branch back, we can't merge it at current form because the the above-mentioned issues, I have fixed them in the previous commits and you simply taking them back.😂 |
|
I suggest merging the refactored version from #263 with a note such as The refactored version resolves several issues, including: SpecialTokens::Category::Eos is a heuristic bucket of “end-like” tokens. That is useful for discovery and tooling, but it is not the same thing as the model’s actual semantic eos_token_id. For example, <|im_end|>, <|eot_id|>, <|header_end|>, and <|eom_id|> may be structural delimiters, not global EOS, unless the model config explicitly defines them as EOS. Guided decoding only needs the real EOS IDs when it adds optional EOS termination to a grammar. Those IDs should come from resolved model config / generation config, not from the heuristic list in special_tokens.rs. The sample loop is not using the standard llguidance flow cleanly. Standard flow is: compute_mask from current matcher state, apply mask, sample once from masked logits, consume_token on the chosen token. Your code does that first, but then adds a second validation/resample loop afterward. That second pass is redundant if the first mask was correct, and it hides state/mask bugs instead of fixing them. The fallback re-sample path is batch-wide, not per-row. When one token fails validation, it rebuilds a full-batch tensor and calls sample_with_strategy() across the whole batch, then only keeps re_sampled[seq_idx]. That is nonstandard, wastes work, and can change RNG advancement for unrelated rows in the batch. A standard fallback would re-sample only the affected row, or better, remove the fallback entirely. The rollback/prefix-cache changes were unnecessary for guided decoding and unsafe for existing logic. RollbackSnapshot and the rollback path added state into scheduler/block_manager/prefix_cache, but the postprocess check was effectively dead and the rollback math/cache mutation could corrupt normal prefix-cache behavior. I removed that path entirely. Existing non-guided behavior had regressed. OpenAI stop was not applied, stop_token_ids matching had been removed, and engine tool token handling had drifted from the model-specific ToolConfig path. Grammar composition had standard-mismatch bugs: merged grammars used repeated alternation +, single-EOS formatting was wrong, and regex/literal helpers silently dropped non-ASCII input. Guided decoding setup only lived in the OpenAI handler. GuidanceState still carries unused speculative/rollback/cache machinery that the current sampling path does not use. That is not standard guided decoding; it is leftover complexity. I also refactored your Markdown file since it was too heavy (~2,500 lines), which isn’t suitable for documentation and tends to cause problems for agents. Large files like this can overwhelm agents’ memory with hundreds of thousands of tokens. |
22a2664 to
7f70675
Compare
|
Done sir done! 😄 I'll get the tools piece into another PR in the morning after i run some coders through it to see if they can pass something that breaks the XML (JSON has been solid since day 1) |
|
Got an issue in the last audit: Opencode with Qwen3.5-27B-FP8 report this (prompts: /init and then "find bugs in the guided decoding") The server log: While, the current main works well on it. |
|
Right now we are at #263 level where I thought the tools path is inert. I'm afk for the evening but in the tools branch I do not see those with the flag enabled or disabled. Will push that commit in a few hours once at keyboard. |
Another bug here: We should not apply that if contraint grammar not provided. |
The revision of scheduler.rs for tool calling also buggy, I reverted that together with fix for above. It works now. You need force push #263 again to here. |
|
Another bug, once the guided decoding request processed, it can affect normal requests making them unable to properly handle tool calls. The state management for guided decoding requires double check. |
Root cause: The old gate was effectively “if the runner has an llguidance factory, enter the guided path”, which is a batch-level condition, not a per-request one. So when a constrained request and a normal tool-call request were decoded in the same batch, the whole batch went through the llguidance masking/commit flow. |
Fixed that in #263 by adding per-sequence guided decoding. I think after final alignment with #263, @sempervictus you'd better do a last test especially for tool calling, guided decoding or mixed normal decoding and guided one. If they all passed, we can merge it. I think I have fixed almost all vulnerable points. |
|
If you're still awake, i actually had a question about this piece (removed in your branch) - if we dont have full seq rollback but the constraint+SamplingParams mask end up producing something not viable for the constraint itself, this chunk was supposed to "fix" it token by token. I have a faster version of this which has early abort on the check but i left this in specifically so we could verify if a correction needed to happen so we'd know that something bears looking into. Is the logic wrong or did you cull this because its a performance hog? - self.commit_guided_tokens(&seq_ids, &tokens, guided_seq_ids);
+ // Clone guided_seq_ids for commit_guided_tokens and the following block
+ let guided_seq_ids_for_commit = guided_seq_ids.clone();
+ self.commit_guided_tokens(&seq_ids, &tokens, guided_seq_ids_for_commit);
+
+ if let Some(ref guided_seq_ids) = guided_seq_ids {
+ let mut guidance_states = self.guidance_states.write();
+ let mut guidance_failed = self.guidance_failed.write();
+ for (seq_idx, seq_id) in seq_ids.iter().enumerate() {
+ if !guided_seq_ids.contains(seq_id) || guidance_failed.contains(seq_id) {
+ continue;
+ }
+
+ if let Some(state) = guidance_states.get_mut(seq_id) {
+ if state.is_finished() {
+ continue;
+ }
+
+ let token = tokens[seq_idx];
+ if let Err(err) = state.commit_token(token) {
+ if guidance_failed.insert(*seq_id) {
+ crate::log_warn!(
+ "[Seq {}] Failed to commit guided token {}: {}. Disabling constraints for this sequence.",
+ seq_id,
+ token,
+ err
+ );
+ }
+ let _ = guidance_states.remove(seq_id);
+ }
+ }
+ }
+ }
pulling it out of the commit tree but figure since i have it in front of me i would ask 😄 |
The post-sample validation was intentional, but I didn’t remove that behavior. I moved it into commit_guided_tokens(...), which is still called immediately after sampling. |
|
ah! thanks 348 test cases - only the user-supplied regex failing. having that handled, out for a few hours. |
Have you pushed latest changes from #263 here, there are more bug fixes. |
|
Pulled them in yeaterday and will rebase branch again now - am getting all the grammar tests to pass as youre way better at the engine stuff so just using 263 state as a base and collapsing as much as I can into guidance.rs and schema to avoid pollution of server and auch. Sorry its a bit hectic this week w gtc - live in a rather remote area, its a whole thing to get to civilization and conferences :). The good news is all the lark logic works after moving the typed ones back to their types from llg so we aren't having to process those inline which simplified the string composition logic. The @ refs I haven't been able to get working but all 11 permutation inputs and their outputs are being built and compiling to Matcher FSMs in the tests (takes a while there's 348 variants in the current set but it can be thousand when you feed the tester lots of tool options and grammar |
I've done last audit in #263, simplely force push #263 here and I will merge this PR. |
|
ah! sorry! pushed before i switched to this tab. 1s, will rebase on 263 only and separate this to another PR |
6f95821 to
6d8daae
Compare
|
All set sir, we're aligned. Will open another one for review with the various grammar options and i'll include the example in that branch while its in draft so we (and rather importantly the LLMs) can see all of the resulting grammars and compile them. |
|
Thanks! |
* Implement Constrained Generation via LLGuidance This implements the full llguidance integration enabling grammar-constrained inference for structured outputs, tool calling, and custom constraints. Architecture: - TopLevelGrammar serialized via rmp_serde across RPC boundaries - Grammar flows: Server → params.grammar → Runner → GuidanceState → Matcher - Inline correction via logits masking during sampling - Post-process correction via rollback on validation failure Key components: - params.grammar field in SamplingParams for RPC serialization GuidanceState - GuidanceState::new() with Matcher state management - GuidanceState::reset() for proper state cleanup - Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite loops - guidance_failed/guidance_mismatch sets cleared on rollback - Vocab size validation in build_llg_factory() - Lark grammar generation from tools via build_tool_call_lark_grammar() CLI flags: - --enable-tool-grammar: Auto-build LLG grammar from MCP tools - --allow-constraint-api: Accept client-provided structured_outputs/response_format * Support Qwen3.5 Dense models on Metal (#258) * Utilize SpecialTokens Idiomatic Accessor for EOS Expand SpecialTokens usage to cover EOS uses across the codebase to include the chat template. This gates access to the EOS tokens through a single common API providing an interdiction point to add or remove them as needed per-model or family as requried. * Idiomatic SpecialTokens Access Pattern - Replace manual EOS token extraction logic with centralized SpecialTokens::new() and idiomatic accessors - Eliminate EosTokenId enum and related complex serialization logic in favor of direct Vec<u32> - Update all callers to use SpecialTokens for tool start/end token IDs - Remove stop_token_ids from SamplingParams and related logic (now handled via SpecialTokens) - Simplify tokenizer config by replacing EosTokenEntry with Option<String> - Add comprehensive SpecialTokens API with category-based accessors, ID/string sets, and search methods * More SpecialTokens, Improve Example/Binary Improve the binary example to be a handy extractor for models which developers can use to update special_tokens.rs quickly. Add tags extracted from Qwen3.5 0.8B * SpecialTokens Strings for Llama4 and Qwen3.5 MoE Narrow the Common category search specifically to find string dups of actually special tokens (handle "aftermarket" models/merges). Add and test Llama4 and Qwen3.5 MoE * ToolConfig Population w/ SpecialTokens * Drop ToolFormat * Lead The Horse to Water, Make Him <Think> This PR introduces a new `reasoning_effort` parameter to control reasoning block generation in the chat completion API, matching OpenAI's reasoning API behavior. - **API Extension**: Added `reasoning_effort` field to `ChatCompletionRequest` accepting "none", "low", "medium", or "high" values (case-insensitive) - **New Module**: Created `src/utils/reasoning.rs` with: - `ReasoningEffort` enum with `from_str` deserialization - `ThinkingGrammarBuilder` for reasoning block grammar construction - `thinking_grammar_with_reasoning_block()` generating Lark grammar patterns - `build_reasoning_grammar()` for composing reasoning blocks with base grammars - **Integration**: Updated `compose_grammars()` in `src/utils/guidance.rs` to accept and apply reasoning effort levels - **Schema Sanitization**: Enhanced `sanitize_schema_for_llguidance()` in `src/tools/schema.rs` to strip null from required field types, ensuring grammars enforce field presence for tool parameters - **Special Token Helpers**: Added `reasoning_start_ids()`, `reasoning_end_ids()`, and `reasoning_tokens()` methods to `SpecialTokens` for robust token detection - **Comprehensive Tests**: Added 11 new tests covering: - Reasoning effort parsing and validation - Thinking grammar builder functionality - Schema null-stripping for required/optional fields - Grammar composition permutations with reasoning * Tier Reasoning Effort * Anchor XML Tool-Grammar With SpecialTokens Pads This change updates the ToolGrammarBuilder to correctly use pad token IDs for XML tool call termination when building Lark grammars for models that use XML-style tool calling (e.g., Qwen3-Coder/3.5). The XML format requires closing markers for </function> and </parameter> tags. When the tokenizer lacks special closing tags the model can run-on generating forever as XML is not a finite stateless grammar; see guidance-ai/llguidance/issues/306. Use pad tokens as "magic" terminating markers embedded into the grammar and recognizable by the tokenizer/llg mask as not output a model can normally emit in its textual output (masked to 0.0 logprob normaly). Anchor the XML function/parameter generation like we bound tool-call and text. This is easier on the model than forcing JSON parsing (qwen3) especially in conjunction with forcing it to `<think>` if its not trained to do so. Mechanically, we modify the chat template to special pad tags after the closing tags for function and param and inject those into the grammar template submitted to the model as tool-choice. Call path: src/tools/schema.rs:361 build_xml_with_anchors(pad_ids) ├─ Uses pad_ids[0] as </function> anchor └─ Uses pad_ids[1] as </parameter> anchor Grammar structure: - start: ( text | tool_call )+ eos? - tool_call: <[tool_start_id]> tool_content <[tool_end_id]> - tool_0: "<function=fetch_url_via_curl>" param_0_0 ... "</function>" <[pad_id_0]> - param_0_0: "<parameter=url>" value_0_0 ... "</parameter>" <[pad_id_1]> - ... The pad tokens serve as finite termination points for the XML parser, allowing the Lark grammar to generate valid, parseable tool calls without requiring explicit special closing tags in the tokenizer vocabulary. * Cargo fmt * Refactor guided decoding * Update docs * Typo fix * Remove tool grammar & fix slow first token response for sync request * Strip guided-decoding’s leftover tool grammar surface * Fix incorrect guidance application * Revert changes for scheduler.rs (tool call related) * Apply per-sequence guided decoding * Remove redundancy * Fix corner case * Permit empty tool call result --------- Co-authored-by: RageLtMan <rageltman [at] sempervictus> Co-authored-by: Guoqing Bao <topon@outlook.com>
This should more or less complete the LLG work by including reasoning API support a la OpenAI's.
Overview
The guided inference system supports multiple reasoning effort levels through the
ReasoningEffortenum. Each level implements a different reasoning strategy optimized for specific use cases:NoneLowMediumHighChainOfThoughtReasoning Effort Levels
ReasoningEffort::Low- Fast ThinkingImplements "Fast Thinking" with tight length constraints (~150 chars max). This reduces hallucination risk by limiting the generation space.
ReasoningEffort::Medium- Standard CoTImplements Wei et al. (2022) baseline with sentence-based termination. Allows multiple steps but enforces sentence boundaries.
ReasoningEffort::High- Adversarial AnalysisImplements Cheng & Su (2025) adversarial critique pattern. Forces explicit error checking before final output.
ReasoningEffort::ChainOfThought- CoVe PatternCombines Madaan et al. (2024) Chain-of-Verification with adversarial self-correction. Maximum accuracy for complex/fact-sensitive tasks.
API Reference
ReasoningEffortEnumLocation:
src/utils/reasoning.rs:16-39ThinkingGrammarBuilderLocation:
src/utils/reasoning.rs:143-178Methods:
ThinkingGrammarBuilder::new(start_id, end_id, effort)- Create new builderThinkingGrammarBuilder::from_string(start_id, end_id)- Create from string IDsbuild()- Generate Lark grammar stringbuild_grammar()- Generate TopLevelGrammarbuild_reasoning_grammar()Location:
src/utils/reasoning.rs:183-218Wraps a base composer with reasoning blocks when reasoning effort is enabled.
Includes #232 and #260
sequenceDiagram participant User participant API participant Pipeline participant SpecialTokens participant LLGFactory participant Matcher participant TokenParser participant EarleyParser participant Lexer participant TokTrie participant Sampler participant LogitsProcessor participant Model User->>API: Request with constraint (regex/json_schema/lark/llguidance) Note over User,API: Phase 1: Request Setup and Grammar Building API->>SpecialTokens: SpecialTokens::new(&tokenizer) SpecialTokens-->>API: Return EOS, BOS, TOOL token IDs API->>Pipeline: build_llg_factory(tokenizer) Pipeline->>LLGFactory: toktrie_hf_tokenizers::ByteTokenizer::from_tokenizer(tokenizer) LLGFactory->>TokTrie: Create token trie from tokenizer vocabulary TokTrie-->>LLGFactory: Return TokEnv with trie LLGFactory->>LLGFactory: ParserFactory::new_simple(&env) LLGFactory-->>Pipeline: Return Arc<ParserFactory> Pipeline->>Pipeline: llg_grammar_from_constraint(&request.constraint) Pipeline->>Matcher: constraint_from_llg_grammar(&factory, grm) Matcher->>Matcher: factory.create_parser(grm) Matcher->>TokenParser: Create with grammar_init TokenParser->>EarleyParser: Build CGrammar from grammar TokenParser->>Lexer: Build LexerSpec from grammar Lexer->>TokTrie: Precompute large lexemes if needed TokTrie-->>Lexer: Return optimized lexeme sets Note over User,Matcher: Phase 2: Prompt Processing (if needed) User->>API: Optional: process_prompt(prompt_tokens) API->>TokenParser: process_prompt(prompt_tokens) TokenParser->>TokenParser: tokenize_bytes_marker(&prompt_bytes) TokenParser->>TokenParser: process_prompt() returns new prompt Note over User,Matcher: Phase 3: Inference Loop loop for each token generation Model->>Model: Forward pass on input tokens Model-->>Pipeline: Return logits tensor Pipeline->>Sampler: sample_sequence(logits, seq, ...) Note over Sampler: Two-stage sampling with llguidance Sampler->>LogitsProcessor: Apply llguidance constraint LogitsProcessor->>TokenParser: compute_mask() TokenParser->>TokenParser: compute_mask_inner() TokenParser->>EarleyParser: run_speculative("compute_mask") EarleyParser->>EarleyParser: trie_started("compute_mask") EarleyParser->>EarleyParser: compute_bias() EarleyParser->>Lexer: compute_bias() with token_prefix Note over Lexer,TokTrie: Lexical Scope Analysis Lexer->>TokTrie: Walk token trie for allowed lexemes TokTrie-->>Lexer: Return SimpleVob bit mask Lexer->>EarleyParser: Return mask to TokenParser TokenParser->>TokenParser: cache mask for fast-forward TokenParser-->>LogitsProcessor: Return SimpleVob mask LogitsProcessor->>LogitsProcessor: Check if sampled token is allowed LogitsProcessor->>Sampler: Apply logit biasing alt Token is allowed Sampler->>Sampler: No biasing needed else Token is not allowed Sampler->>Sampler: Set invalid tokens to -f32::INFINITY Sampler->>Sampler: Re-sample with biased logits end Sampler->>TokenParser: consume_token(sampled_token) TokenParser->>TokenParser: apply_token(sampled_token) TokenParser->>TokenParser: llm_tokens.push(sampled_token) TokenParser->>TokenParser: llm_bytes.extend(token_bytes) TokenParser->>EarleyParser: parser.apply_token(token_bytes, token_id) EarleyParser->>Lexer: advance lexer state Lexer->>Lexer: Update lexer_stack with new state Lexer->>EarleyParser: Return backtrack count alt Backtrack needed EarleyParser->>EarleyParser: rollback(backtrack_bytes) EarleyParser->>EarleyParser: Update llm_tokens and llm_bytes end TokenParser->>TokenParser: check_stop() TokenParser-->>Sampler: Return CommitResult Note over Sampler: Phase 4: Fast-Forward (if enabled) Sampler->>TokenParser: compute_ff_tokens() TokenParser->>TokenParser: ff_tokens() TokenParser->>TokTrie: Tokenize forced bytes TokTrie-->>TokenParser: Return fast-forward tokens alt Fast-forward tokens available TokenParser->>TokenParser: consume_ff_tokens() loop for each ff_token TokenParser->>TokenParser: consume_token(ff_token) TokenParser->>TokenParser: llm_tokens.push(ff_token) TokenParser->>TokenParser: llm_bytes.extend(ff_token_bytes) end end Note over Sampler: Phase 5: Speculative Decoding (if enabled) Model->>Model: Draft model forward pass Model-->>Pipeline: Return draft logits Pipeline->>Sampler: sample_target_sequence_speculative() Sampler->>TokenParser: rollback(n_toks) TokenParser->>EarleyParser: parser.rollback(bytes_to_drop) EarleyParser->>Lexer: pop lexer states Lexer-->>TokenParser: Return rollback result Sampler->>Sampler: Sample draft tokens Sampler->>TokenParser: validate_tokens(draft_tokens) TokenParser->>TokenParser: consume_token(draft_token) alt Draft token accepted TokenParser->>TokenParser: Continue with next draft else Draft token rejected TokenParser->>TokenParser: Accept partial draft TokenParser->>TokenParser: Rollback to last valid state end end Note over User,Matcher: Phase 6: Token Geometry and Binary Data State TokTrie->>TokTrie: Token encoding (8:24 bit split) TokTrie->>TokTrie: node.bits = (token_id << 8) | byte TokTrie->>TokTrie: node.bits2 = (subtree_size << 10) | num_parents TokTrie->>SimpleVob: Bit mask storage SimpleVob->>SimpleVob: data: Vec<u32> (32 tokens per word) SimpleVob->>SimpleVob: allow_token(tok): data[tok>>5] |= 1 << (tok&31) Note over User,Matcher: Phase 7: Rollback and Verification TokenParser->>TokenParser: validate_tokens(tokens) TokenParser->>EarleyParser: validate_tokens_raw(tokens) EarleyParser->>Lexer: Check if tokens match current lexer state Lexer-->>TokenParser: Return number of valid tokens TokenParser->>TokenParser: rollback(n_tokens) TokenParser->>EarleyParser: parser.rollback(bytes_to_drop) EarleyParser->>Lexer: pop lexer states TokenParser->>TokenParser: llm_tokens.truncate(new_len) TokenParser->>TokenParser: llm_bytes.truncate(new_len) Note over User,Matcher: Phase 8: Response Generation Pipeline->>API: Return completion with tokens API->>User: Stream or return final response