Implement LLGuidance by sempervictus · Pull Request #232 · guoqingbao/xinfer

sempervictus · 2026-02-16T06:50:39Z

Architecture

sequenceDiagram
    participant User
    participant API
    participant Pipeline
    participant LLGFactory
    participant Matcher
    participant TokenParser
    participant EarleyParser
    participant Lexer
    participant TokTrie
    participant Sampler
    participant LogitsProcessor
    participant Model

    User->>API: Request with constraint (regex/json_schema/lark/llguidance)

    Note over User,API: Phase 1: Request Setup and Grammar Building

    API->>Pipeline: build_llg_factory(tokenizer)
    Pipeline->>LLGFactory: toktrie_hf_tokenizers::ByteTokenizer::from_tokenizer(tokenizer)
    LLGFactory->>TokTrie: Create token trie from tokenizer vocabulary
    TokTrie-->>LLGFactory: Return TokEnv with trie
    LLGFactory->>LLGFactory: ParserFactory::new_simple(&env)
    LLGFactory-->>Pipeline: Return Arc<ParserFactory>

    Pipeline->>Pipeline: llg_grammar_from_constraint(&request.constraint)
    Pipeline->>Matcher: constraint_from_llg_grammar(&factory, grm)
    Matcher->>Matcher: factory.create_parser(grm)
    Matcher->>TokenParser: Create with grammar_init
    TokenParser->>EarleyParser: Build CGrammar from grammar
    TokenParser->>Lexer: Build LexerSpec from grammar
    Lexer->>TokTrie: Precompute large lexemes if needed
    TokTrie-->>Lexer: Return optimized lexeme sets

    Note over User,Matcher: Phase 2: Prompt Processing (if needed)

    User->>API: Optional: process_prompt(prompt_tokens)
    API->>TokenParser: process_prompt(prompt_tokens)
    TokenParser->>TokenParser: tokenize_bytes_marker(&prompt_bytes)
    TokenParser->>TokenParser: process_prompt() returns new prompt

    Note over User,Matcher: Phase 3: Inference Loop

    loop for each token generation

    Model->>Model: Forward pass on input tokens
    Model-->>Pipeline: Return logits tensor

    Pipeline->>Sampler: sample_sequence(logits, seq, ...)

    Note over Sampler: Two-stage sampling with llguidance

    Sampler->>LogitsProcessor: Apply llguidance constraint

    LogitsProcessor->>TokenParser: compute_mask()
    TokenParser->>TokenParser: compute_mask_inner()
    TokenParser->>EarleyParser: run_speculative("compute_mask")
    EarleyParser->>EarleyParser: trie_started("compute_mask")
    EarleyParser->>EarleyParser: compute_bias()
    EarleyParser->>Lexer: compute_bias() with token_prefix

    Note over Lexer,TokTrie: Lexical Scope Analysis

    Lexer->>TokTrie: Walk token trie for allowed lexemes
    TokTrie-->>Lexer: Return SimpleVob bit mask

    Lexer->>EarleyParser: Return mask to TokenParser
    TokenParser->>TokenParser: cache mask for fast-forward

    TokenParser-->>LogitsProcessor: Return SimpleVob mask

    LogitsProcessor->>LogitsProcessor: Check if sampled token is allowed
    LogitsProcessor->>Sampler: Apply logit biasing

    alt Token is allowed
        Sampler->>Sampler: No biasing needed
    else Token is not allowed
        Sampler->>Sampler: Set invalid tokens to -f32::INFINITY
        Sampler->>Sampler: Re-sample with biased logits
    end

    Sampler->>TokenParser: consume_token(sampled_token)
    TokenParser->>TokenParser: apply_token(sampled_token)
    TokenParser->>TokenParser: llm_tokens.push(sampled_token)
    TokenParser->>TokenParser: llm_bytes.extend(token_bytes)
    TokenParser->>EarleyParser: parser.apply_token(token_bytes, token_id)
    EarleyParser->>Lexer: advance lexer state
    Lexer->>Lexer: Update lexer_stack with new state
    Lexer->>EarleyParser: Return backtrack count

    alt Backtrack needed
        EarleyParser->>EarleyParser: rollback(backtrack_bytes)
        EarleyParser->>EarleyParser: Update llm_tokens and llm_bytes
    end

    TokenParser->>TokenParser: check_stop()
    TokenParser-->>Sampler: Return CommitResult

    Note over Sampler: Phase 4: Fast-Forward (if enabled)

    Sampler->>TokenParser: compute_ff_tokens()
    TokenParser->>TokenParser: ff_tokens()
    TokenParser->>TokTrie: Tokenize forced bytes
    TokTrie-->>TokenParser: Return fast-forward tokens

    alt Fast-forward tokens available
        TokenParser->>TokenParser: consume_ff_tokens()
        loop for each ff_token
            TokenParser->>TokenParser: consume_token(ff_token)
            TokenParser->>TokenParser: llm_tokens.push(ff_token)
            TokenParser->>TokenParser: llm_bytes.extend(ff_token_bytes)
        end
    end

    Note over Sampler: Phase 5: Speculative Decoding (if enabled)

    Model->>Model: Draft model forward pass
    Model-->>Pipeline: Return draft logits

    Pipeline->>Sampler: sample_target_sequence_speculative()
    Sampler->>TokenParser: rollback(n_toks)
    TokenParser->>EarleyParser: parser.rollback(bytes_to_drop)
    EarleyParser->>Lexer: pop lexer states
    Lexer-->>TokenParser: Return rollback result

    Sampler->>Sampler: Sample draft tokens
    Sampler->>TokenParser: validate_tokens(draft_tokens)
    TokenParser->>TokenParser: consume_token(draft_token)

    alt Draft token accepted
        TokenParser->>TokenParser: Continue with next draft
    else Draft token rejected
        TokenParser->>TokenParser: Accept partial draft
        TokenParser->>TokenParser: Rollback to last valid state
    end

    end

    Note over User,Matcher: Phase 6: Token Geometry and Binary Data State

    TokTrie->>TokTrie: Token encoding (8:24 bit split)
    TokTrie->>TokTrie: node.bits = (token_id << 8) | byte
    TokTrie->>TokTrie: node.bits2 = (subtree_size << 10) | num_parents

    TokTrie->>SimpleVob: Bit mask storage
    SimpleVob->>SimpleVob: data: Vec<u32> (32 tokens per word)
    SimpleVob->>SimpleVob: allow_token(tok): data[tok>>5] |= 1 << (tok&31)

    Note over User,Matcher: Phase 7: Rollback and Verification

    TokenParser->>TokenParser: validate_tokens(tokens)
    TokenParser->>EarleyParser: validate_tokens_raw(tokens)
    EarleyParser->>Lexer: Check if tokens match current lexer state
    Lexer-->>TokenParser: Return number of valid tokens

    TokenParser->>TokenParser: rollback(n_tokens)
    TokenParser->>EarleyParser: parser.rollback(bytes_to_drop)
    EarleyParser->>Lexer: pop lexer states
    TokenParser->>TokenParser: llm_tokens.truncate(new_len)
    TokenParser->>TokenParser: llm_bytes.truncate(new_len)

    Note over User,Matcher: Phase 8: Response Generation

    Pipeline->>API: Return completion with tokens
    API->>User: Stream or return final response

Grammar Composition

sequenceDiagram
    participant User
    participant Server
    participant ConstraintBuilder
    participant ToolGrammarBuilder
    participant ComposeLogic
    participant LarkParser
    participant EarleyCompiler

    User->>Server: Request with tools + structured_outputs
    Server->>ConstraintBuilder: grammar_fragment_from_structured_outputs()
    ConstraintBuilder->>LarkParser: Parse JSON schema

    LarkBuilder->>ConstraintBuilder: Return TopLevelGrammar
    Server->>ToolGrammarBuilder: build_json_tool_lark_grammar() if enabled

    ToolGrammarBuilder->>LarkParser: Build tool call grammar
    LarkParser->>ToolGrammarBuilder: Return tool TopLevelGrammar

    Note over ComposeLogic: compose_grammars()

    ComposeLogic->>ComposeLogic: Determine match arm based on:
    ComposeLogic->>ComposeLogic: - constraint_grammars length
    ComposeLogic->>ComposeLogic: - tool_grammar presence
    ComposeLogic->>ComposeLogic: - tool_choice_required
    ComposeLogic->>ComposeLogic: - forced_tool_name presence

    ComposeLogic->>ComposeLogic: If multiple grammars: merge_top_level_grammars()
    ComposeGrammar->>EarleyCompiler: Compile direct alternation
    EarleyCompiler->>LexerBuilder: Build lexer spec

    LexerBuilder->>ComposeLogic: Return single TopLevelGrammar

References

sempervictus · 2026-02-16T06:54:46Z

@guoqingbao - could you please make the pyo3 stuff not blow up? I'm not a maturin person and its... mildly insanity-provoking 😁 Would obviously appreciate code review and assistance as well. There's a few TODOs but actual template wiring and testing is a whole other thing.

Otherwise passing tests on rust side, running correctly, chaining tool calls like the NVFP4 80B on Python vllm

Strongly urge use in conjunction with the attn-rs PR and to make that our baseline moving forward for all CUDA compile flags (professional inference vs toy chatbot design decisions).

sempervictus · 2026-02-16T06:57:13Z

Should supersede #208 - not sure if anything in there is worth pulling to this one @guoqingbao but i'm sure you'll see it on review

sempervictus · 2026-02-16T08:10:21Z

... and template loading, need to get that logic done too :-)

guoqingbao · 2026-02-16T14:06:29Z

I'm not sure, I have removed the strict tool call validation because it doesn't not work with certain agents. The tool call schema can be very diverse and a simple validation strategy simply made many false positives. The llguidance is an optional feature which can not be serve as default path for decode and sampling that's why similar impl in #208 not being merged.

sempervictus · 2026-02-16T15:09:19Z

I'll get template parsing added.
One major reason for this piece is that CTX cache seems to teach them how to call tools for the session esp in long CTX so if the model develops drift it's fairly rooted in the subsequent decisions about "how" - the validation you removed worked as a sort of wall against which they could bump their head but we didn't give feedback to the model about it, just made it fail and forced the model to retry. Hopefully this is a more gentle way to do that with sampling focus which then feeds the next attn pass

sempervictus · 2026-02-16T15:29:57Z

Json seems a fairly common datatype, why does maturin hate Value or for that matter None so much? :)

sempervictus · 2026-02-16T21:49:02Z

@guoqingbao - oddly, this seems to help decoding now that i'm enforcing grammars on tool calls

vllm-rs-svc0  | 2026-02-16T21:46:03.318901Z  INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [3]], avg. 47 tokens/s per request (total: 47 tokens/s)

normally its 35 with the accuracy patch. Seems to fluctuate between normal output and tool calls but... i think this might be handy just for that.

Could you please take a look at the py failures? I have this building just in the rust format w/ the flashinfer patch but we can't merge it until python builds especially given that the GCU uses that interface

guoqingbao · 2026-02-17T03:50:27Z

could you please make the pyo3 stuff not blow up?

Will take a look once backed to the office.

guoqingbao · 2026-02-17T03:53:07Z

@codex help review this PR

sempervictus · 2026-02-17T03:53:11Z

No rush, got it working on the gemm right now while o build clouds :-)

guoqingbao · 2026-02-17T03:54:32Z

No rush, got it working on the gemm right now while o build clouds :-)

Is this working with Claude code?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0dcbb008aa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-17T03:57:43Z

+        let _has_any_penalty =
            cached_params.frequency_penalty.is_some() || cached_params.presence_penalty.is_some();


Restore repeat-penalty application in decoding

frequency_penalty and presence_penalty are still computed here, but this commit removed the only apply_batch_repeat_penalty call and the sequence-token tracking that fed it, so penalty-enabled requests now sample as if penalties were disabled. This silently changes generation behavior for any workload relying on repetition control.

Useful? React with 👍 / 👎.

Will look into this and cleanup the call graph as we approach commit

chatgpt-codex-connector · 2026-02-17T03:57:43Z

+                    GuidanceState::new(
+                        self.llg_factory.as_ref().unwrap().clone(),
+                        &constraint_opt.expect("checked above")
+                    ).expect("Failed to create guidance state")


Avoid panicking on invalid user constraints

Creating a GuidanceState with expect("Failed to create guidance state") will panic the runner when a request supplies an invalid regex/Lark/JSON-schema constraint, turning a bad input into process-level failure instead of a recoverable request error. This is especially risky because constraints are request data and can be malformed at runtime.

Useful? React with 👍 / 👎.

The state was previously checked so if something is wrong here we do want a crash

chatgpt-codex-connector · 2026-02-17T03:57:43Z

+                        for (i, seq_id) in seq_ids.iter().enumerate() {
+                            if i < vec.len() {


Commit each DecodeVec token only once

This branch introduces a second for (i, seq_id) loop inside an outer loop that is already iterating seq_ids, so in Seqs::DecodeVec each sampled token is committed to FSM state multiple times per step. That advances constraints too far and can cause valid constrained generations to fail after a few decode iterations.

Useful? React with 👍 / 👎.

Havent seen generation fail yet

chatgpt-codex-connector · 2026-02-17T03:57:43Z

+                let start_id = match tokenizer.encode(start, false) {
+                    Ok(encoding) => {
+                        let ids = encoding.get_ids();
+                        if ids.len() == 1 { Some(ids[0]) } else { None }
+                    }
+                    Err(_) => None
+                };
+
+                // Try to encode end token
+                let end_id = match tokenizer.encode(end, false) {


Assign detected reasoning token IDs to outer vars

The let start_id = ... / let end_id = ... declarations inside the loop shadow the outer mutable start_id and end_id, so the tuple returned at the end remains (None, None) even when a marker pair is found. As a result, tokenizer-based reasoning marker detection is effectively never enabled.

Useful? React with 👍 / 👎.

sempervictus · 2026-02-17T05:26:10Z

Is this working with Claude code?

No idea, kilocode and aichat seem to love it. Aichat used to screw up tools all the time, now i can do this:

$ aichat -m g60 --role %research% what can you find on arxiv and searxng about llguidance, it is integrated into your inference runtime now. accumulate your comprehension from the multiple tool calls and construct a wholistic report
I will search arXiv and SearXNG for information about "llguidance" to understand its technical foundations, integration, and relevance to inference runtimes.

Call web_search_searxng {"query":"llguidance arxiv"}
Processing 4096 tokens per chunk
Call web_search_searxng {"query":"llguidance arxiv constrained decoding grammar"}
Processing 4096 tokens per chunk
Call fetch_url_via_curl {"url":"https://guidance-ai.github.io/llguidance/llg-go-brrr"}
Call web_search_searxng {"query":"llguidance arxiv constrained decoding performance"}
Processing 4096 tokens per chunk
LLGuidance Overview

LLGuidance is a high-performance library for constrained decoding in large language models (LLMs), enabling structured outputs such as JSON, code, or any context-free grammar (CFG)-defined format. It is integrated into multiple inference runtimes, including SGLang, vLLM, llama.cpp, and OpenAI’s Structured Outputs (as of May 2025).

Performance Metrics

Token mask computation latency: ~50μs per token (for 128k tokenizer)

Startup cost: ~2ms average (negligible)

Outperforms alternatives (e.g., LM-format-enforcer, Outlines, XGrammar) by 10–1000× in typical cases

Core Technical Innovations

Lexer/Parser Split

Lexical analysis via symbolic regular expressions (derivatives) handles 99%+ of token checks.

Earley parser handles only high-level CFG structure, invoked in 0.1–1% of checks.

Token Trie Optimization

Token byte strings are organized into a prefix tree (trie).

Trie traversal with incremental parser pruning avoids invalid branches early.

Implemented in Rust with low-level memory layout optimizations and branch prediction control.

Slicer Optimization

Precomputed regex-defined slices (e.g., JSON string content) allow skipping trie traversal for dense token subsets.

Reduces mask computation time for common unconstrained contexts (e.g., inside strings).

Derivatives for Regex Lexing

Uses Brzozowski derivatives to lazily construct DFAs with minimal startup cost.

Supports intersection and negation, extending expressive power beyond standard regex.

Integration and Adoption

OpenAI: Uses LLGuidance for Structured Outputs (since May 2025).

SGLang/vLLM/llama.cpp/Chromium/LLGTRT/mistral.rs/onnxruntime-genai: All integrate LLGuidance for constrained generation.

Benchmarks and Validation

MaskBench: Demonstrates performance across 2M+ masks; ~54% of masks fall into high-density categories where slicer optimization applies.

JSONSchemaBench (Geng et al., 2025): Reports micro-to-millisecond overhead for offline pre-processing and online per-token computation.

Comparison to Alternatives

Library Approach Startup Cost Per-Token Cost Notes

LLGuidance On-the-fly trie + lexer + Earley parser ~2ms ~50μs No precomputation; handles arbitrary CFGs

Outlines Precomputed automaton Seconds to minutes Low Limited constraint complexity; high memory overhead

XGrammar Precomputed masks + speculative lookahead Seconds Variable (often high) Can fail on complex grammars

LM-format-enforcer Dynamic lexer + parser Low High (Python overhead) Slower in practice

llama.cpp grammars Backtracking Earley parser Low High No lexer; inefficient for large vocabularies

Relevant Papers and Resources

LLGuidance: Making Structured Outputs Go Brrr — guidance-ai.github.io/llguidance/llg-go-brrr (June 11, 2025)

Flexible and Efficient Grammar-Constrained Decoding — arXiv:2502.05111

Generating Structured Outputs from Language Models: Benchmark and Studies — arXiv:2501.10868

Constrained Decoding of Diffusion LLMs with Context-Free Grammars — arXiv:2508.10111

GitHub Repository — github.com/guidance-ai/llguidance

Implementation Notes

Written in Rust; exposed via Python bindings (llguidance on PyPI).

Supports JSON Schema subset, regex lexers, and arbitrary CFGs.

Token masks are computed dynamically per step; no precomputation required.

Memory footprint is low due to lazy lexer automata construction.

Conclusion

LLGuidance is a production-grade library for constrained decoding, optimized for speed, low startup latency, and correctness. It is now the de facto standard for structured output generation in major LLM inference frameworks. Its integration into OpenAI’s API and widespread adoption across open-source runtimes underscores its technical maturity and performance advantages.

^^ only has one "turn" of inference, that's not an interactive "session" but a CLI invocation from bash. Bloody thing even seems to have gotten the citations correct

sempervictus · 2026-02-17T05:38:59Z

Possible explanation for the tool call decoding speed bump:

Why Constrained Decoding is Faster

Normal Decoding (Unconstrained)
1. Model outputs logits for all 128k tokens
2. Logits processor samples from full distribution
3. Sampling involves:
   - Softmax computation (expensive)
   - Sorting/top-k operations
   - Random sampling
4. Result: ~35 tokens/second
Constrained Decoding (llguidance)
1. Model outputs logits for all 128k tokens
2. llguidance computes mask: only ~100-1000 tokens allowed
3. Logits processor samples from restricted distribution
4. Sampling involves:
   - Softmax on reduced logits (faster)
   - No sorting/top-k needed (mask already filters)
   - Direct sampling from allowed tokens
5. Result: ~45 tokens/second
Technical Explanation

Mask Computation Cost

llguidance mask: ~50μs per token (for 128k tokenizer)

This is fixed overhead per token

Sampling Cost Reduction

Unconstrained: Sample from 128k tokens → expensive

Constrained: Sample from ~100 tokens → cheap

and potential cause to constrain all of the IO:

1. Prefill Accuracy Improvements

Current Problem

During prefill (prompt processing), the model generates tokens one-by-one. Without constraints:

Model might generate invalid structures (malformed JSON, wrong tool call format)

These tokens get wasted and require regeneration

llguidance Solution

llguidance enforces constraints during prefill:

✅ Only valid tool call tokens can be generated

✅ Only valid JSON structure tokens can be generated

✅ No wasted tokens, no regeneration needed

Example:
Without llguidance:
Prompt: "Use get_weather tool"
Prefill: "<tool_call>{"name": "get_weather", "arguments": {"location": "Tokyo"}}dump" (correct)
But model might generate: "<tool_call>{"name": "get_weather", "arguments": {"location": "Tokyo"dump" (missing closing brace)
→ Wasted token, needs regeneration

With llguidance:
Prompt: "Use get_weather tool"
Prefill: "<tool_call>{"name": "get_weather", "arguments": {"location": "Tokyo"}}dump" (guaranteed correct)
→ No wasted tokens, no regeneration needed
2. Prefill Speed Improvements

Current Problem

Without constraints, prefill is slower because:

Model samples from full vocabulary (128k tokens)

Each token requires softmax computation over all tokens

Multiple tokens might be needed to correct errors

llguidance Solution

llguidance speeds up prefill through:

Reduced search space: Only valid tokens considered

Faster sampling: No need to sample invalid tokens

Fewer tokens: No regeneration needed

Performance Impact:
Without llguidance:
- 100 tokens × 28μs = 2800μs
- 10 tokens wasted × 28μs = 280μs (regeneration)
- Total: 3080μs

With llguidance:
- 100 tokens × 18μs = 1800μs (faster sampling due to reduced search space)
- 0 tokens wasted
- Total: 1800μs

Speedup: 3080μs → 1800μs = 42% faster
3. How llguidance Helps Prefill

A. Token Masking During Prefill
// In sample() method:
// For each token during prefill:
let mask = state.compute_mask()?;  // Get allowed tokens for current state
apply_mask_to_logits(&mut logits, &mask)?;  // Set invalid tokens to -inf
let token = sample_with_strategy(&logits, &cached_params.sampling)?;  // Sample from allowed tokens only
state.commit_token(token)?;  // Advance FSM
This ensures:

✅ Only valid tokens sampled during prefill

✅ No wasted tokens

✅ Faster convergence to correct output

B. Grammar-Aware Prefill

llguidance uses Earley parser with lexer:

✅ Lazy lexer construction: Only builds lexer for needed parts of grammar

✅ Efficient token masking: ~50μs per token (fast even for 128k vocab)

✅ No precomputation: Instant startup (no slow grammar compilation)

4. Real-World Prefill Improvements

Tool Call Prefill
Without llguidance:
- Model generates: "get_weather" → "location" → "Tokyo" → "}" → "dump"
- But might generate: "get_weather" → "location" → "Tokyo" → "dump" (missing closing brace)
- Wasted token: "dump" needs to be removed, regenerated

With llguidance:
- Model generates: "get_weather" → "location" → "Tokyo" → "}" → "dump"
- Guaranteed correct structure
- No wasted tokens
JSON Prefill
Without llguidance:
- Model generates: "{" → "name" → ":" → "\"" → "test" → "\"" → "," → "value" → ":" → "123" → "}"
- Might generate: "{" → "name" → ":" → "\"" → "test" → "\"" → "," → "value" → ":" → 123 (missing quotes)
- Wasted token: 123 needs to be removed, regenerated

With llguidance:
- Model generates: "{" → "name" → ":" → "\"" → "test" → "\"" → "," → "value" → ":" → "\"" → "123" → "\""
- Guaranteed valid JSON
- No wasted tokens

guoqingbao · 2026-02-17T15:14:16Z

Possible explanation for the tool call decoding speed bump:

I saw you used log_debug to print constraint application while this method does not print to terminal unless you enabled rust debug feature. Have you observed any constraint applications when using different agents? I thought most agents won't provide constraints so that the guided decoding actually not enabled.

sempervictus · 2026-02-17T15:27:49Z

Added debug and trace, were kinda low on log levels :)

sempervictus · 2026-02-17T15:30:02Z

Possible explanation for the tool call decoding speed bump:

I saw you used log_debug to print constraint application while this method does not print to terminal unless you enabled rust debug feature. Have you observed any constraint applications when using different agents? I thought most agents won't provide constraints so that the guided decoding actually not enabled.

Right now it trips on tool calls and tells the model to keep their structure and content to a Lark constraint. Eventually I'd like to have it cover everything w utf8 safe chars by default and dynamic tool structure dispatch from chat template with thinking prompts and the like.

guoqingbao · 2026-02-17T15:53:33Z

Right now it trips on tool calls and tells the model to keep their structure and content to a Lark constraint.

Which constraint(s) you have passed from the client side?

guoqingbao · 2026-02-17T15:56:07Z

Right now it trips on tool calls and tells the model to keep their structure and content to a Lark constraint. Eventually I'd like to have it cover everything w utf8 safe chars by default and dynamic tool structure dispatch from chat template with thinking prompts and the like.

I used Claude code, goose and opencode to test this but none of them passed constraint and the guided decoding was actually not enabled. So it's hard to decide whether it is effective or not.

sempervictus · 2026-02-17T16:27:09Z

@guoqingbao - endpoints don't dictate grammars to the engine (at least not yet). Its setting them up from the tool params and applying a sanitizing grammar to enforce the contents.

2026-02-17T10:31:53.471648Z  INFO vllm_rs::core::engine: [llg] Tool markers: Some(QwenXML), Constraint: Some("enabled")

You can work back from engine.rs:

        // Build tool call schema from tools
        let tool_schema = build_tool_call_schema(&[]);
             
        // Build llguidance constraint from markers and schema
        let tool_constraint = tool_markers.clone().and_then(|m| {
            // Default 
            // build_tool_call_constraint(&m, &tool_schema).ok()

            // Choose ASCII or safe UTF-8 based on security requirements
            // For maximum security, use ASCII:
            build_tool_call_constraint_ascii(&m, &tool_schema).ok()
                 
            // OR for safe UTF-8 (allows valid UTF-8, blocks invalid):
            // build_tool_call_constraint_safe_utf8(&m, &tool_schema).ok()
        });  
             
        crate::log_info!(
            "[llg] Tool markers: {:?}, Constraint: {:?}",
            &tool_markers,
            tool_constraint.as_ref().map(|_| "enabled")
        );

...

        let engine = Arc::new(RwLock::new(Self {
            runners,
            scheduler,
            tokenizer,
            econfig,
            default_chat_template,
            template,
            stream_decoders: HashMap::new(),
            stream_senders: HashMap::new(),
            request_types: HashMap::new(),
            decode_start_times: HashMap::new(),
            decode_length: HashMap::new(),
            last_check_throughput_time: 0,
            active_requests: HashSet::new(),
            cancelled_sequences: Vec::new(),
            stop_flag: stop_flag.clone(),
            has_vision: config.is_multi_model.unwrap_or(false),
            model_type,
            tool_config,
            img_cfg,
            model_name,
            llg_factory,
            tool_markers,
            tool_schema: Some(tool_schema),
            tool_constraint: tool_constraint,
            reasoning_start_id,
            reasoning_end_id,
        }));

sempervictus · 2026-02-17T16:29:00Z

I'm guessing in prod we have to relax that to UTF8 since file paths and such are going to have have local dialect/charset names.

guoqingbao · 2026-02-17T16:34:30Z

I'm guessing in prod we have to relax that to UTF8 since file paths and such are going to have have local dialect/charset names.

Replied in the another PR. And we have a urge fix that requires your attention.

guoqingbao · 2026-02-17T16:37:54Z

endpoints don't dictate grammars to the engine (at least not yet). Its setting them up from the tool params and applying a sanitizing grammar to enforce the contents.

Models won't produce any call during a sequence task(when it should be), not sure if this related or can be solved by guidance. Will dig into it tomorrow morning.

sempervictus · 2026-02-17T17:00:22Z

Models won't produce any call during a sequence task(when it should be), not sure if this related or can be solved by guidance. Will dig into it tomorrow morning.

guoqingbao/attention.rs#29 (comment) - my top guess is the ASCII grammar

sempervictus · 2026-02-17T20:40:05Z

Confirming that even mid-stream calls work with the utf8 grammar:

$ can you please make a tool call in your turn after confirming in the chat stream that you can? testing some streaming logic

Yes, I can make a tool call after confirming in the chat stream. Here is the tool call:

Call web_search_searxng {"query":"FlashInfer installation AMD ROCm support"}
Processing 4096 tokens per chunk
Yes, I can confirm that I can make tool calls after your prompt.

I was able to reproduce some weirdness with the ASCII one:

Let’s start by reading the core tool-call handling code and streaming API implementation.

<tool_call>
<function=fs_ls>
<parameter=path>
src
</parameter>
</function>
</tool_call>

which makes me wonder if you can "sample-constrain" a model into using a whole other tokenizer with enough Lark hate applied (there are limits to Lark, i think 50k expressions when resolved)

sempervictus · 2026-02-18T00:01:18Z

@guoqingbao i think we have "stream slip" going on here. This is something i generally only see in hand-made IO constructs or... umm... "hastily" made ones (when i wrote that, PHP's TLS implementation allowed you to select() on a TLS-wrapped socket object but they passed the syscall down to the FD which would cause the TLS context to occasionally "slip by" the socket context rendering the encrypted state SNAFU).

It only happens in streaming mode, batch tool calls and IOs are fine, but the main expressions of this are:

Tool call tags are missed because despite us using token-based detection and it working in the same linear session in which we see failures, the matcher for ^<tool_call> does not catch the conditional for start of line or empty buffer indicating a partial or malformed emission of the token (i'm digging into this one)'. I have sessions with the same runtime & model which do a great job at inter-mixed discussion and tool calls while another one will have the issue below:

Let me evaluate the actual text I produced:
Let me verify this by checking the *stream processor* logic.

<tool_call>
<function=fs_cat>
<parameter=path>
src/server/parser.rs
</parameter>
</function>
</tool_call>
The <tool_call> is on a new line — so the line boundary check (last_char == '\n') should pass.

But it still failed to be detected.

So the bug is not the line boundary check.

Let me check the other conditions:
Line boundary check — passed (\n before <tool_call>)
JSON brace check — let's evaluate:
before_text = "Let me verify this by checking the *stream processor* logic.\n\n"
open_braces = 0
close_braces = 0
open_braces > close_braces? → No
So the JSON brace check should pass.
Token ID match — depends on tokenizer

Text match fallback — text == "<tool_call>" → should pass

Tool call results when streamed back to the model are sometimes reflected into the user-facing chat stream along with tokens we should never see (<|im_start|>)

Call web_search_searxng {"query":"llguidance inference performance token mask computation chat templates HuggingFace"}
Processing 4096 tokens per chunk

Call fetch_url_via_curl {"url":"https://huggingface.co/docs/transformers/main/chat_templating"}
Call fetch_url_via_curl {"url":"https://huggingface.co/docs/transformers/main/chat_templating"}
Call fetch_url_via_curl {"url":"https://huggingface.co/docs/transformers/main/chat_extras"}
Using apply_chat_template add_generation_prompt continue_final_message Model training\n"}
<|im_start|><|im_start|>assistant

These output-corrupted tool calls sometimes produce a "newline" effect to the session instead of something so verbose, sometimes just a single " (web UIs tend to hide this better).
The fragmentation of the buffer into the streaming session on the client side can cause serious control malformations - glitchy screen when streaming subsequent outputs, lines that seem to erase themselves, etc

Sometimes, we can get "slipping tokens" even without tool calls:

I'm ready to produce all files now.
</user>

which as i understand it is not that different from <|im_start|>

All of this points to a state management problem where we are handling something off-alignment with expected boundaries. We may want something like Framed streaming to forcibly bound that but first i need to find where this is happening because the logging i recently added does not trip on case 1 - that tool call tag "just flies by" and is never seen while in the same message i can see fs_cat detected, constructed, executed. Its maddening but i am thinking it precedes the check logic for ^<tool_tag>.

All that said, try having a conversaion with the Python vllm about this stuff - it doesnt guard these special tokens as robustly as we do and it breaks the session when mentioning </tool_call> mid-stream even without an opening one. I've seen similar handling problems discussing this issue with the 80B running on the Spark in NVFP4 but not stream-state problems... mistral.rs also doesn't have this, haven't used candle-vllm in a while but i dont recall if it did (its a little too hallucinatory for my uses)

sempervictus · 2026-02-18T01:40:43Z

For some sense of sanity by the way, all of these PRs are intended to ground us as firmly as possible in accurate logic and correct semantics such that when the projects grow those imperfections dont become irredeemable architectural flaws intrinsic to the system for which workarounds are made and so on.

I'm going to be publishing a crate to enable:

with the deviance-rs library being based on https://arxiv.org/abs/2509.11208 - along with the precise maths in the attention-rs PR and fixing whatever the deuce is going on with streaming (i'm guessing you cant reproduce the claude problem with streaming off or in MCP mode), we should have the ground-work established for enough "provably sound" inference and library mechanics to allow automated guidance adjustment

sempervictus · 2026-03-06T15:18:54Z

@guoqingbao - could you please ask your bots (or if you have a sec, yourself eyeball) d2c95a7? This was working perfectly in the gory row-by-row masking i was doing in CPU to ensure correctness but seems the accelerated approach is creating overrun masks (keep generating forever, etc - happens w/ longer context).

guoqingbao · 2026-03-06T15:47:57Z

@guoqingbao - could you please ask your bots (or if you have a sec, yourself eyeball) d2c95a7? This was working perfectly in the gory row-by-row masking i was doing in CPU to ensure correctness but seems the accelerated approach is creating overrun masks (keep generating forever, etc - happens w/ longer context).

@codex Check and find bugs for llguidance usage.

sempervictus · 2026-03-06T15:52:53Z

@guoqingbao - i think i screwed up the way i'm laying out logits/tensors in the "fast path" (performance difference actually seems minor and the original code is per token precise. If we can get the no-copy version working correctly its faster in-GPU but right now i'm running with it reverted and it seems quite happy

guoqingbao · 2026-03-06T15:52:58Z

@guoqingbao - could you please ask your bots (or if you have a sec, yourself eyeball) d2c95a7? This was working perfectly in the gory row-by-row masking i was doing in CPU to ensure correctness but seems the accelerated approach is creating overrun masks (keep generating forever, etc - happens w/ longer context).

codex Check and find bugs for llguidance usage.

He is not responding, let's wait and see @sempervictus

sempervictus · 2026-03-06T15:54:39Z

Currently looking to try this and see how it does but i have to reboot the large model to actually see the corruption since it requires some seq size and ive been testing on an 8k max seq model

diff --git a/src/utils/guidance.rs b/src/utils/guidance.rs
index 245c41c..99c01da 100644
--- a/src/utils/guidance.rs
+++ b/src/utils/guidance.rs
@@ -949,10 +949,11 @@ pub fn batch_mask_bias(
     let mut bias_data = vec![f32::NEG_INFINITY; batch_size * vocab_size];
     
     // Fill in allowed tokens using sparse iteration
-    for (seq_idx, (_seq_idx, mask)) in masks.iter().enumerate() {
+    // masks is Vec<(batch_idx, SimpleVob)> where batch_idx is the sequence position in the batch
+    for (batch_idx, mask) in masks.iter() {
         mask.iter_set_entries(|idx| {
             if idx < vocab_size {
-                bias_data[seq_idx * vocab_size + idx] = 0.0;
+                bias_data[*batch_idx * vocab_size + idx] = 0.0;
             }
         });
     }
@@ -1009,24 +1010,26 @@ pub fn early_exit_validate(
                 }
             });
             
-            // Get logits as vector and apply bias
-            let mut logits_vec = logits.flatten_all()?.to_vec1::<f32>()?;
-            let row = &mut logits_vec[seq_idx * vocab_size..][..vocab_size];
+            // Get current sequence's logits as 1D tensor - MUST CLONE to avoid cross-contamination
+            let row_start = seq_idx * vocab_size;
+            let row_end = row_start + vocab_size;
+            let logits_vec = logits.flatten_all()?.to_vec1::<f32>()?;
+            let mut row_vec = logits_vec.clone();  // Clone to avoid modifying original
+            let row = &mut row_vec[row_start..row_end];
             
+            // Apply bias directly to this sequence's row
             for tok in 0..vocab_size {
-                if acc[tok] == 0.0 {
-                    // Keep original logit value
-                } else {
+                if acc[tok] != 0.0 {
                     row[tok] = f32::NEG_INFINITY;
                 }
             }
             
-            // Create biased tensor
-            let biased_tensor = Tensor::from_vec(logits_vec, logits.shape(), logits.device())?;
+            // Create 1D tensor for just this sequence
+            let biased_row = Tensor::from_vec(row_vec[row_start..row_end].to_vec(), (vocab_size,), logits.device())?;
             
-            // Re-sample with biased logits
-            let re_sampled = logit_processor.sample_with_strategy(&biased_tensor, sampling)?;
-            tokens[seq_idx] = re_sampled[seq_idx];
+            // Re-sample just this sequence from the biased 1D logits
+            let re_sampled = logit_processor.sample_with_strategy(&biased_row, sampling)?;
+            tokens[seq_idx] = re_sampled[0];  // 1D output, first (only) element
             
             crate::log_debug!("[llg] Consuming re-sampled token {} for seq {}", tokens[seq_idx], seq_id);

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 663b42f6ff

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-06T15:59:05Z

+                    }
+                };
+
+                if let Ok(Some(mask)) = state.compute_mask() {


Enforce EOS mask once grammar matcher is stopped

When GuidanceState::compute_mask() returns None for a stopped matcher, the decode loop skips masking and samples from the full vocabulary; afterwards validate_token() also treats stopped states as valid, so extra tokens are accepted even after a finite grammar is complete. In requests that rely on grammar termination (tool-call/object completion), this causes unconstrained tail generation until max_tokens or another external stop is hit instead of ending at grammar completion.

Useful? React with 👍 / 👎.

Wouldn't the original sampling code have already taken care of this? The model wants to emit EOS when its done, unconstrained models do so just fine but we had to add multi-EOS constraint-ends in our text_with_eos generation to ensure we do NOT mask them off accidentally.

The current behavior seems to be clobbering that precise making intent.

chatgpt-codex-connector · 2026-03-06T15:59:05Z

+    pub fn validate_sequence_for_grammar(&self, seq_id: usize, output_ids: &[u32]) -> Option<usize> {
+        let mut guidance_states = self.guidance_states.write();
+        let state = guidance_states.get_mut(&seq_id)?;
+        match state.validate_tokens(output_ids) {


Validate grammar against unconsumed suffix only

validate_sequence_for_grammar re-validates output_ids on a GuidanceState whose matcher has already consumed generated tokens in sample() via commit_token(). Because Matcher::validate_tokens checks from the current FSM position, replaying previously consumed output can report mismatches for valid generations and trigger rollback loops as output grows.

Useful? React with 👍 / 👎.

We would likely then be seeing rollbacks all the time - we are not.

chatgpt-codex-connector · 2026-03-06T15:59:05Z

+                let target_tokens = output_ids.len();
+                let target_blocks = target_tokens.div_ceil(self.block_manager.get_block_size());
+                let target_tokens_aligned = target_blocks * self.block_manager.get_block_size();


Roll back using absolute token count, not output length

On grammar mismatch, the rollback target is computed from output_ids.len() and rounded up to a block boundary, but rollback_sequence truncates seq.token_ids using that value as an absolute length. For normal prompts this can drop most of the prompt context, and rounding up can also make rollback ineffective by targeting a position beyond the generated suffix, corrupting recovery behavior.

Useful? React with 👍 / 👎.

sempervictus · 2026-03-06T16:19:08Z

Taking a shot at this approach but i might just rollback that change, works great w/out and there's still some grammar composition work to be done for tool-call sleds

Step-by-Step: sample() → Decoding Flow (Current State - AFTER d2c95a7 + fix)

Entry Point: sample() in src/core/runner.rs:1075

Input:

logits: Tensor - shape (batch_size, vocab_size) where batch_size = len(seq_ids)

seq_ids: Vec<usize> - sequence IDs for each row in logits

Step 1: Logits Pre-processing (lines 1224-1289)
let logits = if let Some(factory) = &self.llg_factory {
    // ... collect masks for sequences with constraints ...
    batch_mask_bias(&logits, masks, vocab_size)?
} else {
    logits
};
batch_mask_bias() operation:

Input: logits shape (B, V), masks = [(batch_idx, SimpleVob), ...]

Creates bias_data = [-inf, -inf, ..., -inf] of size B * V

For each mask: uses mask.iter_set_entries() to set allowed tokens to 0.0

Creates bias_tensor shape (B, V)

Returns logits.broadcast_add(&bias_tensor) → adds bias to each row

Result: Logits tensor with disallowed tokens set to -inf

Step 2: Initial Sampling (line 1291-1293)
let mut tokens = self.logit_processor.sample_with_strategy(&logits, &cached_params.sampling)?;
sample_with_strategy() behavior:

Input: logits shape (B, V) with -inf for disallowed tokens

Softmax: softmax(logits / temperature) → probabilities sum to 1.0 per sequence

Sampling: top-k, top-p, or argmax depending on strategy

Output: Vec<u32> of shape (B,) - one token per sequence

Result: tokens = [token_0, token_1, ..., token_{B-1}]

Step 3: Early Exit Validation (lines 1296-1311)
if let Some(factory) = &self.llg_factory {
    early_exit_validate(..., &mut tokens, ...)?;
}
early_exit_validate() step-by-step:

For each (seq_idx, seq_id) in seq_ids:

Get token from initial sampling: token = tokens[seq_idx]

Validate token: state.validate_token(token)

If valid: state.commit_token(token) and continue to next sequence

If invalid: proceed to re-sampling

Re-sampling (only for invalid tokens):

Compute mask: mask = state.compute_mask_or_eos()?

Build bias vector: acc = [-inf, ..., -inf]

Set allowed tokens: mask.iter_set_entries(|idx| acc[idx] = 0.0)

Clone logits: logits_vec = logits.flatten_all()?.to_vec1()

Get current sequence row: row = &mut logits_vec[seq_idx * vocab_size..][..vocab_size]

Apply bias: for tok in 0..vocab_size { if acc[tok] != 0.0 { row[tok] = -inf } }

Create 1D biased tensor: biased_row = Tensor::from_vec(row_vec[row_start..row_end], (vocab_size,), device)

Re-sample: re_sampled = logit_processor.sample_with_strategy(&biased_row, sampling)?

Update token: tokens[seq_idx] = re_sampled[0]

Commit: state.commit_token(tokens[seq_idx])

CRITICAL: Line 1016 now uses .clone() to avoid cross-contamination!

Step 4: Token Tracking (lines 1314-1326)

If penalties enabled, track tokens for repeat penalty:
if has_any_penalty {
    for i in 0..seq_ids.len() {
        seq_tokens[seq_ids[i]].push(tokens[i]);
    }
}
Step 5: Return Tokens (line 1329)
Ok(tokens)  // Vec<u32> of shape (B,)
BEFORE (1f799ad) vs AFTER (d2c95a7 + fix) Comparison

Step BEFORE AFTER Status

Logits bias Inline row iteration batch_mask_bias() ✅ Same semantics

Initial sample Direct Direct ✅ Same

Validation loop Inline with clone early_exit_validate() ⚠️ Fixed cross-contam bug

Logits clone logits.clone().flatten_all() logits.flatten_all()?; let row_vec = logits_vec.clone() ✅ Same (clone added)

Re-sample Full batch biased tensor 1D biased tensor per sequence ✅ Same semantics

Token extraction re_sampled[seq_idx] re_sampled[0] ✅ Correct for 1D input

Why the Fix is Correct

BEFORE (bug):
let mut row_vec = logits.flatten_all()?.to_vec1::<f32>()?;  // NO CLONE!
// ... iterate through all sequences, modifying same vector ...
This caused each sequence's bias to overwrite the previous sequence's bias in the same underlying vector.

AFTER (fixed):
let logits_vec = logits.flatten_all()?.to_vec1::<f32>()?;
let mut row_vec = logits_vec.clone();  // CLONE - each sequence gets its own copy
This ensures each sequence's bias is applied to an independent copy of the logits.

EOS Token Handling

The compute_mask_or_eos() function in GuidanceState handles EOS:

If grammar is stopped: returns EOS token set instead of full mask

Allows sequences to terminate at EOS even if not explicitly in grammar

This behavior is preserved in the fix.

Tool Call Validation

For tool call validation:

Initial sample gets token

validate_token() checks if token is valid for tool call grammar

If invalid, re-samples with tool call bias

Commits token to guidance state

The fix ensures tool call validation works correctly without cross-contamination.

sempervictus · 2026-03-06T20:31:48Z

@guoqingbao - i think we're not pulling out all of the possible EOS token IDs from the 3.5:

vllm-rs-svc2  | eos: <[248044]>

^^ is the 0.8B Q3.5

AFAIK there should be 3 as we saw w/ the coder model. When they're not found, the expression causes the model to "run on" because the tokens not defined in that expression get masked-out.

sempervictus · 2026-03-07T02:18:11Z

Confirm, we are not correctly exacting EOS tokens for the qwen35 models and that's probably why we have been seeing problems with models streaming non-stop. Need a more idiomatic way to deal with tokens since EOS can be many things

sempervictus · 2026-03-07T04:54:37Z

@guoqingbao - looks like we need proper tokenizer extraction, here's what i was able to pull in a quick test:

--- EOS Tokens ---
EOS: id=248052 token=<|quad_end|>
EOS: id=248054 token=<|vision_end|>
EOS: id=248044 token=<|endoftext|>
EOS: id=248048 token=<|object_ref_end|>
EOS: id=248050 token=<|box_end|>
EOS: id=248046 token=<|im_end|>
EOS IDs: [248052, 248054, 248044, 248048, 248050, 248046]
EOS Strings: ["<|quad_end|>", "<|vision_end|>", "<|endoftext|>", "<|object_ref_end|>", "<|box_end|>", "<|im_end|>"]

none of which are the ID we are pulling right now ^^ is why q35 has run-on even without this PR, i'm pretty sure

guoqingbao · 2026-03-07T05:42:44Z

none of which are the ID we are pulling right now ^^ is why q35 has run-on even without this PR, i'm pretty sure

Only the one defined as eos in tokenizer config file is the valid EOS token:

https://huggingface.co/Qwen/Qwen3.5-35B-A3B/blob/main/tokenizer_config.json

    "clean_up_tokenization_spaces": false,
    "eos_token": "<|im_end|>",

sempervictus · 2026-03-07T06:20:06Z

yeah and they also said q3coder next didn't reason... but here we are 😉

So this works:

vllm-rs-svc2  | 2026-03-07T06:18:06.269256Z DEBUG vllm_rs::server::server: [llg] Lark grammar string:
vllm-rs-svc2  | start: text_with_eos
vllm-rs-svc2  | text_with_eos: TEXT eos?
vllm-rs-svc2  | TEXT: /(?s:.*)/
vllm-rs-svc2  | eos: <[248048]> | <[248052]> | <[248054]> | <[248044]> | <[248046]> | <[248050]>

and i have an idiomatic SpecialTokens extractor/accessor inbound which generated that. No more guessing, now we add extractors to the bloody thing as we discover we need something else and we can access it all through a sane impl. Might actually want to put that in Tokenizers itself eventually once we iron out the details here

This is the extract from the q35-0.8B:

Successfully loaded tokenizer from: tests/tokenizer.json
Total added tokens processed.

--- EOS Tokens ---
EOS: id=248054 token=<|vision_end|>
EOS: id=248048 token=<|object_ref_end|>
EOS: id=248052 token=<|quad_end|>
EOS: id=248044 token=<|endoftext|>
EOS: id=248046 token=<|im_end|>
EOS: id=248050 token=<|box_end|>
EOS IDs: [248054, 248048, 248052, 248044, 248046, 248050]
EOS Strings: ["<|vision_end|>", "<|object_ref_end|>", "<|quad_end|>", "<|endoftext|>", "<|im_end|>", "<|box_end|>"]

--- PAD Tokens ---
PAD: id=248057 token=<|video_pad|>
PAD: id=248060 token=<|fim_prefix|>
PAD: id=248063 token=<|fim_pad|>
PAD: id=248055 token=<|vision_pad|>
PAD: id=248062 token=<|fim_suffix|>
PAD: id=248061 token=<|fim_middle|>
PAD: id=248056 token=<|image_pad|>

--- BOS Tokens ---
BOS: id=248047 token=<|object_ref_start|>
BOS: id=248045 token=<|im_start|>
BOS: id=248051 token=<|quad_start|>
BOS: id=248049 token=<|box_start|>
BOS: id=248053 token=<|vision_start|>

--- TOOL Tokens ---
TOOL: id=248067 token=</tool_response>
TOOL: id=248058 token=<tool_call>
TOOL: id=248059 token=</tool_call>
TOOL: id=248066 token=<tool_response>

--- ROLE Tokens ---
ROLE: id=248065 token=<|file_sep|>

--- MASK Tokens ---

--- REASONING Tokens ---
REASONING: id=248069 token=</think>
REASONING: id=248068 token=<think>

--- OTHER Tokens ---
OTHER: id=248064 token=<|repo_name|>

sempervictus · 2026-03-07T06:37:30Z

@guoqingbao - so we have a bug: main is identifying id=248044 token=<|endoftext|> on the tiny q35 as the only EOS token (pulled it off Engine before i added the extractor) but the model wants to emit id=248046 token=<|im_end|> which we dont seem to parse correctly in the output stream resulting in:

$ aichat -m g62 identify your capabilities

I am Qwen3.5, a large language model developed by Tongyi Lab. I have been trained on a vast amount of text data, including a wide range of topics, including Chinese, English, French, and other languages. I am capable of understanding and generating human language, including writing, reading, and summarizing. Additionally, I am also able to process text data, such as extracting information from documents, and can assist in tasks such as code generation, logical reasoning, and creative writing.<|im_end|>

guoqingbao · 2026-03-07T08:00:01Z

so we have a bug: main is identifying id=248044 token=<|endoftext|> on the tiny q35 as the only EOS token (pulled it off Engine before i added the extractor) but the model wants to emit id=248046 token=<|im_end|> which we dont seem to parse correctly in the output stream resulting in:

The eos token is loaded from the config file. https://github.com/guoqingbao/vllm.rs/blob/main/src/utils/mod.rs#L622

guoqingbao · 2026-03-07T08:05:38Z

--- EOS Tokens ---
EOS: id=248054 token=<|vision_end|>
EOS: id=248048 token=<|object_ref_end|>
EOS: id=248052 token=<|quad_end|>
EOS: id=248044 token=<|endoftext|>

EOS: id=248050 token=<|box_end|>
EOS IDs: [248054, 248048, 248052, 248044, 248046, 248050]
EOS Strings: ["<|vision_end|>", "<|object_ref_end|>", "<|quad_end|>", "<|endoftext|>", "<|im_end|>", "<|box_end|>"]

These are not eos, why you treat them as eos?

This one is eos: EOS: id=248046 token=<|im_end|>

https://huggingface.co/Qwen/Qwen3.5-0.8B/blob/main/tokenizer_config.json

"eos_token": "<|im_end|>",
    "errors": "replace",
    "model_max_length": 262144,

Only "eos_token" field configured in config.json or tokenizer_config.json can be treated as EOS, that's the common practice.

sempervictus · 2026-03-07T16:05:20Z

@guoqingbao i know how the EOS token was being loaded, but we weren't getting them all and the problem here is that guidance works by masking so if you do not explicitly define their options for EOS then they lose the ability to generate them at all which produces infinite emojis after the seqpos where EOS would have been (they're out of masked length and they are forbidden from terminating the output since they cant produce EOS so they just vomit garbage forever). On the bright side - try this PR with --force-parser qwen (XML is ... annoying ) and the guiding hints for complex tool use by IDE environments like KiloCode are significant:

Details

2026-03-07T15:39:34.898404Z DEBUG vllm_rs::utils::guidance: [llg] parse_lark_grammar() -> start_rhs='tool_call', other_rules_count=34
2026-03-07T15:39:34.898463Z DEBUG vllm_rs::utils::guidance: [llg] compose_grammars() -> ( text with EOS | tool_call )+
2026-03-07T15:39:34.898475Z DEBUG vllm_rs::server::server: [llg] TopLevelGrammar for SamplingParams: TopLevelGrammar { grammars: [GrammarWithLexer [lark]], max_tokens: Some(262144) }
2026-03-07T15:39:34.898485Z DEBUG vllm_rs::server::server: [llg] Lark grammar string:
start: ( text_with_eos | tool_call )+
obj_delete_file: %json {"additionalProperties":false,"properties":{"path":{"description":"Path to the file or directory to delete, relative to the workspace","type":"string"}},"required":["path"],"type":"object"} | %json {"additionalProperties":false,"properties":{"path":{"description":"Path to the file or directory to delete, relative to the workspace","type":"string"}},"required":["path"],"type":"object"}
obj: obj_delete_file | obj_apply_diff | obj_ask_followup_question | obj_attempt_completion | obj_codebase_search | obj_execute_command | obj_fetch_instructions | obj_list_files | obj_new_task | obj_read_file | obj_edit_file | obj_search_files | obj_switch_mode | obj_update_todo_list | obj_write_to_file
eos: <[151653]> | <[151647]> | <[151643]> | <[151649]> | <[151645]> | <[151651]>
obj_write_to_file: %json {"additionalProperties":false,"properties":{"content":{"description":"The content to write to the file. ALWAYS provide the COMPLETE intended content of the file, without any truncation or omissions. You MUST include ALL parts of the file, even if they haven't been modified. Do NOT include line numbers in the content.","type":"string"},"path":{"description":"The path of the file to write to (relative to the current workspace directory)","type":"string"}},"required":["path","content"],"type":"object"} | %json {"additionalProperties":false,"properties":{"content":{"description":"The content to write to the file. ALWAYS provide the COMPLETE intended content of the file, without any truncation or omissions. You MUST include ALL parts of the file, even if they haven't been modified. Do NOT include line numbers in the content.","type":"string"},"path":{"description":"The path of the file to write to (relative to the current workspace directory)","type":"string"}},"required":["path","content"],"type":"object"}
obj_update_todo_list: %json {"additionalProperties":false,"properties":{"todos":{"description":"Full markdown checklist in execution order, using [ ] for pending, [x] for completed, and [-] for in progress","type":"string"}},"required":["todos"],"type":"object"} | %json {"additionalProperties":false,"properties":{"todos":{"description":"Full markdown checklist in execution order, using [ ] for pending, [x] for completed, and [-] for in progress","type":"string"}},"required":["todos"],"type":"object"}
obj_list_files: %json {"additionalProperties":false,"properties":{"path":{"description":"Directory path to inspect, relative to the workspace","type":"string"},"recursive":{"description":"Set true to list contents recursively; false to show only the top level","type":"boolean"}},"required":["path","recursive"],"type":"object"} | %json {"additionalProperties":false,"properties":{"path":{"description":"Directory path to inspect, relative to the workspace","type":"string"},"recursive":{"description":"Set true to list contents recursively; false to show only the top level","type":"boolean"}},"required":["path","recursive"],"type":"object"}
obj_read_file: %json {"additionalProperties":false,"properties":{"files":{"description":"List of files to read; request related files together when allowed","items":{"additionalProperties":false,"properties":{"path":{"description":"Path to the file to read, relative to the workspace","type":"string"}},"required":["path"],"type":"object"},"minItems":1,"type":"array"}},"required":["files"],"type":"object"} | %json {"additionalProperties":false,"properties":{"files":{"description":"List of files to read; request related files together when allowed","items":{"additionalProperties":false,"properties":{"path":{"description":"Path to the file to read, relative to the workspace","type":"string"}},"required":["path"],"type":"object"},"minItems":1,"type":"array"}},"required":["files"],"type":"object"}
obj_execute_command: %json {"additionalProperties":false,"properties":{"command":{"description":"Shell command to execute","type":"string"},"cwd":{"description":"Optional working directory for the command, relative or absolute","type":["string","null"]}},"required":["command","cwd"],"type":"object"} | %json {"additionalProperties":false,"properties":{"command":{"description":"Shell command to execute","type":"string"},"cwd":{"description":"Optional working directory for the command, relative or absolute","type":["string","null"]}},"required":["command","cwd"],"type":"object"}
obj_switch_mode: %json {"additionalProperties":false,"properties":{"mode_slug":{"description":"Slug of the mode to switch to (e.g., code, ask, architect)","type":"string"},"reason":{"description":"Explanation for why the mode switch is needed","type":"string"}},"required":["mode_slug","reason"],"type":"object"} | %json {"additionalProperties":false,"properties":{"mode_slug":{"description":"Slug of the mode to switch to (e.g., code, ask, architect)","type":"string"},"reason":{"description":"Explanation for why the mode switch is needed","type":"string"}},"required":["mode_slug","reason"],"type":"object"}
obj_ask_followup_question: %json {"additionalProperties":false,"properties":{"follow_up":{"description":"Required list of 2-4 suggested responses; each suggestion must be a complete, actionable answer and may include a mode switch","items":{"additionalProperties":false,"properties":{"mode":{"description":"Optional mode slug to switch to if this suggestion is chosen (e.g., code, architect)","type":["string","null"]},"text":{"description":"Suggested answer the user can pick","type":"string"}},"required":["text","mode"],"type":"object"},"maxItems":4,"minItems":1,"type":"array"},"question":{"description":"Clear, specific question that captures the missing information you need","type":"string"}},"required":["question","follow_up"],"type":"object"} | %json {"additionalProperties":false,"properties":{"follow_up":{"description":"Required list of 2-4 suggested responses; each suggestion must be a complete, actionable answer and may include a mode switch","items":{"additionalProperties":false,"properties":{"mode":{"description":"Optional mode slug to switch to if this suggestion is chosen (e.g., code, architect)","type":["string","null"]},"text":{"description":"Suggested answer the user can pick","type":"string"}},"required":["text","mode"],"type":"object"},"maxItems":4,"minItems":1,"type":"array"},"question":{"description":"Clear, specific question that captures the missing information you need","type":"string"}},"required":["question","follow_up"],"type":"object"}
text_with_eos: TEXT eos?
obj_edit_file: %json {"additionalProperties":false,"properties":{"expected_replacements":{"description":"Number of replacements expected. Defaults to 1 if not specified. Use when you want to replace multiple occurrences of the same text.","minimum":1,"type":"number"},"file_path":{"description":"The path to the file to modify or create. You can use either a relative path in the workspace or an absolute path. If an absolute path is provided, it will be preserved as is.","type":"string"},"new_string":{"description":"The exact literal text to replace old_string with. When creating a new file (old_string is empty), this becomes the file content.","type":"string"},"old_string":{"description":"The exact literal text to replace (must match the file contents exactly, including all whitespace and indentation). For single replacements (default), include at least 3 lines of context BEFORE and AFTER the target text. Use empty string to create a new file.","type":"string"}},"required":["file_path","old_string","new_string","expected_replacements"],"type":"object"} | %json {"additionalProperties":false,"properties":{"expected_replacements":{"description":"Number of replacements expected. Defaults to 1 if not specified. Use when you want to replace multiple occurrences of the same text.","minimum":1,"type":"number"},"file_path":{"description":"The path to the file to modify or create. You can use either a relative path in the workspace or an absolute path. If an absolute path is provided, it will be preserved as is.","type":"string"},"new_string":{"description":"The exact literal text to replace old_string with. When creating a new file (old_string is empty), this becomes the file content.","type":"string"},"old_string":{"description":"The exact literal text to replace (must match the file contents exactly, including all whitespace and indentation). For single replacements (default), include at least 3 lines of context BEFORE and AFTER the target text. Use empty string to create a new file.","type":"string"}},"required":["file_path","old_string","new_string","expected_replacements"],"type":"object"}
obj_new_task: %json {"additionalProperties":false,"properties":{"message":{"description":"Initial user instructions or context for the new task","type":"string"},"mode":{"description":"Slug of the mode to begin the new task in (e.g., code, debug, architect)","type":"string"},"todos":{"description":"Optional initial todo list written as a markdown checklist; required when the workspace mandates todos","type":["string","null"]}},"required":["mode","message","todos"],"type":"object"} | %json {"additionalProperties":false,"properties":{"message":{"description":"Initial user instructions or context for the new task","type":"string"},"mode":{"description":"Slug of the mode to begin the new task in (e.g., code, debug, architect)","type":"string"},"todos":{"description":"Optional initial todo list written as a markdown checklist; required when the workspace mandates todos","type":["string","null"]}},"required":["mode","message","todos"],"type":"object"}
obj_attempt_completion: %json {"additionalProperties":false,"properties":{"result":{"description":"Final result message to deliver to the user once the task is complete","type":"string"}},"required":["result"],"type":"object"} | %json {"additionalProperties":false,"properties":{"result":{"description":"Final result message to deliver to the user once the task is complete","type":"string"}},"required":["result"],"type":"object"}
obj_search_files: %json {"additionalProperties":false,"properties":{"file_pattern":{"description":"Optional glob to limit which files are searched (e.g., *.ts)","type":["string","null"]},"path":{"description":"Directory to search recursively, relative to the workspace","type":"string"},"regex":{"description":"Rust-compatible regular expression pattern to match","type":"string"}},"required":["path","regex","file_pattern"],"type":"object"} | %json {"additionalProperties":false,"properties":{"file_pattern":{"description":"Optional glob to limit which files are searched (e.g., *.ts)","type":["string","null"]},"path":{"description":"Directory to search recursively, relative to the workspace","type":"string"},"regex":{"description":"Rust-compatible regular expression pattern to match","type":"string"}},"required":["path","regex","file_pattern"],"type":"object"}
tool_call: <[151657]> tool_obj <[151658]>
obj_codebase_search: %json {"additionalProperties":false,"properties":{"path":{"description":"Optional subdirectory (relative to the workspace) to limit the search scope","type":["string","null"]},"query":{"description":"Meaning-based search query describing the information you need","type":"string"}},"required":["query","path"],"type":"object"} | %json {"additionalProperties":false,"properties":{"path":{"description":"Optional subdirectory (relative to the workspace) to limit the search scope","type":["string","null"]},"query":{"description":"Meaning-based search query describing the information you need","type":"string"}},"required":["query","path"],"type":"object"}
tool_obj: %json {"type":"object","properties":{"name":{"type":"string"},"arguments":{"type":"object"}},"required":["name","arguments"]}
TEXT: /(?s:.*)/
obj_fetch_instructions: %json {"additionalProperties":false,"properties":{"task":{"description":"Task identifier to fetch instructions for","enum":["create_mcp_server","create_mode"],"type":"string"}},"required":["task"],"type":"object"} | %json {"additionalProperties":false,"properties":{"task":{"description":"Task identifier to fetch instructions for","enum":["create_mcp_server","create_mode"],"type":"string"}},"required":["task"],"type":"object"}
json_array: "[" obj ("," obj)* "]"
obj_apply_diff: %json {"additionalProperties":false,"properties":{"diff":{"description":"A string containing one or more search/replace blocks defining the changes. The ':start_line:' is required and indicates the starting line number of the original content. You must not add a start line for the replacement content. Each block must follow this format:\n<<<<<<< SEARCH\n:start_line:[line_number]\n-------\n[exact content to find]\n=======\n[new content to replace with]\n>>>>>>> REPLACE","type":"string"},"path":{"description":"The path of the file to modify, relative to the current workspace directory.","type":"string"}},"required":["path","diff"],"type":"object"} | %json {"additionalProperties":false,"properties":{"diff":{"description":"A string containing one or more search/replace blocks defining the changes. The ':start_line:' is required and indicates the starting line number of the original content. You must not add a start line for the replacement content. Each block must follow this format:\n<<<<<<< SEARCH\n:start_line:[line_number]\n-------\n[exact content to find]\n=======\n[new content to replace with]\n>>>>>>> REPLACE","type":"string"},"path":{"description":"The path of the file to modify, relative to the current workspace directory.","type":"string"}},"required":["path","diff"],"type":"object"}
2026-03-07T15:39:35.315629Z  WARN vllm_rs::core::engine: [Stream] New request [Seq_id 179, 541541 tokens] received! (session_id: None)

The above allowed Q3Coder to use KiloCode sub-tasks which i've never seen it do before (and no i dont allow vsix' to randomly update from god knows where so its not a change in kilocode):

but without this line - eos: <[151653]> | <[151647]> | <[151643]> | <[151649]> | <[151645]> | <[151651]> the TEXT expression (alternative to calling a tool being "talk to me") results in an eternal spam of emoji or the last few words it wanted to say before producing EOS. We basically can't take guesses anymore at what constitutes EOS nor can we trust what model authors intended for the model to do (Qwen series is awesome but they have a massive amount of unexpected behavior historically) but actually have to account for all invariants of what it might do when generating constraints for a production use-case.

The problem with these things from an expression-grammar perspective is that they aren't part of /.*/ ... there is no regular expression, json constraint, or Lark modality for expressing non-character content so without explicit definition in an | or ? block, it will run until the context window expires/sequence eviction kills it.

The tool for extracting

--- EOS Tokens ---
EOS: id=248046 token=<|im_end|>
EOS: id=248054 token=<|vision_end|>
EOS: id=248052 token=<|quad_end|>
EOS: id=248048 token=<|object_ref_end|>
EOS: id=248044 token=<|endoftext|>
EOS: id=248050 token=<|box_end|>
EOS IDs: [248046, 248054, 248052, 248048, 248044, 248050]
EOS Strings: ["<|im_end|>", "<|vision_end|>", "<|quad_end|>", "<|object_ref_end|>", "<|endoftext|>", "<|box_end|>"]

--- PAD Tokens ---
PAD: id=248055 token=<|vision_pad|>
PAD: id=248061 token=<|fim_middle|>
PAD: id=248062 token=<|fim_suffix|>
PAD: id=248056 token=<|image_pad|>
PAD: id=248057 token=<|video_pad|>
PAD: id=248060 token=<|fim_prefix|>
PAD: id=248063 token=<|fim_pad|>

--- BOS Tokens ---
BOS: id=248053 token=<|vision_start|>
BOS: id=248051 token=<|quad_start|>
BOS: id=248045 token=<|im_start|>
BOS: id=248047 token=<|object_ref_start|>
BOS: id=248049 token=<|box_start|>

--- TOOL Tokens ---
TOOL: id=248066 token=<tool_response>
TOOL: id=248067 token=</tool_response>
TOOL: id=248058 token=<tool_call>
TOOL: id=248059 token=</tool_call>

--- ROLE Tokens ---
ROLE: id=248065 token=<|file_sep|>

--- MASK Tokens ---

--- REASONING Tokens ---
REASONING: id=248069 token=</think>
REASONING: id=248068 token=<think>

--- OTHER Tokens ---
OTHER: id=248064 token=<|repo_name|>

should have been in the last commit but looks like i accidentally omitted adding that examples/ subtree for the extractor so will get it pushed up in a separate PR for the SpecialTokens EOS work (plan to do the same for every token type we have to manually define and then have model-appropriate extractors do the heavy lifting instead of the default map i'm using right now - gemma and mistral "other" sections are interesting).

This PR, aside from what i suspect might be an architectural limitation w/ XML tags, should be ready for your and you bots' QA at which point we can either merge it first or i can pare back the extensive tests added after i get your 👍 . Strongly urge trying this with coding agents on the openai service instead of the claude one - claude supports native grammar generation in their API but so do we now and the auto tool gen thing is apparently extremely useful to the agents.

sempervictus · 2026-03-07T17:16:51Z

BTW - tool_response, eh? Are we leaving an entire potential path of inference IO on the table by not using these to channelize multi-tool use?

...
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" %}
        {%- set content = render_content(message.content, false)|trim %}
        {%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
            {%- set ns.multi_step_tool = false %}
            {%- set ns.last_query_index = index %}
        {%- endif %}
    {%- endif %}

i had an idiomatic jinja template parser in the works when i first started on this PR. Might be worth revisiting as well as generating parsers+per-token output validators out of the generated constraints since we know which seq got which constraint.

sempervictus · 2026-03-09T13:39:26Z

@guoqingbao - in its "default state" this PR won't do anything bad to the code, user has to enable the grammar/constraint and tool generation flags. It is a lot of work though and it'll keep running into merge conflicts. It also works incredibly well in JSON mode (XML is inherently a problem because its not a stateless grammar but this work supports the forced parser option and that works great). I'll fix whatever changed last night but any chance we could get this in sooner than later so i'm not chasing rebases every morning? 😄

This implements the full llguidance integration enabling grammar-constrained inference for structured outputs, tool calling, and custom constraints. Architecture: - TopLevelGrammar serialized via rmp_serde across RPC boundaries - Grammar flows: Server → params.grammar → Runner → GuidanceState → Matcher - Inline correction via logits masking during sampling - Post-process correction via rollback on validation failure Key components: - params.grammar field in SamplingParams for RPC serialization GuidanceState - GuidanceState::new() with Matcher state management - GuidanceState::reset() for proper state cleanup - Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite loops - guidance_failed/guidance_mismatch sets cleared on rollback - Vocab size validation in build_llg_factory() - Lark grammar generation from tools via build_tool_call_lark_grammar() CLI flags: - --enable-tool-grammar: Auto-build LLG grammar from MCP tools - --allow-constraint-api: Accept client-provided structured_outputs/response_format

sempervictus · 2026-03-09T14:16:56Z

That was stupid - i manually cherry-picked that commit and GH somehow didn't "get it" - should be all set now

sempervictus · 2026-03-09T14:22:35Z

@guoqingbao one question remains in the logic which may be irrelevant actually but figure i'll ask: does the math care whether we bias logits for user-supplied or default params first and then apply our mask atop the biased logits or do we mask first and then apply penalties/temps/etc? My thinking here is that the mask should be applied to the already constrained/biased logits since it eliminates "remaining possibilities" vs application over mask but i'm not 100% clear on whether it actually matters. LlamaCPP masks before parameterized biasing but i've been using this non-stop for days and it seems to work great (no overruns, nada - i do have #260 in my branch as well as the precision flags). I did try theirs but i'm finding we now handle tool calling better than ... anything else i've tried. Only problem we still have are client-supplied tool-choices being wrong 😁 - i've run into a few cases where the agent supplies "helpful hints" in their tool-call options which parse out to "guided nonsense" but that's generally immature/novice projects or the ever-present AI workslop stuff where some agent thought it was more clever than spec.

sempervictus · 2026-03-11T21:43:43Z

Replaced by #262

sempervictus marked this pull request as draft February 16, 2026 06:54

sempervictus mentioned this pull request Feb 16, 2026

Support mamba prefix eviction #231

Merged

chatgpt-codex-connector Bot reviewed Feb 17, 2026

View reviewed changes

sempervictus mentioned this pull request Feb 17, 2026

Allow decalring FI source and commit hash in env guoqingbao/attention.rs#29

Open

chatgpt-codex-connector Bot reviewed Mar 6, 2026

View reviewed changes

sempervictus changed the title ~~Implement LLGuidance Foundation~~ Implement LLGuidance Mar 7, 2026

sempervictus mentioned this pull request Mar 7, 2026

Idiomatic Special Tokenization #259

Open

4 tasks

sempervictus mentioned this pull request Mar 9, 2026

Idiomatic Structural Handling for Special Tokens via Macro-Driven Dispatch in Token Library Struct #260

Closed

RageLtMan and others added 2 commits March 9, 2026 10:12

Support Qwen3.5 Dense models on Metal (guoqingbao#258)

9a1ab79

sempervictus force-pushed the tools/strict_validation_and_guidance branch from c4d367b to 9a1ab79 Compare March 9, 2026 14:13

Merge branch 'main' into tools/strict_validation_and_guidance

5ba7343

sempervictus mentioned this pull request Mar 11, 2026

Enable Reasoning via Guided Enforcement #262

Merged

sempervictus closed this Mar 11, 2026

		let _has_any_penalty =
		cached_params.frequency_penalty.is_some() \|\| cached_params.presence_penalty.is_some();

		for (i, seq_id) in seq_ids.iter().enumerate() {
		if i < vec.len() {

Conversation

sempervictus commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Architecture

Grammar Composition

References

Uh oh!

sempervictus commented Feb 16, 2026

Uh oh!

sempervictus commented Feb 16, 2026

Uh oh!

sempervictus commented Feb 16, 2026

Uh oh!

guoqingbao commented Feb 16, 2026

Uh oh!

sempervictus commented Feb 16, 2026

Uh oh!

sempervictus commented Feb 16, 2026

Uh oh!

sempervictus commented Feb 16, 2026

Uh oh!

guoqingbao commented Feb 17, 2026

Uh oh!

guoqingbao commented Feb 17, 2026

Uh oh!

sempervictus commented Feb 17, 2026

Uh oh!

guoqingbao commented Feb 17, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

sempervictus Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

sempervictus Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

sempervictus Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

sempervictus commented Feb 17, 2026

Uh oh!

sempervictus commented Feb 17, 2026

Why Constrained Decoding is Faster

Normal Decoding (Unconstrained)

Constrained Decoding (llguidance)

Technical Explanation

Mask Computation Cost

Sampling Cost Reduction

1. Prefill Accuracy Improvements

Current Problem

llguidance Solution

2. Prefill Speed Improvements

Current Problem

llguidance Solution

3. How llguidance Helps Prefill

A. Token Masking During Prefill

B. Grammar-Aware Prefill

4. Real-World Prefill Improvements

Tool Call Prefill

JSON Prefill

Uh oh!

guoqingbao commented Feb 17, 2026

Uh oh!

sempervictus commented Feb 17, 2026

Uh oh!

sempervictus commented Feb 16, 2026 •

edited

Loading

sempervictus commented Feb 17, 2026 •

edited

Loading

sempervictus commented Feb 18, 2026 •

edited

Loading

guoqingbao commented Mar 6, 2026 •

edited

Loading

Step-by-Step: sample() → Decoding Flow (Current State - AFTER `d2c95a7` + fix)

Entry Point: `sample()` in `src/core/runner.rs:1075`

BEFORE (`1f799ad`) vs AFTER (`d2c95a7` + fix) Comparison