Implement LLGuidance#232
Conversation
|
@guoqingbao - could you please make the pyo3 stuff not blow up? I'm not a maturin person and its... mildly insanity-provoking 😁 Would obviously appreciate code review and assistance as well. There's a few TODOs but actual template wiring and testing is a whole other thing. Otherwise passing tests on rust side, running correctly, chaining tool calls like the NVFP4 80B on Python Strongly urge use in conjunction with the |
|
Should supersede #208 - not sure if anything in there is worth pulling to this one @guoqingbao but i'm sure you'll see it on review |
|
... and template loading, need to get that logic done too :-) |
|
I'm not sure, I have removed the strict tool call validation because it doesn't not work with certain agents. The tool call schema can be very diverse and a simple validation strategy simply made many false positives. The llguidance is an optional feature which can not be serve as default path for decode and sampling that's why similar impl in #208 not being merged. |
|
I'll get template parsing added. |
|
Json seems a fairly common datatype, why does maturin hate |
|
@guoqingbao - oddly, this seems to help decoding now that i'm enforcing grammars on tool calls vllm-rs-svc0 | 2026-02-16T21:46:03.318901Z INFO vllm_rs::core::engine: Decoding: 1 active request(s) [Seq: [3]], avg. 47 tokens/s per request (total: 47 tokens/s)normally its 35 with the accuracy patch. Seems to fluctuate between normal output and tool calls but... i think this might be handy just for that. Could you please take a look at the py failures? I have this building just in the rust format w/ the flashinfer patch but we can't merge it until python builds especially given that the GCU uses that interface |
Will take a look once backed to the office. |
|
@codex help review this PR |
|
No rush, got it working on the gemm right now while o build clouds :-) |
Is this working with Claude code? |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0dcbb008aa
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let _has_any_penalty = | ||
| cached_params.frequency_penalty.is_some() || cached_params.presence_penalty.is_some(); |
There was a problem hiding this comment.
Restore repeat-penalty application in decoding
frequency_penalty and presence_penalty are still computed here, but this commit removed the only apply_batch_repeat_penalty call and the sequence-token tracking that fed it, so penalty-enabled requests now sample as if penalties were disabled. This silently changes generation behavior for any workload relying on repetition control.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Will look into this and cleanup the call graph as we approach commit
| GuidanceState::new( | ||
| self.llg_factory.as_ref().unwrap().clone(), | ||
| &constraint_opt.expect("checked above") | ||
| ).expect("Failed to create guidance state") |
There was a problem hiding this comment.
Avoid panicking on invalid user constraints
Creating a GuidanceState with expect("Failed to create guidance state") will panic the runner when a request supplies an invalid regex/Lark/JSON-schema constraint, turning a bad input into process-level failure instead of a recoverable request error. This is especially risky because constraints are request data and can be malformed at runtime.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The state was previously checked so if something is wrong here we do want a crash
| for (i, seq_id) in seq_ids.iter().enumerate() { | ||
| if i < vec.len() { |
There was a problem hiding this comment.
Commit each DecodeVec token only once
This branch introduces a second for (i, seq_id) loop inside an outer loop that is already iterating seq_ids, so in Seqs::DecodeVec each sampled token is committed to FSM state multiple times per step. That advances constraints too far and can cause valid constrained generations to fail after a few decode iterations.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Havent seen generation fail yet
| let start_id = match tokenizer.encode(start, false) { | ||
| Ok(encoding) => { | ||
| let ids = encoding.get_ids(); | ||
| if ids.len() == 1 { Some(ids[0]) } else { None } | ||
| } | ||
| Err(_) => None | ||
| }; | ||
|
|
||
| // Try to encode end token | ||
| let end_id = match tokenizer.encode(end, false) { |
There was a problem hiding this comment.
Assign detected reasoning token IDs to outer vars
The let start_id = ... / let end_id = ... declarations inside the loop shadow the outer mutable start_id and end_id, so the tuple returned at the end remains (None, None) even when a marker pair is found. As a result, tokenizer-based reasoning marker detection is effectively never enabled.
Useful? React with 👍 / 👎.
No idea, kilocode and aichat seem to love it. Aichat used to screw up tools all the time, now i can do this:
^^ only has one "turn" of inference, that's not an interactive "session" but a CLI invocation from |
|
Possible explanation for the tool call decoding speed bump:
and potential cause to constrain all of the IO:
|
I saw you used log_debug to print constraint application while this method does not print to terminal unless you enabled rust debug feature. Have you observed any constraint applications when using different agents? I thought most agents won't provide constraints so that the guided decoding actually not enabled. |
|
Added debug and trace, were kinda low on log levels :) |
Right now it trips on tool calls and tells the model to keep their structure and content to a Lark constraint. Eventually I'd like to have it cover everything w utf8 safe chars by default and dynamic tool structure dispatch from chat template with thinking prompts and the like. |
Which constraint(s) you have passed from the client side? |
I used Claude code, goose and opencode to test this but none of them passed constraint and the guided decoding was actually not enabled. So it's hard to decide whether it is effective or not. |
|
@guoqingbao - endpoints don't dictate grammars to the engine (at least not yet). Its setting them up from the tool params and applying a sanitizing grammar to enforce the contents. You can work back from engine.rs: // Build tool call schema from tools
let tool_schema = build_tool_call_schema(&[]);
// Build llguidance constraint from markers and schema
let tool_constraint = tool_markers.clone().and_then(|m| {
// Default
// build_tool_call_constraint(&m, &tool_schema).ok()
// Choose ASCII or safe UTF-8 based on security requirements
// For maximum security, use ASCII:
build_tool_call_constraint_ascii(&m, &tool_schema).ok()
// OR for safe UTF-8 (allows valid UTF-8, blocks invalid):
// build_tool_call_constraint_safe_utf8(&m, &tool_schema).ok()
});
crate::log_info!(
"[llg] Tool markers: {:?}, Constraint: {:?}",
&tool_markers,
tool_constraint.as_ref().map(|_| "enabled")
);
...
let engine = Arc::new(RwLock::new(Self {
runners,
scheduler,
tokenizer,
econfig,
default_chat_template,
template,
stream_decoders: HashMap::new(),
stream_senders: HashMap::new(),
request_types: HashMap::new(),
decode_start_times: HashMap::new(),
decode_length: HashMap::new(),
last_check_throughput_time: 0,
active_requests: HashSet::new(),
cancelled_sequences: Vec::new(),
stop_flag: stop_flag.clone(),
has_vision: config.is_multi_model.unwrap_or(false),
model_type,
tool_config,
img_cfg,
model_name,
llg_factory,
tool_markers,
tool_schema: Some(tool_schema),
tool_constraint: tool_constraint,
reasoning_start_id,
reasoning_end_id,
})); |
|
I'm guessing in prod we have to relax that to UTF8 since file paths and such are going to have have local dialect/charset names. |
Replied in the another PR. And we have a urge fix that requires your attention. |
Models won't produce any call during a sequence task(when it should be), not sure if this related or can be solved by guidance. Will dig into it tomorrow morning. |
guoqingbao/attention.rs#29 (comment) - my top guess is the ASCII grammar |
|
Confirming that even mid-stream calls work with the utf8 grammar:
I was able to reproduce some weirdness with the ASCII one:
which makes me wonder if you can "sample-constrain" a model into using a whole other tokenizer with enough Lark hate applied (there are limits to Lark, i think 50k expressions when resolved) |
|
@guoqingbao i think we have "stream slip" going on here. This is something i generally only see in hand-made IO constructs or... umm... "hastily" made ones (when i wrote that, PHP's TLS implementation allowed you to It only happens in streaming mode, batch tool calls and IOs are fine, but the main expressions of this are:
Sometimes, we can get "slipping tokens" even without tool calls:
which as i understand it is not that different from All of this points to a state management problem where we are handling something off-alignment with expected boundaries. We may want something like Framed streaming to forcibly bound that but first i need to find where this is happening because the logging i recently added does not trip on case 1 - that tool call tag "just flies by" and is never seen while in the same message i can see All that said, try having a conversaion with the Python |
|
For some sense of sanity by the way, all of these PRs are intended to ground us as firmly as possible in accurate logic and correct semantics such that when the projects grow those imperfections dont become irredeemable architectural flaws intrinsic to the system for which workarounds are made and so on. I'm going to be publishing a crate to enable: |
|
@guoqingbao - could you please ask your bots (or if you have a sec, yourself eyeball) d2c95a7? This was working perfectly in the gory row-by-row masking i was doing in CPU to ensure correctness but seems the accelerated approach is creating overrun masks (keep generating forever, etc - happens w/ longer context). |
@codex Check and find bugs for llguidance usage. |
|
@guoqingbao - i think i screwed up the way i'm laying out logits/tensors in the "fast path" (performance difference actually seems minor and the original code is per token precise. If we can get the no-copy version working correctly its faster in-GPU but right now i'm running with it reverted and it seems quite happy |
He is not responding, let's wait and see @sempervictus |
|
Currently looking to try this and see how it does but i have to reboot the large model to actually see the corruption since it requires some seq size and ive been testing on an 8k max seq model diff --git a/src/utils/guidance.rs b/src/utils/guidance.rs
index 245c41c..99c01da 100644
--- a/src/utils/guidance.rs
+++ b/src/utils/guidance.rs
@@ -949,10 +949,11 @@ pub fn batch_mask_bias(
let mut bias_data = vec![f32::NEG_INFINITY; batch_size * vocab_size];
// Fill in allowed tokens using sparse iteration
- for (seq_idx, (_seq_idx, mask)) in masks.iter().enumerate() {
+ // masks is Vec<(batch_idx, SimpleVob)> where batch_idx is the sequence position in the batch
+ for (batch_idx, mask) in masks.iter() {
mask.iter_set_entries(|idx| {
if idx < vocab_size {
- bias_data[seq_idx * vocab_size + idx] = 0.0;
+ bias_data[*batch_idx * vocab_size + idx] = 0.0;
}
});
}
@@ -1009,24 +1010,26 @@ pub fn early_exit_validate(
}
});
- // Get logits as vector and apply bias
- let mut logits_vec = logits.flatten_all()?.to_vec1::<f32>()?;
- let row = &mut logits_vec[seq_idx * vocab_size..][..vocab_size];
+ // Get current sequence's logits as 1D tensor - MUST CLONE to avoid cross-contamination
+ let row_start = seq_idx * vocab_size;
+ let row_end = row_start + vocab_size;
+ let logits_vec = logits.flatten_all()?.to_vec1::<f32>()?;
+ let mut row_vec = logits_vec.clone(); // Clone to avoid modifying original
+ let row = &mut row_vec[row_start..row_end];
+ // Apply bias directly to this sequence's row
for tok in 0..vocab_size {
- if acc[tok] == 0.0 {
- // Keep original logit value
- } else {
+ if acc[tok] != 0.0 {
row[tok] = f32::NEG_INFINITY;
}
}
- // Create biased tensor
- let biased_tensor = Tensor::from_vec(logits_vec, logits.shape(), logits.device())?;
+ // Create 1D tensor for just this sequence
+ let biased_row = Tensor::from_vec(row_vec[row_start..row_end].to_vec(), (vocab_size,), logits.device())?;
- // Re-sample with biased logits
- let re_sampled = logit_processor.sample_with_strategy(&biased_tensor, sampling)?;
- tokens[seq_idx] = re_sampled[seq_idx];
+ // Re-sample just this sequence from the biased 1D logits
+ let re_sampled = logit_processor.sample_with_strategy(&biased_row, sampling)?;
+ tokens[seq_idx] = re_sampled[0]; // 1D output, first (only) element
crate::log_debug!("[llg] Consuming re-sampled token {} for seq {}", tokens[seq_idx], seq_id);
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 663b42f6ff
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| }; | ||
|
|
||
| if let Ok(Some(mask)) = state.compute_mask() { |
There was a problem hiding this comment.
Enforce EOS mask once grammar matcher is stopped
When GuidanceState::compute_mask() returns None for a stopped matcher, the decode loop skips masking and samples from the full vocabulary; afterwards validate_token() also treats stopped states as valid, so extra tokens are accepted even after a finite grammar is complete. In requests that rely on grammar termination (tool-call/object completion), this causes unconstrained tail generation until max_tokens or another external stop is hit instead of ending at grammar completion.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Wouldn't the original sampling code have already taken care of this? The model wants to emit EOS when its done, unconstrained models do so just fine but we had to add multi-EOS constraint-ends in our text_with_eos generation to ensure we do NOT mask them off accidentally.
The current behavior seems to be clobbering that precise making intent.
| pub fn validate_sequence_for_grammar(&self, seq_id: usize, output_ids: &[u32]) -> Option<usize> { | ||
| let mut guidance_states = self.guidance_states.write(); | ||
| let state = guidance_states.get_mut(&seq_id)?; | ||
| match state.validate_tokens(output_ids) { |
There was a problem hiding this comment.
Validate grammar against unconsumed suffix only
validate_sequence_for_grammar re-validates output_ids on a GuidanceState whose matcher has already consumed generated tokens in sample() via commit_token(). Because Matcher::validate_tokens checks from the current FSM position, replaying previously consumed output can report mismatches for valid generations and trigger rollback loops as output grows.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
We would likely then be seeing rollbacks all the time - we are not.
| let target_tokens = output_ids.len(); | ||
| let target_blocks = target_tokens.div_ceil(self.block_manager.get_block_size()); | ||
| let target_tokens_aligned = target_blocks * self.block_manager.get_block_size(); |
There was a problem hiding this comment.
Roll back using absolute token count, not output length
On grammar mismatch, the rollback target is computed from output_ids.len() and rounded up to a block boundary, but rollback_sequence truncates seq.token_ids using that value as an absolute length. For normal prompts this can drop most of the prompt context, and rounding up can also make rollback ineffective by targeting a position beyond the generated suffix, corrupting recovery behavior.
Useful? React with 👍 / 👎.
|
Taking a shot at this approach but i might just rollback that change, works great w/out and there's still some grammar composition work to be done for tool-call sleds
|
|
@guoqingbao - i think we're not pulling out all of the possible EOS token IDs from the 3.5: ^^ is the 0.8B Q3.5 AFAIK there should be 3 as we saw w/ the coder model. When they're not found, the expression causes the model to "run on" because the tokens not defined in that expression get masked-out. |
|
Confirm, we are not correctly exacting EOS tokens for the qwen35 models and that's probably why we have been seeing problems with models streaming non-stop. Need a more idiomatic way to deal with tokens since EOS can be many things |
|
@guoqingbao - looks like we need proper tokenizer extraction, here's what i was able to pull in a quick test: --- EOS Tokens ---
EOS: id=248052 token=<|quad_end|>
EOS: id=248054 token=<|vision_end|>
EOS: id=248044 token=<|endoftext|>
EOS: id=248048 token=<|object_ref_end|>
EOS: id=248050 token=<|box_end|>
EOS: id=248046 token=<|im_end|>
EOS IDs: [248052, 248054, 248044, 248048, 248050, 248046]
EOS Strings: ["<|quad_end|>", "<|vision_end|>", "<|endoftext|>", "<|object_ref_end|>", "<|box_end|>", "<|im_end|>"]none of which are the ID we are pulling right now ^^ is why q35 has run-on even without this PR, i'm pretty sure |
Only the one defined as eos in tokenizer config file is the valid EOS token: https://huggingface.co/Qwen/Qwen3.5-35B-A3B/blob/main/tokenizer_config.json |
|
yeah and they also said q3coder next didn't reason... but here we are 😉 So this works: vllm-rs-svc2 | 2026-03-07T06:18:06.269256Z DEBUG vllm_rs::server::server: [llg] Lark grammar string:
vllm-rs-svc2 | start: text_with_eos
vllm-rs-svc2 | text_with_eos: TEXT eos?
vllm-rs-svc2 | TEXT: /(?s:.*)/
vllm-rs-svc2 | eos: <[248048]> | <[248052]> | <[248054]> | <[248044]> | <[248046]> | <[248050]>and i have an idiomatic SpecialTokens extractor/accessor inbound which generated that. No more guessing, now we add extractors to the bloody thing as we discover we need something else and we can access it all through a sane impl. Might actually want to put that in Tokenizers itself eventually once we iron out the details here This is the extract from the q35-0.8B: Successfully loaded tokenizer from: tests/tokenizer.json
Total added tokens processed.
--- EOS Tokens ---
EOS: id=248054 token=<|vision_end|>
EOS: id=248048 token=<|object_ref_end|>
EOS: id=248052 token=<|quad_end|>
EOS: id=248044 token=<|endoftext|>
EOS: id=248046 token=<|im_end|>
EOS: id=248050 token=<|box_end|>
EOS IDs: [248054, 248048, 248052, 248044, 248046, 248050]
EOS Strings: ["<|vision_end|>", "<|object_ref_end|>", "<|quad_end|>", "<|endoftext|>", "<|im_end|>", "<|box_end|>"]
--- PAD Tokens ---
PAD: id=248057 token=<|video_pad|>
PAD: id=248060 token=<|fim_prefix|>
PAD: id=248063 token=<|fim_pad|>
PAD: id=248055 token=<|vision_pad|>
PAD: id=248062 token=<|fim_suffix|>
PAD: id=248061 token=<|fim_middle|>
PAD: id=248056 token=<|image_pad|>
--- BOS Tokens ---
BOS: id=248047 token=<|object_ref_start|>
BOS: id=248045 token=<|im_start|>
BOS: id=248051 token=<|quad_start|>
BOS: id=248049 token=<|box_start|>
BOS: id=248053 token=<|vision_start|>
--- TOOL Tokens ---
TOOL: id=248067 token=</tool_response>
TOOL: id=248058 token=<tool_call>
TOOL: id=248059 token=</tool_call>
TOOL: id=248066 token=<tool_response>
--- ROLE Tokens ---
ROLE: id=248065 token=<|file_sep|>
--- MASK Tokens ---
--- REASONING Tokens ---
REASONING: id=248069 token=</think>
REASONING: id=248068 token=<think>
--- OTHER Tokens ---
OTHER: id=248064 token=<|repo_name|> |
|
@guoqingbao - so we have a bug:
|
The eos token is loaded from the config file. https://github.com/guoqingbao/vllm.rs/blob/main/src/utils/mod.rs#L622 |
These are not eos, why you treat them as eos? This one is eos: EOS: id=248046 token=<|im_end|> https://huggingface.co/Qwen/Qwen3.5-0.8B/blob/main/tokenizer_config.json Only "eos_token" field configured in config.json or tokenizer_config.json can be treated as EOS, that's the common practice. |
|
@guoqingbao i know how the EOS token was being loaded, but we weren't getting them all and the problem here is that guidance works by masking so if you do not explicitly define their options for EOS then they lose the ability to generate them at all which produces infinite emojis after the seqpos where EOS would have been (they're out of masked length and they are forbidden from terminating the output since they cant produce EOS so they just vomit garbage forever). On the bright side - try this PR with Details2026-03-07T15:39:34.898404Z DEBUG vllm_rs::utils::guidance: [llg] parse_lark_grammar() -> start_rhs='tool_call', other_rules_count=34
2026-03-07T15:39:34.898463Z DEBUG vllm_rs::utils::guidance: [llg] compose_grammars() -> ( text with EOS | tool_call )+
2026-03-07T15:39:34.898475Z DEBUG vllm_rs::server::server: [llg] TopLevelGrammar for SamplingParams: TopLevelGrammar { grammars: [GrammarWithLexer [lark]], max_tokens: Some(262144) }
2026-03-07T15:39:34.898485Z DEBUG vllm_rs::server::server: [llg] Lark grammar string:
start: ( text_with_eos | tool_call )+
obj_delete_file: %json {"additionalProperties":false,"properties":{"path":{"description":"Path to the file or directory to delete, relative to the workspace","type":"string"}},"required":["path"],"type":"object"} | %json {"additionalProperties":false,"properties":{"path":{"description":"Path to the file or directory to delete, relative to the workspace","type":"string"}},"required":["path"],"type":"object"}
obj: obj_delete_file | obj_apply_diff | obj_ask_followup_question | obj_attempt_completion | obj_codebase_search | obj_execute_command | obj_fetch_instructions | obj_list_files | obj_new_task | obj_read_file | obj_edit_file | obj_search_files | obj_switch_mode | obj_update_todo_list | obj_write_to_file
eos: <[151653]> | <[151647]> | <[151643]> | <[151649]> | <[151645]> | <[151651]>
obj_write_to_file: %json {"additionalProperties":false,"properties":{"content":{"description":"The content to write to the file. ALWAYS provide the COMPLETE intended content of the file, without any truncation or omissions. You MUST include ALL parts of the file, even if they haven't been modified. Do NOT include line numbers in the content.","type":"string"},"path":{"description":"The path of the file to write to (relative to the current workspace directory)","type":"string"}},"required":["path","content"],"type":"object"} | %json {"additionalProperties":false,"properties":{"content":{"description":"The content to write to the file. ALWAYS provide the COMPLETE intended content of the file, without any truncation or omissions. You MUST include ALL parts of the file, even if they haven't been modified. Do NOT include line numbers in the content.","type":"string"},"path":{"description":"The path of the file to write to (relative to the current workspace directory)","type":"string"}},"required":["path","content"],"type":"object"}
obj_update_todo_list: %json {"additionalProperties":false,"properties":{"todos":{"description":"Full markdown checklist in execution order, using [ ] for pending, [x] for completed, and [-] for in progress","type":"string"}},"required":["todos"],"type":"object"} | %json {"additionalProperties":false,"properties":{"todos":{"description":"Full markdown checklist in execution order, using [ ] for pending, [x] for completed, and [-] for in progress","type":"string"}},"required":["todos"],"type":"object"}
obj_list_files: %json {"additionalProperties":false,"properties":{"path":{"description":"Directory path to inspect, relative to the workspace","type":"string"},"recursive":{"description":"Set true to list contents recursively; false to show only the top level","type":"boolean"}},"required":["path","recursive"],"type":"object"} | %json {"additionalProperties":false,"properties":{"path":{"description":"Directory path to inspect, relative to the workspace","type":"string"},"recursive":{"description":"Set true to list contents recursively; false to show only the top level","type":"boolean"}},"required":["path","recursive"],"type":"object"}
obj_read_file: %json {"additionalProperties":false,"properties":{"files":{"description":"List of files to read; request related files together when allowed","items":{"additionalProperties":false,"properties":{"path":{"description":"Path to the file to read, relative to the workspace","type":"string"}},"required":["path"],"type":"object"},"minItems":1,"type":"array"}},"required":["files"],"type":"object"} | %json {"additionalProperties":false,"properties":{"files":{"description":"List of files to read; request related files together when allowed","items":{"additionalProperties":false,"properties":{"path":{"description":"Path to the file to read, relative to the workspace","type":"string"}},"required":["path"],"type":"object"},"minItems":1,"type":"array"}},"required":["files"],"type":"object"}
obj_execute_command: %json {"additionalProperties":false,"properties":{"command":{"description":"Shell command to execute","type":"string"},"cwd":{"description":"Optional working directory for the command, relative or absolute","type":["string","null"]}},"required":["command","cwd"],"type":"object"} | %json {"additionalProperties":false,"properties":{"command":{"description":"Shell command to execute","type":"string"},"cwd":{"description":"Optional working directory for the command, relative or absolute","type":["string","null"]}},"required":["command","cwd"],"type":"object"}
obj_switch_mode: %json {"additionalProperties":false,"properties":{"mode_slug":{"description":"Slug of the mode to switch to (e.g., code, ask, architect)","type":"string"},"reason":{"description":"Explanation for why the mode switch is needed","type":"string"}},"required":["mode_slug","reason"],"type":"object"} | %json {"additionalProperties":false,"properties":{"mode_slug":{"description":"Slug of the mode to switch to (e.g., code, ask, architect)","type":"string"},"reason":{"description":"Explanation for why the mode switch is needed","type":"string"}},"required":["mode_slug","reason"],"type":"object"}
obj_ask_followup_question: %json {"additionalProperties":false,"properties":{"follow_up":{"description":"Required list of 2-4 suggested responses; each suggestion must be a complete, actionable answer and may include a mode switch","items":{"additionalProperties":false,"properties":{"mode":{"description":"Optional mode slug to switch to if this suggestion is chosen (e.g., code, architect)","type":["string","null"]},"text":{"description":"Suggested answer the user can pick","type":"string"}},"required":["text","mode"],"type":"object"},"maxItems":4,"minItems":1,"type":"array"},"question":{"description":"Clear, specific question that captures the missing information you need","type":"string"}},"required":["question","follow_up"],"type":"object"} | %json {"additionalProperties":false,"properties":{"follow_up":{"description":"Required list of 2-4 suggested responses; each suggestion must be a complete, actionable answer and may include a mode switch","items":{"additionalProperties":false,"properties":{"mode":{"description":"Optional mode slug to switch to if this suggestion is chosen (e.g., code, architect)","type":["string","null"]},"text":{"description":"Suggested answer the user can pick","type":"string"}},"required":["text","mode"],"type":"object"},"maxItems":4,"minItems":1,"type":"array"},"question":{"description":"Clear, specific question that captures the missing information you need","type":"string"}},"required":["question","follow_up"],"type":"object"}
text_with_eos: TEXT eos?
obj_edit_file: %json {"additionalProperties":false,"properties":{"expected_replacements":{"description":"Number of replacements expected. Defaults to 1 if not specified. Use when you want to replace multiple occurrences of the same text.","minimum":1,"type":"number"},"file_path":{"description":"The path to the file to modify or create. You can use either a relative path in the workspace or an absolute path. If an absolute path is provided, it will be preserved as is.","type":"string"},"new_string":{"description":"The exact literal text to replace old_string with. When creating a new file (old_string is empty), this becomes the file content.","type":"string"},"old_string":{"description":"The exact literal text to replace (must match the file contents exactly, including all whitespace and indentation). For single replacements (default), include at least 3 lines of context BEFORE and AFTER the target text. Use empty string to create a new file.","type":"string"}},"required":["file_path","old_string","new_string","expected_replacements"],"type":"object"} | %json {"additionalProperties":false,"properties":{"expected_replacements":{"description":"Number of replacements expected. Defaults to 1 if not specified. Use when you want to replace multiple occurrences of the same text.","minimum":1,"type":"number"},"file_path":{"description":"The path to the file to modify or create. You can use either a relative path in the workspace or an absolute path. If an absolute path is provided, it will be preserved as is.","type":"string"},"new_string":{"description":"The exact literal text to replace old_string with. When creating a new file (old_string is empty), this becomes the file content.","type":"string"},"old_string":{"description":"The exact literal text to replace (must match the file contents exactly, including all whitespace and indentation). For single replacements (default), include at least 3 lines of context BEFORE and AFTER the target text. Use empty string to create a new file.","type":"string"}},"required":["file_path","old_string","new_string","expected_replacements"],"type":"object"}
obj_new_task: %json {"additionalProperties":false,"properties":{"message":{"description":"Initial user instructions or context for the new task","type":"string"},"mode":{"description":"Slug of the mode to begin the new task in (e.g., code, debug, architect)","type":"string"},"todos":{"description":"Optional initial todo list written as a markdown checklist; required when the workspace mandates todos","type":["string","null"]}},"required":["mode","message","todos"],"type":"object"} | %json {"additionalProperties":false,"properties":{"message":{"description":"Initial user instructions or context for the new task","type":"string"},"mode":{"description":"Slug of the mode to begin the new task in (e.g., code, debug, architect)","type":"string"},"todos":{"description":"Optional initial todo list written as a markdown checklist; required when the workspace mandates todos","type":["string","null"]}},"required":["mode","message","todos"],"type":"object"}
obj_attempt_completion: %json {"additionalProperties":false,"properties":{"result":{"description":"Final result message to deliver to the user once the task is complete","type":"string"}},"required":["result"],"type":"object"} | %json {"additionalProperties":false,"properties":{"result":{"description":"Final result message to deliver to the user once the task is complete","type":"string"}},"required":["result"],"type":"object"}
obj_search_files: %json {"additionalProperties":false,"properties":{"file_pattern":{"description":"Optional glob to limit which files are searched (e.g., *.ts)","type":["string","null"]},"path":{"description":"Directory to search recursively, relative to the workspace","type":"string"},"regex":{"description":"Rust-compatible regular expression pattern to match","type":"string"}},"required":["path","regex","file_pattern"],"type":"object"} | %json {"additionalProperties":false,"properties":{"file_pattern":{"description":"Optional glob to limit which files are searched (e.g., *.ts)","type":["string","null"]},"path":{"description":"Directory to search recursively, relative to the workspace","type":"string"},"regex":{"description":"Rust-compatible regular expression pattern to match","type":"string"}},"required":["path","regex","file_pattern"],"type":"object"}
tool_call: <[151657]> tool_obj <[151658]>
obj_codebase_search: %json {"additionalProperties":false,"properties":{"path":{"description":"Optional subdirectory (relative to the workspace) to limit the search scope","type":["string","null"]},"query":{"description":"Meaning-based search query describing the information you need","type":"string"}},"required":["query","path"],"type":"object"} | %json {"additionalProperties":false,"properties":{"path":{"description":"Optional subdirectory (relative to the workspace) to limit the search scope","type":["string","null"]},"query":{"description":"Meaning-based search query describing the information you need","type":"string"}},"required":["query","path"],"type":"object"}
tool_obj: %json {"type":"object","properties":{"name":{"type":"string"},"arguments":{"type":"object"}},"required":["name","arguments"]}
TEXT: /(?s:.*)/
obj_fetch_instructions: %json {"additionalProperties":false,"properties":{"task":{"description":"Task identifier to fetch instructions for","enum":["create_mcp_server","create_mode"],"type":"string"}},"required":["task"],"type":"object"} | %json {"additionalProperties":false,"properties":{"task":{"description":"Task identifier to fetch instructions for","enum":["create_mcp_server","create_mode"],"type":"string"}},"required":["task"],"type":"object"}
json_array: "[" obj ("," obj)* "]"
obj_apply_diff: %json {"additionalProperties":false,"properties":{"diff":{"description":"A string containing one or more search/replace blocks defining the changes. The ':start_line:' is required and indicates the starting line number of the original content. You must not add a start line for the replacement content. Each block must follow this format:\n<<<<<<< SEARCH\n:start_line:[line_number]\n-------\n[exact content to find]\n=======\n[new content to replace with]\n>>>>>>> REPLACE","type":"string"},"path":{"description":"The path of the file to modify, relative to the current workspace directory.","type":"string"}},"required":["path","diff"],"type":"object"} | %json {"additionalProperties":false,"properties":{"diff":{"description":"A string containing one or more search/replace blocks defining the changes. The ':start_line:' is required and indicates the starting line number of the original content. You must not add a start line for the replacement content. Each block must follow this format:\n<<<<<<< SEARCH\n:start_line:[line_number]\n-------\n[exact content to find]\n=======\n[new content to replace with]\n>>>>>>> REPLACE","type":"string"},"path":{"description":"The path of the file to modify, relative to the current workspace directory.","type":"string"}},"required":["path","diff"],"type":"object"}
2026-03-07T15:39:35.315629Z WARN vllm_rs::core::engine: [Stream] New request [Seq_id 179, 541541 tokens] received! (session_id: None)The above allowed Q3Coder to use KiloCode sub-tasks which i've never seen it do before (and no i dont allow vsix' to randomly update from god knows where so its not a change in kilocode): but without this line - The problem with these things from an expression-grammar perspective is that they aren't part of The tool for extracting
should have been in the last commit but looks like i accidentally omitted adding that examples/ subtree for the extractor so will get it pushed up in a separate PR for the SpecialTokens EOS work (plan to do the same for every token type we have to manually define and then have model-appropriate extractors do the heavy lifting instead of the default map i'm using right now - gemma and mistral "other" sections are interesting). This PR, aside from what i suspect might be an architectural limitation w/ XML tags, should be ready for your and you bots' QA at which point we can either merge it first or i can pare back the extensive tests added after i get your 👍 . Strongly urge trying this with coding agents on the openai service instead of the claude one - claude supports native grammar generation in their API but so do we now and the auto tool gen thing is apparently extremely useful to the agents. |
|
BTW - ...
{%- for message in messages[::-1] %}
{%- set index = (messages|length - 1) - loop.index0 %}
{%- if ns.multi_step_tool and message.role == "user" %}
{%- set content = render_content(message.content, false)|trim %}
{%- if not(content.startswith('<tool_response>') and content.endswith('</tool_response>')) %}
{%- set ns.multi_step_tool = false %}
{%- set ns.last_query_index = index %}
{%- endif %}
{%- endif %}i had an idiomatic jinja template parser in the works when i first started on this PR. Might be worth revisiting as well as generating parsers+per-token output validators out of the generated constraints since we know which seq got which constraint. |
|
@guoqingbao - in its "default state" this PR won't do anything bad to the code, user has to enable the grammar/constraint and tool generation flags. It is a lot of work though and it'll keep running into merge conflicts. It also works incredibly well in JSON mode (XML is inherently a problem because its not a stateless grammar but this work supports the forced parser option and that works great). I'll fix whatever changed last night but any chance we could get this in sooner than later so i'm not chasing rebases every morning? 😄 |
This implements the full llguidance integration enabling grammar-constrained inference for structured outputs, tool calling, and custom constraints. Architecture: - TopLevelGrammar serialized via rmp_serde across RPC boundaries - Grammar flows: Server → params.grammar → Runner → GuidanceState → Matcher - Inline correction via logits masking during sampling - Post-process correction via rollback on validation failure Key components: - params.grammar field in SamplingParams for RPC serialization GuidanceState - GuidanceState::new() with Matcher state management - GuidanceState::reset() for proper state cleanup - Rollback counter (MAX_ROLLBACK_ATTEMPTS=3) preventing infinite loops - guidance_failed/guidance_mismatch sets cleared on rollback - Vocab size validation in build_llg_factory() - Lark grammar generation from tools via build_tool_call_lark_grammar() CLI flags: - --enable-tool-grammar: Auto-build LLG grammar from MCP tools - --allow-constraint-api: Accept client-provided structured_outputs/response_format
c4d367b to
9a1ab79
Compare
|
That was stupid - i manually cherry-picked that commit and GH somehow didn't "get it" - should be all set now |
|
@guoqingbao one question remains in the logic which may be irrelevant actually but figure i'll ask: does the math care whether we bias logits for user-supplied or default params first and then apply our mask atop the biased logits or do we mask first and then apply penalties/temps/etc? My thinking here is that the mask should be applied to the already constrained/biased logits since it eliminates "remaining possibilities" vs application over mask but i'm not 100% clear on whether it actually matters. LlamaCPP masks before parameterized biasing but i've been using this non-stop for days and it seems to work great (no overruns, nada - i do have #260 in my branch as well as the precision flags). I did try theirs but i'm finding we now handle tool calling better than ... anything else i've tried. Only problem we still have are client-supplied tool-choices being wrong 😁 - i've run into a few cases where the agent supplies "helpful hints" in their tool-call options which parse out to "guided nonsense" but that's generally immature/novice projects or the ever-present AI workslop stuff where some agent thought it was more clever than spec. |
|
Replaced by #262 |


Architecture
sequenceDiagram participant User participant API participant Pipeline participant LLGFactory participant Matcher participant TokenParser participant EarleyParser participant Lexer participant TokTrie participant Sampler participant LogitsProcessor participant Model User->>API: Request with constraint (regex/json_schema/lark/llguidance) Note over User,API: Phase 1: Request Setup and Grammar Building API->>Pipeline: build_llg_factory(tokenizer) Pipeline->>LLGFactory: toktrie_hf_tokenizers::ByteTokenizer::from_tokenizer(tokenizer) LLGFactory->>TokTrie: Create token trie from tokenizer vocabulary TokTrie-->>LLGFactory: Return TokEnv with trie LLGFactory->>LLGFactory: ParserFactory::new_simple(&env) LLGFactory-->>Pipeline: Return Arc<ParserFactory> Pipeline->>Pipeline: llg_grammar_from_constraint(&request.constraint) Pipeline->>Matcher: constraint_from_llg_grammar(&factory, grm) Matcher->>Matcher: factory.create_parser(grm) Matcher->>TokenParser: Create with grammar_init TokenParser->>EarleyParser: Build CGrammar from grammar TokenParser->>Lexer: Build LexerSpec from grammar Lexer->>TokTrie: Precompute large lexemes if needed TokTrie-->>Lexer: Return optimized lexeme sets Note over User,Matcher: Phase 2: Prompt Processing (if needed) User->>API: Optional: process_prompt(prompt_tokens) API->>TokenParser: process_prompt(prompt_tokens) TokenParser->>TokenParser: tokenize_bytes_marker(&prompt_bytes) TokenParser->>TokenParser: process_prompt() returns new prompt Note over User,Matcher: Phase 3: Inference Loop loop for each token generation Model->>Model: Forward pass on input tokens Model-->>Pipeline: Return logits tensor Pipeline->>Sampler: sample_sequence(logits, seq, ...) Note over Sampler: Two-stage sampling with llguidance Sampler->>LogitsProcessor: Apply llguidance constraint LogitsProcessor->>TokenParser: compute_mask() TokenParser->>TokenParser: compute_mask_inner() TokenParser->>EarleyParser: run_speculative("compute_mask") EarleyParser->>EarleyParser: trie_started("compute_mask") EarleyParser->>EarleyParser: compute_bias() EarleyParser->>Lexer: compute_bias() with token_prefix Note over Lexer,TokTrie: Lexical Scope Analysis Lexer->>TokTrie: Walk token trie for allowed lexemes TokTrie-->>Lexer: Return SimpleVob bit mask Lexer->>EarleyParser: Return mask to TokenParser TokenParser->>TokenParser: cache mask for fast-forward TokenParser-->>LogitsProcessor: Return SimpleVob mask LogitsProcessor->>LogitsProcessor: Check if sampled token is allowed LogitsProcessor->>Sampler: Apply logit biasing alt Token is allowed Sampler->>Sampler: No biasing needed else Token is not allowed Sampler->>Sampler: Set invalid tokens to -f32::INFINITY Sampler->>Sampler: Re-sample with biased logits end Sampler->>TokenParser: consume_token(sampled_token) TokenParser->>TokenParser: apply_token(sampled_token) TokenParser->>TokenParser: llm_tokens.push(sampled_token) TokenParser->>TokenParser: llm_bytes.extend(token_bytes) TokenParser->>EarleyParser: parser.apply_token(token_bytes, token_id) EarleyParser->>Lexer: advance lexer state Lexer->>Lexer: Update lexer_stack with new state Lexer->>EarleyParser: Return backtrack count alt Backtrack needed EarleyParser->>EarleyParser: rollback(backtrack_bytes) EarleyParser->>EarleyParser: Update llm_tokens and llm_bytes end TokenParser->>TokenParser: check_stop() TokenParser-->>Sampler: Return CommitResult Note over Sampler: Phase 4: Fast-Forward (if enabled) Sampler->>TokenParser: compute_ff_tokens() TokenParser->>TokenParser: ff_tokens() TokenParser->>TokTrie: Tokenize forced bytes TokTrie-->>TokenParser: Return fast-forward tokens alt Fast-forward tokens available TokenParser->>TokenParser: consume_ff_tokens() loop for each ff_token TokenParser->>TokenParser: consume_token(ff_token) TokenParser->>TokenParser: llm_tokens.push(ff_token) TokenParser->>TokenParser: llm_bytes.extend(ff_token_bytes) end end Note over Sampler: Phase 5: Speculative Decoding (if enabled) Model->>Model: Draft model forward pass Model-->>Pipeline: Return draft logits Pipeline->>Sampler: sample_target_sequence_speculative() Sampler->>TokenParser: rollback(n_toks) TokenParser->>EarleyParser: parser.rollback(bytes_to_drop) EarleyParser->>Lexer: pop lexer states Lexer-->>TokenParser: Return rollback result Sampler->>Sampler: Sample draft tokens Sampler->>TokenParser: validate_tokens(draft_tokens) TokenParser->>TokenParser: consume_token(draft_token) alt Draft token accepted TokenParser->>TokenParser: Continue with next draft else Draft token rejected TokenParser->>TokenParser: Accept partial draft TokenParser->>TokenParser: Rollback to last valid state end end Note over User,Matcher: Phase 6: Token Geometry and Binary Data State TokTrie->>TokTrie: Token encoding (8:24 bit split) TokTrie->>TokTrie: node.bits = (token_id << 8) | byte TokTrie->>TokTrie: node.bits2 = (subtree_size << 10) | num_parents TokTrie->>SimpleVob: Bit mask storage SimpleVob->>SimpleVob: data: Vec<u32> (32 tokens per word) SimpleVob->>SimpleVob: allow_token(tok): data[tok>>5] |= 1 << (tok&31) Note over User,Matcher: Phase 7: Rollback and Verification TokenParser->>TokenParser: validate_tokens(tokens) TokenParser->>EarleyParser: validate_tokens_raw(tokens) EarleyParser->>Lexer: Check if tokens match current lexer state Lexer-->>TokenParser: Return number of valid tokens TokenParser->>TokenParser: rollback(n_tokens) TokenParser->>EarleyParser: parser.rollback(bytes_to_drop) EarleyParser->>Lexer: pop lexer states TokenParser->>TokenParser: llm_tokens.truncate(new_len) TokenParser->>TokenParser: llm_bytes.truncate(new_len) Note over User,Matcher: Phase 8: Response Generation Pipeline->>API: Return completion with tokens API->>User: Stream or return final responseGrammar Composition
sequenceDiagram participant User participant Server participant ConstraintBuilder participant ToolGrammarBuilder participant ComposeLogic participant LarkParser participant EarleyCompiler User->>Server: Request with tools + structured_outputs Server->>ConstraintBuilder: grammar_fragment_from_structured_outputs() ConstraintBuilder->>LarkParser: Parse JSON schema LarkBuilder->>ConstraintBuilder: Return TopLevelGrammar Server->>ToolGrammarBuilder: build_json_tool_lark_grammar() if enabled ToolGrammarBuilder->>LarkParser: Build tool call grammar LarkParser->>ToolGrammarBuilder: Return tool TopLevelGrammar Note over ComposeLogic: compose_grammars() ComposeLogic->>ComposeLogic: Determine match arm based on: ComposeLogic->>ComposeLogic: - constraint_grammars length ComposeLogic->>ComposeLogic: - tool_grammar presence ComposeLogic->>ComposeLogic: - tool_choice_required ComposeLogic->>ComposeLogic: - forced_tool_name presence ComposeLogic->>ComposeLogic: If multiple grammars: merge_top_level_grammars() ComposeGrammar->>EarleyCompiler: Compile direct alternation EarleyCompiler->>LexerBuilder: Build lexer spec LexerBuilder->>ComposeLogic: Return single TopLevelGrammarReferences