Enforce guided model output for tool calling by guoqingbao · Pull Request #208 · guoqingbao/xinfer

guoqingbao · 2026-01-29T04:13:54Z

No description provided.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24b24916c6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-29T04:19:04Z

    #[serde(default)]
    pub mcp_mode: Option<bool>,
+    #[serde(skip)]
+    pub constraint: Option<Constraint>,


Preserve constraints when serializing SamplingParams

In process/runner mode (multi-GPU or nccl), SamplingParams is serialized inside Sequence for MessageType::RunPrefill/RunDecode. With #[serde(skip)] on constraint, the JSON schema never reaches the runner, so guided decoding is silently disabled for tool calls even when the server sets a constraint. This breaks the new “enforce guided output” behavior whenever use_runner is true, because the masks are never applied in the runner process.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-01-29T04:19:04Z

+                        GuidanceState::new(factory.clone(), constraint)
+                            .expect("Failed to create guidance state")


Avoid panicking on invalid tool schema constraints

The guided decoding path unconditionally expects a GuidanceState when a constraint is present. If the tool schema JSON is unsupported by llguidance (or malformed), GuidanceState::new returns Err and this expect will panic, crashing the request (and likely the runner). This is user-input dependent because tool schemas are supplied by API clients, so a single incompatible schema can take down the worker instead of gracefully falling back to unguided decoding.

Useful? React with 👍 / 👎.

An error of this nature should probably be returned into the conversation stream sanitized of internal structural data but containing the relevant constraint being violated by the message content to permit participants to remedy their structure and continue the back-and-forth cadence of discussion instead of having to pollute the KV cache with errant state to resume.

sempervictus · 2026-01-29T17:36:33Z

Hmm, SamplingParams seems to need a bit more love:

115.3    Compiling vllm-rs v0.8.11 (/vllm.rs)
117.4 error[E0063]: missing field `constraint` in initializer of `SamplingParams`
117.4    --> src/py/mod.rs:371:9
117.4     |
117.4 371 |         Self {
117.4     |         ^^^^ missing `constraint`
117.4 
117.4 error[E0063]: missing field `constraint` in initializer of `SamplingParams`
117.4    --> src/py/mod.rs:389:9
117.4     |
117.4 389 |         Self {
117.4     |         ^^^^ missing `constraint`
117.4 
120.2 For more information about this error, try `rustc --explain E0063`.
120.2 error: could not compile `vllm-rs` (lib) due to 2 previous errors

Same effect on FA and non-FA build targets.

guoqingbao · 2026-01-30T00:21:20Z

Hmm, SamplingParams seems to need a bit more love:

This PR currently is not in working state. I found a more viable way in dealing with tool call. llguidance is not necessary for tool calling, it's an addictive feature for users to control structured output.

sempervictus · 2026-01-30T01:29:16Z

From what i've read in the last day it seems like an potential way to help govern issues like #169 though for tool calls specifically i was looking at nom or similar approaches in the byte stream to try and pattern-match tool calls for either being the first set of characters in a message or being the first on a line within a message per what seem the most "conventional" calling semantics (unless inside a code or thinking block in which case all bytes should be shown including the tags).

guoqingbao · 2026-01-30T02:12:58Z

From what i've read in the last day it seems like an potential way to help govern issues like #169 though for tool calls specifically i was looking at nom or similar approaches in the byte stream to try and pattern-match tool calls for either being the first set of characters in a message or being the first on a line within a message per what seem the most "conventional" calling semantics (unless inside a code or thinking block in which case all bytes should be shown including the tags).

The community best practice is writing model-specific parsers.

sempervictus · 2026-01-30T02:37:05Z

The community best practice is writing model-specific parsers.

which raises the question of "what is the proper domain-specific-language for the task?" since stream parsing is really the crux of the problem where buffering and such can impact the suspension of disbelief

guoqingbao · 2026-01-30T13:32:53Z

The community best practice is writing model-specific parsers.

which raises the question of "what is the proper domain-specific-language for the task?" since stream parsing is really the crux of the problem where buffering and such can impact the suspension of disbelief

Implemented in #210

sempervictus · 2026-02-02T01:04:20Z

@guoqingbao - what're your thoughts on the remainder of this effort? Articles like this one strongly suggest that with repetitive data, long-context, and more or less "real world workloads," generation of masks to force output into a proper format helps significantly.

guoqingbao · 2026-02-02T02:09:26Z

@guoqingbao - what're your thoughts on the remainder of this effort? Articles like this one strongly suggest that with repetitive data, long-context, and more or less "real world workloads," generation of masks to force output into a proper format helps significantly.

The guided decoding feature can be somewhat addictive as a way to force the model to produce output in a given format. However, this is not always useful for tool calling. Most frameworks treat guided decoding as an optional feature, allowing users to provide a specific schema in the request that the model can follow. This approach is typically more effective for generating structured outputs such as Excel files, PowerPoint presentations, and other formatted content, rather than for tool calling.

State-of-the-art models are trained to generate tool call formats natively and generally work well with the chat templates provided by model creators. Forcing models to adhere to a specific tool call schema using llguidance or similar libraries can actually degrade output quality and disrupt tool-calling behavior.

In my tests with Qwen3 models, the results with guided decoding were not positive.

sempervictus · 2026-02-02T02:29:38Z

Apologies, i dont mean necessarily specific for tool calling but as a feature which could be leveraged for structured document builds, API communication, etc as well as IO semantic changes such as hedl which might be extremely handy if it actually can make significant dents in the token counts being fed into the KV cache since that reduces the space over which attention has to be computed in the long-run.

Happy to pick this up when i have a few cycles to experiment, i imagine the decoding speed and cutlass/flashinfer code are far more beneficial

guoqingbao · 2026-02-02T03:21:38Z

Happy to pick this up when i have a few cycles to experiment, i imagine the decoding speed and cutlass/flashinfer code are far more beneficial

Will add this feature for sure, but current priority is flashinfer.

sempervictus · 2026-02-02T05:03:19Z

Will add this feature for sure, but current priority is flashinfer.

sempervictus · 2026-02-11T09:27:26Z

Well, now we know why other runtimes use this 😭 - running the 235 w/ flashinfer i just got:

But first step is creating the adapter file with proper includes and function implementations.
</think>


Error: Failed to call chat-completions api

Caused by:
    0: Tool call 'fs_create' have non-JSON arguments '{"path":"/attention.rs/src/kernels/src/flashmla_sm120_adapter.cu","contents":"#include <cuda_runtime.h>\n#include <cuda_bf16.h>\n#include <vector>\n\n#ifdef USE_FLASHMLA_SM120\n    #include \"smxx/get_mla_metadata.cuh\"\n    #include \"smxx/mla_combine.cuh\"\n    #include \"sm120/prefill/dense/fmha_cutlass_fwd_sm120.cu\"\n    #if defined(SM_90_PASS)\n        // SM90-specific includes not used for SM120\n        static_assert(false, \"SM_90_PASS should not be set for SM120 adapter\");\n    #else\n        // Standard prefill interface for non-Hopper Blackwell variants\n        #define FLASHMLA_SM120_PREFILL_FWD_KERNELS \\\n            FMHA_FWD_TMA_WARPSPECIALIZED(64, 64) \\\n            FMHA_FWD_TMA_WARPSPECIALIZED(32, 64) \\\n            FMHA_FWD_TMA_WARPSPECIALIZED(32, 32)\n    #endif\n#endif\n\nextern \"C\" {\nvoid flashmla_append_kv_cache(\n    void* k_data_ptr,\n    void* v_data_ptr,\n    void* new_k_ptr,\n    void* new_v_ptr,\n    int32_t* paged_kv_indices,\n    int32_t* paged_kv_indptr,\n    int32_t* paged_kv_last_len,\n    int32_t batch_size);\nv}\nopt/code/Rust/ML/Inference/attention.rs/src/kernels/src/flashmla_sm100_adapter.cu"'
    1: EOF while parsing an object at line 1 column 1180

guoqingbao · 2026-02-11T09:32:41Z

Well, now we know why other runtimes use this 😭 - running the 235 w/ flashinfer i just got:

Is vLLM can working with this model for this task?

sempervictus · 2026-02-11T09:34:44Z

Even at NVFP4 this would be too large for the device - it shares host and VRAM but the 80B model always produces #221 think tags and makes tool calls like a machine gun accurately and with scoping parameters when it can... its impressive (like "llama4 goes on ice to run it" impressive)

guoqingbao · 2026-02-11T09:35:14Z

But first step is creating the adapter file with proper includes and function implementations.
</think>

This may also related to the think tag, not sure why certain models don't output the start flag.

sempervictus · 2026-03-01T00:37:03Z

@guoqingbao - now that i have some "lessons learned" docs for the coder, i'm having it take a pass at integrating this over #232 since you've already implemented most of what i still need to add re complex grammar generation. There's some code i need to pull from #232 but i think the way to go will be to base constraint code on this work since we've proven the integration can function on the skinny version already.

sempervictus · 2026-03-01T06:59:29Z

Making good progress, got to the part where LLMs break again - Sequences aren't mutable most of the time because they're in refs and the API change to enact that would be brutal 😄 so going try and take a few to wrap this part by hand but i'm hoping to have everything you started here done

guoqingbao · 2026-03-16T02:28:05Z

Finished on #262

Enforce guided model output for tool calling

24b2491

chatgpt-codex-connector Bot reviewed Jan 29, 2026

View reviewed changes

guoqingbao marked this pull request as draft January 29, 2026 04:37

guoqingbao added 5 commits January 29, 2026 07:02

Improve tool call handling

b008a66

Resolve llguidance issue

74c9ce3

Fix claude code tool calling

88ae722

Fix build

ed586a8

Use llguidance upon requests

3b8cb1f

This was referenced Feb 13, 2026

Support Qwen3.5 and Qwen3-Next models #220

Merged

Add Qwen3-Next Coder parser fallback and adjust routing to qwen_coder #228

Closed

Implement LLGuidance #232

Closed

sempervictus mentioned this pull request Mar 11, 2026

Enable Reasoning via Guided Enforcement #262

Merged

guoqingbao closed this Mar 16, 2026

		GuidanceState::new(factory.clone(), constraint)
		.expect("Failed to create guidance state")

Conversation

guoqingbao commented Jan 29, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

sempervictus Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

sempervictus commented Jan 29, 2026

Uh oh!

guoqingbao commented Jan 30, 2026

Uh oh!

sempervictus commented Jan 30, 2026

Uh oh!

guoqingbao commented Jan 30, 2026

Uh oh!

sempervictus commented Jan 30, 2026

Uh oh!

guoqingbao commented Jan 30, 2026

Uh oh!

sempervictus commented Feb 2, 2026

Uh oh!

guoqingbao commented Feb 2, 2026

Uh oh!

sempervictus commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guoqingbao commented Feb 2, 2026

Uh oh!

sempervictus commented Feb 2, 2026

Uh oh!

sempervictus commented Feb 11, 2026

Uh oh!

guoqingbao commented Feb 11, 2026

Uh oh!

sempervictus commented Feb 11, 2026

Uh oh!

guoqingbao commented Feb 11, 2026

Uh oh!

sempervictus commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sempervictus commented Mar 1, 2026

Uh oh!

guoqingbao commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sempervictus commented Feb 2, 2026 •

edited

Loading

sempervictus commented Mar 1, 2026 •

edited

Loading