Skip to content

Enforce guided model output for tool calling#208

Closed
guoqingbao wants to merge 6 commits into
mainfrom
guided_tool
Closed

Enforce guided model output for tool calling#208
guoqingbao wants to merge 6 commits into
mainfrom
guided_tool

Conversation

@guoqingbao
Copy link
Copy Markdown
Owner

No description provided.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24b24916c6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/utils/config.rs
Comment on lines 442 to +445
#[serde(default)]
pub mcp_mode: Option<bool>,
#[serde(skip)]
pub constraint: Option<Constraint>,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve constraints when serializing SamplingParams

In process/runner mode (multi-GPU or nccl), SamplingParams is serialized inside Sequence for MessageType::RunPrefill/RunDecode. With #[serde(skip)] on constraint, the JSON schema never reaches the runner, so guided decoding is silently disabled for tool calls even when the server sets a constraint. This breaks the new “enforce guided output” behavior whenever use_runner is true, because the masks are never applied in the runner process.

Useful? React with 👍 / 👎.

Comment thread src/core/runner.rs Outdated
Comment on lines +724 to +725
GuidanceState::new(factory.clone(), constraint)
.expect("Failed to create guidance state")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid panicking on invalid tool schema constraints

The guided decoding path unconditionally expects a GuidanceState when a constraint is present. If the tool schema JSON is unsupported by llguidance (or malformed), GuidanceState::new returns Err and this expect will panic, crashing the request (and likely the runner). This is user-input dependent because tool schemas are supplied by API clients, so a single incompatible schema can take down the worker instead of gracefully falling back to unguided decoding.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An error of this nature should probably be returned into the conversation stream sanitized of internal structural data but containing the relevant constraint being violated by the message content to permit participants to remedy their structure and continue the back-and-forth cadence of discussion instead of having to pollute the KV cache with errant state to resume.

@guoqingbao guoqingbao marked this pull request as draft January 29, 2026 04:37
@sempervictus
Copy link
Copy Markdown
Contributor

Hmm, SamplingParams seems to need a bit more love:

115.3    Compiling vllm-rs v0.8.11 (/vllm.rs)
117.4 error[E0063]: missing field `constraint` in initializer of `SamplingParams`
117.4    --> src/py/mod.rs:371:9
117.4     |
117.4 371 |         Self {
117.4     |         ^^^^ missing `constraint`
117.4 
117.4 error[E0063]: missing field `constraint` in initializer of `SamplingParams`
117.4    --> src/py/mod.rs:389:9
117.4     |
117.4 389 |         Self {
117.4     |         ^^^^ missing `constraint`
117.4 
120.2 For more information about this error, try `rustc --explain E0063`.
120.2 error: could not compile `vllm-rs` (lib) due to 2 previous errors

Same effect on FA and non-FA build targets.

@guoqingbao
Copy link
Copy Markdown
Owner Author

Hmm, SamplingParams seems to need a bit more love:

This PR currently is not in working state. I found a more viable way in dealing with tool call. llguidance is not necessary for tool calling, it's an addictive feature for users to control structured output.

@sempervictus
Copy link
Copy Markdown
Contributor

From what i've read in the last day it seems like an potential way to help govern issues like #169 though for tool calls specifically i was looking at nom or similar approaches in the byte stream to try and pattern-match tool calls for either being the first set of characters in a message or being the first on a line within a message per what seem the most "conventional" calling semantics (unless inside a code or thinking block in which case all bytes should be shown including the tags).

@guoqingbao
Copy link
Copy Markdown
Owner Author

From what i've read in the last day it seems like an potential way to help govern issues like #169 though for tool calls specifically i was looking at nom or similar approaches in the byte stream to try and pattern-match tool calls for either being the first set of characters in a message or being the first on a line within a message per what seem the most "conventional" calling semantics (unless inside a code or thinking block in which case all bytes should be shown including the tags).

The community best practice is writing model-specific parsers.

@sempervictus
Copy link
Copy Markdown
Contributor

The community best practice is writing model-specific parsers.

which raises the question of "what is the proper domain-specific-language for the task?" since stream parsing is really the crux of the problem where buffering and such can impact the suspension of disbelief

@guoqingbao
Copy link
Copy Markdown
Owner Author

The community best practice is writing model-specific parsers.

which raises the question of "what is the proper domain-specific-language for the task?" since stream parsing is really the crux of the problem where buffering and such can impact the suspension of disbelief

Implemented in #210

@sempervictus
Copy link
Copy Markdown
Contributor

@guoqingbao - what're your thoughts on the remainder of this effort? Articles like this one strongly suggest that with repetitive data, long-context, and more or less "real world workloads," generation of masks to force output into a proper format helps significantly.

@guoqingbao
Copy link
Copy Markdown
Owner Author

@guoqingbao - what're your thoughts on the remainder of this effort? Articles like this one strongly suggest that with repetitive data, long-context, and more or less "real world workloads," generation of masks to force output into a proper format helps significantly.

The guided decoding feature can be somewhat addictive as a way to force the model to produce output in a given format. However, this is not always useful for tool calling. Most frameworks treat guided decoding as an optional feature, allowing users to provide a specific schema in the request that the model can follow. This approach is typically more effective for generating structured outputs such as Excel files, PowerPoint presentations, and other formatted content, rather than for tool calling.

State-of-the-art models are trained to generate tool call formats natively and generally work well with the chat templates provided by model creators. Forcing models to adhere to a specific tool call schema using llguidance or similar libraries can actually degrade output quality and disrupt tool-calling behavior.

In my tests with Qwen3 models, the results with guided decoding were not positive.

@sempervictus
Copy link
Copy Markdown
Contributor

sempervictus commented Feb 2, 2026

Apologies, i dont mean necessarily specific for tool calling but as a feature which could be leveraged for structured document builds, API communication, etc as well as IO semantic changes such as hedl which might be extremely handy if it actually can make significant dents in the token counts being fed into the KV cache since that reduces the space over which attention has to be computed in the long-run.

Happy to pick this up when i have a few cycles to experiment, i imagine the decoding speed and cutlass/flashinfer code are far more beneficial

@guoqingbao
Copy link
Copy Markdown
Owner Author

Happy to pick this up when i have a few cycles to experiment, i imagine the decoding speed and cutlass/flashinfer code are far more beneficial

Will add this feature for sure, but current priority is flashinfer.

@sempervictus
Copy link
Copy Markdown
Contributor

Will add this feature for sure, but current priority is flashinfer.

image

@sempervictus
Copy link
Copy Markdown
Contributor

Well, now we know why other runtimes use this 😭 - running the 235 w/ flashinfer i just got:

But first step is creating the adapter file with proper includes and function implementations.
</think>


Error: Failed to call chat-completions api

Caused by:
    0: Tool call 'fs_create' have non-JSON arguments '{"path":"/attention.rs/src/kernels/src/flashmla_sm120_adapter.cu","contents":"#include <cuda_runtime.h>\n#include <cuda_bf16.h>\n#include <vector>\n\n#ifdef USE_FLASHMLA_SM120\n    #include \"smxx/get_mla_metadata.cuh\"\n    #include \"smxx/mla_combine.cuh\"\n    #include \"sm120/prefill/dense/fmha_cutlass_fwd_sm120.cu\"\n    #if defined(SM_90_PASS)\n        // SM90-specific includes not used for SM120\n        static_assert(false, \"SM_90_PASS should not be set for SM120 adapter\");\n    #else\n        // Standard prefill interface for non-Hopper Blackwell variants\n        #define FLASHMLA_SM120_PREFILL_FWD_KERNELS \\\n            FMHA_FWD_TMA_WARPSPECIALIZED(64, 64) \\\n            FMHA_FWD_TMA_WARPSPECIALIZED(32, 64) \\\n            FMHA_FWD_TMA_WARPSPECIALIZED(32, 32)\n    #endif\n#endif\n\nextern \"C\" {\nvoid flashmla_append_kv_cache(\n    void* k_data_ptr,\n    void* v_data_ptr,\n    void* new_k_ptr,\n    void* new_v_ptr,\n    int32_t* paged_kv_indices,\n    int32_t* paged_kv_indptr,\n    int32_t* paged_kv_last_len,\n    int32_t batch_size);\nv}\nopt/code/Rust/ML/Inference/attention.rs/src/kernels/src/flashmla_sm100_adapter.cu"'
    1: EOF while parsing an object at line 1 column 1180

@guoqingbao
Copy link
Copy Markdown
Owner Author

Well, now we know why other runtimes use this 😭 - running the 235 w/ flashinfer i just got:

Is vLLM can working with this model for this task?

@sempervictus
Copy link
Copy Markdown
Contributor

Even at NVFP4 this would be too large for the device - it shares host and VRAM but the 80B model always produces #221 think tags and makes tool calls like a machine gun accurately and with scoping parameters when it can... its impressive (like "llama4 goes on ice to run it" impressive)

@guoqingbao
Copy link
Copy Markdown
Owner Author

But first step is creating the adapter file with proper includes and function implementations.
</think>

This may also related to the think tag, not sure why certain models don't output the start flag.

@sempervictus
Copy link
Copy Markdown
Contributor

sempervictus commented Mar 1, 2026

@guoqingbao - now that i have some "lessons learned" docs for the coder, i'm having it take a pass at integrating this over #232 since you've already implemented most of what i still need to add re complex grammar generation. There's some code i need to pull from #232 but i think the way to go will be to base constraint code on this work since we've proven the integration can function on the skinny version already.

@sempervictus
Copy link
Copy Markdown
Contributor

Making good progress, got to the part where LLMs break again - Sequences aren't mutable most of the time because they're in refs and the API change to enact that would be brutal 😄 so going try and take a few to wrap this part by hand but i'm hoping to have everything you started here done

@guoqingbao
Copy link
Copy Markdown
Owner Author

Finished on #262

@guoqingbao guoqingbao closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants