Enforce guided model output for tool calling#208
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24b24916c6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| #[serde(default)] | ||
| pub mcp_mode: Option<bool>, | ||
| #[serde(skip)] | ||
| pub constraint: Option<Constraint>, |
There was a problem hiding this comment.
Preserve constraints when serializing SamplingParams
In process/runner mode (multi-GPU or nccl), SamplingParams is serialized inside Sequence for MessageType::RunPrefill/RunDecode. With #[serde(skip)] on constraint, the JSON schema never reaches the runner, so guided decoding is silently disabled for tool calls even when the server sets a constraint. This breaks the new “enforce guided output” behavior whenever use_runner is true, because the masks are never applied in the runner process.
Useful? React with 👍 / 👎.
| GuidanceState::new(factory.clone(), constraint) | ||
| .expect("Failed to create guidance state") |
There was a problem hiding this comment.
Avoid panicking on invalid tool schema constraints
The guided decoding path unconditionally expects a GuidanceState when a constraint is present. If the tool schema JSON is unsupported by llguidance (or malformed), GuidanceState::new returns Err and this expect will panic, crashing the request (and likely the runner). This is user-input dependent because tool schemas are supplied by API clients, so a single incompatible schema can take down the worker instead of gracefully falling back to unguided decoding.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
An error of this nature should probably be returned into the conversation stream sanitized of internal structural data but containing the relevant constraint being violated by the message content to permit participants to remedy their structure and continue the back-and-forth cadence of discussion instead of having to pollute the KV cache with errant state to resume.
|
Hmm, 115.3 Compiling vllm-rs v0.8.11 (/vllm.rs)
117.4 error[E0063]: missing field `constraint` in initializer of `SamplingParams`
117.4 --> src/py/mod.rs:371:9
117.4 |
117.4 371 | Self {
117.4 | ^^^^ missing `constraint`
117.4
117.4 error[E0063]: missing field `constraint` in initializer of `SamplingParams`
117.4 --> src/py/mod.rs:389:9
117.4 |
117.4 389 | Self {
117.4 | ^^^^ missing `constraint`
117.4
120.2 For more information about this error, try `rustc --explain E0063`.
120.2 error: could not compile `vllm-rs` (lib) due to 2 previous errorsSame effect on FA and non-FA build targets. |
This PR currently is not in working state. I found a more viable way in dealing with tool call. llguidance is not necessary for tool calling, it's an addictive feature for users to control structured output. |
|
From what i've read in the last day it seems like an potential way to help govern issues like #169 though for tool calls specifically i was looking at |
The community best practice is writing model-specific parsers. |
which raises the question of "what is the proper domain-specific-language for the task?" since stream parsing is really the crux of the problem where buffering and such can impact the suspension of disbelief |
Implemented in #210 |
|
@guoqingbao - what're your thoughts on the remainder of this effort? Articles like this one strongly suggest that with repetitive data, long-context, and more or less "real world workloads," generation of masks to force output into a proper format helps significantly. |
The guided decoding feature can be somewhat addictive as a way to force the model to produce output in a given format. However, this is not always useful for tool calling. Most frameworks treat guided decoding as an optional feature, allowing users to provide a specific schema in the request that the model can follow. This approach is typically more effective for generating structured outputs such as Excel files, PowerPoint presentations, and other formatted content, rather than for tool calling. State-of-the-art models are trained to generate tool call formats natively and generally work well with the chat templates provided by model creators. Forcing models to adhere to a specific tool call schema using llguidance or similar libraries can actually degrade output quality and disrupt tool-calling behavior. In my tests with Qwen3 models, the results with guided decoding were not positive. |
|
Apologies, i dont mean necessarily specific for tool calling but as a feature which could be leveraged for structured document builds, API communication, etc as well as IO semantic changes such as hedl which might be extremely handy if it actually can make significant dents in the token counts being fed into the KV cache since that reduces the space over which attention has to be computed in the long-run. Happy to pick this up when i have a few cycles to experiment, i imagine the decoding speed and cutlass/flashinfer code are far more beneficial |
Will add this feature for sure, but current priority is flashinfer. |
|
Well, now we know why other runtimes use this 😭 - running the 235 w/ flashinfer i just got: But first step is creating the adapter file with proper includes and function implementations.
</think>
Error: Failed to call chat-completions api
Caused by:
0: Tool call 'fs_create' have non-JSON arguments '{"path":"/attention.rs/src/kernels/src/flashmla_sm120_adapter.cu","contents":"#include <cuda_runtime.h>\n#include <cuda_bf16.h>\n#include <vector>\n\n#ifdef USE_FLASHMLA_SM120\n #include \"smxx/get_mla_metadata.cuh\"\n #include \"smxx/mla_combine.cuh\"\n #include \"sm120/prefill/dense/fmha_cutlass_fwd_sm120.cu\"\n #if defined(SM_90_PASS)\n // SM90-specific includes not used for SM120\n static_assert(false, \"SM_90_PASS should not be set for SM120 adapter\");\n #else\n // Standard prefill interface for non-Hopper Blackwell variants\n #define FLASHMLA_SM120_PREFILL_FWD_KERNELS \\\n FMHA_FWD_TMA_WARPSPECIALIZED(64, 64) \\\n FMHA_FWD_TMA_WARPSPECIALIZED(32, 64) \\\n FMHA_FWD_TMA_WARPSPECIALIZED(32, 32)\n #endif\n#endif\n\nextern \"C\" {\nvoid flashmla_append_kv_cache(\n void* k_data_ptr,\n void* v_data_ptr,\n void* new_k_ptr,\n void* new_v_ptr,\n int32_t* paged_kv_indices,\n int32_t* paged_kv_indptr,\n int32_t* paged_kv_last_len,\n int32_t batch_size);\nv}\nopt/code/Rust/ML/Inference/attention.rs/src/kernels/src/flashmla_sm100_adapter.cu"'
1: EOF while parsing an object at line 1 column 1180 |
Is vLLM can working with this model for this task? |
|
Even at NVFP4 this would be too large for the device - it shares host and VRAM but the 80B model always produces #221 think tags and makes tool calls like a machine gun accurately and with scoping parameters when it can... its impressive (like "llama4 goes on ice to run it" impressive) |
This may also related to the think tag, not sure why certain models don't output the start flag. |
|
@guoqingbao - now that i have some "lessons learned" docs for the coder, i'm having it take a pass at integrating this over #232 since you've already implemented most of what i still need to add re complex grammar generation. There's some code i need to pull from #232 but i think the way to go will be to base constraint code on this work since we've proven the integration can function on the skinny version already. |
|
Making good progress, got to the part where LLMs break again - Sequences aren't mutable most of the time because they're in refs and the API change to enact that would be brutal 😄 so going try and take a few to wrap this part by hand but i'm hoping to have everything you started here done |
|
Finished on #262 |

No description provided.