[Responses API] Structured output + reasoning via structural tag embedding#35904
[Responses API] Structured output + reasoning via structural tag embedding#35904will-deines wants to merge 8 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the structured output capabilities of the Responses API, particularly for reasoning models, by embedding content constraints within structural tags, introducing json_object format support, fixing a streaming bug related to Pydantic model serialization, and improving robustness with reasoning channel tags. A security review found no vulnerabilities. The changes are well-designed, thoroughly tested, and well-documented, with no high or critical severity issues identified by the code review.
6b69e1a to
76faff3
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
…d tags Extend prepare_structured_tag() to be the single authority for all generation constraints in GPT-OSS Harmony models: channel structure, tool enforcement, argument validation, and content constraints. tool_choice=required support: - New from_function_tool_to_tag() and tag_with_function_tools() helpers - prepare_structured_tag() extended with tool_choice, function_tools params - Channel blocking: omit <|channel|>final trigger to force tool calls - Remove NotImplementedError for non-auto tool_choice in Harmony path Absorbed from upstream PR vllm-project#35904 (structured output + reasoning): - Content constraint embedding in <|channel|>final tag - _constraint_to_content_format() and _extract_response_format_schema() - struct_out is None branch (reasoning tags always applied) - inject_response_formats() for Harmony cookbook compliance - json_object format handling (was silently ignored) - Streaming .model_dump() alias bug fix
…d tags Extend prepare_structured_tag() to be the single authority for all generation constraints in GPT-OSS Harmony models: channel structure, tool enforcement, argument validation, and content constraints. tool_choice=required support: - New from_function_tool_to_tag() and tag_with_function_tools() helpers - prepare_structured_tag() extended with tool_choice, function_tools params - Channel blocking: omit <|channel|>final trigger to force tool calls - Remove NotImplementedError for non-auto tool_choice in Harmony path Absorbed from upstream PR vllm-project#35904 (structured output + reasoning): - Content constraint embedding in <|channel|>final tag - _constraint_to_content_format() and _extract_response_format_schema() - struct_out is None branch (reasoning tags always applied) - inject_response_formats() for Harmony cookbook compliance - json_object format handling (was silently ignored) - Streaming .model_dump() alias bug fix
…d tags Extend prepare_structured_tag() to be the single authority for all generation constraints in GPT-OSS Harmony models: channel structure, tool enforcement, argument validation, and content constraints. tool_choice=required support: - New from_function_tool_to_tag() and tag_with_function_tools() helpers - prepare_structured_tag() extended with tool_choice, function_tools params - Channel blocking: omit <|channel|>final trigger to force tool calls - Remove NotImplementedError for non-auto tool_choice in Harmony path Absorbed from upstream PR vllm-project#35904 (structured output + reasoning): - Content constraint embedding in <|channel|>final tag - _constraint_to_content_format() and _extract_response_format_schema() - struct_out is None branch (reasoning tags always applied) - inject_response_formats() for Harmony cookbook compliance - json_object format handling (was silently ignored) - Streaming .model_dump() alias bug fix
…d tags Extend prepare_structured_tag() to be the single authority for all generation constraints in GPT-OSS Harmony models: channel structure, tool enforcement, argument validation, and content constraints. tool_choice=required support: - New from_function_tool_to_tag() and tag_with_function_tools() helpers - prepare_structured_tag() extended with tool_choice, function_tools params - Channel blocking: omit <|channel|>final trigger to force tool calls - Remove NotImplementedError for non-auto tool_choice in Harmony path Absorbed from upstream PR vllm-project#35904 (structured output + reasoning): - Content constraint embedding in <|channel|>final tag - _constraint_to_content_format() and _extract_response_format_schema() - struct_out is None branch (reasoning tags always applied) - inject_response_formats() for Harmony cookbook compliance - json_object format handling (was silently ignored) - Streaming .model_dump() alias bug fix
…d tags Extend prepare_structured_tag() to be the single authority for all generation constraints in GPT-OSS Harmony models: channel structure, tool enforcement, argument validation, and content constraints. tool_choice=required support: - New from_function_tool_to_tag() and tag_with_function_tools() helpers - prepare_structured_tag() extended with tool_choice, function_tools params - Channel blocking: omit <|channel|>final trigger to force tool calls - Remove NotImplementedError for non-auto tool_choice in Harmony path Absorbed from upstream PR vllm-project#35904 (structured output + reasoning): - Content constraint embedding in <|channel|>final tag - _constraint_to_content_format() and _extract_response_format_schema() - struct_out is None branch (reasoning tags always applied) - inject_response_formats() for Harmony cookbook compliance - json_object format handling (was silently ignored) - Streaming .model_dump() alias bug fix Signed-off-by: Will Deines <will@garr.io>
…dding When a user requests JSON schema enforcement (text.format.type=json_schema) with a reasoning model (GPT-OSS), the grammar constraint was never scoped to the final output channel. This caused grammar bitmasks to be applied from token 0, clobbering reasoning output. Fix by embedding content constraints (json_schema, json_object, regex, grammar, choice) inside the structural tag's <|channel|>final region using xgrammar's native TriggeredTagsFormat support. This ensures grammar enforcement only applies within the final output region, not during reasoning. Also: - Handle text.format.type=json_object (was silently ignored) - Fix streaming + json_schema alias bug (.model_dump() dropped schema alias) - Apply reasoning channel tags even when no structured output is requested Signed-off-by: Will Deines <will@garr.io>
…nstraints When creating a new StructuredOutputsParams with the structural_tag, use dataclasses.replace() to clear content constraint fields while preserving user-specified options like disable_any_whitespace, disable_fallback, disable_additional_properties, and whitespace_pattern. Signed-off-by: Will Deines <will@garr.io>
… message Per the Harmony cookbook, structured output requires both grammar enforcement (structural tags) and prompt guidance (a # Response Formats section in the developer message). This injects the response format schema into the developer message when json_schema is requested, creating a developer message even without custom tools if needed. Signed-off-by: Will Deines <will@garr.io>
Handled by .git/info/exclude on feature branches, force-added on production/garrio-release. Signed-off-by: Will Deines <will@garr.io>
76faff3 to
c85af4a
Compare
…d tags Extend prepare_structured_tag() to be the single authority for all generation constraints in GPT-OSS Harmony models: channel structure, tool enforcement, argument validation, and content constraints. tool_choice=required support: - New from_function_tool_to_tag() and tag_with_function_tools() helpers - prepare_structured_tag() extended with tool_choice, function_tools params - Channel blocking: omit <|channel|>final trigger to force tool calls - Remove NotImplementedError for non-auto tool_choice in Harmony path Absorbed from upstream PR vllm-project#35904 (structured output + reasoning): - Content constraint embedding in <|channel|>final tag - _constraint_to_content_format() and _extract_response_format_schema() - struct_out is None branch (reasoning tags always applied) - inject_response_formats() for Harmony cookbook compliance - json_object format handling (was silently ignored) - Streaming .model_dump() alias bug fix Signed-off-by: Will Deines <will@garr.io>
StructuredOutputsParams.json can be str | dict | None but the return type is dict | None. Parse the string case with json.loads so mypy is satisfied and string schemas work correctly at runtime. Signed-off-by: Will Deines <will@garr.io>
c85af4a to
a2893f6
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4fa603e0a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.use_harmony: | ||
| if request.stream: | ||
| context = StreamingHarmonyContext(messages, available_tools) | ||
| else: | ||
| context = HarmonyContext(messages, available_tools) |
There was a problem hiding this comment.
Move structural-tag setup out of the non-Harmony branch
For GPT-OSS Responses requests, self.use_harmony is always true (vllm/entrypoints/openai/responses/serving.py:278), so execution takes this branch and never reaches the new reasoning_parser.prepare_structured_tag(...) logic under else. In the main target scenario of this change, text.format=json_schema/json_object therefore stays as a normal guided-decoding constraint on sampling_params instead of being embedded into <|channel|>final, so reasoning tokens are still constrained from token 0 exactly as before.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in a22f008. Dedented the structural tag block out of the else branch so it now runs unconditionally after context selection, for both Harmony and non-Harmony paths.
| dev_instructions = request.instructions | ||
| if response_format_schema is not None: | ||
| dev_instructions = inject_response_formats( | ||
| dev_instructions, response_format_schema | ||
| ) |
There was a problem hiding this comment.
Preserve injected schema when system instructions are enabled
The new schema guidance is only appended to dev_instructions here, but get_developer_message() drops instructions whenever VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS is set (vllm/entrypoints/openai/parser/harmony_utils.py:127). _construct_harmony_system_input_message() still sends the unmodified request.instructions to get_system_message() (serving.py:1211-1218), so in that deployment mode the # Response Formats section never reaches either prompt. GPT-OSS users running with system-instruction mode therefore lose the prompt-side schema guidance this patch is supposed to add.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in a22f008. Added a separate response_format_section parameter to get_developer_message() so the schema is passed independently of instructions. When VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS is enabled, user instructions are routed to the system message as before, but the # Response Formats section still reaches the developer message. Added TestGetDeveloperMessageResponseFormats with 4 tests covering both modes.
…em-instructions mode Fix two bugs identified in PR review: 1. The structural tag setup block was nested inside the `else` branch of `if self.use_harmony:`, making it unreachable for GPT-OSS (the primary target). Dedent the block so it runs unconditionally after context selection. 2. The `# Response Formats` schema section was lost when VLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONS was enabled, because get_developer_message() dropped all instructions in that mode. Add a separate response_format_section parameter so the schema is always included in the developer message regardless of the system-instructions flag. Signed-off-by: Will Deines <will@garr.io>
a22f008 to
dd23bae
Compare
…structured-output Signed-off-by: Will Deines <will@garr.io>
|
This pull request has merge conflicts that must be resolved before it can be |
Summary
text.format.type=json_schema) with a GPT-OSS reasoning model, the grammar constraint is now scoped to the<|channel|>finalregion via xgrammar'sTriggeredTagsFormat. Previously, grammar bitmasks were applied from token 0, clobbering reasoning output.json_objectformat:text.format.type=json_objectwas silently ignored in the Responses API. Now producesStructuredOutputsParams(json_object=True), matching chat completions behavior..model_dump()in the streaming path that dropped theschema→schema_Pydantic alias, causingResponseCreatedEventdeserialization failures.struct_out is Nonebranch).# Response Formatsinto developer message: Per the Harmony cookbook, structured output requires both grammar enforcement (structural tags) and prompt guidance (a# Response Formatssection in the developer message telling the model what schema to produce). Whenjson_schemais requested, the schema is now automatically injected into the Harmony developer message, creating one even if no custom tools are present.elsebranch ofif self.use_harmony:, making it dead code for GPT-OSS models (the primary target). Dedented so it runs unconditionally after context selection.# Response Formatssection was lost whenVLLM_GPT_OSS_HARMONY_SYSTEM_INSTRUCTIONSwas enabled, becauseget_developer_message()dropped all instructions in that mode. Added a separateresponse_format_sectionparameter so the schema always reaches the developer message independently of instructions.Approach
Rather than modifying
StructuredOutputsParamsto allow multiple simultaneous constraint types (which would require deep changes to validation, backends, and dispatch), we embed the content constraint inside the structural tag's<|channel|>finaltag.xgrammar's
TagFormat.contentfield already accepts a discriminated union ofJSONSchemaFormat,GrammarFormat,RegexFormat, etc. (defined inxgrammar/structural_tag.py). The infrastructure to "apply JSON schema grammar only within the<|channel|>finalregion" already exists — we just wire it up from the Responses API.This means:
StructuredOutputsParamskeeps its existing mutual-exclusivity invariant (one constraint type)structural_tag, which internally contains both reasoning channel enforcement AND the content constraint scoped to the final channeldisable_any_whitespace,disable_fallback, etc.) are preserved viadataclasses.replace()Prompt guidance via
# Response FormatsThe Harmony cookbook is explicit that structured output requires two complementary mechanisms:
# Response Formatssection in the developer message telling the model what schema to followThe cookbook states: "This prompt alone will, however, only influence the model's behavior but doesn't guarantee the full adherence to the schema." — grammar enforcement is the complement. This PR implements both sides: structural tags handle grammar enforcement (path 2), and
inject_response_formats()handles prompt guidance (path 1).Per the cookbook's role specification,
# Response Formatsbelongs in the developer message (which holds instructions, function tools, and output format schemas), not the system message. Whenjson_schemais requested but no custom tools are present, we now create a developer message specifically for the response format section.Decisions We Made That Can Be Debated
1. Embed constraint inside structural tag vs. allow multiple constraint types on
StructuredOutputsParamsWhat we chose: When a reasoning parser is active and a content constraint (json_schema, regex, grammar, choice) is present, we convert the content constraint into an xgrammar
contentformat dict, embed it in the<|channel|>finaltag within the structural tag, then clear the original constraint fields. The finalStructuredOutputsParamshas onlystructural_tagset.Alternative: Modify
StructuredOutputsParamsto support multiple simultaneous constraint types (e.g.structural_tag+json). This would avoid the mid-pipeline mutation pattern where we clear fields after embedding them, but requires changes to validation logic, backend dispatch inStructuredOutputManager, and every guided decoding backend's understanding of what "one constraint" means.Why we chose this: xgrammar's
TagFormat.contentfield already supports this composition natively — the infrastructure exists and is tested. The mutual-exclusivity invariant onStructuredOutputsParamsis load-bearing across the entire structured output stack, and relaxing it has a large blast radius.What reviewers might disagree with: The mid-pipeline mutation (clearing
json/regex/etc. after embedding) meansStructuredOutputsParamsno longer reflects what the user originally requested. If downstream code inspects these fields (e.g., for logging, metrics, or error messages), it will seeNoneinstead of the original constraint. An alternative could be to construct a freshStructuredOutputsParams(structural_tag=...)rather than mutating viadataclasses.replace().2. Fix
text.formatpath rather than redirecting users tostructured_outputsfieldWhat we chose: We fix the standard OpenAI
text.formatpath so thatjson_schema,json_object, and streaming all work correctly. Users can use eithertext.format(OpenAI-compatible) or the vLLM-specificstructured_outputsfield (#33709).Alternative: Only support structured output through the vLLM-specific
structured_outputsfield and treattext.formatas a passthrough/echo-only field (the status quo before this PR, wherejson_objectwas silently ignored).Context: This is an area of active debate. In #33709, @yeqcharlotte and @chaunceyjiang questioned why structured output wasn't going through
text.formatinstead of a separate field. In #33381, @chaunceyjiang argued vLLM-specific extensions should go through the OpenResponses extension mechanism. Meanwhile, @alecsolder defended the separate field for cross-provider reusability and separation of concerns. In #19097,vllm_-prefixed types were proposed but the RFC was auto-closed without implementation.Why we chose this: Users coming from the OpenAI SDK will naturally use
text.format.type=json_schema— it should just work. Thestructured_outputsfield is additive for vLLM-specific capabilities (grammar, regex, choice) thattext.formatcan't express. Fixing both paths costs little and prevents user confusion.3. Remove
.model_dump()vs. addby_alias=Truefor streaming alias bugWhat we chose: Remove the
.model_dump()call in the streaming path and pass theResponsesResponsePydantic object directly toResponseCreatedEvent, matching howResponseCompletedEventalready works. This is the approach from #34611.Alternative: Keep
.model_dump()but addby_alias=Trueso Pydantic serializesschema_as"schema". This is the approach from #26356, which has community confirmation of working.Why we chose this: Removing the unnecessary dict round-trip eliminates the entire class of alias bugs rather than patching one instance. This is consistent with @qandrew's own #26185 which previously removed a
.model_dump()call on theResponseCompletedEventpath for the same category of issue. Theby_alias=Trueapproach is fragile — any future alias field would break again if someone forgets the flag.4. Apply reasoning channel tags even when no structured output is requested
What we chose: When
struct_out is Noneand a reasoning parser is active, we now create aStructuredOutputsParams(structural_tag=...)with just the reasoning channel tags. Previously, theprepare_structured_tag()block was only entered whenstruct_outwas already aStructuredOutputsParamsinstance.Alternative: Keep the existing behavior where reasoning channel tags are only applied when the user explicitly requests some form of structured output.
Why we chose this: Without structural tags, GPT-OSS models emit raw Harmony format (
<|channel|>analysis<|message|>...) that the reasoning parser must post-hoc parse. With structural tags, xgrammar enforces the channel structure at decode time, which is more robust and enables future optimizations. This also means the reasoning parser'sis_reasoning_endstate machine (which has had multi-turn bugs per #34454) is supplemented by grammar-level enforcement.What reviewers might disagree with: This changes default behavior for all GPT-OSS requests that don't request structured output. If a model produces valid output without structural tags but would be over-constrained with them, this could cause regressions. We don't have e2e validation of this path yet.
5.
json_objectmapped to{"type": "object"}in structural tag contentWhat we chose: In
_constraint_to_content_format(),json_object=Trueis converted to{"type": "json_schema", "json_schema": {"type": "object"}}for embedding in the structural tag.Alternative: Map it to a dedicated
json_objectcontent format type if xgrammar supports one, or skip embedding entirely and let the existingjson_objecthandling in the structured output backend handle it outside the structural tag.Why we chose this: xgrammar's
TagFormat.contentexpects one of its known format types (json_schema, regex, grammar, etc.).{"type": "object"}is the minimal JSON schema that enforces "output must be a JSON object" — semantically equivalent tojson_objectmode. This ensures the constraint is properly scoped to the<|channel|>finalregion for reasoning models rather than being applied globally.6. Adding
final_content_formatparameter to the base classprepare_structured_tag()What we chose: We added
final_content_format: dict | None = Noneas an optional parameter onReasoningParser.prepare_structured_tag()in the base class, with a default ofNonethat preserves backward compatibility.Alternative: Only add the parameter on
GPTOSSReasoningParserand handle the dispatch inserving.pywith a type check or capability flag. Or create a separate method likeprepare_structured_tag_with_constraint().Why we chose this: The base class change is backward-compatible (default
None, existing implementations don't need changes). The concept of "scope this content constraint to the model's final output region" is generic — it's not GPT-OSS-specific. Other reasoning models (Qwen3, DeepSeek-R1, future models) with structural tag support would benefit from the same interface. Keeping it on the base class establishes a clean contract.What reviewers might disagree with: This couples content constraint format knowledge (xgrammar dict format) to the reasoning parser interface. If vLLM ever supports a non-xgrammar structured output backend, this dict format may not apply. A more abstract interface (e.g., passing
StructuredOutputsParamsdirectly) might be more future-proof.Related Issues, PRs, and RFCs
Directly Addressed by This PR
schemafield becomesNonein streaming with json_schemaschema_/schemaalias bug in streaming. The.model_dump()removal in this PR fixes it..model_dump()in streaming path. We adopt this approach.by_alias=True). We prefer #34611's approach (pass objects directly).texttype response_format receivedtype: "text"passthrough. Ourjson_objecthandling follows the same pattern.json_objectgap in the Responses API could produce similar errors. Our Step 1 prevents this.Foundation This PR Builds On
structured_outputsfor responses APIstructured_outputsfield to ResponsesRequest. Our work builds on this.to_sampling_params()infrastructure on ResponsesRequest.Parser/ParserManagerand the structural tag preparation block we're extending.Related PRs (same problem space)
reasoning_endedgate that prevents structural tag bitmasks from being applied during reasoning. Our approach inherently avoids this problem — the grammar handles reasoning/content boundaries internally via triggers, so the externalreasoning_endedgate doesn't interfere.test_gptoss_structural_tags.pyfollows the consolidated layout.prepare_structured_tag()withtool_choice+function_toolsparams, building on thefinal_content_formatinfrastructure introduced here.Related RFCs
structured_outputsfield. We reuseStructuredOutputsParamsrather than creating new types.structured_outputsfield (already merged in #33709) and also fix the standardtext.formatpath. No new protocol extensions.structured_outputsas instance field on ResponsesRequeststructured_outputsfrom local var to field for tool parser mutation. Compatible with our changes; we support both the field path and thetext.formatpath.Changes
vllm/entrypoints/openai/responses/protocol.pyjson_objecthandling into_sampling_params()vllm/entrypoints/openai/parser/harmony_utils.pyinject_response_formats()helper; addresponse_format_sectionparam toget_developer_message()so schema is preserved independently of instructions (fixes system-instructions mode)vllm/entrypoints/openai/responses/serving.py_extract_response_format_schema()and_constraint_to_content_format()helpers; inject response formats into developer message; dedent structural tag block out ofelsebranch so it runs for both Harmony and non-Harmony paths; split response format from instructions at developer message call site; fix streaming.model_dump()vllm/reasoning/abs_reasoning_parsers.pyfinal_content_formatparam toprepare_structured_tag()base classvllm/reasoning/gptoss_reasoning_parser.pyfinal_content_format— append `<tests/entrypoints/openai/responses/test_structured_output.py_constraint_to_content_formattests/v1/structured_output/test_gptoss_structural_tags.pytests/entrypoints/openai/responses/test_sampling_params.pyjson_objecttesttests/entrypoints/openai/responses/test_response_formats.py_extract_response_format_schema()tests/entrypoints/openai/parser/test_harmony_utils.pyTestInjectResponseFormats— tests forinject_response_formats(); addTestGetDeveloperMessageResponseFormats— tests forresponse_format_sectionparam behavior with/without system-instructions modeTest plan
pytest tests/entrypoints/openai/responses/test_structured_output.py tests/v1/structured_output/test_gptoss_structural_tags.py tests/entrypoints/openai/responses/test_sampling_params.py tests/entrypoints/openai/responses/test_response_formats.py tests/entrypoints/openai/parser/test_harmony_utils.py::TestInjectResponseFormats tests/entrypoints/openai/parser/test_harmony_utils.py::TestGetDeveloperMessageResponseFormats -vTestGetDeveloperMessageResponseFormats: verifies response format preserved/dropped correctly with and without system-instructions modeOut of Scope (follow-ups)
serving_chat.pynever callsprepare_structured_tag(). Same fix pattern applies but is a separate PR targeting the chat completions path.strictfield forwarding fromResponseFormatTextJSONSchemaConfig— low priority, vLLM always enforces strictly.structured_outputsshould go through extension mechanism.structured_outputsas instance field (#33249) — promotesstructured_outputsfrom local variable to field for tool parser mutation. Compatible with our changes but independent concern.