feat(responses): stateless multi-turn via encrypted_content (RFC #26934 + @grs proposal)#35701
Conversation
Implements the @grs proposal for stateless multi-turn Responses API conversations without server-side storage, using the standard OpenAI `encrypted_content` field on a synthetic `ResponseReasoningItem` as the state carrier. **How it works:** 1. Client sets `store=false` + `include=["reasoning.encrypted_content"]` 2. vLLM serialises the Harmony message history into a signed blob (`vllm:1:<base64(json)>:<hmac-sha256>`) and appends it as a synthetic `ReasoningItem` to the response output 3. On the next turn the client passes `previous_response` (full response object) instead of `previous_response_id` 4. vLLM extracts, verifies, and deserialises the history from the carrier item — no in-memory store touched **No breaking changes.** Existing `previous_response_id` + store-enabled path is unchanged. New path requires explicit opt-in. **Multi-node safe:** set `VLLM_RESPONSES_STATE_SIGNING_KEY` to the same 64-char hex value on all nodes so tokens validate across replicas. Files changed: - `vllm/entrypoints/openai/responses/state.py` (new) — serialise / deserialise / HMAC-verify state carriers - `vllm/entrypoints/openai/responses/protocol.py` — add `previous_response` field + mutual-exclusion validator on `ResponsesRequest`; `model_rebuild()` for forward ref - `vllm/entrypoints/openai/responses/serving.py` — stateless prev-response resolution; thread `prev_messages` through `_make_request*`; inject state carrier in `responses_full_generator`; 501 guards on `retrieve_responses` / `cancel_responses` when store disabled - `vllm/entrypoints/openai/responses/utils.py` — skip state-carrier `ReasoningItem`s when reconstructing chat messages - `vllm/envs.py` — register `VLLM_RESPONSES_STATE_SIGNING_KEY` - `tests/entrypoints/openai/responses/test_state.py` (new) — 16 unit tests - `tests/entrypoints/openai/responses/test_serving_stateless.py` (new) — 14 unit tests Closes #26934 (partial — non-streaming only; streaming carrier TBD) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Documentation preview: https://vllm--35701.org.readthedocs.build/en/35701/ |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-designed feature for stateless multi-turn conversations in the Responses API, effectively addressing memory leak and multi-node deployment issues. The implementation, including the new state serialization module and protocol extensions, is robust and well-tested. My review identifies one high-severity issue in the new error handling for the cancel_responses endpoint in stateless mode, where the logic for task-not-found cases is incorrect and the returned HTTP status code is misleading. I've provided a detailed suggestion to improve the correctness and clarity of the API behavior in this scenario.
| if not self.enable_store: | ||
| # Stateless mode: cancel in-flight tasks by ID only (no stored response). | ||
| if task := self.background_tasks.get(response_id): | ||
| task.cancel() | ||
| try: | ||
| await task | ||
| except asyncio.CancelledError: | ||
| logger.exception( | ||
| "Background task for %s was cancelled (stateless mode)", | ||
| response_id, | ||
| ) | ||
| return self.create_error_response( | ||
| err_type="invalid_request_error", | ||
| message=( | ||
| "Response cancellation in stateless mode cancelled any " | ||
| "in-flight task, but no stored response object is available. " | ||
| "Enable VLLM_ENABLE_RESPONSES_API_STORE=1 for full cancel " | ||
| "support." | ||
| ), | ||
| status_code=HTTPStatus.NOT_IMPLEMENTED, | ||
| ) |
There was a problem hiding this comment.
The current implementation for cancelling responses in stateless mode has a few issues:
- Incorrect Error on Not Found: If
response_iddoes not correspond to an active background task, the function returns a 501 Not Implemented error. It should return a 404 Not Found, consistent with the stateful path's behavior. - Misleading Status Code: Returning
HTTPStatus.NOT_IMPLEMENTED(501) is misleading when a cancellation is successfully initiated. A 400 Bad Request would be more appropriate to signal that this is a client error (using a feature not fully supported in stateless mode), while the message can clarify that the cancellation was attempted. - Logging:
logger.exceptionis used forasyncio.CancelledError. Since cancellation is the expected outcome of this operation,logger.infoorlogger.debugwould be more suitable.
I suggest refactoring this block to handle these cases correctly.
if not self.enable_store:
# Stateless mode: cancel in-flight tasks by ID only (no stored response).
task = self.background_tasks.get(response_id)
if not task:
# Mimic the stateful path's behavior of returning a 404 if the
# response is not in the store.
return self._make_not_found_error(response_id)
task.cancel()
try:
await task
except asyncio.CancelledError:
logger.info(
"Background task for %s was cancelled (stateless mode)",
response_id,
)
# The task was cancelled, but we cannot return a full response object
# as required by the API spec because no state is stored.
# Return an error response indicating this limitation.
return self.create_error_response(
err_type="invalid_request_error",
message=(
"Response cancellation was initiated for the in-flight task, "
"but a full response object cannot be returned in stateless mode. "
"Enable VLLM_ENABLE_RESPONSES_API_STORE=1 for full cancel "
"support."
),
status_code=HTTPStatus.BAD_REQUEST,
)There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 010a69cea9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for m in prev_messages | ||
| ] | ||
| else: | ||
| prev_msgs = self.msg_store[prev_response.id] |
There was a problem hiding this comment.
Guard Harmony fallback when stateless carrier is missing
In Harmony mode, if a client sends previous_response but the prior response does not include the synthetic state carrier (for example they omitted include=['reasoning.encrypted_content']), _extract_state_from_response returns None and this branch falls back to self.msg_store[prev_response.id]. With store disabled, that key is absent and raises KeyError, and create_responses does not catch KeyError, so this becomes a 500 instead of a user-actionable 400. Please validate that previous_response actually contains a carrier before indexing msg_store (or use a safe lookup and explicit error).
Useful? React with 👍 / 👎.
| if self.previous_response is not None and self.store: | ||
| # Stateless path: store is meaningless (and would cause confusion). | ||
| # Mirror the silent-disable behavior in create_responses for store=True. | ||
| self.store = False |
There was a problem hiding this comment.
Preserve background/store invariant after validation
This mode='after' validator mutates self.store to False when previous_response is set, but background is validated earlier in mode='before' against the original input. A request like background=true with previous_response (and default store=true) can therefore pass validation and end up as background=true, store=false, which then bypasses store-disabled guards and creates a background response that cannot be retrieved/cancelled when store is off (those endpoints now return 501). Re-validate background after this mutation or avoid mutating store here.
Useful? React with 👍 / 👎.
Summary
Implements the @grs proposal for stateless multi-turn Responses API conversations — enabling multi-turn without server-side storage by serialising Harmony message history into the standard OpenAI
encrypted_contentfield on a syntheticResponseReasoningItem.Addresses: RFC #26934 | Related: #33089 #34738
Motivation
vLLM's Responses API currently stores conversation state in three in-process Python dicts (
response_store,msg_store,event_store) commented# HACK/# FIXMEin source. These are disabled by default (VLLM_ENABLE_RESPONSES_API_STORE=0) because they:previous_response_idmulti-turn for most users (issues [Feature]: Support multi-turn conversation for OpenAI Response API #33089, Fix memory leak in Responses API store #34738)RFC #26934 (@qandrew, Meta) calls for decoupling storage from the inference engine. @grs proposed reusing the existing
encrypted_contentfield onReasoningItemobjects — no new protocol extensions needed.Design
Wire format:
vllm:1:<base64(json(messages))>:<hmac-sha256-hex>VLLM_RESPONSES_STATE_SIGNING_KEY(64-char hex) enables multi-node deployments; random key is generated per-process if unset (with a warning)State carrier item is a synthetic
ReasoningItemwithid="rs_state_<req_id[-12:]>". It is invisible to the LLM — filtered out by_construct_single_message_from_response_item()inutils.py.Files Changed
vllm/entrypoints/openai/responses/state.pyserialize_state/deserialize_state/is_state_carrier/ HMAC helpersvllm/entrypoints/openai/responses/protocol.pyprevious_response: ResponsesResponse | NonetoResponsesRequest;model_validatorfor mutual exclusion withprevious_response_id;model_rebuild()for forward refvllm/entrypoints/openai/responses/serving.pycreate_responses; threadprev_messagesthrough_make_request/_make_request_with_harmony/_construct_input_messages_with_harmony; inject state carrier inresponses_full_generator; 501 guards onretrieve_responses/cancel_responseswhen store disabledvllm/entrypoints/openai/responses/utils.py_construct_single_message_from_response_itemreturnsNonefor state-carrier items;construct_chat_messages_with_tool_callfiltersNoneentriesvllm/envs.pyVLLM_RESPONSES_STATE_SIGNING_KEYtests/entrypoints/openai/responses/test_state.pytests/entrypoints/openai/responses/test_serving_stateless.pyTest Results
All tests are pure-Python unit tests (no GPU needed).
Usage Example
Turn 1 — no store needed:
Turn 2 — pass back the full response:
Multi-node deployment:
Backward Compatibility
previous_response_id+VLLM_ENABLE_RESPONSES_API_STORE=1path is completely unchanged.store=false+include=["reasoning.encrypted_content"]+previous_response.previous_response_idwithout store now returns a helpful 400 error pointing users to the stateless path.Known Limitations / Follow-up
responses_full_generator. Streaming (responses_stream_generator) needsResponseOutputItemAddedEvent+ResponseOutputItemDoneEventevents for the carrier beforeResponseCompletedEvent. Left as a follow-up to keep this PR reviewable.encrypted_contentusage (the field name is from the OpenAI spec; vLLM reuses it for opaque state). Actual encryption is out of scope per the RFC.msg_store/response_storehacks not removed. This PR adds the stateless path without deleting the existing store-enabled path. Cleanup can follow once stateless is validated.Reviewer Notes
Key design decision open for discussion: single state carrier vs. per-item state. The current implementation appends one synthetic
ReasoningItemcarrying the full Harmony history. @alecsolder raised in #26934 that tool-call metadata may not be expressible in per-reasoning-item form; this design sidesteps that by carrying the complete history in one item.cc @qandrew @grs @WoosukKwon @njhill @Simon-Laux