feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934)#35903
feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934)#35903will-deines wants to merge 13 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a well-designed stateless multi-turn conversation mechanism for the Responses API. By using an HMAC-signed state carrier in the encrypted_content field, it effectively addresses the memory leak and multi-node deployment issues of the previous in-memory store. The implementation is robust, with new logic for state serialization, deserialization, and validation. The changes are well-tested with a comprehensive suite of new unit tests covering protocol validation, error paths, and state carrier logic. I have one suggestion to improve the robustness of message store access.
| for m in prev_messages | ||
| ] | ||
| else: | ||
| prev_msgs = self.msg_store[prev_response.id] |
There was a problem hiding this comment.
Direct dictionary access self.msg_store[prev_response.id] can raise a KeyError if the ID is not found, leading to an unhandled exception and a 500 Internal Server Error. This could happen if response_store and msg_store become inconsistent.
For improved robustness and to align with the non-Harmony path in _make_request which uses .get(), consider changing this to self.msg_store.get(prev_response.id) and then handling the potential None value for prev_msgs in the subsequent logic to prevent a TypeError.
There was a problem hiding this comment.
Good catch — switched to .get() with an explicit ValueError when the ID is missing. This gives a clear error message instead of an unhandled KeyError. Fixed in 0d68f9d.
bddeedd to
589f29a
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
…-project#26934) Implements the @grs proposal for stateless multi-turn Responses API conversations without server-side storage, using the standard OpenAI `encrypted_content` field on a synthetic `ResponseReasoningItem` as the state carrier. **How it works:** 1. Client sets `store=false` + `include=["reasoning.encrypted_content"]` 2. vLLM serialises the Harmony message history into a signed blob (`vllm:1:<base64(json)>:<hmac-sha256>`) and appends it as a synthetic `ReasoningItem` to the response output 3. On the next turn the client passes `previous_response` (full response object) instead of `previous_response_id` 4. vLLM extracts, verifies, and deserialises the history from the carrier item — no in-memory store touched **No breaking changes.** Existing `previous_response_id` + store-enabled path is unchanged. New path requires explicit opt-in. **Multi-node safe:** set `VLLM_RESPONSES_STATE_SIGNING_KEY` to the same 64-char hex value on all nodes so tokens validate across replicas. Files changed: - `vllm/entrypoints/openai/responses/state.py` (new) — serialise / deserialise / HMAC-verify state carriers - `vllm/entrypoints/openai/responses/protocol.py` — add `previous_response` field + mutual-exclusion validator on `ResponsesRequest`; `model_rebuild()` for forward ref - `vllm/entrypoints/openai/responses/serving.py` — stateless prev-response resolution; thread `prev_messages` through `_make_request*`; inject state carrier in `responses_full_generator`; 501 guards on `retrieve_responses` / `cancel_responses` when store disabled - `vllm/entrypoints/openai/responses/utils.py` — skip state-carrier `ReasoningItem`s when reconstructing chat messages - `vllm/envs.py` — register `VLLM_RESPONSES_STATE_SIGNING_KEY` - `tests/entrypoints/openai/responses/test_state.py` (new) — 16 unit tests - `tests/entrypoints/openai/responses/test_serving_stateless.py` (new) — 14 unit tests Closes vllm-project#26934 (partial — non-streaming only; streaming carrier TBD) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Will Deines <will@garr.io>
…400 on success, info log Per code review feedback: - Return 404 (not 501) when response_id has no matching background task, consistent with the stateful path's _make_not_found_error behavior - Return 400 BAD_REQUEST (not 501 NOT_IMPLEMENTED) when a task is found and cancelled — cancellation succeeded, but no stored response object can be returned; 501 was misleading - Use logger.info instead of logger.exception for asyncio.CancelledError, since cancellation is the expected outcome of this call path Update test to assert 404 for the unknown-id case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Will Deines <will@garr.io>
…arrier guard, background invariant)
Fixes three issues found in review:
1. cancel_responses stateless mode (gemini-code-assist P1):
- Return 404 (not 501) for unknown response_id — consistent with stateful path
- Return 400 BAD_REQUEST (not 501) on successful cancellation — task was
cancelled but no stored response is available; 501 was misleading
- Use logger.info (not logger.exception) for expected CancelledError
2. Missing carrier guard in create_responses (codex P1):
- When previous_response has no state carrier and store is disabled,
return 400 with a clear message instead of falling through to
msg_store[id] KeyError → 500
3. background/store invariant in protocol validator (codex P2):
- Reject background=True + previous_response at validation time rather
than silently producing an unretrievable background response
Tests:
- Add test_cancel_without_store_active_task_returns_400: covers the
success branch of the cancel fix; uses await asyncio.sleep(0) to
start the task before cancelling (Python 3.12: unstarted tasks
cancelled before first await never run their body)
- Add test_previous_response_without_carrier_returns_400: regression
for the KeyError → 500 bug
- Add test_background_with_previous_response_raises: regression for
the background/store invariant
- Remove test_no_previous_response_preserves_store_true: passed
regardless of our code (no new path exercised)
- Remove test_full_stateless_roundtrip: duplicate of
test_build_and_extract_roundtrip
- Rename test_prev_messages_used_over_empty_msg_store →
test_construct_input_messages_prepends_prev_msg (accurate name)
All 31 tests pass.
Signed-off-by: Will Deines <will@garr.io>
…ey validation
Gemini critical — non-Harmony state carrier missing the assistant turn:
carrier_messages was only the input messages, omitting the assistant
response just generated. The next turn would see history without the
last assistant message. Fix: append
construct_chat_messages_with_tool_call(response.output) so the
carrier contains the full turn (input + response).
Codex P1 — carrierless previous_response check gated on enable_store:
The guard 'if prev_messages_from_state is None and not self.enable_store'
was too narrow. previous_response always means stateless path; a
server restart with enable_store=True and empty msg_store would still
KeyError. Fix: drop the 'not self.enable_store' condition.
Codex P2 — any-length hex key accepted as signing key:
bytes.fromhex('aa') produces 1 byte — a weak HMAC key. Fix: enforce
len(key_bytes) >= 32 (64 hex chars) and raise ValueError if too short.
Tests:
- test_previous_response_without_carrier_store_enabled_returns_400:
regression for P1 (store=True path also returns 400, not KeyError)
- test_short_key_raises: regression for P2 (4-byte key raises)
Run pre-commit --all-files; apply linter reformatting.
Signed-off-by: Will Deines <will@garr.io>
Signed-off-by: Will Deines <will@garr.io>
…t state Signed-off-by: Will Deines <will@garr.io>
Signed-off-by: Will Deines <will@garr.io>
589f29a to
f5f030f
Compare
The non-streaming path caught GenerationError and called self._convert_generation_error_to_response() which doesn't exist on OpenAIServingResponses or its base class. The base class only has _convert_generation_error_to_streaming_response() for the streaming path. Non-streaming GenerationErrors are handled by the global FastAPI exception handler registered in api_server.py, so the explicit catch is unnecessary. Signed-off-by: Will Deines <will@garr.io>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a8a5379112
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| carrier_messages = messages + construct_chat_messages_with_tool_call( | ||
| response.output |
There was a problem hiding this comment.
Stop duplicating the last assistant turn in stateless prompts
When store=false + previous_response is used on non-Harmony models, this branch serializes messages and response.output into the carrier. On the next turn _make_request() passes that deserialized list as prev_msg, but construct_input_messages() still appends prev_response.output again, so the previous assistant/tool output is injected twice. Every stateless follow-up on the simple/parsable path will therefore send a different conversation than the original one, which can change model behavior and break tool-call continuation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 7c35d29. _make_request() now passes prev_response_output=None when prev_messages is not None (i.e. the stateless path), since the carrier already contains the full history including the assistant turn. Added regression tests in TestNoDuplicateAssistantTurn confirming no duplication.
| if prev_response is not None and request.previous_response is not None: | ||
| prev_messages_from_state = self._extract_state_from_response(prev_response) |
There was a problem hiding this comment.
Handle invalid state carriers as request errors
_extract_state_from_response() raises ValueError for HMAC mismatches or malformed carriers, but this call now happens before the preprocessing try/except. If a client retries after a restart, lands on a different replica, or sends a tampered blob, create_responses() will propagate an uncaught exception to FastAPI and return a 500 instead of the intended 4xx validation error. Because cross-instance mismatches are expected unless VLLM_RESPONSES_STATE_SIGNING_KEY is shared, this will surface as a production-facing failure mode for normal stateless deployments.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 7c35d29. Wrapped _extract_state_from_response() in try/except ValueError returning a 400 with a message about tampering and VLLM_RESPONSES_STATE_SIGNING_KEY consistency. Added regression test in TestTamperedCarrierReturns400.
| if len(item.summary) == 1: | ||
| reasoning_content = item.summary[0].text | ||
| elif item.content and len(item.content) == 1: |
There was a problem hiding this comment.
Preserve reasoning content when summary is also present
This changes the existing precedence so a reasoning item with both content and a one-entry summary now uses the summary text instead of the actual content. That is a regression for providers/models that populate both fields: downstream chat reconstruction loses the full reasoning text, even though the repo’s existing tests document that content should win when both are present.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 7c35d29. Restored content-first precedence: item.content is checked before item.summary, with summary used only as a fallback (with warning). The existing TestReasoningItemContentPriority tests now pass again.
- Fix duplicate assistant turn: pass prev_response_output=None when using stateless path (prev_messages already contains the output) - Handle ValueError from _extract_state_from_response: wrap in try/except returning 400 on HMAC mismatch instead of unhandled 500 - Fix content/summary precedence: check item.content before item.summary in reasoning items, restoring intended content-first behavior Signed-off-by: Will Deines <will@garr.io>
b56e429 to
87b43c6
Compare
Signed-off-by: Will Deines <will@garr.io>
|
This pull request has merge conflicts that must be resolved before it can be |
The stateless encrypted_content carrier was using Pydantic's model_dump()/model_validate() to serialize/deserialize Harmony messages, but this produces dicts incompatible with the library's typed constructors. Switch to to_dict()/from_dict() so messages roundtrip correctly and remain renderable for completion. Signed-off-by: Will Deines <will@garr.io>
Related Issues, PRs, and RFCs
Directly Addressed by This PR
previous_response_idrequiresVLLM_ENABLE_RESPONSES_API_STORE=1; this PR provides a store-free alternativeencrypted_contentavoids server-side storage entirely, eliminating the memory leak root cause rather than bounding itRelated RFCs
encrypted_contentfield from the OpenAI spec (no new protocol extensions), aligning with the conservative extension policyencrypted_contentandprevious_responseare both existing OpenAI spec fieldsCompanion PR
Decisions We Made That Can Be Debated
1. HMAC signing vs. actual encryption of conversation history
What we chose: The state carrier uses HMAC-SHA256 for tamper detection. The conversation history is base64-encoded but not encrypted — anyone who intercepts the
encrypted_contentfield can decode and read it.Alternative: Use authenticated encryption (AES-GCM or similar) so the history is both tamper-proof and confidential. The field is called
encrypted_contentafter all.Why we chose this: The field name comes from the OpenAI spec, not from us — we use it as an opaque signed blob consistent with the spec's intent. Real encryption adds key management complexity (key rotation, IV generation, padding) that isn't justified for the initial implementation. The content travels over TLS between client and server, so in-transit confidentiality is already handled. The RFC (#26934) scoped encryption as out-of-scope for the initial implementation.
What reviewers might disagree with: Users may assume
encrypted_contentmeans encrypted. If the response is logged, cached, or stored client-side, the conversation history is readable. A follow-up could add optional encryption behind a flag.2. Single carrier (full Harmony history) vs. per-item state
What we chose: One synthetic
ReasoningItemcarries the full Harmony message history as a single signed blob. The carrier is appended to the response output and filtered out inutils.pybefore messages reach the LLM.Alternative: Attach state to each reasoning item individually — each
ReasoningItemcarries its own context, and the history is reconstructed by collecting all items.Why we chose this: Tool-call metadata (raised by @alecsolder in #26934) — tool call results and assistant metadata aren't expressible in per-item form but are captured in the Harmony message list. A single carrier is also simpler to implement, verify, and debug. The full message list is what the model actually needs on the next turn.
What reviewers might disagree with: The carrier grows linearly with conversation length. For very long conversations, this could become large. A per-item approach or delta-based encoding could be more efficient, but adds significant complexity.
3. Reuse
encrypted_contentfield vs. a new vLLM-specific fieldWhat we chose: We reuse the existing
encrypted_contentfield onResponseReasoningItemfrom the OpenAI spec, and the existingprevious_responsefield on the request. Zero new wire protocol fields.Alternative: Add a vLLM-specific field (e.g.,
vllm_state_carrier) or use the OpenResponses extension mechanism proposed in #33381.Why we chose this: RFCs #32850 and #33381 both emphasize conservative extensions and alignment with existing specs.
encrypted_contentis already defined as an opaque blob for platform use — our usage is consistent with that intent. Clients using the standard OpenAI SDK can use this feature without any SDK modifications. @DanielMe's proposal in #32850 supports allowing extensions "when there is a documented need" following existing patterns.4. Per-process random key (default) vs. requiring explicit key configuration
What we chose: When
VLLM_RESPONSES_STATE_SIGNING_KEYis not set, a per-process random key is generated with a warning. This means stateless multi-turn works out of the box for single-node dev, but carriers are invalidated on restart and incompatible across nodes.Alternative: Require the env var and fail hard if it's not set — forcing users to think about key management upfront.
Why we chose this: The zero-config default matches vLLM's general philosophy — things should work out of the box for the common single-node case. The warning makes the limitation visible. Production deployments with multi-node or restart requirements will naturally need to set the key, and the error messages guide them there.
What reviewers might disagree with: Silent key generation could lead to hard-to-debug issues in production (e.g., a rolling restart invalidates all in-flight carriers). A louder failure mode might be safer.
5. New
previous_responsefield (full object) vs. extendingprevious_response_idfor statelessWhat we chose: A new vLLM-specific
previous_responsefield that accepts the fullResponsesResponseobject, with amodel_validatorenforcing mutual exclusion withprevious_response_id. This is a protocol extension — the OpenAI spec only definesprevious_response_id(a string for server-side lookup).Alternative: Overload
previous_response_idwith a special sentinel or encoded value, or embed the carrier in a separate field alongsideprevious_response_id.Why we chose this:
previous_response_idimplies a server-side lookup by design — overloading it for stateless use would be confusing. A separate field makes the two paths self-documenting:previous_response_idmeans "look it up in the store,"previous_responsemeans "here's the full response, no store needed." The mutual exclusion validator makes it impossible to mix the two.What reviewers might disagree with: This is a vLLM-specific extension to the wire protocol, which #32850 and #33381 argue should be conservative. However, the field is purely additive (existing clients are unaffected) and the alternative — overloading
previous_response_id— would be more confusing. If OpenResponses or the OpenAI spec adds a similar field in the future, we can align with it.6. Stateless path as additive (keep the store) vs. replacing it
What we chose: The stateless path is purely additive. The existing
previous_response_id+VLLM_ENABLE_RESPONSES_API_STORE=1path is completely unchanged. Both paths coexist.Alternative: Remove the in-memory store entirely and force all multi-turn through the stateless path.
Why we chose this: RFC #26934 identifies two user personas — production deployments that want stateless operation, and researchers/small-scale users who want an all-in-one server with state. Removing the store would break the second persona. The companion PR (#35874) makes the store pluggable for production use cases that need server-side state (e.g.,
backgroundmode,retrieve_responses).What reviewers might disagree with: Keeping both paths means more code to maintain and test. If the stateless path proves sufficient for most use cases, the store path could be deprecated in a future PR.
Summary
Implements stateless multi-turn Responses API conversations without server-side storage, using the existing
encrypted_contentfield onResponseReasoningItemas the state carrier. Proposed by @grs in #26934.The three in-process dicts (
response_store,msg_store,event_store) marked# HACK/# FIXMEinserving.pyare disabled by default because they leak memory, lose state on restart, and are incompatible with multi-node deployments. This PR provides a production-ready alternative for the multi-turn use case.Design
Wire format:
vllm:1:<base64(json(messages))>:<hmac-sha256-hex>VLLM_RESPONSES_STATE_SIGNING_KEY(64-char hex) enables multi-node / restart-safe operation. Without it, a per-process random key is generated with a warning.ReasoningItemis filtered out inutils.pybefore messages reach the LLM — invisible to the model.Files Changed
vllm/entrypoints/openai/responses/state.pyserialize_state/deserialize_state/is_state_carrier/ HMAC helpersvllm/entrypoints/openai/responses/protocol.pyprevious_response: ResponsesResponse | NonetoResponsesRequest; mutual-exclusionmodel_validatorwithprevious_response_id; rejectbackground=Truewithprevious_response;model_rebuild()for forward refvllm/entrypoints/openai/responses/serving.pycreate_responses; threadprev_messagesthrough_make_request/_make_request_with_harmony/_construct_input_messages_with_harmony; inject state carrier inresponses_full_generator; try/exceptValueErroraround state extraction returning 400 on HMAC mismatch; 400 guard when carrier missing fromprevious_response; 501 onretrieve_responseswhen store disabled; 404/400 oncancel_responseswhen store disabled; avoid duplicate assistant turn on stateless path by passingprev_response_output=Nonewhenprev_messagesis setvllm/entrypoints/openai/responses/utils.py_construct_single_message_from_response_itemreturnsNonefor state-carrier items; filterNoneinconstruct_chat_messages_with_tool_call; fix content/summary precedence regression (content-first, summary as fallback with warning)vllm/envs.pyVLLM_RESPONSES_STATE_SIGNING_KEYtests/entrypoints/openai/responses/test_state.pytests/entrypoints/openai/responses/test_serving_stateless.pyTest Results
All pure-Python unit tests — no GPU required.
Usage
Turn 1:
Turn 2:
Multi-node / restart-safe:
Backward Compatibility
No breaking changes. The existing
previous_response_id+VLLM_ENABLE_RESPONSES_API_STORE=1path is completely unchanged. The new path requires explicit opt-in (store=false+include=["reasoning.encrypted_content"]+previous_responseon turn 2).previous_response_idwith store disabled now returns a helpful 400 pointing users to the stateless path.Test plan
pytest tests/entrypoints/openai/responses/test_state.py tests/entrypoints/openai/responses/test_serving_stateless.py -vVLLM_RESPONSES_STATE_SIGNING_KEYenables cross-node carrier compatibilitycc @qandrew @WoosukKwon @njhill @DanielMe @chaunceyjiang