feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934)#35740
Conversation
Implements the @grs proposal for stateless multi-turn Responses API conversations without server-side storage, using the standard OpenAI `encrypted_content` field on a synthetic `ResponseReasoningItem` as the state carrier. **How it works:** 1. Client sets `store=false` + `include=["reasoning.encrypted_content"]` 2. vLLM serialises the Harmony message history into a signed blob (`vllm:1:<base64(json)>:<hmac-sha256>`) and appends it as a synthetic `ReasoningItem` to the response output 3. On the next turn the client passes `previous_response` (full response object) instead of `previous_response_id` 4. vLLM extracts, verifies, and deserialises the history from the carrier item — no in-memory store touched **No breaking changes.** Existing `previous_response_id` + store-enabled path is unchanged. New path requires explicit opt-in. **Multi-node safe:** set `VLLM_RESPONSES_STATE_SIGNING_KEY` to the same 64-char hex value on all nodes so tokens validate across replicas. Files changed: - `vllm/entrypoints/openai/responses/state.py` (new) — serialise / deserialise / HMAC-verify state carriers - `vllm/entrypoints/openai/responses/protocol.py` — add `previous_response` field + mutual-exclusion validator on `ResponsesRequest`; `model_rebuild()` for forward ref - `vllm/entrypoints/openai/responses/serving.py` — stateless prev-response resolution; thread `prev_messages` through `_make_request*`; inject state carrier in `responses_full_generator`; 501 guards on `retrieve_responses` / `cancel_responses` when store disabled - `vllm/entrypoints/openai/responses/utils.py` — skip state-carrier `ReasoningItem`s when reconstructing chat messages - `vllm/envs.py` — register `VLLM_RESPONSES_STATE_SIGNING_KEY` - `tests/entrypoints/openai/responses/test_state.py` (new) — 16 unit tests - `tests/entrypoints/openai/responses/test_serving_stateless.py` (new) — 14 unit tests Closes #26934 (partial — non-streaming only; streaming carrier TBD) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…400 on success, info log Per code review feedback: - Return 404 (not 501) when response_id has no matching background task, consistent with the stateful path's _make_not_found_error behavior - Return 400 BAD_REQUEST (not 501 NOT_IMPLEMENTED) when a task is found and cancelled — cancellation succeeded, but no stored response object can be returned; 501 was misleading - Use logger.info instead of logger.exception for asyncio.CancelledError, since cancellation is the expected outcome of this call path Update test to assert 404 for the unknown-id case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…arrier guard, background invariant)
Fixes three issues found in review:
1. cancel_responses stateless mode (gemini-code-assist P1):
- Return 404 (not 501) for unknown response_id — consistent with stateful path
- Return 400 BAD_REQUEST (not 501) on successful cancellation — task was
cancelled but no stored response is available; 501 was misleading
- Use logger.info (not logger.exception) for expected CancelledError
2. Missing carrier guard in create_responses (codex P1):
- When previous_response has no state carrier and store is disabled,
return 400 with a clear message instead of falling through to
msg_store[id] KeyError → 500
3. background/store invariant in protocol validator (codex P2):
- Reject background=True + previous_response at validation time rather
than silently producing an unretrievable background response
Tests:
- Add test_cancel_without_store_active_task_returns_400: covers the
success branch of the cancel fix; uses await asyncio.sleep(0) to
start the task before cancelling (Python 3.12: unstarted tasks
cancelled before first await never run their body)
- Add test_previous_response_without_carrier_returns_400: regression
for the KeyError → 500 bug
- Add test_background_with_previous_response_raises: regression for
the background/store invariant
- Remove test_no_previous_response_preserves_store_true: passed
regardless of our code (no new path exercised)
- Remove test_full_stateless_roundtrip: duplicate of
test_build_and_extract_roundtrip
- Rename test_prev_messages_used_over_empty_msg_store →
test_construct_input_messages_prepends_prev_msg (accurate name)
All 31 tests pass.
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-designed feature for stateless multi-turn conversations in the Responses API, using a signed state carrier in the encrypted_content field. The implementation is thorough, with comprehensive new tests and careful consideration for security aspects like using hmac.compare_digest to prevent timing attacks. The changes are well-structured across new and existing modules.
I found one critical issue in the implementation for the non-Harmony code path, where the state carrier was being generated without including the assistant's response from the current turn. This would cause the conversation history to be incomplete for subsequent turns. I've provided a detailed comment and a suggested fix for this.
|
Hi @will-deines, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1d1be94bcb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ey validation
Gemini critical — non-Harmony state carrier missing the assistant turn:
carrier_messages was only the input messages, omitting the assistant
response just generated. The next turn would see history without the
last assistant message. Fix: append
construct_chat_messages_with_tool_call(response.output) so the
carrier contains the full turn (input + response).
Codex P1 — carrierless previous_response check gated on enable_store:
The guard 'if prev_messages_from_state is None and not self.enable_store'
was too narrow. previous_response always means stateless path; a
server restart with enable_store=True and empty msg_store would still
KeyError. Fix: drop the 'not self.enable_store' condition.
Codex P2 — any-length hex key accepted as signing key:
bytes.fromhex('aa') produces 1 byte — a weak HMAC key. Fix: enforce
len(key_bytes) >= 32 (64 hex chars) and raise ValueError if too short.
Tests:
- test_previous_response_without_carrier_store_enabled_returns_400:
regression for P1 (store=True path also returns 400, not KeyError)
- test_short_key_raises: regression for P2 (4-byte key raises)
Run pre-commit --all-files; apply linter reformatting.
Signed-off-by: Will Deines <will@garr.io>
|
Hi @will-deines, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
| # history from the encrypted_content state carrier embedded in the response | ||
| # output, so no server-side store is required. | ||
| # Cannot be set together with previous_response_id. | ||
| previous_response: "ResponsesResponse | None" = None |
There was a problem hiding this comment.
Is this field one of the standard fields in the Open Response API?
There was a problem hiding this comment.
No, previous_response is a vLLM extension — the standard OpenAI API only has previous_response_id (a string ID that requires server-side storage).
This extension is needed because previous_response_id depends on vLLM's in-process store (the response_store / msg_store / event_store dicts marked # HACK / # FIXME), which leaks memory (#34738), loses state on restart, and is incompatible with production multi-node deployments (RFC #26934, raised by @qandrew at Meta).
The approach follows @grs's proposal in #26934: the client sends back the full response object; vLLM extracts the signed state from the existing encrypted_content field on a ReasoningItem — so the wire-format response itself uses zero new fields.
This is consistent with vLLM's existing extension policy (RFC #32850) — the Responses API already ships non-standard request fields (top_k, request_id, priority, etc.), and the Open Responses spec explicitly allows implementation extensions.
Mutual exclusion with previous_response_id is enforced via a Pydantic model_validator.
Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction). Signed-off-by: Will Deines <will@garr.io>
Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction). Signed-off-by: Will Deines <will@garr.io>
Related Issues, PRs, and RFCs
Directly Addressed by This PR
previous_response_idrequiresVLLM_ENABLE_RESPONSES_API_STORE=1; this PR provides a store-free alternativeencrypted_contentavoids server-side storage entirely, eliminating the memory leak root cause rather than bounding itRelated RFCs
encrypted_contentfield from the OpenAI spec (no new protocol extensions), aligning with the conservative extension policyencrypted_contentandprevious_responseare both existing OpenAI spec fieldsCompanion PR
Decisions We Made That Can Be Debated
1. HMAC signing vs. actual encryption of conversation history
What we chose: The state carrier uses HMAC-SHA256 for tamper detection. The conversation history is base64-encoded but not encrypted — anyone who intercepts the
encrypted_contentfield can decode and read it.Alternative: Use authenticated encryption (AES-GCM or similar) so the history is both tamper-proof and confidential. The field is called
encrypted_contentafter all.Why we chose this: The field name comes from the OpenAI spec, not from us — we use it as an opaque signed blob consistent with the spec's intent. Real encryption adds key management complexity (key rotation, IV generation, padding) that isn't justified for the initial implementation. The content travels over TLS between client and server, so in-transit confidentiality is already handled. The RFC (#26934) scoped encryption as out-of-scope for the initial implementation.
What reviewers might disagree with: Users may assume
encrypted_contentmeans encrypted. If the response is logged, cached, or stored client-side, the conversation history is readable. A follow-up could add optional encryption behind a flag.2. Single carrier (full Harmony history) vs. per-item state
What we chose: One synthetic
ReasoningItemcarries the full Harmony message history as a single signed blob. The carrier is appended to the response output and filtered out inutils.pybefore messages reach the LLM.Alternative: Attach state to each reasoning item individually — each
ReasoningItemcarries its own context, and the history is reconstructed by collecting all items.Why we chose this: Tool-call metadata (raised by @alecsolder in #26934) — tool call results and assistant metadata aren't expressible in per-item form but are captured in the Harmony message list. A single carrier is also simpler to implement, verify, and debug. The full message list is what the model actually needs on the next turn.
What reviewers might disagree with: The carrier grows linearly with conversation length. For very long conversations, this could become large. A per-item approach or delta-based encoding could be more efficient, but adds significant complexity.
3. Reuse
encrypted_contentfield vs. a new vLLM-specific fieldWhat we chose: We reuse the existing
encrypted_contentfield onResponseReasoningItemfrom the OpenAI spec, and the existingprevious_responsefield on the request. Zero new wire protocol fields.Alternative: Add a vLLM-specific field (e.g.,
vllm_state_carrier) or use the OpenResponses extension mechanism proposed in #33381.Why we chose this: RFCs #32850 and #33381 both emphasize conservative extensions and alignment with existing specs.
encrypted_contentis already defined as an opaque blob for platform use — our usage is consistent with that intent. Clients using the standard OpenAI SDK can use this feature without any SDK modifications. @DanielMe's proposal in #32850 supports allowing extensions "when there is a documented need" following existing patterns.4. Per-process random key (default) vs. requiring explicit key configuration
What we chose: When
VLLM_RESPONSES_STATE_SIGNING_KEYis not set, a per-process random key is generated with a warning. This means stateless multi-turn works out of the box for single-node dev, but carriers are invalidated on restart and incompatible across nodes.Alternative: Require the env var and fail hard if it's not set — forcing users to think about key management upfront.
Why we chose this: The zero-config default matches vLLM's general philosophy — things should work out of the box for the common single-node case. The warning makes the limitation visible. Production deployments with multi-node or restart requirements will naturally need to set the key, and the error messages guide them there.
What reviewers might disagree with: Silent key generation could lead to hard-to-debug issues in production (e.g., a rolling restart invalidates all in-flight carriers). A louder failure mode might be safer.
5. New
previous_responsefield (full object) vs. extendingprevious_response_idfor statelessWhat we chose: A new vLLM-specific
previous_responsefield that accepts the fullResponsesResponseobject, with amodel_validatorenforcing mutual exclusion withprevious_response_id. This is a protocol extension — the OpenAI spec only definesprevious_response_id(a string for server-side lookup).Alternative: Overload
previous_response_idwith a special sentinel or encoded value, or embed the carrier in a separate field alongsideprevious_response_id.Why we chose this:
previous_response_idimplies a server-side lookup by design — overloading it for stateless use would be confusing. A separate field makes the two paths self-documenting:previous_response_idmeans "look it up in the store,"previous_responsemeans "here's the full response, no store needed." The mutual exclusion validator makes it impossible to mix the two.What reviewers might disagree with: This is a vLLM-specific extension to the wire protocol, which #32850 and #33381 argue should be conservative. However, the field is purely additive (existing clients are unaffected) and the alternative — overloading
previous_response_id— would be more confusing. If OpenResponses or the OpenAI spec adds a similar field in the future, we can align with it.6. Stateless path as additive (keep the store) vs. replacing it
What we chose: The stateless path is purely additive. The existing
previous_response_id+VLLM_ENABLE_RESPONSES_API_STORE=1path is completely unchanged. Both paths coexist.Alternative: Remove the in-memory store entirely and force all multi-turn through the stateless path.
Why we chose this: RFC #26934 identifies two user personas — production deployments that want stateless operation, and researchers/small-scale users who want an all-in-one server with state. Removing the store would break the second persona. The companion PR (#35874) makes the store pluggable for production use cases that need server-side state (e.g.,
backgroundmode,retrieve_responses).What reviewers might disagree with: Keeping both paths means more code to maintain and test. If the stateless path proves sufficient for most use cases, the store path could be deprecated in a future PR.
Summary
Implements stateless multi-turn Responses API conversations without server-side storage, using the existing
encrypted_contentfield onResponseReasoningItemas the state carrier. Proposed by @grs in #26934.The three in-process dicts (
response_store,msg_store,event_store) marked# HACK/# FIXMEinserving.pyare disabled by default because they leak memory, lose state on restart, and are incompatible with multi-node deployments. This PR provides a production-ready alternative for the multi-turn use case.Design
Wire format:
vllm:1:<base64(json(messages))>:<hmac-sha256-hex>VLLM_RESPONSES_STATE_SIGNING_KEY(64-char hex) enables multi-node / restart-safe operation. Without it, a per-process random key is generated with a warning.ReasoningItemis filtered out inutils.pybefore messages reach the LLM — invisible to the model.Files Changed
vllm/entrypoints/openai/responses/state.pyserialize_state/deserialize_state/is_state_carrier/ HMAC helpersvllm/entrypoints/openai/responses/protocol.pyprevious_response: ResponsesResponse | NonetoResponsesRequest; mutual-exclusionmodel_validatorwithprevious_response_id; rejectbackground=Truewithprevious_response;model_rebuild()for forward refvllm/entrypoints/openai/responses/serving.pycreate_responses; threadprev_messagesthrough_make_request/_make_request_with_harmony/_construct_input_messages_with_harmony; inject state carrier inresponses_full_generator; 400 guard when carrier missing fromprevious_response; 501 onretrieve_responseswhen store disabled; 404/400 oncancel_responseswhen store disabledvllm/entrypoints/openai/responses/utils.py_construct_single_message_from_response_itemreturnsNonefor state-carrier items; filterNoneinconstruct_chat_messages_with_tool_callvllm/envs.pyVLLM_RESPONSES_STATE_SIGNING_KEYtests/entrypoints/openai/responses/test_state.pytests/entrypoints/openai/responses/test_serving_stateless.pyTest Results
All pure-Python unit tests — no GPU required.
Usage
Turn 1:
Turn 2:
Multi-node / restart-safe:
Backward Compatibility
No breaking changes. The existing
previous_response_id+VLLM_ENABLE_RESPONSES_API_STORE=1path is completely unchanged. The new path requires explicit opt-in (store=false+include=["reasoning.encrypted_content"]+previous_responseon turn 2).previous_response_idwith store disabled now returns a helpful 400 pointing users to the stateless path.Test plan
pytest tests/entrypoints/openai/responses/test_state.py tests/entrypoints/openai/responses/test_serving_stateless.py -vVLLM_RESPONSES_STATE_SIGNING_KEYenables cross-node carrier compatibilitycc @qandrew @WoosukKwon @njhill @DanielMe @chaunceyjiang