feat(responses): stateless multi-turn via encrypted_content (RFC #26934 + @grs proposal) by will-deines · Pull Request #35701 · vllm-project/vllm

will-deines · 2026-03-02T03:10:31Z

Summary

Implements the @grs proposal for stateless multi-turn Responses API conversations — enabling multi-turn without server-side storage by serialising Harmony message history into the standard OpenAI encrypted_content field on a synthetic ResponseReasoningItem.

Addresses: RFC #26934 | Related: #33089 #34738

Motivation

vLLM's Responses API currently stores conversation state in three in-process Python dicts (response_store, msg_store, event_store) commented # HACK / # FIXME in source. These are disabled by default (VLLM_ENABLE_RESPONSES_API_STORE=0) because they:

Cause memory leaks in long-running deployments
Lose all state on server restart
Are incompatible with multi-node / load-balanced deployments
Block adoption of previous_response_id multi-turn for most users (issues [Feature]: Support multi-turn conversation for OpenAI Response API #33089, Fix memory leak in Responses API store #34738)

RFC #26934 (@qandrew, Meta) calls for decoupling storage from the inference engine. @grs proposed reusing the existing encrypted_content field on ReasoningItem objects — no new protocol extensions needed.

Design

Turn 1:  client → vLLM  { store: false, include: ["reasoning.encrypted_content"], input: "..." }
         vLLM  → client  { output: [...real items..., ReasoningItem(encrypted_content="vllm:1:<b64>:<hmac>")] }

Turn 2:  client → vLLM  { store: false, previous_response: <full Turn 1 response>, input: "..." }
         vLLM extracts encrypted_content, verifies HMAC, deserialises Harmony history → uses as context

Wire format: vllm:1:<base64(json(messages))>:<hmac-sha256-hex>

Not encrypted (content is base64 JSON) — HMAC is for tamper detection only
VLLM_RESPONSES_STATE_SIGNING_KEY (64-char hex) enables multi-node deployments; random key is generated per-process if unset (with a warning)

State carrier item is a synthetic ReasoningItem with id="rs_state_<req_id[-12:]>". It is invisible to the LLM — filtered out by _construct_single_message_from_response_item() in utils.py.

Files Changed

File	Change
`vllm/entrypoints/openai/responses/state.py`	NEW — `serialize_state` / `deserialize_state` / `is_state_carrier` / HMAC helpers
`vllm/entrypoints/openai/responses/protocol.py`	Add `previous_response: ResponsesResponse \| None` to `ResponsesRequest`; `model_validator` for mutual exclusion with `previous_response_id`; `model_rebuild()` for forward ref
`vllm/entrypoints/openai/responses/serving.py`	Stateless prev-response resolution in `create_responses`; thread `prev_messages` through `_make_request` / `_make_request_with_harmony` / `_construct_input_messages_with_harmony`; inject state carrier in `responses_full_generator`; 501 guards on `retrieve_responses` / `cancel_responses` when store disabled
`vllm/entrypoints/openai/responses/utils.py`	`_construct_single_message_from_response_item` returns `None` for state-carrier items; `construct_chat_messages_with_tool_call` filters `None` entries
`vllm/envs.py`	Register `VLLM_RESPONSES_STATE_SIGNING_KEY`
`tests/entrypoints/openai/responses/test_state.py`	NEW — 16 unit tests (round-trip, tamper detection, cross-key incompatibility, bad hex key, random key caching)
`tests/entrypoints/openai/responses/test_serving_stateless.py`	NEW — 14 unit tests (protocol validation, state carrier helpers, storeless error paths, utils skipping, prev_messages override)

Test Results

tests/entrypoints/openai/responses/test_state.py             16/16 passed
tests/entrypoints/openai/responses/test_serving_stateless.py 14/14 passed

All tests are pure-Python unit tests (no GPU needed).

Usage Example

Turn 1 — no store needed:

resp1 = client.responses.create(
    model="gpt-4o",
    input="My name is Alice",
    store=False,
    include=["reasoning.encrypted_content"],
)
# resp1.output contains [..., ReasoningItem(encrypted_content="vllm:1:...")]

Turn 2 — pass back the full response:

resp2 = client.responses.create(
    model="gpt-4o",
    input="What is my name?",
    store=False,
    previous_response=resp1,          # full object, not previous_response_id
    include=["reasoning.encrypted_content"],
)
# → "Your name is Alice."

Multi-node deployment:

export VLLM_RESPONSES_STATE_SIGNING_KEY="$(openssl rand -hex 32)"
# Set the same value on all vLLM nodes so tokens validate across replicas

Backward Compatibility

No breaking changes. Existing previous_response_id + VLLM_ENABLE_RESPONSES_API_STORE=1 path is completely unchanged.
New stateless path requires explicit opt-in via store=false + include=["reasoning.encrypted_content"] + previous_response.
previous_response_id without store now returns a helpful 400 error pointing users to the stateless path.

Known Limitations / Follow-up

Streaming not yet covered. The state carrier is only injected in responses_full_generator. Streaming (responses_stream_generator) needs ResponseOutputItemAddedEvent + ResponseOutputItemDoneEvent events for the carrier before ResponseCompletedEvent. Left as a follow-up to keep this PR reviewable.
No encryption. Content is signed (HMAC) but not encrypted. This mirrors existing encrypted_content usage (the field name is from the OpenAI spec; vLLM reuses it for opaque state). Actual encryption is out of scope per the RFC.
In-memory msg_store / response_store hacks not removed. This PR adds the stateless path without deleting the existing store-enabled path. Cleanup can follow once stateless is validated.

Reviewer Notes

Key design decision open for discussion: single state carrier vs. per-item state. The current implementation appends one synthetic ReasoningItem carrying the full Harmony history. @alecsolder raised in #26934 that tool-call metadata may not be expressible in per-reasoning-item form; this design sidesteps that by carrying the complete history in one item.

cc @qandrew @grs @WoosukKwon @njhill @Simon-Laux

@grs

Implements the @grs proposal for stateless multi-turn Responses API conversations without server-side storage, using the standard OpenAI `encrypted_content` field on a synthetic `ResponseReasoningItem` as the state carrier. **How it works:** 1. Client sets `store=false` + `include=["reasoning.encrypted_content"]` 2. vLLM serialises the Harmony message history into a signed blob (`vllm:1:<base64(json)>:<hmac-sha256>`) and appends it as a synthetic `ReasoningItem` to the response output 3. On the next turn the client passes `previous_response` (full response object) instead of `previous_response_id` 4. vLLM extracts, verifies, and deserialises the history from the carrier item — no in-memory store touched **No breaking changes.** Existing `previous_response_id` + store-enabled path is unchanged. New path requires explicit opt-in. **Multi-node safe:** set `VLLM_RESPONSES_STATE_SIGNING_KEY` to the same 64-char hex value on all nodes so tokens validate across replicas. Files changed: - `vllm/entrypoints/openai/responses/state.py` (new) — serialise / deserialise / HMAC-verify state carriers - `vllm/entrypoints/openai/responses/protocol.py` — add `previous_response` field + mutual-exclusion validator on `ResponsesRequest`; `model_rebuild()` for forward ref - `vllm/entrypoints/openai/responses/serving.py` — stateless prev-response resolution; thread `prev_messages` through `_make_request*`; inject state carrier in `responses_full_generator`; 501 guards on `retrieve_responses` / `cancel_responses` when store disabled - `vllm/entrypoints/openai/responses/utils.py` — skip state-carrier `ReasoningItem`s when reconstructing chat messages - `vllm/envs.py` — register `VLLM_RESPONSES_STATE_SIGNING_KEY` - `tests/entrypoints/openai/responses/test_state.py` (new) — 16 unit tests - `tests/entrypoints/openai/responses/test_serving_stateless.py` (new) — 14 unit tests Closes #26934 (partial — non-streaming only; streaming carrier TBD) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mergify · 2026-03-02T03:11:13Z

Documentation preview: https://vllm--35701.org.readthedocs.build/en/35701/

gemini-code-assist

Code Review

This pull request introduces a significant and well-designed feature for stateless multi-turn conversations in the Responses API, effectively addressing memory leak and multi-node deployment issues. The implementation, including the new state serialization module and protocol extensions, is robust and well-tested. My review identifies one high-severity issue in the new error handling for the cancel_responses endpoint in stateless mode, where the logic for task-not-found cases is incorrect and the returned HTTP status code is misleading. I've provided a detailed suggestion to improve the correctness and clarity of the API behavior in this scenario.

gemini-code-assist · 2026-03-02T03:12:31Z

vllm/entrypoints/openai/responses/serving.py

+        if not self.enable_store:
+            # Stateless mode: cancel in-flight tasks by ID only (no stored response).
+            if task := self.background_tasks.get(response_id):
+                task.cancel()
+                try:
+                    await task
+                except asyncio.CancelledError:
+                    logger.exception(
+                        "Background task for %s was cancelled (stateless mode)",
+                        response_id,
+                    )
+            return self.create_error_response(
+                err_type="invalid_request_error",
+                message=(
+                    "Response cancellation in stateless mode cancelled any "
+                    "in-flight task, but no stored response object is available. "
+                    "Enable VLLM_ENABLE_RESPONSES_API_STORE=1 for full cancel "
+                    "support."
+                ),
+                status_code=HTTPStatus.NOT_IMPLEMENTED,
+            )


The current implementation for cancelling responses in stateless mode has a few issues:

Incorrect Error on Not Found: If response_id does not correspond to an active background task, the function returns a 501 Not Implemented error. It should return a 404 Not Found, consistent with the stateful path's behavior.

Misleading Status Code: Returning HTTPStatus.NOT_IMPLEMENTED (501) is misleading when a cancellation is successfully initiated. A 400 Bad Request would be more appropriate to signal that this is a client error (using a feature not fully supported in stateless mode), while the message can clarify that the cancellation was attempted.

Logging: logger.exception is used for asyncio.CancelledError. Since cancellation is the expected outcome of this operation, logger.info or logger.debug would be more suitable.

I suggest refactoring this block to handle these cases correctly.

if not self.enable_store: # Stateless mode: cancel in-flight tasks by ID only (no stored response). task = self.background_tasks.get(response_id) if not task: # Mimic the stateful path's behavior of returning a 404 if the # response is not in the store. return self._make_not_found_error(response_id) task.cancel() try: await task except asyncio.CancelledError: logger.info( "Background task for %s was cancelled (stateless mode)", response_id, ) # The task was cancelled, but we cannot return a full response object # as required by the API spec because no state is stored. # Return an error response indicating this limitation. return self.create_error_response( err_type="invalid_request_error", message=( "Response cancellation was initiated for the in-flight task, " "but a full response object cannot be returned in stateless mode. " "Enable VLLM_ENABLE_RESPONSES_API_STORE=1 for full cancel " "support." ), status_code=HTTPStatus.BAD_REQUEST, )

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 010a69cea9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-02T03:14:46Z

vllm/entrypoints/openai/responses/serving.py

+                    for m in prev_messages
+                ]
+            else:
+                prev_msgs = self.msg_store[prev_response.id]


Guard Harmony fallback when stateless carrier is missing

In Harmony mode, if a client sends previous_response but the prior response does not include the synthetic state carrier (for example they omitted include=['reasoning.encrypted_content']), _extract_state_from_response returns None and this branch falls back to self.msg_store[prev_response.id]. With store disabled, that key is absent and raises KeyError, and create_responses does not catch KeyError, so this becomes a 500 instead of a user-actionable 400. Please validate that previous_response actually contains a carrier before indexing msg_store (or use a safe lookup and explicit error).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-02T03:14:46Z

vllm/entrypoints/openai/responses/protocol.py

+        if self.previous_response is not None and self.store:
+            # Stateless path: store is meaningless (and would cause confusion).
+            # Mirror the silent-disable behavior in create_responses for store=True.
+            self.store = False


Preserve background/store invariant after validation

This mode='after' validator mutates self.store to False when previous_response is set, but background is validated earlier in mode='before' against the original input. A request like background=true with previous_response (and default store=true) can therefore pass validation and end up as background=true, store=false, which then bypasses store-disabled guards and creates a background response that cannot be retrieved/cancelled when store is off (those endpoints now return 501). Re-validate background after this mutation or avoid mutating store here.

Useful? React with 👍 / 👎.

garrio-1 and others added 3 commits February 26, 2026 15:04

[Doc] Add contributor workspace bootstrap and pre-commit cache helper

a109b13

docs: add responses harmony statefulness contribution plan

f3f6858

will-deines requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, robertgshaw2-redhat and russellb as code owners March 2, 2026 03:10

will-deines closed this Mar 2, 2026

mergify bot added documentation Improvements or additions to documentation frontend labels Mar 2, 2026

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

will-deines deleted the feat/stateless-responses-encrypted-content branch March 2, 2026 03:14

chatgpt-codex-connector bot reviewed Mar 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(responses): stateless multi-turn via encrypted_content (RFC #26934 + @grs proposal)#35701

feat(responses): stateless multi-turn via encrypted_content (RFC #26934 + @grs proposal)#35701
will-deines wants to merge 3 commits intovllm-project:mainfrom
will-deines:feat/stateless-responses-encrypted-content

will-deines commented Mar 2, 2026

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 2, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 2, 2026

Uh oh!

chatgpt-codex-connector bot Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

will-deines commented Mar 2, 2026

Summary

Motivation

Design

Files Changed

Test Results

Usage Example

Backward Compatibility

Known Limitations / Follow-up

Reviewer Notes

Uh oh!

mergify bot commented Mar 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants