Skip to content

feat(responses): stateless multi-turn via encrypted_content (RFC #26934 + @grs proposal)#35701

Closed
will-deines wants to merge 3 commits intovllm-project:mainfrom
will-deines:feat/stateless-responses-encrypted-content
Closed

feat(responses): stateless multi-turn via encrypted_content (RFC #26934 + @grs proposal)#35701
will-deines wants to merge 3 commits intovllm-project:mainfrom
will-deines:feat/stateless-responses-encrypted-content

Conversation

@will-deines
Copy link
Copy Markdown

Summary

Implements the @grs proposal for stateless multi-turn Responses API conversations — enabling multi-turn without server-side storage by serialising Harmony message history into the standard OpenAI encrypted_content field on a synthetic ResponseReasoningItem.

Addresses: RFC #26934 | Related: #33089 #34738


Motivation

vLLM's Responses API currently stores conversation state in three in-process Python dicts (response_store, msg_store, event_store) commented # HACK / # FIXME in source. These are disabled by default (VLLM_ENABLE_RESPONSES_API_STORE=0) because they:

RFC #26934 (@qandrew, Meta) calls for decoupling storage from the inference engine. @grs proposed reusing the existing encrypted_content field on ReasoningItem objects — no new protocol extensions needed.


Design

Turn 1:  client → vLLM  { store: false, include: ["reasoning.encrypted_content"], input: "..." }
         vLLM  → client  { output: [...real items..., ReasoningItem(encrypted_content="vllm:1:<b64>:<hmac>")] }

Turn 2:  client → vLLM  { store: false, previous_response: <full Turn 1 response>, input: "..." }
         vLLM extracts encrypted_content, verifies HMAC, deserialises Harmony history → uses as context

Wire format: vllm:1:<base64(json(messages))>:<hmac-sha256-hex>

  • Not encrypted (content is base64 JSON) — HMAC is for tamper detection only
  • VLLM_RESPONSES_STATE_SIGNING_KEY (64-char hex) enables multi-node deployments; random key is generated per-process if unset (with a warning)

State carrier item is a synthetic ReasoningItem with id="rs_state_<req_id[-12:]>". It is invisible to the LLM — filtered out by _construct_single_message_from_response_item() in utils.py.


Files Changed

File Change
vllm/entrypoints/openai/responses/state.py NEWserialize_state / deserialize_state / is_state_carrier / HMAC helpers
vllm/entrypoints/openai/responses/protocol.py Add previous_response: ResponsesResponse | None to ResponsesRequest; model_validator for mutual exclusion with previous_response_id; model_rebuild() for forward ref
vllm/entrypoints/openai/responses/serving.py Stateless prev-response resolution in create_responses; thread prev_messages through _make_request / _make_request_with_harmony / _construct_input_messages_with_harmony; inject state carrier in responses_full_generator; 501 guards on retrieve_responses / cancel_responses when store disabled
vllm/entrypoints/openai/responses/utils.py _construct_single_message_from_response_item returns None for state-carrier items; construct_chat_messages_with_tool_call filters None entries
vllm/envs.py Register VLLM_RESPONSES_STATE_SIGNING_KEY
tests/entrypoints/openai/responses/test_state.py NEW — 16 unit tests (round-trip, tamper detection, cross-key incompatibility, bad hex key, random key caching)
tests/entrypoints/openai/responses/test_serving_stateless.py NEW — 14 unit tests (protocol validation, state carrier helpers, storeless error paths, utils skipping, prev_messages override)

Test Results

tests/entrypoints/openai/responses/test_state.py             16/16 passed
tests/entrypoints/openai/responses/test_serving_stateless.py 14/14 passed

All tests are pure-Python unit tests (no GPU needed).


Usage Example

Turn 1 — no store needed:

resp1 = client.responses.create(
    model="gpt-4o",
    input="My name is Alice",
    store=False,
    include=["reasoning.encrypted_content"],
)
# resp1.output contains [..., ReasoningItem(encrypted_content="vllm:1:...")]

Turn 2 — pass back the full response:

resp2 = client.responses.create(
    model="gpt-4o",
    input="What is my name?",
    store=False,
    previous_response=resp1,          # full object, not previous_response_id
    include=["reasoning.encrypted_content"],
)
# → "Your name is Alice."

Multi-node deployment:

export VLLM_RESPONSES_STATE_SIGNING_KEY="$(openssl rand -hex 32)"
# Set the same value on all vLLM nodes so tokens validate across replicas

Backward Compatibility

  • No breaking changes. Existing previous_response_id + VLLM_ENABLE_RESPONSES_API_STORE=1 path is completely unchanged.
  • New stateless path requires explicit opt-in via store=false + include=["reasoning.encrypted_content"] + previous_response.
  • previous_response_id without store now returns a helpful 400 error pointing users to the stateless path.

Known Limitations / Follow-up

  • Streaming not yet covered. The state carrier is only injected in responses_full_generator. Streaming (responses_stream_generator) needs ResponseOutputItemAddedEvent + ResponseOutputItemDoneEvent events for the carrier before ResponseCompletedEvent. Left as a follow-up to keep this PR reviewable.
  • No encryption. Content is signed (HMAC) but not encrypted. This mirrors existing encrypted_content usage (the field name is from the OpenAI spec; vLLM reuses it for opaque state). Actual encryption is out of scope per the RFC.
  • In-memory msg_store / response_store hacks not removed. This PR adds the stateless path without deleting the existing store-enabled path. Cleanup can follow once stateless is validated.

Reviewer Notes

Key design decision open for discussion: single state carrier vs. per-item state. The current implementation appends one synthetic ReasoningItem carrying the full Harmony history. @alecsolder raised in #26934 that tool-call metadata may not be expressible in per-reasoning-item form; this design sidesteps that by carrying the complete history in one item.

cc @qandrew @grs @WoosukKwon @njhill @Simon-Laux

garrio-1 and others added 3 commits February 26, 2026 15:04
Implements the @grs proposal for stateless multi-turn Responses API
conversations without server-side storage, using the standard OpenAI
`encrypted_content` field on a synthetic `ResponseReasoningItem` as the
state carrier.

**How it works:**
1. Client sets `store=false` + `include=["reasoning.encrypted_content"]`
2. vLLM serialises the Harmony message history into a signed blob
   (`vllm:1:<base64(json)>:<hmac-sha256>`) and appends it as a synthetic
   `ReasoningItem` to the response output
3. On the next turn the client passes `previous_response` (full response
   object) instead of `previous_response_id`
4. vLLM extracts, verifies, and deserialises the history from the carrier
   item — no in-memory store touched

**No breaking changes.** Existing `previous_response_id` + store-enabled
path is unchanged. New path requires explicit opt-in.

**Multi-node safe:** set `VLLM_RESPONSES_STATE_SIGNING_KEY` to the same
64-char hex value on all nodes so tokens validate across replicas.

Files changed:
- `vllm/entrypoints/openai/responses/state.py` (new) — serialise /
  deserialise / HMAC-verify state carriers
- `vllm/entrypoints/openai/responses/protocol.py` — add
  `previous_response` field + mutual-exclusion validator on
  `ResponsesRequest`; `model_rebuild()` for forward ref
- `vllm/entrypoints/openai/responses/serving.py` — stateless prev-response
  resolution; thread `prev_messages` through `_make_request*`; inject
  state carrier in `responses_full_generator`; 501 guards on
  `retrieve_responses` / `cancel_responses` when store disabled
- `vllm/entrypoints/openai/responses/utils.py` — skip state-carrier
  `ReasoningItem`s when reconstructing chat messages
- `vllm/envs.py` — register `VLLM_RESPONSES_STATE_SIGNING_KEY`
- `tests/entrypoints/openai/responses/test_state.py` (new) — 16 unit tests
- `tests/entrypoints/openai/responses/test_serving_stateless.py` (new) —
  14 unit tests

Closes #26934 (partial — non-streaming only; streaming carrier TBD)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 2, 2026

Documentation preview: https://vllm--35701.org.readthedocs.build/en/35701/

@mergify mergify bot added documentation Improvements or additions to documentation frontend labels Mar 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed feature for stateless multi-turn conversations in the Responses API, effectively addressing memory leak and multi-node deployment issues. The implementation, including the new state serialization module and protocol extensions, is robust and well-tested. My review identifies one high-severity issue in the new error handling for the cancel_responses endpoint in stateless mode, where the logic for task-not-found cases is incorrect and the returned HTTP status code is misleading. I've provided a detailed suggestion to improve the correctness and clarity of the API behavior in this scenario.

Comment on lines +1319 to +1339
if not self.enable_store:
# Stateless mode: cancel in-flight tasks by ID only (no stored response).
if task := self.background_tasks.get(response_id):
task.cancel()
try:
await task
except asyncio.CancelledError:
logger.exception(
"Background task for %s was cancelled (stateless mode)",
response_id,
)
return self.create_error_response(
err_type="invalid_request_error",
message=(
"Response cancellation in stateless mode cancelled any "
"in-flight task, but no stored response object is available. "
"Enable VLLM_ENABLE_RESPONSES_API_STORE=1 for full cancel "
"support."
),
status_code=HTTPStatus.NOT_IMPLEMENTED,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for cancelling responses in stateless mode has a few issues:

  1. Incorrect Error on Not Found: If response_id does not correspond to an active background task, the function returns a 501 Not Implemented error. It should return a 404 Not Found, consistent with the stateful path's behavior.
  2. Misleading Status Code: Returning HTTPStatus.NOT_IMPLEMENTED (501) is misleading when a cancellation is successfully initiated. A 400 Bad Request would be more appropriate to signal that this is a client error (using a feature not fully supported in stateless mode), while the message can clarify that the cancellation was attempted.
  3. Logging: logger.exception is used for asyncio.CancelledError. Since cancellation is the expected outcome of this operation, logger.info or logger.debug would be more suitable.

I suggest refactoring this block to handle these cases correctly.

        if not self.enable_store:
            # Stateless mode: cancel in-flight tasks by ID only (no stored response).
            task = self.background_tasks.get(response_id)
            if not task:
                # Mimic the stateful path's behavior of returning a 404 if the
                # response is not in the store.
                return self._make_not_found_error(response_id)

            task.cancel()
            try:
                await task
            except asyncio.CancelledError:
                logger.info(
                    "Background task for %s was cancelled (stateless mode)",
                    response_id,
                )

            # The task was cancelled, but we cannot return a full response object
            # as required by the API spec because no state is stored.
            # Return an error response indicating this limitation.
            return self.create_error_response(
                err_type="invalid_request_error",
                message=(
                    "Response cancellation was initiated for the in-flight task, "
                    "but a full response object cannot be returned in stateless mode. "
                    "Enable VLLM_ENABLE_RESPONSES_API_STORE=1 for full cancel "
                    "support."
                ),
                status_code=HTTPStatus.BAD_REQUEST,
            )

@will-deines will-deines deleted the feat/stateless-responses-encrypted-content branch March 2, 2026 03:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 010a69cea9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

for m in prev_messages
]
else:
prev_msgs = self.msg_store[prev_response.id]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard Harmony fallback when stateless carrier is missing

In Harmony mode, if a client sends previous_response but the prior response does not include the synthetic state carrier (for example they omitted include=['reasoning.encrypted_content']), _extract_state_from_response returns None and this branch falls back to self.msg_store[prev_response.id]. With store disabled, that key is absent and raises KeyError, and create_responses does not catch KeyError, so this becomes a 500 instead of a user-actionable 400. Please validate that previous_response actually contains a carrier before indexing msg_store (or use a safe lookup and explicit error).

Useful? React with 👍 / 👎.

Comment on lines +396 to +399
if self.previous_response is not None and self.store:
# Stateless path: store is meaningless (and would cause confusion).
# Mirror the silent-disable behavior in create_responses for store=True.
self.store = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve background/store invariant after validation

This mode='after' validator mutates self.store to False when previous_response is set, but background is validated earlier in mode='before' against the original input. A request like background=true with previous_response (and default store=true) can therefore pass validation and end up as background=true, store=false, which then bypasses store-disabled guards and creates a background response that cannot be retrieved/cancelled when store is off (those endpoints now return 501). Re-validate background after this mutation or avoid mutating store here.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants