Skip to content

feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934)#35740

Closed
will-deines wants to merge 6 commits intovllm-project:mainfrom
will-deines:feat/stateless-responses-encrypted-content
Closed

feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934)#35740
will-deines wants to merge 6 commits intovllm-project:mainfrom
will-deines:feat/stateless-responses-encrypted-content

Conversation

@will-deines
Copy link
Copy Markdown

@will-deines will-deines commented Mar 2, 2026

Related Issues, PRs, and RFCs

Directly Addressed by This PR

# Title Status How This PR Relates
#26934 [RFC] Separating State & Providing Flexibility for serving ResponsesAPI Open Primary motivation — implements the stateless inference path requested by @qandrew (Meta) and proposed by @grs
#33089 [Feature] Support multi-turn conversation for OpenAI Response API Open Direct fix — users with OpenCode/Codex CLI agents fail on turn 2 because previous_response_id requires VLLM_ENABLE_RESPONSES_API_STORE=1; this PR provides a store-free alternative
#34738 Fix memory leak in Responses API store (LRU eviction) Open Superseded — stateless via encrypted_content avoids server-side storage entirely, eliminating the memory leak root cause rather than bounding it

Related RFCs

# Title Status Design Decision
#32850 [RFC] Clarify policy for Open Responses API extensions in vLLM Open We reuse the existing encrypted_content field from the OpenAI spec (no new protocol extensions), aligning with the conservative extension policy
#33381 [RFC] Align with the openresponses.org spec Open This approach adds zero new fields to the wire protocol — encrypted_content and previous_response are both existing OpenAI spec fields

Companion PR

# Title Status Relevance
#35874 feat(responses): pluggable ResponseStore abstraction Open Depends on this PR — extracts the in-memory store into a pluggable ABC; together these serve RFC #26934's two personas (stateless for production, pluggable store for researchers/small-scale)

Decisions We Made That Can Be Debated

1. HMAC signing vs. actual encryption of conversation history

What we chose: The state carrier uses HMAC-SHA256 for tamper detection. The conversation history is base64-encoded but not encrypted — anyone who intercepts the encrypted_content field can decode and read it.

Alternative: Use authenticated encryption (AES-GCM or similar) so the history is both tamper-proof and confidential. The field is called encrypted_content after all.

Why we chose this: The field name comes from the OpenAI spec, not from us — we use it as an opaque signed blob consistent with the spec's intent. Real encryption adds key management complexity (key rotation, IV generation, padding) that isn't justified for the initial implementation. The content travels over TLS between client and server, so in-transit confidentiality is already handled. The RFC (#26934) scoped encryption as out-of-scope for the initial implementation.

What reviewers might disagree with: Users may assume encrypted_content means encrypted. If the response is logged, cached, or stored client-side, the conversation history is readable. A follow-up could add optional encryption behind a flag.

2. Single carrier (full Harmony history) vs. per-item state

What we chose: One synthetic ReasoningItem carries the full Harmony message history as a single signed blob. The carrier is appended to the response output and filtered out in utils.py before messages reach the LLM.

Alternative: Attach state to each reasoning item individually — each ReasoningItem carries its own context, and the history is reconstructed by collecting all items.

Why we chose this: Tool-call metadata (raised by @alecsolder in #26934) — tool call results and assistant metadata aren't expressible in per-item form but are captured in the Harmony message list. A single carrier is also simpler to implement, verify, and debug. The full message list is what the model actually needs on the next turn.

What reviewers might disagree with: The carrier grows linearly with conversation length. For very long conversations, this could become large. A per-item approach or delta-based encoding could be more efficient, but adds significant complexity.

3. Reuse encrypted_content field vs. a new vLLM-specific field

What we chose: We reuse the existing encrypted_content field on ResponseReasoningItem from the OpenAI spec, and the existing previous_response field on the request. Zero new wire protocol fields.

Alternative: Add a vLLM-specific field (e.g., vllm_state_carrier) or use the OpenResponses extension mechanism proposed in #33381.

Why we chose this: RFCs #32850 and #33381 both emphasize conservative extensions and alignment with existing specs. encrypted_content is already defined as an opaque blob for platform use — our usage is consistent with that intent. Clients using the standard OpenAI SDK can use this feature without any SDK modifications. @DanielMe's proposal in #32850 supports allowing extensions "when there is a documented need" following existing patterns.

4. Per-process random key (default) vs. requiring explicit key configuration

What we chose: When VLLM_RESPONSES_STATE_SIGNING_KEY is not set, a per-process random key is generated with a warning. This means stateless multi-turn works out of the box for single-node dev, but carriers are invalidated on restart and incompatible across nodes.

Alternative: Require the env var and fail hard if it's not set — forcing users to think about key management upfront.

Why we chose this: The zero-config default matches vLLM's general philosophy — things should work out of the box for the common single-node case. The warning makes the limitation visible. Production deployments with multi-node or restart requirements will naturally need to set the key, and the error messages guide them there.

What reviewers might disagree with: Silent key generation could lead to hard-to-debug issues in production (e.g., a rolling restart invalidates all in-flight carriers). A louder failure mode might be safer.

5. New previous_response field (full object) vs. extending previous_response_id for stateless

What we chose: A new vLLM-specific previous_response field that accepts the full ResponsesResponse object, with a model_validator enforcing mutual exclusion with previous_response_id. This is a protocol extension — the OpenAI spec only defines previous_response_id (a string for server-side lookup).

Alternative: Overload previous_response_id with a special sentinel or encoded value, or embed the carrier in a separate field alongside previous_response_id.

Why we chose this: previous_response_id implies a server-side lookup by design — overloading it for stateless use would be confusing. A separate field makes the two paths self-documenting: previous_response_id means "look it up in the store," previous_response means "here's the full response, no store needed." The mutual exclusion validator makes it impossible to mix the two.

What reviewers might disagree with: This is a vLLM-specific extension to the wire protocol, which #32850 and #33381 argue should be conservative. However, the field is purely additive (existing clients are unaffected) and the alternative — overloading previous_response_id — would be more confusing. If OpenResponses or the OpenAI spec adds a similar field in the future, we can align with it.

6. Stateless path as additive (keep the store) vs. replacing it

What we chose: The stateless path is purely additive. The existing previous_response_id + VLLM_ENABLE_RESPONSES_API_STORE=1 path is completely unchanged. Both paths coexist.

Alternative: Remove the in-memory store entirely and force all multi-turn through the stateless path.

Why we chose this: RFC #26934 identifies two user personas — production deployments that want stateless operation, and researchers/small-scale users who want an all-in-one server with state. Removing the store would break the second persona. The companion PR (#35874) makes the store pluggable for production use cases that need server-side state (e.g., background mode, retrieve_responses).

What reviewers might disagree with: Keeping both paths means more code to maintain and test. If the stateless path proves sufficient for most use cases, the store path could be deprecated in a future PR.


Summary

Implements stateless multi-turn Responses API conversations without server-side storage, using the existing encrypted_content field on ResponseReasoningItem as the state carrier. Proposed by @grs in #26934.

The three in-process dicts (response_store, msg_store, event_store) marked # HACK / # FIXME in serving.py are disabled by default because they leak memory, lose state on restart, and are incompatible with multi-node deployments. This PR provides a production-ready alternative for the multi-turn use case.


Design

Turn 1:  client → vLLM  { store: false, include: ["reasoning.encrypted_content"], input: "..." }
         vLLM  → client  { output: [...real items..., ReasoningItem(encrypted_content="vllm:1:<b64>:<hmac>")] }

Turn 2:  client → vLLM  { store: false, previous_response: <full Turn 1 response>, input: "..." }
         vLLM  extracts encrypted_content → verifies HMAC → deserialises Harmony history → no store touched

Wire format: vllm:1:<base64(json(messages))>:<hmac-sha256-hex>

  • Content is signed, not encrypted (HMAC-SHA256). The field name comes from the OpenAI spec; vLLM uses it as an opaque signed blob for tamper detection, consistent with the spec's intent.
  • VLLM_RESPONSES_STATE_SIGNING_KEY (64-char hex) enables multi-node / restart-safe operation. Without it, a per-process random key is generated with a warning.
  • The state carrier ReasoningItem is filtered out in utils.py before messages reach the LLM — invisible to the model.

Files Changed

File Change
vllm/entrypoints/openai/responses/state.py NEWserialize_state / deserialize_state / is_state_carrier / HMAC helpers
vllm/entrypoints/openai/responses/protocol.py Add previous_response: ResponsesResponse | None to ResponsesRequest; mutual-exclusion model_validator with previous_response_id; reject background=True with previous_response; model_rebuild() for forward ref
vllm/entrypoints/openai/responses/serving.py Stateless prev-response resolution in create_responses; thread prev_messages through _make_request / _make_request_with_harmony / _construct_input_messages_with_harmony; inject state carrier in responses_full_generator; 400 guard when carrier missing from previous_response; 501 on retrieve_responses when store disabled; 404/400 on cancel_responses when store disabled
vllm/entrypoints/openai/responses/utils.py _construct_single_message_from_response_item returns None for state-carrier items; filter None in construct_chat_messages_with_tool_call
vllm/envs.py Register VLLM_RESPONSES_STATE_SIGNING_KEY
tests/entrypoints/openai/responses/test_state.py NEW — 16 unit tests (round-trip, tamper detection, cross-key incompatibility, invalid hex, random key caching)
tests/entrypoints/openai/responses/test_serving_stateless.py NEW — 15 unit tests (protocol validation, state carrier helpers, all error paths, utils skipping, cancel success path)

Test Results

tests/entrypoints/openai/responses/test_state.py             16/16 passed
tests/entrypoints/openai/responses/test_serving_stateless.py 15/15 passed

All pure-Python unit tests — no GPU required.


Usage

Turn 1:

resp1 = client.responses.create(
    model="...",
    input="My name is Alice",
    store=False,
    include=["reasoning.encrypted_content"],
)
# resp1.output[-1] == ReasoningItem(encrypted_content="vllm:1:...", id="rs_state_...")

Turn 2:

resp2 = client.responses.create(
    model="...",
    input="What is my name?",
    store=False,
    previous_response=resp1,   # full response object, not previous_response_id
    include=["reasoning.encrypted_content"],
)
# → "Your name is Alice."

Multi-node / restart-safe:

export VLLM_RESPONSES_STATE_SIGNING_KEY="$(openssl rand -hex 32)"
# same value across all vLLM nodes

Backward Compatibility

No breaking changes. The existing previous_response_id + VLLM_ENABLE_RESPONSES_API_STORE=1 path is completely unchanged. The new path requires explicit opt-in (store=false + include=["reasoning.encrypted_content"] + previous_response on turn 2).

previous_response_id with store disabled now returns a helpful 400 pointing users to the stateless path.


Test plan

  • Unit tests pass: pytest tests/entrypoints/openai/responses/test_state.py tests/entrypoints/openai/responses/test_serving_stateless.py -v
  • Pre-commit passes on all changed files
  • E2e with GPT-OSS model: verify stateless multi-turn produces correct conversation continuity
  • E2e with multi-node: verify VLLM_RESPONSES_STATE_SIGNING_KEY enables cross-node carrier compatibility

cc @qandrew @WoosukKwon @njhill @DanielMe @chaunceyjiang

garrio-1 and others added 3 commits March 2, 2026 07:00
Implements the @grs proposal for stateless multi-turn Responses API
conversations without server-side storage, using the standard OpenAI
`encrypted_content` field on a synthetic `ResponseReasoningItem` as the
state carrier.

**How it works:**
1. Client sets `store=false` + `include=["reasoning.encrypted_content"]`
2. vLLM serialises the Harmony message history into a signed blob
   (`vllm:1:<base64(json)>:<hmac-sha256>`) and appends it as a synthetic
   `ReasoningItem` to the response output
3. On the next turn the client passes `previous_response` (full response
   object) instead of `previous_response_id`
4. vLLM extracts, verifies, and deserialises the history from the carrier
   item — no in-memory store touched

**No breaking changes.** Existing `previous_response_id` + store-enabled
path is unchanged. New path requires explicit opt-in.

**Multi-node safe:** set `VLLM_RESPONSES_STATE_SIGNING_KEY` to the same
64-char hex value on all nodes so tokens validate across replicas.

Files changed:
- `vllm/entrypoints/openai/responses/state.py` (new) — serialise /
  deserialise / HMAC-verify state carriers
- `vllm/entrypoints/openai/responses/protocol.py` — add
  `previous_response` field + mutual-exclusion validator on
  `ResponsesRequest`; `model_rebuild()` for forward ref
- `vllm/entrypoints/openai/responses/serving.py` — stateless prev-response
  resolution; thread `prev_messages` through `_make_request*`; inject
  state carrier in `responses_full_generator`; 501 guards on
  `retrieve_responses` / `cancel_responses` when store disabled
- `vllm/entrypoints/openai/responses/utils.py` — skip state-carrier
  `ReasoningItem`s when reconstructing chat messages
- `vllm/envs.py` — register `VLLM_RESPONSES_STATE_SIGNING_KEY`
- `tests/entrypoints/openai/responses/test_state.py` (new) — 16 unit tests
- `tests/entrypoints/openai/responses/test_serving_stateless.py` (new) —
  14 unit tests

Closes #26934 (partial — non-streaming only; streaming carrier TBD)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…400 on success, info log

Per code review feedback:
- Return 404 (not 501) when response_id has no matching background task,
  consistent with the stateful path's _make_not_found_error behavior
- Return 400 BAD_REQUEST (not 501 NOT_IMPLEMENTED) when a task is found
  and cancelled — cancellation succeeded, but no stored response object
  can be returned; 501 was misleading
- Use logger.info instead of logger.exception for asyncio.CancelledError,
  since cancellation is the expected outcome of this call path

Update test to assert 404 for the unknown-id case.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…arrier guard, background invariant)

Fixes three issues found in review:

1. cancel_responses stateless mode (gemini-code-assist P1):
   - Return 404 (not 501) for unknown response_id — consistent with stateful path
   - Return 400 BAD_REQUEST (not 501) on successful cancellation — task was
     cancelled but no stored response is available; 501 was misleading
   - Use logger.info (not logger.exception) for expected CancelledError

2. Missing carrier guard in create_responses (codex P1):
   - When previous_response has no state carrier and store is disabled,
     return 400 with a clear message instead of falling through to
     msg_store[id] KeyError → 500

3. background/store invariant in protocol validator (codex P2):
   - Reject background=True + previous_response at validation time rather
     than silently producing an unretrievable background response

Tests:
   - Add test_cancel_without_store_active_task_returns_400: covers the
     success branch of the cancel fix; uses await asyncio.sleep(0) to
     start the task before cancelling (Python 3.12: unstarted tasks
     cancelled before first await never run their body)
   - Add test_previous_response_without_carrier_returns_400: regression
     for the KeyError → 500 bug
   - Add test_background_with_previous_response_raises: regression for
     the background/store invariant
   - Remove test_no_previous_response_preserves_store_true: passed
     regardless of our code (no new path exercised)
   - Remove test_full_stateless_roundtrip: duplicate of
     test_build_and_extract_roundtrip
   - Rename test_prev_messages_used_over_empty_msg_store →
     test_construct_input_messages_prepends_prev_msg (accurate name)

All 31 tests pass.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 2, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the frontend label Mar 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed feature for stateless multi-turn conversations in the Responses API, using a signed state carrier in the encrypted_content field. The implementation is thorough, with comprehensive new tests and careful consideration for security aspects like using hmac.compare_digest to prevent timing attacks. The changes are well-structured across new and existing modules.

I found one critical issue in the implementation for the non-Harmony code path, where the state carrier was being generated without including the assistant's response from the current turn. This would cause the conversation history to be incomplete for subsequent turns. I've provided a detailed comment and a suggested fix for this.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 2, 2026

Hi @will-deines, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d1be94bcb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…ey validation

Gemini critical — non-Harmony state carrier missing the assistant turn:
  carrier_messages was only the input messages, omitting the assistant
  response just generated. The next turn would see history without the
  last assistant message. Fix: append
  construct_chat_messages_with_tool_call(response.output) so the
  carrier contains the full turn (input + response).

Codex P1 — carrierless previous_response check gated on enable_store:
  The guard 'if prev_messages_from_state is None and not self.enable_store'
  was too narrow. previous_response always means stateless path; a
  server restart with enable_store=True and empty msg_store would still
  KeyError. Fix: drop the 'not self.enable_store' condition.

Codex P2 — any-length hex key accepted as signing key:
  bytes.fromhex('aa') produces 1 byte — a weak HMAC key. Fix: enforce
  len(key_bytes) >= 32 (64 hex chars) and raise ValueError if too short.

Tests:
  - test_previous_response_without_carrier_store_enabled_returns_400:
    regression for P1 (store=True path also returns 400, not KeyError)
  - test_short_key_raises: regression for P2 (4-byte key raises)

Run pre-commit --all-files; apply linter reformatting.

Signed-off-by: Will Deines <will@garr.io>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 2, 2026

Hi @will-deines, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

# history from the encrypted_content state carrier embedded in the response
# output, so no server-side store is required.
# Cannot be set together with previous_response_id.
previous_response: "ResponsesResponse | None" = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this field one of the standard fields in the Open Response API?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, previous_response is a vLLM extension — the standard OpenAI API only has previous_response_id (a string ID that requires server-side storage).

This extension is needed because previous_response_id depends on vLLM's in-process store (the response_store / msg_store / event_store dicts marked # HACK / # FIXME), which leaks memory (#34738), loses state on restart, and is incompatible with production multi-node deployments (RFC #26934, raised by @qandrew at Meta).

The approach follows @grs's proposal in #26934: the client sends back the full response object; vLLM extracts the signed state from the existing encrypted_content field on a ReasoningItem — so the wire-format response itself uses zero new fields.

This is consistent with vLLM's existing extension policy (RFC #32850) — the Responses API already ships non-standard request fields (top_k, request_id, priority, etc.), and the Open Responses spec explicitly allows implementation extensions.

Mutual exclusion with previous_response_id is enforced via a Pydantic model_validator.

@will-deines will-deines closed this Mar 3, 2026
will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 3, 2026
Extract the three in-memory dicts (response_store, msg_store,
response_store_lock) from OpenAIServingResponses into a pluggable
ResponseStore ABC with an InMemoryResponseStore default.

Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified
class name to swap in their own backend (Redis, Postgres, etc.)
without patching vLLM.

- Add ResponseStore ABC with 5 abstract methods + close() hook
- Add InMemoryResponseStore wrapping current dict behavior with
  internal asyncio.Lock (removes external response_store_lock)
- Add create_response_store() factory reading env var
- Refactor ~15 call sites in serving.py to use self.store.*
- Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py
- Update test helper to use InMemoryResponseStore
- Add unit + integration tests for store and serving interactions

Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934
(pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 4, 2026
Extract the three in-memory dicts (response_store, msg_store,
response_store_lock) from OpenAIServingResponses into a pluggable
ResponseStore ABC with an InMemoryResponseStore default.

Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified
class name to swap in their own backend (Redis, Postgres, etc.)
without patching vLLM.

- Add ResponseStore ABC with 5 abstract methods + close() hook
- Add InMemoryResponseStore wrapping current dict behavior with
  internal asyncio.Lock (removes external response_store_lock)
- Add create_response_store() factory reading env var
- Refactor ~15 call sites in serving.py to use self.store.*
- Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py
- Update test helper to use InMemoryResponseStore
- Add unit + integration tests for store and serving interactions

Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934
(pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 4, 2026
Extract the three in-memory dicts (response_store, msg_store,
response_store_lock) from OpenAIServingResponses into a pluggable
ResponseStore ABC with an InMemoryResponseStore default.

Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified
class name to swap in their own backend (Redis, Postgres, etc.)
without patching vLLM.

- Add ResponseStore ABC with 5 abstract methods + close() hook
- Add InMemoryResponseStore wrapping current dict behavior with
  internal asyncio.Lock (removes external response_store_lock)
- Add create_response_store() factory reading env var
- Refactor ~15 call sites in serving.py to use self.store.*
- Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py
- Update test helper to use InMemoryResponseStore
- Add unit + integration tests for store and serving interactions

Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934
(pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).

Signed-off-by: Will Deines <will@garr.io>
will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 12, 2026
Extract the three in-memory dicts (response_store, msg_store,
response_store_lock) from OpenAIServingResponses into a pluggable
ResponseStore ABC with an InMemoryResponseStore default.

Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified
class name to swap in their own backend (Redis, Postgres, etc.)
without patching vLLM.

- Add ResponseStore ABC with 5 abstract methods + close() hook
- Add InMemoryResponseStore wrapping current dict behavior with
  internal asyncio.Lock (removes external response_store_lock)
- Add create_response_store() factory reading env var
- Refactor ~15 call sites in serving.py to use self.store.*
- Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py
- Update test helper to use InMemoryResponseStore
- Add unit + integration tests for store and serving interactions

Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934
(pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 18, 2026
Extract the three in-memory dicts (response_store, msg_store,
response_store_lock) from OpenAIServingResponses into a pluggable
ResponseStore ABC with an InMemoryResponseStore default.

Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified
class name to swap in their own backend (Redis, Postgres, etc.)
without patching vLLM.

- Add ResponseStore ABC with 5 abstract methods + close() hook
- Add InMemoryResponseStore wrapping current dict behavior with
  internal asyncio.Lock (removes external response_store_lock)
- Add create_response_store() factory reading env var
- Refactor ~15 call sites in serving.py to use self.store.*
- Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py
- Update test helper to use InMemoryResponseStore
- Add unit + integration tests for store and serving interactions

Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934
(pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).

Signed-off-by: Will Deines <will@garr.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants