Skip to content

feat(responses): pluggable ResponseStore abstraction#35900

Closed
will-deines wants to merge 7 commits intovllm-project:mainfrom
will-deines:feat/pluggable-response-store
Closed

feat(responses): pluggable ResponseStore abstraction#35900
will-deines wants to merge 7 commits intovllm-project:mainfrom
will-deines:feat/pluggable-response-store

Conversation

@will-deines
Copy link
Copy Markdown

Recreated from #35874, which was closed when the fork was temporarily made private.

Depends on #35740 — review that first. Diff will clean up automatically once it merges.

Summary

Extracts the in-memory response/message dicts from OpenAIServingResponses into a pluggable ResponseStore ABC, addressing RFC #26934.

  • New ResponseStore ABC (vllm/entrypoints/openai/responses/store.py) with 5 async methods for get/put responses, get/put messages, and atomic status transitions
  • InMemoryResponseStore default implementation wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock)
  • VLLM_RESPONSES_STORE_BACKEND env var — point to a fully-qualified class name (e.g. mypackage.redis_store.RedisResponseStore) to swap backends
  • Refactored serving.py — ~15 call sites updated from direct dict access to self.store.* async calls
  • Atomic conditional writes (put_response(..., unless_status="cancelled")) replace get-check-put-under-lock patterns
  • Atomic status transitions (update_response_status(..., allowed_current_statuses=...)) replace get-check-mutate patterns

What stays local (not pluggable)

  • event_store and background_tasks — these use asyncio.Event/asyncio.Task which are inherently process-local

Decisions we made

This section explains the key design choices and how they trace back to the RFCs and discussions.

1. Pluggable ABC + factory, not a simple on/off toggle

RFC #26934 originally proposed VLLM_ENABLE_RESPONSES_API_STORE as a boolean toggle — either vLLM owns state or it doesn't. We chose a pluggable ABC with a factory instead because the RFC discussion converged on the principle that "vLLM should not be opinionated on whether it owns the state management layer" (@qandrew, comment). A toggle only gives two options; an ABC lets each deployment choose the backend that fits its topology — in-memory for dev, Redis/Postgres/DynamoDB for production — without vLLM shipping or maintaining any of those external integrations.

2. Env var with fully-qualified class name (VLLM_RESPONSES_STORE_BACKEND)

RFC #32850 discusses vLLM's policy for Responses API extensions. The agreed direction is to "allow extensions when there is a documented need" and follow existing vLLM patterns (@DanielMe's proposal). vLLM already uses this pattern for other pluggable components (e.g. resolve_obj_by_qualname). An env var keeps the interface zero-config for the common case (in-memory default) while enabling production deployments to swap backends without forking.

3. Atomic operations in the store interface, not external locks

The original serving.py used an external response_store_lock around get-check-put sequences on plain dicts. This pattern cannot work with external stores — a Redis or Postgres backend can't share a Python asyncio.Lock. We pushed atomicity into the ABC itself:

  • put_response(..., unless_status="cancelled") — conditional write that skips if the response is already cancelled (replaces the cancel-race-condition guard)
  • update_response_status(..., allowed_current_statuses=...) — atomic CAS-style status transition (replaces get-check-mutate patterns)

This lets external implementations use their native atomicity primitives (Redis SETNX, Postgres UPDATE ... WHERE status IN (...), DynamoDB conditional expressions).

4. Pluggable store supersedes LRU eviction (#34738)

PR #34738 proposed OrderedDict + LRU eviction to bound memory. @qandrew's response questioned the production motivation and pointed to #26934 as the longer-term direction. @lisperz agreed that in-memory storage "doesn't survive restarts and doesn't scale across multiple API server instances." Rather than shipping a bounded-but-still-in-memory solution, the pluggable store lets users choose their own eviction/persistence/replication strategy. The in-memory default intentionally ships without LRU — adding it later as a separate concern is straightforward if needed.

5. event_store and background_tasks stay process-local

These use asyncio.Event and asyncio.Task which are inherently bound to a single event loop. Externalizing them would require a fundamentally different design (distributed task queues, pub/sub for event notification). The RFC's priority is decoupling response/message persistence — the process-local coordination primitives are a separate concern for a follow-up if needed.

6. No new wire protocol fields

RFCs #32850 and #33381 both emphasize alignment with the OpenResponses spec and conservative extensions. This PR introduces zero new fields on the request or response wire format. The VLLM_RESPONSES_STORE_BACKEND env var is a server-side configuration knob, not a protocol extension. Clients are completely unaware of which backend is in use.

7. Complementary to the stateless path (#35740), not a replacement

RFC #26934 identifies two user personas: large-scale production (stateless, external state management) and researchers/small-scale (stateful, all-in-one). PR #35740 serves persona 1 with the encrypted_content state carrier. This PR serves both personas by making the store-backed path production-ready: persona 2 gets the in-memory default that just works, and persona 1 can either go fully stateless (#35740) or plug in their own persistent store for features that need server-side state (e.g. background mode, retrieve_responses).

8. Concurrency ownership moves into the store

The external response_store_lock in serving.py was a code smell — it coupled the serving layer to the implementation detail that the store was a plain dict. Each store implementation now owns its own concurrency strategy: InMemoryResponseStore uses an asyncio.Lock, a Redis implementation might use optimistic locking or WATCH/MULTI, a Postgres implementation might use row-level locks. This is a cleaner separation of concerns.

Related Issues & RFCs

# Title Relevance
#26934 [RFC] Separating State & Providing Flexibility for serving ResponsesAPI Primary motivation — proposes decoupling state management from vLLM; pluggable backends for production
#35740 feat(responses): stateless multi-turn via encrypted_content state carrier Predecessor PR — implements stateless path; this PR adds the pluggable interface for the store-backed path
#33089 [Feature] Support multi-turn conversation for OpenAI Response API Direct fixprevious_response_id requires store; pluggable backend enables production-grade multi-turn
#34738 Fix memory leak in Responses API store (LRU eviction) Superseded — LRU eviction is a band-aid; pluggable store lets users choose their own eviction/persistence strategy
#32850 [RFC] Clarify policy for Open Responses API extensions in vLLM Extension policyVLLM_RESPONSES_STORE_BACKEND env var is a vLLM extension; aligns with existing extension precedent
#33381 [RFC] Align with the openresponses.org spec Conformance — pluggable store supports standard state formats; no new wire protocol fields

Context from RFC #26934 (Meta)

The RFC argues that vLLM should not be opinionated about storage — users need to plug in their own backend depending on deployment topology (single-node dev, multi-node production, managed service). The current in-memory dicts (response_store, msg_store) marked # HACK / # FIXME in serving.py:

  • Leak memory unboundedly
  • Lose all state on restart
  • Are incompatible with multi-node / load-balanced deployments

This PR provides the interface layer. vLLM ships only the in-memory default — users bring their own Redis/Postgres/DynamoDB implementation by setting one env var. No external dependencies added.

Why not just LRU eviction (#34738)?

LRU eviction (proposed in #34738) is a band-aid that still loses state on restart and doesn't work across nodes. The pluggable store lets users choose their own eviction/persistence/replication strategy appropriate to their deployment.

Test plan

  • pytest tests/entrypoints/openai/responses/test_store.py -v — unit + integration tests for store ABC, InMemoryResponseStore, factory, and serving.py integration (cancel/retrieve)
  • pytest tests/entrypoints/openai/responses/test_serving_stateless.py -v — existing stateless tests pass with refactored store
  • pre-commit run --all-files — linting passes
  • E2e tests against running server (local_test/ harness, gitignored)

garrio-1 and others added 7 commits March 2, 2026 07:00
…-project#26934)

Implements the @grs proposal for stateless multi-turn Responses API
conversations without server-side storage, using the standard OpenAI
`encrypted_content` field on a synthetic `ResponseReasoningItem` as the
state carrier.

**How it works:**
1. Client sets `store=false` + `include=["reasoning.encrypted_content"]`
2. vLLM serialises the Harmony message history into a signed blob
   (`vllm:1:<base64(json)>:<hmac-sha256>`) and appends it as a synthetic
   `ReasoningItem` to the response output
3. On the next turn the client passes `previous_response` (full response
   object) instead of `previous_response_id`
4. vLLM extracts, verifies, and deserialises the history from the carrier
   item — no in-memory store touched

**No breaking changes.** Existing `previous_response_id` + store-enabled
path is unchanged. New path requires explicit opt-in.

**Multi-node safe:** set `VLLM_RESPONSES_STATE_SIGNING_KEY` to the same
64-char hex value on all nodes so tokens validate across replicas.

Files changed:
- `vllm/entrypoints/openai/responses/state.py` (new) — serialise /
  deserialise / HMAC-verify state carriers
- `vllm/entrypoints/openai/responses/protocol.py` — add
  `previous_response` field + mutual-exclusion validator on
  `ResponsesRequest`; `model_rebuild()` for forward ref
- `vllm/entrypoints/openai/responses/serving.py` — stateless prev-response
  resolution; thread `prev_messages` through `_make_request*`; inject
  state carrier in `responses_full_generator`; 501 guards on
  `retrieve_responses` / `cancel_responses` when store disabled
- `vllm/entrypoints/openai/responses/utils.py` — skip state-carrier
  `ReasoningItem`s when reconstructing chat messages
- `vllm/envs.py` — register `VLLM_RESPONSES_STATE_SIGNING_KEY`
- `tests/entrypoints/openai/responses/test_state.py` (new) — 16 unit tests
- `tests/entrypoints/openai/responses/test_serving_stateless.py` (new) —
  14 unit tests

Closes vllm-project#26934 (partial — non-streaming only; streaming carrier TBD)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…400 on success, info log

Per code review feedback:
- Return 404 (not 501) when response_id has no matching background task,
  consistent with the stateful path's _make_not_found_error behavior
- Return 400 BAD_REQUEST (not 501 NOT_IMPLEMENTED) when a task is found
  and cancelled — cancellation succeeded, but no stored response object
  can be returned; 501 was misleading
- Use logger.info instead of logger.exception for asyncio.CancelledError,
  since cancellation is the expected outcome of this call path

Update test to assert 404 for the unknown-id case.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…arrier guard, background invariant)

Fixes three issues found in review:

1. cancel_responses stateless mode (gemini-code-assist P1):
   - Return 404 (not 501) for unknown response_id — consistent with stateful path
   - Return 400 BAD_REQUEST (not 501) on successful cancellation — task was
     cancelled but no stored response is available; 501 was misleading
   - Use logger.info (not logger.exception) for expected CancelledError

2. Missing carrier guard in create_responses (codex P1):
   - When previous_response has no state carrier and store is disabled,
     return 400 with a clear message instead of falling through to
     msg_store[id] KeyError → 500

3. background/store invariant in protocol validator (codex P2):
   - Reject background=True + previous_response at validation time rather
     than silently producing an unretrievable background response

Tests:
   - Add test_cancel_without_store_active_task_returns_400: covers the
     success branch of the cancel fix; uses await asyncio.sleep(0) to
     start the task before cancelling (Python 3.12: unstarted tasks
     cancelled before first await never run their body)
   - Add test_previous_response_without_carrier_returns_400: regression
     for the KeyError → 500 bug
   - Add test_background_with_previous_response_raises: regression for
     the background/store invariant
   - Remove test_no_previous_response_preserves_store_true: passed
     regardless of our code (no new path exercised)
   - Remove test_full_stateless_roundtrip: duplicate of
     test_build_and_extract_roundtrip
   - Rename test_prev_messages_used_over_empty_msg_store →
     test_construct_input_messages_prepends_prev_msg (accurate name)

All 31 tests pass.
…ey validation

Gemini critical — non-Harmony state carrier missing the assistant turn:
  carrier_messages was only the input messages, omitting the assistant
  response just generated. The next turn would see history without the
  last assistant message. Fix: append
  construct_chat_messages_with_tool_call(response.output) so the
  carrier contains the full turn (input + response).

Codex P1 — carrierless previous_response check gated on enable_store:
  The guard 'if prev_messages_from_state is None and not self.enable_store'
  was too narrow. previous_response always means stateless path; a
  server restart with enable_store=True and empty msg_store would still
  KeyError. Fix: drop the 'not self.enable_store' condition.

Codex P2 — any-length hex key accepted as signing key:
  bytes.fromhex('aa') produces 1 byte — a weak HMAC key. Fix: enforce
  len(key_bytes) >= 32 (64 hex chars) and raise ValueError if too short.

Tests:
  - test_previous_response_without_carrier_store_enabled_returns_400:
    regression for P1 (store=True path also returns 400, not KeyError)
  - test_short_key_raises: regression for P2 (4-byte key raises)

Run pre-commit --all-files; apply linter reformatting.

Signed-off-by: Will Deines <will@garr.io>
Extract the three in-memory dicts (response_store, msg_store,
response_store_lock) from OpenAIServingResponses into a pluggable
ResponseStore ABC with an InMemoryResponseStore default.

Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified
class name to swap in their own backend (Redis, Postgres, etc.)
without patching vLLM.

- Add ResponseStore ABC with 5 abstract methods + close() hook
- Add InMemoryResponseStore wrapping current dict behavior with
  internal asyncio.Lock (removes external response_store_lock)
- Add create_response_store() factory reading env var
- Refactor ~15 call sites in serving.py to use self.store.*
- Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py
- Update test helper to use InMemoryResponseStore
- Add unit + integration tests for store and serving interactions

Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934
(pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
get_messages/put_messages were accessing self._messages without the
internal asyncio.Lock, inconsistent with how self._responses is
protected. Adds lock acquisition for both methods to prevent race
conditions in concurrent access.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-designed feature: a pluggable ResponseStore abstraction, which correctly separates state management from the serving logic, and adds a stateless multi-turn conversation capability using a signed state carrier. This greatly improves the production-readiness of vLLM's Responses API. However, a critical prompt injection vulnerability has been identified. In the non-Harmony path, the server uses the unsigned output field from the client-provided previous_response even with a verified state carrier, allowing an attacker to spoof assistant messages in the conversation history. The suggested fix is to ensure the unsigned output is only used in the store-backed (stateful) path where previous_response is retrieved from the server's own store. Additionally, an area for improvement was noted in the deserialization logic to enhance its robustness.

request_input=request.input,
prev_msg=self.msg_store.get(prev_response.id) if prev_response else None,
prev_msg=prev_msg,
prev_response_output=prev_response.output if prev_response else None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

In stateless multi-turn mode, the conversation history is reconstructed using both the signed state carrier (prev_msg) and the unsigned output field of the previous_response object. Since the output field is provided by the client and not protected by the HMAC, an attacker can modify it to inject arbitrary assistant messages into the prompt. Additionally, because the signed history already contains the previous turn's output, this results in the previous output being duplicated in the prompt.

To remediate this, the prev_response_output should be set to None when prev_messages (the stateless history) is provided, as the history is already complete.

Suggested change
prev_response_output=prev_response.output if prev_response else None,
prev_response_output=prev_response.output if prev_response and prev_messages is None else None,

Comment on lines +1163 to +1166
prev_msgs = [
OpenAIHarmonyMessage.model_validate(m) if isinstance(m, dict) else m
for m in prev_messages
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The list comprehension to deserialize prev_messages has a branch else m which could allow non-dict elements into prev_msgs. This could lead to a TypeError later if an element is not an OpenAIHarmonyMessage instance. It's safer to rely on OpenAIHarmonyMessage.model_validate(m) for all elements and let it raise a ValidationError for non-dict items. This makes the code more robust against malformed state, even if such state is correctly signed.

                prev_msgs = [OpenAIHarmonyMessage.model_validate(m) for m in prev_messages]

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7ac6fe63bb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +660 to 661
prev_msg=prev_msg,
prev_response_output=prev_response.output if prev_response else None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid duplicating prior output in stateless follow-up turns

When previous_response is used, prev_msg is populated from the state carrier, and that carrier is built from the full history (messages plus the prior turn’s output in responses_full_generator). Passing prev_response_output again here re-appends the same assistant output a second time, so each non-Harmony stateless turn repeats prior outputs, bloats context, and can skew model behavior.

Useful? React with 👍 / 👎.

create_response_store,
)

self.store = create_response_store()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate custom store initialization behind enable_store

create_response_store() is called unconditionally during handler init, so a configured VLLM_RESPONSES_STORE_BACKEND is imported/instantiated even when VLLM_ENABLE_RESPONSES_API_STORE=0. This can fail server startup (or create unnecessary external connections) in deployments that intentionally run stateless mode, because the custom backend is never needed on those code paths.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants