feat(responses): pluggable ResponseStore abstraction#35874
feat(responses): pluggable ResponseStore abstraction#35874will-deines wants to merge 8 commits intovllm-project:mainfrom
Conversation
Implements the @grs proposal for stateless multi-turn Responses API conversations without server-side storage, using the standard OpenAI `encrypted_content` field on a synthetic `ResponseReasoningItem` as the state carrier. **How it works:** 1. Client sets `store=false` + `include=["reasoning.encrypted_content"]` 2. vLLM serialises the Harmony message history into a signed blob (`vllm:1:<base64(json)>:<hmac-sha256>`) and appends it as a synthetic `ReasoningItem` to the response output 3. On the next turn the client passes `previous_response` (full response object) instead of `previous_response_id` 4. vLLM extracts, verifies, and deserialises the history from the carrier item — no in-memory store touched **No breaking changes.** Existing `previous_response_id` + store-enabled path is unchanged. New path requires explicit opt-in. **Multi-node safe:** set `VLLM_RESPONSES_STATE_SIGNING_KEY` to the same 64-char hex value on all nodes so tokens validate across replicas. Files changed: - `vllm/entrypoints/openai/responses/state.py` (new) — serialise / deserialise / HMAC-verify state carriers - `vllm/entrypoints/openai/responses/protocol.py` — add `previous_response` field + mutual-exclusion validator on `ResponsesRequest`; `model_rebuild()` for forward ref - `vllm/entrypoints/openai/responses/serving.py` — stateless prev-response resolution; thread `prev_messages` through `_make_request*`; inject state carrier in `responses_full_generator`; 501 guards on `retrieve_responses` / `cancel_responses` when store disabled - `vllm/entrypoints/openai/responses/utils.py` — skip state-carrier `ReasoningItem`s when reconstructing chat messages - `vllm/envs.py` — register `VLLM_RESPONSES_STATE_SIGNING_KEY` - `tests/entrypoints/openai/responses/test_state.py` (new) — 16 unit tests - `tests/entrypoints/openai/responses/test_serving_stateless.py` (new) — 14 unit tests Closes #26934 (partial — non-streaming only; streaming carrier TBD) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…400 on success, info log Per code review feedback: - Return 404 (not 501) when response_id has no matching background task, consistent with the stateful path's _make_not_found_error behavior - Return 400 BAD_REQUEST (not 501 NOT_IMPLEMENTED) when a task is found and cancelled — cancellation succeeded, but no stored response object can be returned; 501 was misleading - Use logger.info instead of logger.exception for asyncio.CancelledError, since cancellation is the expected outcome of this call path Update test to assert 404 for the unknown-id case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…arrier guard, background invariant)
Fixes three issues found in review:
1. cancel_responses stateless mode (gemini-code-assist P1):
- Return 404 (not 501) for unknown response_id — consistent with stateful path
- Return 400 BAD_REQUEST (not 501) on successful cancellation — task was
cancelled but no stored response is available; 501 was misleading
- Use logger.info (not logger.exception) for expected CancelledError
2. Missing carrier guard in create_responses (codex P1):
- When previous_response has no state carrier and store is disabled,
return 400 with a clear message instead of falling through to
msg_store[id] KeyError → 500
3. background/store invariant in protocol validator (codex P2):
- Reject background=True + previous_response at validation time rather
than silently producing an unretrievable background response
Tests:
- Add test_cancel_without_store_active_task_returns_400: covers the
success branch of the cancel fix; uses await asyncio.sleep(0) to
start the task before cancelling (Python 3.12: unstarted tasks
cancelled before first await never run their body)
- Add test_previous_response_without_carrier_returns_400: regression
for the KeyError → 500 bug
- Add test_background_with_previous_response_raises: regression for
the background/store invariant
- Remove test_no_previous_response_preserves_store_true: passed
regardless of our code (no new path exercised)
- Remove test_full_stateless_roundtrip: duplicate of
test_build_and_extract_roundtrip
- Rename test_prev_messages_used_over_empty_msg_store →
test_construct_input_messages_prepends_prev_msg (accurate name)
All 31 tests pass.
…ey validation
Gemini critical — non-Harmony state carrier missing the assistant turn:
carrier_messages was only the input messages, omitting the assistant
response just generated. The next turn would see history without the
last assistant message. Fix: append
construct_chat_messages_with_tool_call(response.output) so the
carrier contains the full turn (input + response).
Codex P1 — carrierless previous_response check gated on enable_store:
The guard 'if prev_messages_from_state is None and not self.enable_store'
was too narrow. previous_response always means stateless path; a
server restart with enable_store=True and empty msg_store would still
KeyError. Fix: drop the 'not self.enable_store' condition.
Codex P2 — any-length hex key accepted as signing key:
bytes.fromhex('aa') produces 1 byte — a weak HMAC key. Fix: enforce
len(key_bytes) >= 32 (64 hex chars) and raise ValueError if too short.
Tests:
- test_previous_response_without_carrier_store_enabled_returns_400:
regression for P1 (store=True path also returns 400, not KeyError)
- test_short_key_raises: regression for P2 (4-byte key raises)
Run pre-commit --all-files; apply linter reformatting.
Signed-off-by: Will Deines <will@garr.io>
Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on #35740 (stateless multi-turn). Addresses RFC #26934 (pluggable state backends) and supersedes #34738 (LRU eviction).
There was a problem hiding this comment.
Code Review
This pull request introduces a well-designed pluggable ResponseStore abstraction, refactoring the in-memory response handling into a more robust and extensible system. The changes significantly improve thread safety by replacing direct dictionary access with atomic async methods, mitigating potential race conditions. Additionally, the implementation of a stateless multi-turn conversation feature using a signed state carrier is a valuable enhancement. The code is well-structured and accompanied by thorough tests. I've identified one critical issue related to a potential race condition in the InMemoryResponseStore that should be addressed.
get_messages/put_messages were accessing self._messages without the internal asyncio.Lock, inconsistent with how self._responses is protected. Adds lock acquisition for both methods to prevent race conditions in concurrent access.
|
Gemini's critical finding has been addressed in 7ac6fe6 — |
Summary
Extracts the in-memory response/message dicts from
OpenAIServingResponsesinto a pluggableResponseStoreABC, addressing RFC #26934.ResponseStoreABC (vllm/entrypoints/openai/responses/store.py) with 5 async methods for get/put responses, get/put messages, and atomic status transitionsInMemoryResponseStoredefault implementation wrapping current dict behavior with internalasyncio.Lock(removes externalresponse_store_lock)VLLM_RESPONSES_STORE_BACKENDenv var — point to a fully-qualified class name (e.g.mypackage.redis_store.RedisResponseStore) to swap backendsserving.py— ~15 call sites updated from direct dict access toself.store.*async callsput_response(..., unless_status="cancelled")) replace get-check-put-under-lock patternsupdate_response_status(..., allowed_current_statuses=...)) replace get-check-mutate patternsWhat stays local (not pluggable)
event_storeandbackground_tasks— these useasyncio.Event/asyncio.Taskwhich are inherently process-localDecisions we made
This section explains the key design choices and how they trace back to the RFCs and discussions.
1. Pluggable ABC + factory, not a simple on/off toggle
RFC #26934 originally proposed
VLLM_ENABLE_RESPONSES_API_STOREas a boolean toggle — either vLLM owns state or it doesn't. We chose a pluggable ABC with a factory instead because the RFC discussion converged on the principle that "vLLM should not be opinionated on whether it owns the state management layer" (@qandrew, comment). A toggle only gives two options; an ABC lets each deployment choose the backend that fits its topology — in-memory for dev, Redis/Postgres/DynamoDB for production — without vLLM shipping or maintaining any of those external integrations.2. Env var with fully-qualified class name (
VLLM_RESPONSES_STORE_BACKEND)RFC #32850 discusses vLLM's policy for Responses API extensions. The agreed direction is to "allow extensions when there is a documented need" and follow existing vLLM patterns (@DanielMe's proposal). vLLM already uses this pattern for other pluggable components (e.g.
resolve_obj_by_qualname). An env var keeps the interface zero-config for the common case (in-memory default) while enabling production deployments to swap backends without forking.3. Atomic operations in the store interface, not external locks
The original
serving.pyused an externalresponse_store_lockaround get-check-put sequences on plain dicts. This pattern cannot work with external stores — a Redis or Postgres backend can't share a Pythonasyncio.Lock. We pushed atomicity into the ABC itself:put_response(..., unless_status="cancelled")— conditional write that skips if the response is already cancelled (replaces the cancel-race-condition guard)update_response_status(..., allowed_current_statuses=...)— atomic CAS-style status transition (replaces get-check-mutate patterns)This lets external implementations use their native atomicity primitives (Redis
SETNX, PostgresUPDATE ... WHERE status IN (...), DynamoDB conditional expressions).4. Pluggable store supersedes LRU eviction (#34738)
PR #34738 proposed
OrderedDict+ LRU eviction to bound memory. @qandrew's response questioned the production motivation and pointed to #26934 as the longer-term direction. @lisperz agreed that in-memory storage "doesn't survive restarts and doesn't scale across multiple API server instances." Rather than shipping a bounded-but-still-in-memory solution, the pluggable store lets users choose their own eviction/persistence/replication strategy. The in-memory default intentionally ships without LRU — adding it later as a separate concern is straightforward if needed.5.
event_storeandbackground_tasksstay process-localThese use
asyncio.Eventandasyncio.Taskwhich are inherently bound to a single event loop. Externalizing them would require a fundamentally different design (distributed task queues, pub/sub for event notification). The RFC's priority is decoupling response/message persistence — the process-local coordination primitives are a separate concern for a follow-up if needed.6. No new wire protocol fields
RFCs #32850 and #33381 both emphasize alignment with the OpenResponses spec and conservative extensions. This PR introduces zero new fields on the request or response wire format. The
VLLM_RESPONSES_STORE_BACKENDenv var is a server-side configuration knob, not a protocol extension. Clients are completely unaware of which backend is in use.7. Complementary to the stateless path (#35740), not a replacement
RFC #26934 identifies two user personas: large-scale production (stateless, external state management) and researchers/small-scale (stateful, all-in-one). PR #35740 serves persona 1 with the
encrypted_contentstate carrier. This PR serves both personas by making the store-backed path production-ready: persona 2 gets the in-memory default that just works, and persona 1 can either go fully stateless (#35740) or plug in their own persistent store for features that need server-side state (e.g.backgroundmode,retrieve_responses).8. Concurrency ownership moves into the store
The external
response_store_lockinserving.pywas a code smell — it coupled the serving layer to the implementation detail that the store was a plain dict. Each store implementation now owns its own concurrency strategy:InMemoryResponseStoreuses anasyncio.Lock, a Redis implementation might use optimistic locking orWATCH/MULTI, a Postgres implementation might use row-level locks. This is a cleaner separation of concerns.Related Issues & RFCs
previous_response_idrequires store; pluggable backend enables production-grade multi-turnVLLM_RESPONSES_STORE_BACKENDenv var is a vLLM extension; aligns with existing extension precedentContext from RFC #26934 (Meta)
The RFC argues that vLLM should not be opinionated about storage — users need to plug in their own backend depending on deployment topology (single-node dev, multi-node production, managed service). The current in-memory dicts (
response_store,msg_store) marked# HACK/# FIXMEinserving.py:This PR provides the interface layer. vLLM ships only the in-memory default — users bring their own Redis/Postgres/DynamoDB implementation by setting one env var. No external dependencies added.
Why not just LRU eviction (#34738)?
LRU eviction (proposed in #34738) is a band-aid that still loses state on restart and doesn't work across nodes. The pluggable store lets users choose their own eviction/persistence/replication strategy appropriate to their deployment.
Test plan
pytest tests/entrypoints/openai/responses/test_store.py -v— unit + integration tests for store ABC, InMemoryResponseStore, factory, and serving.py integration (cancel/retrieve)pytest tests/entrypoints/openai/responses/test_serving_stateless.py -v— existing stateless tests pass with refactored storepre-commit run --all-files— linting passes