feat(responses): pluggable ResponseStore abstraction by will-deines · Pull Request #35874 · vllm-project/vllm

will-deines · 2026-03-03T15:26:08Z

Depends on #35740 — review that first. Diff will clean up automatically once it merges.

Summary

Extracts the in-memory response/message dicts from OpenAIServingResponses into a pluggable ResponseStore ABC, addressing RFC #26934.

New ResponseStore ABC (vllm/entrypoints/openai/responses/store.py) with 5 async methods for get/put responses, get/put messages, and atomic status transitions
InMemoryResponseStore default implementation wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock)
VLLM_RESPONSES_STORE_BACKEND env var — point to a fully-qualified class name (e.g. mypackage.redis_store.RedisResponseStore) to swap backends
Refactored serving.py — ~15 call sites updated from direct dict access to self.store.* async calls
Atomic conditional writes (put_response(..., unless_status="cancelled")) replace get-check-put-under-lock patterns
Atomic status transitions (update_response_status(..., allowed_current_statuses=...)) replace get-check-mutate patterns

What stays local (not pluggable)

event_store and background_tasks — these use asyncio.Event/asyncio.Task which are inherently process-local

Decisions we made

This section explains the key design choices and how they trace back to the RFCs and discussions.

1. Pluggable ABC + factory, not a simple on/off toggle

RFC #26934 originally proposed VLLM_ENABLE_RESPONSES_API_STORE as a boolean toggle — either vLLM owns state or it doesn't. We chose a pluggable ABC with a factory instead because the RFC discussion converged on the principle that "vLLM should not be opinionated on whether it owns the state management layer" (@qandrew, comment). A toggle only gives two options; an ABC lets each deployment choose the backend that fits its topology — in-memory for dev, Redis/Postgres/DynamoDB for production — without vLLM shipping or maintaining any of those external integrations.

2. Env var with fully-qualified class name (`VLLM_RESPONSES_STORE_BACKEND`)

RFC #32850 discusses vLLM's policy for Responses API extensions. The agreed direction is to "allow extensions when there is a documented need" and follow existing vLLM patterns (@DanielMe's proposal). vLLM already uses this pattern for other pluggable components (e.g. resolve_obj_by_qualname). An env var keeps the interface zero-config for the common case (in-memory default) while enabling production deployments to swap backends without forking.

3. Atomic operations in the store interface, not external locks

The original serving.py used an external response_store_lock around get-check-put sequences on plain dicts. This pattern cannot work with external stores — a Redis or Postgres backend can't share a Python asyncio.Lock. We pushed atomicity into the ABC itself:

put_response(..., unless_status="cancelled") — conditional write that skips if the response is already cancelled (replaces the cancel-race-condition guard)
update_response_status(..., allowed_current_statuses=...) — atomic CAS-style status transition (replaces get-check-mutate patterns)

This lets external implementations use their native atomicity primitives (Redis SETNX, Postgres UPDATE ... WHERE status IN (...), DynamoDB conditional expressions).

4. Pluggable store supersedes LRU eviction (#34738)

PR #34738 proposed OrderedDict + LRU eviction to bound memory. @qandrew's response questioned the production motivation and pointed to #26934 as the longer-term direction. @lisperz agreed that in-memory storage "doesn't survive restarts and doesn't scale across multiple API server instances." Rather than shipping a bounded-but-still-in-memory solution, the pluggable store lets users choose their own eviction/persistence/replication strategy. The in-memory default intentionally ships without LRU — adding it later as a separate concern is straightforward if needed.

5. `event_store` and `background_tasks` stay process-local

These use asyncio.Event and asyncio.Task which are inherently bound to a single event loop. Externalizing them would require a fundamentally different design (distributed task queues, pub/sub for event notification). The RFC's priority is decoupling response/message persistence — the process-local coordination primitives are a separate concern for a follow-up if needed.

6. No new wire protocol fields

RFCs #32850 and #33381 both emphasize alignment with the OpenResponses spec and conservative extensions. This PR introduces zero new fields on the request or response wire format. The VLLM_RESPONSES_STORE_BACKEND env var is a server-side configuration knob, not a protocol extension. Clients are completely unaware of which backend is in use.

7. Complementary to the stateless path (#35740), not a replacement

RFC #26934 identifies two user personas: large-scale production (stateless, external state management) and researchers/small-scale (stateful, all-in-one). PR #35740 serves persona 1 with the encrypted_content state carrier. This PR serves both personas by making the store-backed path production-ready: persona 2 gets the in-memory default that just works, and persona 1 can either go fully stateless (#35740) or plug in their own persistent store for features that need server-side state (e.g. background mode, retrieve_responses).

8. Concurrency ownership moves into the store

The external response_store_lock in serving.py was a code smell — it coupled the serving layer to the implementation detail that the store was a plain dict. Each store implementation now owns its own concurrency strategy: InMemoryResponseStore uses an asyncio.Lock, a Redis implementation might use optimistic locking or WATCH/MULTI, a Postgres implementation might use row-level locks. This is a cleaner separation of concerns.

Related Issues & RFCs

#	Title	Relevance
#26934	[RFC] Separating State & Providing Flexibility for serving ResponsesAPI	Primary motivation — proposes decoupling state management from vLLM; pluggable backends for production
#35740	feat(responses): stateless multi-turn via encrypted_content state carrier	Predecessor PR — implements stateless path; this PR adds the pluggable interface for the store-backed path
#33089	[Feature] Support multi-turn conversation for OpenAI Response API	Direct fix — `previous_response_id` requires store; pluggable backend enables production-grade multi-turn
#34738	Fix memory leak in Responses API store (LRU eviction)	Superseded — LRU eviction is a band-aid; pluggable store lets users choose their own eviction/persistence strategy
#32850	[RFC] Clarify policy for Open Responses API extensions in vLLM	Extension policy — `VLLM_RESPONSES_STORE_BACKEND` env var is a vLLM extension; aligns with existing extension precedent
#33381	[RFC] Align with the openresponses.org spec	Conformance — pluggable store supports standard state formats; no new wire protocol fields

Context from RFC #26934 (Meta)

The RFC argues that vLLM should not be opinionated about storage — users need to plug in their own backend depending on deployment topology (single-node dev, multi-node production, managed service). The current in-memory dicts (response_store, msg_store) marked # HACK / # FIXME in serving.py:

Leak memory unboundedly
Lose all state on restart
Are incompatible with multi-node / load-balanced deployments

This PR provides the interface layer. vLLM ships only the in-memory default — users bring their own Redis/Postgres/DynamoDB implementation by setting one env var. No external dependencies added.

Why not just LRU eviction (#34738)?

LRU eviction (proposed in #34738) is a band-aid that still loses state on restart and doesn't work across nodes. The pluggable store lets users choose their own eviction/persistence/replication strategy appropriate to their deployment.

Test plan

pytest tests/entrypoints/openai/responses/test_store.py -v — unit + integration tests for store ABC, InMemoryResponseStore, factory, and serving.py integration (cancel/retrieve)
pytest tests/entrypoints/openai/responses/test_serving_stateless.py -v — existing stateless tests pass with refactored store
pre-commit run --all-files — linting passes
E2e tests against running server (local_test/ harness, gitignored)

@grs

Implements the @grs proposal for stateless multi-turn Responses API conversations without server-side storage, using the standard OpenAI `encrypted_content` field on a synthetic `ResponseReasoningItem` as the state carrier. **How it works:** 1. Client sets `store=false` + `include=["reasoning.encrypted_content"]` 2. vLLM serialises the Harmony message history into a signed blob (`vllm:1:<base64(json)>:<hmac-sha256>`) and appends it as a synthetic `ReasoningItem` to the response output 3. On the next turn the client passes `previous_response` (full response object) instead of `previous_response_id` 4. vLLM extracts, verifies, and deserialises the history from the carrier item — no in-memory store touched **No breaking changes.** Existing `previous_response_id` + store-enabled path is unchanged. New path requires explicit opt-in. **Multi-node safe:** set `VLLM_RESPONSES_STATE_SIGNING_KEY` to the same 64-char hex value on all nodes so tokens validate across replicas. Files changed: - `vllm/entrypoints/openai/responses/state.py` (new) — serialise / deserialise / HMAC-verify state carriers - `vllm/entrypoints/openai/responses/protocol.py` — add `previous_response` field + mutual-exclusion validator on `ResponsesRequest`; `model_rebuild()` for forward ref - `vllm/entrypoints/openai/responses/serving.py` — stateless prev-response resolution; thread `prev_messages` through `_make_request*`; inject state carrier in `responses_full_generator`; 501 guards on `retrieve_responses` / `cancel_responses` when store disabled - `vllm/entrypoints/openai/responses/utils.py` — skip state-carrier `ReasoningItem`s when reconstructing chat messages - `vllm/envs.py` — register `VLLM_RESPONSES_STATE_SIGNING_KEY` - `tests/entrypoints/openai/responses/test_state.py` (new) — 16 unit tests - `tests/entrypoints/openai/responses/test_serving_stateless.py` (new) — 14 unit tests Closes #26934 (partial — non-streaming only; streaming carrier TBD) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…400 on success, info log Per code review feedback: - Return 404 (not 501) when response_id has no matching background task, consistent with the stateful path's _make_not_found_error behavior - Return 400 BAD_REQUEST (not 501 NOT_IMPLEMENTED) when a task is found and cancelled — cancellation succeeded, but no stored response object can be returned; 501 was misleading - Use logger.info instead of logger.exception for asyncio.CancelledError, since cancellation is the expected outcome of this call path Update test to assert 404 for the unknown-id case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…arrier guard, background invariant) Fixes three issues found in review: 1. cancel_responses stateless mode (gemini-code-assist P1): - Return 404 (not 501) for unknown response_id — consistent with stateful path - Return 400 BAD_REQUEST (not 501) on successful cancellation — task was cancelled but no stored response is available; 501 was misleading - Use logger.info (not logger.exception) for expected CancelledError 2. Missing carrier guard in create_responses (codex P1): - When previous_response has no state carrier and store is disabled, return 400 with a clear message instead of falling through to msg_store[id] KeyError → 500 3. background/store invariant in protocol validator (codex P2): - Reject background=True + previous_response at validation time rather than silently producing an unretrievable background response Tests: - Add test_cancel_without_store_active_task_returns_400: covers the success branch of the cancel fix; uses await asyncio.sleep(0) to start the task before cancelling (Python 3.12: unstarted tasks cancelled before first await never run their body) - Add test_previous_response_without_carrier_returns_400: regression for the KeyError → 500 bug - Add test_background_with_previous_response_raises: regression for the background/store invariant - Remove test_no_previous_response_preserves_store_true: passed regardless of our code (no new path exercised) - Remove test_full_stateless_roundtrip: duplicate of test_build_and_extract_roundtrip - Rename test_prev_messages_used_over_empty_msg_store → test_construct_input_messages_prepends_prev_msg (accurate name) All 31 tests pass.

…ey validation Gemini critical — non-Harmony state carrier missing the assistant turn: carrier_messages was only the input messages, omitting the assistant response just generated. The next turn would see history without the last assistant message. Fix: append construct_chat_messages_with_tool_call(response.output) so the carrier contains the full turn (input + response). Codex P1 — carrierless previous_response check gated on enable_store: The guard 'if prev_messages_from_state is None and not self.enable_store' was too narrow. previous_response always means stateless path; a server restart with enable_store=True and empty msg_store would still KeyError. Fix: drop the 'not self.enable_store' condition. Codex P2 — any-length hex key accepted as signing key: bytes.fromhex('aa') produces 1 byte — a weak HMAC key. Fix: enforce len(key_bytes) >= 32 (64 hex chars) and raise ValueError if too short. Tests: - test_previous_response_without_carrier_store_enabled_returns_400: regression for P1 (store=True path also returns 400, not KeyError) - test_short_key_raises: regression for P2 (4-byte key raises) Run pre-commit --all-files; apply linter reformatting. Signed-off-by: Will Deines <will@garr.io>

Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on #35740 (stateless multi-turn). Addresses RFC #26934 (pluggable state backends) and supersedes #34738 (LRU eviction).

gemini-code-assist

Code Review

This pull request introduces a well-designed pluggable ResponseStore abstraction, refactoring the in-memory response handling into a more robust and extensible system. The changes significantly improve thread safety by replacing direct dictionary access with atomic async methods, mitigating potential race conditions. Additionally, the implementation of a stateless multi-turn conversation feature using a signed state carrier is a valuable enhancement. The code is well-structured and accompanied by thorough tests. I've identified one critical issue related to a potential race condition in the InMemoryResponseStore that should be addressed.

vllm/entrypoints/openai/responses/store.py

get_messages/put_messages were accessing self._messages without the internal asyncio.Lock, inconsistent with how self._responses is protected. Adds lock acquisition for both methods to prevent race conditions in concurrent access.

will-deines · 2026-03-03T18:00:40Z

Gemini's critical finding has been addressed in 7ac6fe6 — get_messages and put_messages now acquire self._lock before accessing self._messages, consistent with how self._responses is protected.

garrio-1 and others added 6 commits March 2, 2026 07:00

fix: remove unused TYPE_CHECKING import to pass pre-commit ruff check

bdc8b36

mergify bot added the frontend label Mar 3, 2026

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

vllm/entrypoints/openai/responses/store.py Outdated Show resolved Hide resolved

will-deines mentioned this pull request Mar 3, 2026

feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934) #35740

Closed

4 tasks

Merge branch 'main' into feat/pluggable-response-store

1bb43df

will-deines closed this Mar 3, 2026

This was referenced Mar 3, 2026

feat(responses): pluggable ResponseStore abstraction #35900

Closed

feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934) #35903

Open

feat(responses): pluggable ResponseStore abstraction #35905

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(responses): pluggable ResponseStore abstraction#35874

feat(responses): pluggable ResponseStore abstraction#35874
will-deines wants to merge 8 commits intovllm-project:mainfrom
will-deines:feat/pluggable-response-store

will-deines commented Mar 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

will-deines commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

will-deines commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What stays local (not pluggable)

Decisions we made

1. Pluggable ABC + factory, not a simple on/off toggle

2. Env var with fully-qualified class name (VLLM_RESPONSES_STORE_BACKEND)

3. Atomic operations in the store interface, not external locks

4. Pluggable store supersedes LRU eviction (#34738)

5. event_store and background_tasks stay process-local

6. No new wire protocol fields

7. Complementary to the stateless path (#35740), not a replacement

8. Concurrency ownership moves into the store

Related Issues & RFCs

Context from RFC #26934 (Meta)

Why not just LRU eviction (#34738)?

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

will-deines commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

will-deines commented Mar 3, 2026 •

edited

Loading

2. Env var with fully-qualified class name (`VLLM_RESPONSES_STORE_BACKEND`)

5. `event_store` and `background_tasks` stay process-local