feat(responses): pluggable ResponseStore abstraction by will-deines · Pull Request #35905 · vllm-project/vllm

will-deines · 2026-03-03T20:17:06Z

Recreated from #35874, which was closed when the fork was temporarily made private.

Depends on #35903 — review that first. Diff will clean up automatically once it merges.

Summary

Extracts the in-memory response/message dicts from OpenAIServingResponses into a pluggable ResponseStore ABC, addressing RFC #26934.

New ResponseStore ABC (vllm/entrypoints/openai/responses/store.py) with 5 async methods for get/put responses, get/put messages, and atomic status transitions
InMemoryResponseStore default implementation wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock)
VLLM_RESPONSES_STORE_BACKEND env var — point to a fully-qualified class name (e.g. mypackage.redis_store.RedisResponseStore) to swap backends
Refactored serving.py — ~15 call sites updated from direct dict access to self.store.* async calls
Atomic conditional writes (put_response(..., unless_status="cancelled")) replace get-check-put-under-lock patterns
Atomic status transitions (update_response_status(..., allowed_current_statuses=...)) replace get-check-mutate patterns

What stays local (not pluggable)

event_store and background_tasks — these use asyncio.Event/asyncio.Task which are inherently process-local. retrieve_responses gracefully handles the cross-replica case where a response exists in the shared store but has no local event buffer (see review follow-ups below).

Decisions we made

This section explains the key design choices and how they trace back to the RFCs and discussions.

1. Pluggable ABC + factory, not a simple on/off toggle

RFC #26934 originally proposed VLLM_ENABLE_RESPONSES_API_STORE as a boolean toggle — either vLLM owns state or it doesn't. We chose a pluggable ABC with a factory instead because the RFC discussion converged on the principle that "vLLM should not be opinionated on whether it owns the state management layer" (@qandrew, comment). A toggle only gives two options; an ABC lets each deployment choose the backend that fits its topology — in-memory for dev, Redis/Postgres/DynamoDB for production — without vLLM shipping or maintaining any of those external integrations.

2. Env var with fully-qualified class name (`VLLM_RESPONSES_STORE_BACKEND`)

RFC #32850 discusses vLLM's policy for Responses API extensions. The agreed direction is to "allow extensions when there is a documented need" and follow existing vLLM patterns (@DanielMe's proposal). vLLM already uses this pattern for other pluggable components (e.g. resolve_obj_by_qualname). An env var keeps the interface zero-config for the common case (in-memory default) while enabling production deployments to swap backends without forking.

3. Atomic operations in the store interface, not external locks

The original serving.py used an external response_store_lock around get-check-put sequences on plain dicts. This pattern cannot work with external stores — a Redis or Postgres backend can't share a Python asyncio.Lock. We pushed atomicity into the ABC itself:

put_response(..., unless_status="cancelled") — conditional write that skips if the response is already cancelled (replaces the cancel-race-condition guard)
update_response_status(..., allowed_current_statuses=...) — atomic CAS-style status transition (replaces get-check-mutate patterns)

This lets external implementations use their native atomicity primitives (Redis SETNX, Postgres UPDATE ... WHERE status IN (...), DynamoDB conditional expressions).

4. Pluggable store supersedes LRU eviction (#34738)

PR #34738 proposed OrderedDict + LRU eviction to bound memory. @qandrew's response questioned the production motivation and pointed to #26934 as the longer-term direction. @lisperz agreed that in-memory storage "doesn't survive restarts and doesn't scale across multiple API server instances." Rather than shipping a bounded-but-still-in-memory solution, the pluggable store lets users choose their own eviction/persistence/replication strategy. The in-memory default intentionally ships without LRU — adding it later as a separate concern is straightforward if needed.

5. `event_store` and `background_tasks` stay process-local

These use asyncio.Event and asyncio.Task which are inherently bound to a single event loop. Externalizing them would require a fundamentally different design (distributed task queues, pub/sub for event notification). The RFC's priority is decoupling response/message persistence.

For the cross-replica implications this creates:

Streaming retrieval: retrieve_responses with stream=True now checks for a local event buffer before delegating to the stream generator. If the response was processed on another replica and has reached a terminal state (completed/failed/cancelled), the full response is returned as a non-streaming fallback. If it's still in_progress on another replica, a 400 error directs the client to use non-streaming retrieval instead.
Cancel: cancel_responses marks the store entry as "cancelled" via CAS, but can only cancel process-local tasks. A response being processed on another replica will continue running until finalization, where put_response(..., unless_status="cancelled") discards the result — store state is consistent, just compute is wasted. True distributed cancellation (pub/sub, task queue) is deferred.

6. No new wire protocol fields

RFCs #32850 and #33381 both emphasize alignment with the OpenResponses spec and conservative extensions. This PR introduces zero new fields on the request or response wire format. The VLLM_RESPONSES_STORE_BACKEND env var is a server-side configuration knob, not a protocol extension. Clients are completely unaware of which backend is in use.

7. Complementary to the stateless path (#35903), not a replacement

RFC #26934 identifies two user personas: large-scale production (stateless, external state management) and researchers/small-scale (stateful, all-in-one). PR #35903 serves persona 1 with the encrypted_content state carrier. This PR serves both personas by making the store-backed path production-ready: persona 2 gets the in-memory default that just works, and persona 1 can either go fully stateless (#35903) or plug in their own persistent store for features that need server-side state (e.g. background mode, retrieve_responses).

8. Concurrency ownership moves into the store

The external response_store_lock in serving.py was a code smell — it coupled the serving layer to the implementation detail that the store was a plain dict. Each store implementation now owns its own concurrency strategy: InMemoryResponseStore uses an asyncio.Lock, a Redis implementation might use optimistic locking or WATCH/MULTI, a Postgres implementation might use row-level locks. This is a cleaner separation of concerns.

Review follow-ups

Fixes for issues identified during review:

Fix 1: Double-include of assistant output on stateless non-Harmony path

On the stateless path, prev_messages (deserialized from the state carrier) already contains the assistant's output. But construct_input_messages() was also called with prev_response_output=prev_response.output, which extracted ResponseOutputMessage items and appended them again as assistant messages. Fixed by only passing prev_response_output on the store-backed path (when prev_messages is None).

Fix 2: Tampered state carrier returns 400 instead of 500

_extract_state_from_response() calls deserialize_state() which raises ValueError on HMAC mismatch. This call was before the try/except ValueError block that handles template/conversion errors. A tampered previous_response caused an uncaught 500. Fixed by wrapping the extraction in its own try/except ValueError that returns a clear 400 with an integrity-validation message.

Fix 3: Cross-replica cancel limitation documented

With a shared ResponseStore across replicas, cancel_responses marks the store entry as "cancelled" via CAS but can only cancel process-local tasks. Documented this limitation with a code comment near the background_tasks lookup.

Fix 4: Streaming retrieval without local event buffer

retrieve_responses with stream=True delegated to responses_background_stream_generator which raises if the response_id is not in the process-local event_store. With a shared external store, responses processed on another replica exist in the store but not in event_store. Fixed by checking event_store first: terminal responses fall back to returning the full response; in-progress responses return a 400 directing the client to use non-streaming retrieval.

Related Issues & RFCs

#	Title	Relevance
#26934	[RFC] Separating State & Providing Flexibility for serving ResponsesAPI	Primary motivation — proposes decoupling state management from vLLM; pluggable backends for production
#35903	feat(responses): stateless multi-turn via encrypted_content state carrier	Predecessor PR — implements stateless path; this PR adds the pluggable interface for the store-backed path
#33089	[Feature] Support multi-turn conversation for OpenAI Response API	Direct fix — `previous_response_id` requires store; pluggable backend enables production-grade multi-turn
#34738	Fix memory leak in Responses API store (LRU eviction)	Superseded — LRU eviction is a band-aid; pluggable store lets users choose their own eviction/persistence strategy
#32850	[RFC] Clarify policy for Open Responses API extensions in vLLM	Extension policy — `VLLM_RESPONSES_STORE_BACKEND` env var is a vLLM extension; aligns with existing extension precedent
#33381	[RFC] Align with the openresponses.org spec	Conformance — pluggable store supports standard state formats; no new wire protocol fields

Context from RFC #26934 (Meta)

The RFC argues that vLLM should not be opinionated about storage — users need to plug in their own backend depending on deployment topology (single-node dev, multi-node production, managed service). The current in-memory dicts (response_store, msg_store) marked # HACK / # FIXME in serving.py:

Leak memory unboundedly
Lose all state on restart
Are incompatible with multi-node / load-balanced deployments

This PR provides the interface layer. vLLM ships only the in-memory default — users bring their own Redis/Postgres/DynamoDB implementation by setting one env var. No external dependencies added.

Why not just LRU eviction (#34738)?

LRU eviction (proposed in #34738) is a band-aid that still loses state on restart and doesn't work across nodes. The pluggable store lets users choose their own eviction/persistence/replication strategy appropriate to their deployment.

Note: This PR depends on #35903 (stateless multi-turn). The diff against main also includes state.py, protocol.py, utils.py, test_state.py, and test_serving_stateless.py from that dependency — those files are described in #35903's body.

Test plan

pytest tests/entrypoints/openai/responses/test_store.py -v — unit + integration tests for store ABC, InMemoryResponseStore, factory, and serving.py integration (cancel/retrieve/streaming fallback)
pytest tests/entrypoints/openai/responses/test_serving_stateless.py -v — stateless tests including double-include prevention, tampered carrier → 400, and existing error paths
pytest tests/entrypoints/openai/responses/test_state.py -v — state carrier serialization/deserialization round-trips and HMAC validation
pre-commit run --all-files — linting passes
E2e tests against running server (local_test/ harness, gitignored)

gemini-code-assist

Code Review

This is a high-quality pull request that introduces a well-designed, pluggable ResponseStore abstraction, significantly improving the state management for the Responses API. The refactoring to an async store with atomic operations is a major enhancement over the previous dictionary-based approach with external locks. The implementation of the stateless multi-turn conversation feature using a signed state carrier is also well-executed and secure. The accompanying tests are thorough and cover critical paths, including error handling and security considerations. I have one suggestion to improve the robustness of the state carrier parsing logic to make it more resilient to future format changes.

gemini-code-assist · 2026-03-03T20:22:26Z

vllm/entrypoints/openai/responses/state.py

+    # Expected: "vllm:1:<payload_b64>:<sig>"
+    # Split into exactly 4 parts on the first 3 colons.
+    parts = encrypted_content.split(":", 3)
+    if len(parts) != 4:
+        raise ValueError(
+            "Malformed vLLM state carrier: expected "
+            f"'{_FORMAT_VERSION}:<payload>:<sig>', got {encrypted_content!r}"
+        )
+    _, _, payload_b64, sig = parts


The current parsing logic for the state carrier string relies on split(":", 3) and assumes that _FORMAT_VERSION contains exactly one colon. This could be fragile if the format version string changes in the future (e.g., to "vllm:2:alpha"), which would cause incorrect parsing of the payload and signature. Using rsplit to separate the signature from the right would be more robust against such changes.

Suggested change

# Expected: "vllm:1:<payload_b64>:<sig>"

# Split into exactly 4 parts on the first 3 colons.

parts = encrypted_content.split(":", 3)

if len(parts) != 4:

raise ValueError(

"Malformed vLLM state carrier: expected "

f"'{_FORMAT_VERSION}:<payload>:<sig>', got {encrypted_content!r}"

)

_, _, payload_b64, sig = parts

# Expected: "vllm:1:<payload_b64>:<sig>"

# Strip prefix and robustly split payload from signature from the right.

try:

payload_b64, sig = encrypted_content[len(f"{_FORMAT_VERSION}:"):].rsplit(":", 1)

except ValueError as exc:

raise ValueError(

"Malformed vLLM state carrier: expected "

f"'{_FORMAT_VERSION}:<payload>:<sig>', got {encrypted_content!r}"

) from exc

Agreed — changed to strip the known _FORMAT_VERSION: prefix first, then rsplit(":", 1) to split payload from signature. This is robust regardless of how many colons appear in the version string. Fixed in 52a0732.

mergify · 2026-03-05T17:27:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @will-deines.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@grs

…-project#26934) Implements the @grs proposal for stateless multi-turn Responses API conversations without server-side storage, using the standard OpenAI `encrypted_content` field on a synthetic `ResponseReasoningItem` as the state carrier. **How it works:** 1. Client sets `store=false` + `include=["reasoning.encrypted_content"]` 2. vLLM serialises the Harmony message history into a signed blob (`vllm:1:<base64(json)>:<hmac-sha256>`) and appends it as a synthetic `ReasoningItem` to the response output 3. On the next turn the client passes `previous_response` (full response object) instead of `previous_response_id` 4. vLLM extracts, verifies, and deserialises the history from the carrier item — no in-memory store touched **No breaking changes.** Existing `previous_response_id` + store-enabled path is unchanged. New path requires explicit opt-in. **Multi-node safe:** set `VLLM_RESPONSES_STATE_SIGNING_KEY` to the same 64-char hex value on all nodes so tokens validate across replicas. Files changed: - `vllm/entrypoints/openai/responses/state.py` (new) — serialise / deserialise / HMAC-verify state carriers - `vllm/entrypoints/openai/responses/protocol.py` — add `previous_response` field + mutual-exclusion validator on `ResponsesRequest`; `model_rebuild()` for forward ref - `vllm/entrypoints/openai/responses/serving.py` — stateless prev-response resolution; thread `prev_messages` through `_make_request*`; inject state carrier in `responses_full_generator`; 501 guards on `retrieve_responses` / `cancel_responses` when store disabled - `vllm/entrypoints/openai/responses/utils.py` — skip state-carrier `ReasoningItem`s when reconstructing chat messages - `vllm/envs.py` — register `VLLM_RESPONSES_STATE_SIGNING_KEY` - `tests/entrypoints/openai/responses/test_state.py` (new) — 16 unit tests - `tests/entrypoints/openai/responses/test_serving_stateless.py` (new) — 14 unit tests Closes vllm-project#26934 (partial — non-streaming only; streaming carrier TBD) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Will Deines <will@garr.io>

…400 on success, info log Per code review feedback: - Return 404 (not 501) when response_id has no matching background task, consistent with the stateful path's _make_not_found_error behavior - Return 400 BAD_REQUEST (not 501 NOT_IMPLEMENTED) when a task is found and cancelled — cancellation succeeded, but no stored response object can be returned; 501 was misleading - Use logger.info instead of logger.exception for asyncio.CancelledError, since cancellation is the expected outcome of this call path Update test to assert 404 for the unknown-id case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Will Deines <will@garr.io>

…arrier guard, background invariant) Fixes three issues found in review: 1. cancel_responses stateless mode (gemini-code-assist P1): - Return 404 (not 501) for unknown response_id — consistent with stateful path - Return 400 BAD_REQUEST (not 501) on successful cancellation — task was cancelled but no stored response is available; 501 was misleading - Use logger.info (not logger.exception) for expected CancelledError 2. Missing carrier guard in create_responses (codex P1): - When previous_response has no state carrier and store is disabled, return 400 with a clear message instead of falling through to msg_store[id] KeyError → 500 3. background/store invariant in protocol validator (codex P2): - Reject background=True + previous_response at validation time rather than silently producing an unretrievable background response Tests: - Add test_cancel_without_store_active_task_returns_400: covers the success branch of the cancel fix; uses await asyncio.sleep(0) to start the task before cancelling (Python 3.12: unstarted tasks cancelled before first await never run their body) - Add test_previous_response_without_carrier_returns_400: regression for the KeyError → 500 bug - Add test_background_with_previous_response_raises: regression for the background/store invariant - Remove test_no_previous_response_preserves_store_true: passed regardless of our code (no new path exercised) - Remove test_full_stateless_roundtrip: duplicate of test_build_and_extract_roundtrip - Rename test_prev_messages_used_over_empty_msg_store → test_construct_input_messages_prepends_prev_msg (accurate name) All 31 tests pass. Signed-off-by: Will Deines <will@garr.io>

…ey validation Gemini critical — non-Harmony state carrier missing the assistant turn: carrier_messages was only the input messages, omitting the assistant response just generated. The next turn would see history without the last assistant message. Fix: append construct_chat_messages_with_tool_call(response.output) so the carrier contains the full turn (input + response). Codex P1 — carrierless previous_response check gated on enable_store: The guard 'if prev_messages_from_state is None and not self.enable_store' was too narrow. previous_response always means stateless path; a server restart with enable_store=True and empty msg_store would still KeyError. Fix: drop the 'not self.enable_store' condition. Codex P2 — any-length hex key accepted as signing key: bytes.fromhex('aa') produces 1 byte — a weak HMAC key. Fix: enforce len(key_bytes) >= 32 (64 hex chars) and raise ValueError if too short. Tests: - test_previous_response_without_carrier_store_enabled_returns_400: regression for P1 (store=True path also returns 400, not KeyError) - test_short_key_raises: regression for P2 (4-byte key raises) Run pre-commit --all-files; apply linter reformatting. Signed-off-by: Will Deines <will@garr.io>

Signed-off-by: Will Deines <will@garr.io>

Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction). Signed-off-by: Will Deines <will@garr.io>

get_messages/put_messages were accessing self._messages without the internal asyncio.Lock, inconsistent with how self._responses is protected. Adds lock acquisition for both methods to prevent race conditions in concurrent access. Signed-off-by: Will Deines <will@garr.io>

Signed-off-by: Will Deines <will@garr.io>

Handled by .git/info/exclude on feature branches, force-added on production/garrio-release. Signed-off-by: Will Deines <will@garr.io>

mergify · 2026-03-18T17:08:42Z

Hi @will-deines, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

chatgpt-codex-connector

💡 Codex Review

vllm/vllm/entrypoints/openai/responses/serving.py

Lines 1432 to 1435 in cfc072d

    
           if stream: 
        
               return self.responses_background_stream_generator( 
        
                   response_id, 
        
                   starting_after,

Reject streamed retrievals that lack a local event buffer

Once responses are loaded from a shared ResponseStore, GET /v1/responses/{id}?stream=true can legitimately hit a different replica or one that has restarted. This branch still hands back responses_background_stream_generator(), but that generator immediately requires the process-local event_store, so persisted background responses become unretrievable as SSE even though the response row exists in the external store.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T17:15:02Z

vllm/entrypoints/openai/responses/serving.py

+            prev_msg=prev_msg,
            prev_response_output=prev_response.output if prev_response else None,


Avoid double-including prior output in stateless chat turns

On the non-Harmony path, prev_messages already contains the full prior transcript because responses_full_generator() serialized messages + construct_chat_messages_with_tool_call(response.output) into the state carrier. Passing prev_response.output again here makes construct_input_messages() append the previous assistant output a second time, so every follow-up request using previous_response on non-Harmony models sends a duplicated last turn back to the model.

Useful? React with 👍 / 👎.

Fixed in 9a34479. _make_request now passes prev_response_output=prev_response.output if prev_response and prev_messages is None else None — when prev_messages is not None we're on the stateless path and the carrier already has the full history, so we skip the redundant append.

Tests: TestDoubleIncludePrevention in test_serving_stateless.py verifies the assistant message appears exactly once on the stateless path and that the store-backed path still works.

chatgpt-codex-connector · 2026-03-18T17:15:02Z

vllm/entrypoints/openai/responses/serving.py

+        if prev_response is not None and request.previous_response is not None:
+            prev_messages_from_state = self._extract_state_from_response(prev_response)


Handle invalid state carriers as a request error

_extract_state_from_response() raises ValueError when the client sends a malformed or HMAC-mismatched previous_response, but this call happens before the preprocessing try/except. In create_responses() that means bad user input escapes as an unhandled exception instead of an invalid_request_error, so a tampered stateless continuation currently turns into a 500.

Useful? React with 👍 / 👎.

Fixed in 9a34479. The _extract_state_from_response() call is now wrapped in its own try/except ValueError that returns a 400 with an integrity-validation message, before falling through to the existing try/except block for template/conversion errors.

Test: TestTamperedCarrierReturns400 in test_serving_stateless.py builds a valid carrier, corrupts the HMAC, and asserts the result is a 400 mentioning "integrity".

chatgpt-codex-connector · 2026-03-18T17:15:02Z

vllm/entrypoints/openai/responses/serving.py

+        response = await self.store.update_response_status(
+            response_id,
+            "cancelled",
+            allowed_current_statuses={"queued", "in_progress"},


Don't report remote background jobs as cancelled prematurely

With a shared custom ResponseStore and more than one API replica, this CAS can run on a replica that did not create the background job. The status is flipped to cancelled in the shared store before we check the process-local background_tasks, so a cross-replica /cancel call returns a cancelled response even though the real task on the original replica keeps running. Because finalization later uses unless_status='cancelled', the store continues to claim the job was stopped while generation/tool execution still completes.

Useful? React with 👍 / 👎.

Acknowledged — this is a real limitation but by design. Added a code comment in 9a34479 near the background_tasks lookup documenting the behavior:

With a shared ResponseStore across replicas, the CAS above marks the response as "cancelled" in the store, but the actual background task may be running on a different replica. We can only cancel process-local tasks here; the remote replica will discard its result at finalization via put_response(..., unless_status="cancelled").

Store state is consistent (the CAS + unless_status guard at finalization ensures a cancelled response stays cancelled), but compute is wasted on the remote replica. True distributed cancellation would require pub/sub or a task queue, which is explicitly deferred — event_store and background_tasks are process-local by design (see Decision #5 in the PR body).

…ampered carrier, cross-replica streaming) - Fix double-include of assistant output on stateless non-Harmony path by only passing prev_response_output when prev_messages is None - Catch ValueError from _extract_state_from_response before the try/except block so tampered HMAC returns 400 instead of 500 - Add event_store pre-check in retrieve_responses for streaming: terminal responses fall back to full response, in-progress ones return 400 directing to non-streaming retrieval - Document cross-replica cancel limitation near background_tasks lookup

mergify · 2026-03-18T18:04:04Z

Hi @will-deines, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

will-deines · 2026-03-18T18:06:00Z

Re: the streaming retrieval issue from the Codex review — fixed in 9a34479.

retrieve_responses with stream=True now checks for a local event buffer before delegating to the stream generator:

Terminal state (completed/failed/cancelled) with no local event_store entry: falls back to returning the full response (the useful data is already there, just not as SSE).
In-progress with no local event_store entry: returns a 400 directing the client to use non-streaming retrieval (the events are being produced on another replica and we can't relay them).

Tests: test_stream_retrieve_completed_without_event_buffer_returns_response and test_stream_retrieve_in_progress_without_event_buffer_returns_400 in test_store.py.

…ampered carrier, cross-replica streaming) - Fix double-include of assistant output on stateless non-Harmony path by only passing prev_response_output when prev_messages is None - Catch ValueError from _extract_state_from_response before the try/except block so tampered HMAC returns 400 instead of 500 - Add event_store pre-check in retrieve_responses for streaming: terminal responses fall back to full response, in-progress ones return 400 directing to non-streaming retrieval - Document cross-replica cancel limitation near background_tasks lookup Signed-off-by: Will Deines <will@garr.io>

mergify · 2026-03-18T18:49:21Z

Hi @will-deines, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Will Deines <will@garr.io>

mergify · 2026-03-23T10:20:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @will-deines.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…onse-store # Conflicts: # vllm/entrypoints/openai/responses/serving.py Signed-off-by: Will Deines <will@garr.io>

chaunceyjiang

Thnaks~ @will-deines

This is a great contribution. However, we’re very sorry that the vLLM team is currently pausing the merging of PRs related to the stateful Responses API. The reason is that the related processing logic is being moved to a new repository, where it will be handled by a subproject.

For more detailed discussions and updates, you can join #feat-responses.

mergify · 2026-03-25T17:34:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @will-deines.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify bot added the frontend label Mar 3, 2026

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 3, 2026

merge: pluggable response store (PR vllm-project#35905)

5306521

will-deines force-pushed the feat/pluggable-response-store branch from 71c3e26 to 08bd735 Compare March 4, 2026 20:12

mergify bot added the needs-rebase label Mar 5, 2026

garrio-1 and others added 10 commits March 18, 2026 09:50

fix: remove unused TYPE_CHECKING import to pass pre-commit ruff check

9944d81

Signed-off-by: Will Deines <will@garr.io>

fix: use rsplit for more robust state carrier parsing

2d277c0

Signed-off-by: Will Deines <will@garr.io>

style: fix ruff-format slice spacing in state carrier parsing

e6feaa0

Signed-off-by: Will Deines <will@garr.io>

fix: remove local_test/ from .gitignore

d154088

Handled by .git/info/exclude on feature branches, force-added on production/garrio-release. Signed-off-by: Will Deines <will@garr.io>

will-deines force-pushed the feat/pluggable-response-store branch from 08bd735 to d154088 Compare March 18, 2026 13:59

mergify bot removed the needs-rebase label Mar 18, 2026

Merge branch 'main' into feat/pluggable-response-store

cfc072d

will-deines marked this pull request as ready for review March 18, 2026 17:08

will-deines requested review from DarkLight1337, NickLucche, aarnphm, chaunceyjiang, robertgshaw2-redhat and russellb as code owners March 18, 2026 17:08

chatgpt-codex-connector bot reviewed Mar 18, 2026

View reviewed changes

will-deines force-pushed the feat/pluggable-response-store branch from 9a34479 to 5eb7202 Compare March 18, 2026 18:31

Merge upstream/main and resolve conflict in responses/utils.py

046f069

Signed-off-by: Will Deines <will@garr.io>

mergify bot added the needs-rebase label Mar 23, 2026

Merge remote-tracking branch 'upstream/main' into feat/pluggable-resp…

a1b1bcc

…onse-store # Conflicts: # vllm/entrypoints/openai/responses/serving.py Signed-off-by: Will Deines <will@garr.io>

mergify bot removed the needs-rebase label Mar 24, 2026

will-deines mentioned this pull request Mar 24, 2026

[RFC][ResponsesAPI]: Separating State & Providing Flexibility for serving ResponsesAPI #26934

Open

chaunceyjiang reviewed Mar 24, 2026

View reviewed changes

This was referenced Mar 24, 2026

Add OpenAI Responses API core computor-org/vllm-mlx#1

Merged

server: add OpenAI-compatible /v1/responses endpoint waybarrios/vllm-mlx#214

Open

mergify bot added the needs-rebase label Mar 25, 2026

	if stream:
	return self.responses_background_stream_generator(
	response_id,
	starting_after,

		prev_msg=prev_msg,
		prev_response_output=prev_response.output if prev_response else None,

		if prev_response is not None and request.previous_response is not None:
		prev_messages_from_state = self._extract_state_from_response(prev_response)

Uh oh!

Conversation

will-deines commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What stays local (not pluggable)

Decisions we made

1. Pluggable ABC + factory, not a simple on/off toggle

2. Env var with fully-qualified class name (VLLM_RESPONSES_STORE_BACKEND)

3. Atomic operations in the store interface, not external locks

4. Pluggable store supersedes LRU eviction (#34738)

5. event_store and background_tasks stay process-local

6. No new wire protocol fields

7. Complementary to the stateless path (#35903), not a replacement

8. Concurrency ownership moves into the store

Review follow-ups

Fix 1: Double-include of assistant output on stateless non-Harmony path

Fix 2: Tampered state carrier returns 400 instead of 500

Fix 3: Cross-replica cancel limitation documented

Fix 4: Streaming retrieval without local event buffer

Related Issues & RFCs

Context from RFC #26934 (Meta)

Why not just LRU eviction (#34738)?

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

will-deines Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 5, 2026

Uh oh!

mergify bot commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

will-deines Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

will-deines Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

will-deines Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 18, 2026

Uh oh!

will-deines commented Mar 18, 2026

Uh oh!

mergify bot commented Mar 18, 2026

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

chaunceyjiang left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

will-deines commented Mar 3, 2026 •

edited

Loading

2. Env var with fully-qualified class name (`VLLM_RESPONSES_STORE_BACKEND`)

5. `event_store` and `background_tasks` stay process-local