Fix memory leak in Responses API store by lisperz · Pull Request #34738 · vllm-project/vllm

lisperz · 2026-02-17T19:30:08Z

Noticed the FIXME comments in serving.py about the response store growing unbounded. This fixes it by using OrderedDict with LRU eviction.

What changed:

Switched from dict to OrderedDict for response_store, msg_store, and event_store
Added VLLM_RESPONSES_API_STORE_MAX_SIZE env var (defaults to 10k entries)
Oldest entries get evicted when the limit is hit
Added DELETE /v1/responses/{response_id} endpoint
Recently accessed responses are moved to the end so they don't get evicted

Testing:
Added unit tests that cover the eviction logic. All existing tests still pass.

This should fix the memory leak warning that shows up when you enable the store.

Signed-off-by: lisperz <zhuchen200245@163.com>

mergify · 2026-02-17T19:30:47Z

Documentation preview: https://vllm--34738.org.readthedocs.build/en/34738/

gemini-code-assist

Code Review

This pull request introduces an LRU cache mechanism to fix a memory leak in the Responses API store, which is a great improvement. The implementation using OrderedDict is correct, and the addition of unit tests and a DELETE endpoint is thorough. However, I've found a critical issue where background tasks for evicted responses are not being cancelled. This could lead to the task re-inserting the response into the store upon completion, negating the effect of the eviction and re-introducing the memory growth issue. I've provided a suggestion to fix this by cancelling the task upon eviction.

gemini-code-assist · 2026-02-17T19:33:04Z

vllm/entrypoints/openai/responses/serving.py

+        while len(self.response_store) > self.store_max_size:
+            evicted_id, _ = self.response_store.popitem(last=False)
+            self.msg_store.pop(evicted_id, None)
+            self.event_store.pop(evicted_id, None)
+            logger.debug("Evicted response %s from store (LRU)", evicted_id)


When a response is evicted from the store, its associated background task is not being cancelled. This is a critical issue because the task will continue to run and, upon completion, may re-add the response to the store. This would bypass the store_max_size limit, effectively re-introducing the memory leak this PR aims to fix. The task also continues to consume resources unnecessarily.

To fix this, you should also remove the task from background_tasks and cancel it when its corresponding response is evicted.

while len(self.response_store) > self.store_max_size: evicted_id, _ = self.response_store.popitem(last=False) self.msg_store.pop(evicted_id, None) self.event_store.pop(evicted_id, None) if task := self.background_tasks.pop(evicted_id, None): task.cancel() logger.debug("Evicted response %s from store (LRU)", evicted_id)

When VLLM_ENABLE_RESPONSES_API_STORE is enabled, the response_store, msg_store, and event_store dicts grow without bound. This was flagged in FIXME comments but never addressed. Changes: - Use OrderedDict instead of dict for LRU eviction - Add VLLM_RESPONSES_API_STORE_MAX_SIZE env var (default 10k) - Evict oldest entries when limit is reached - Add DELETE endpoint for manual cleanup - Move accessed entries to end on retrieve Tested with 7 unit tests covering eviction logic. Signed-off-by: lisperz <zhuchen200245@163.com>

qandrew · 2026-02-18T01:39:48Z

Hi thanks for putting this together! curious what the motivation for adding this PR is, is there a production use case for this that's needed?

Longer term we should consider offloading state management of resposnesAPI to a 3rd party database etc (see #26934), would you have thoughts on this?

lisperz · 2026-02-18T02:51:47Z

Hi thanks for putting this together! curious what the motivation for adding this PR is, is there a production use case for this that's needed?

Longer term we should consider offloading state management of resposnesAPI to a 3rd party database etc (see #26934), would you have thoughts on this?

Good question. The main use case I see is OpenAI Agents SDK compatibility. The Responses API relies on previous_response_id for multi-turn conversations and background mode for async agent tasks. Both require the store to function. As more users adopt the Agents SDK pattern with self-hosted models, this becomes a real need. That said, I agree with #26934 that in-memory storage is the wrong long-term answer — it doesn't survive restarts and doesn't scale across multiple API server instances. This PR was meant as a short-term fix to prevent OOM for users who have already enabled the store.

qandrew

@lisperz thanks for the comments! I don't really think this PR as it currently is helps much, I would recommend implementing the possibility of offloading storage like the design in #26934 if you have the bandwidth :)

Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).

Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction). Signed-off-by: Will Deines <will@garr.io>

Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).

Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction). Signed-off-by: Will Deines <will@garr.io>

lisperz added 3 commits February 13, 2026 13:22

docs: add RunPod deployment guide for vLLM

2277dc9

Signed-off-by: lisperz <zhuchen200245@163.com>

docs: update verification example to use chat completions endpoint

85bc0e1

Signed-off-by: lisperz <zhuchen200245@163.com>

fix: specify language for fenced code blocks

dbff742

Signed-off-by: lisperz <zhuchen200245@163.com>

lisperz requested review from DarkLight1337, aarnphm, chaunceyjiang and russellb as code owners February 17, 2026 19:30

mergify bot added documentation Improvements or additions to documentation frontend v1 labels Feb 17, 2026

gemini-code-assist bot reviewed Feb 17, 2026

View reviewed changes

lisperz force-pushed the fix/responses-store-memory-leak branch from d5b5918 to 66d7d4b Compare February 17, 2026 19:40

lisperz force-pushed the fix/responses-store-memory-leak branch from 23b7ac6 to 78fe5a7 Compare February 17, 2026 19:49

Merge branch 'main' into fix/responses-store-memory-leak

999b27c

lisperz mentioned this pull request Feb 18, 2026

[RFC][ResponsesAPI]: Separating State & Providing Flexibility for serving ResponsesAPI #26934

Open

qandrew suggested changes Feb 18, 2026

View reviewed changes

This was referenced Mar 3, 2026

feat(responses): pluggable ResponseStore abstraction #35900

Closed

feat(responses): stateless multi-turn via encrypted_content state carrier (RFC #26934) #35903

Open

feat(responses): pluggable ResponseStore abstraction #35905

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix memory leak in Responses API store#34738

Fix memory leak in Responses API store#34738
lisperz wants to merge 5 commits intovllm-project:mainfrom
lisperz:fix/responses-store-memory-leak

lisperz commented Feb 17, 2026

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 17, 2026

Uh oh!

qandrew commented Feb 18, 2026

Uh oh!

lisperz commented Feb 18, 2026

Uh oh!

qandrew left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

lisperz commented Feb 17, 2026

Uh oh!

mergify bot commented Feb 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

qandrew commented Feb 18, 2026

Uh oh!

lisperz commented Feb 18, 2026

Uh oh!

qandrew left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants