Skip to content

Fix memory leak in Responses API store#34738

Open
lisperz wants to merge 5 commits intovllm-project:mainfrom
lisperz:fix/responses-store-memory-leak
Open

Fix memory leak in Responses API store#34738
lisperz wants to merge 5 commits intovllm-project:mainfrom
lisperz:fix/responses-store-memory-leak

Conversation

@lisperz
Copy link
Copy Markdown
Contributor

@lisperz lisperz commented Feb 17, 2026

Noticed the FIXME comments in serving.py about the response store growing unbounded. This fixes it by using OrderedDict with LRU eviction.

What changed:

  • Switched from dict to OrderedDict for response_store, msg_store, and event_store
  • Added VLLM_RESPONSES_API_STORE_MAX_SIZE env var (defaults to 10k entries)
  • Oldest entries get evicted when the limit is hit
  • Added DELETE /v1/responses/{response_id} endpoint
  • Recently accessed responses are moved to the end so they don't get evicted

Testing:
Added unit tests that cover the eviction logic. All existing tests still pass.

This should fix the memory leak warning that shows up when you enable the store.

Signed-off-by: lisperz <zhuchen200245@163.com>
Signed-off-by: lisperz <zhuchen200245@163.com>
Signed-off-by: lisperz <zhuchen200245@163.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 17, 2026

Documentation preview: https://vllm--34738.org.readthedocs.build/en/34738/

@mergify mergify bot added documentation Improvements or additions to documentation frontend v1 labels Feb 17, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an LRU cache mechanism to fix a memory leak in the Responses API store, which is a great improvement. The implementation using OrderedDict is correct, and the addition of unit tests and a DELETE endpoint is thorough. However, I've found a critical issue where background tasks for evicted responses are not being cancelled. This could lead to the task re-inserting the response into the store upon completion, negating the effect of the eviction and re-introducing the memory growth issue. I've provided a suggestion to fix this by cancelling the task upon eviction.

Comment on lines +1291 to +1295
while len(self.response_store) > self.store_max_size:
evicted_id, _ = self.response_store.popitem(last=False)
self.msg_store.pop(evicted_id, None)
self.event_store.pop(evicted_id, None)
logger.debug("Evicted response %s from store (LRU)", evicted_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

When a response is evicted from the store, its associated background task is not being cancelled. This is a critical issue because the task will continue to run and, upon completion, may re-add the response to the store. This would bypass the store_max_size limit, effectively re-introducing the memory leak this PR aims to fix. The task also continues to consume resources unnecessarily.

To fix this, you should also remove the task from background_tasks and cancel it when its corresponding response is evicted.

        while len(self.response_store) > self.store_max_size:
            evicted_id, _ = self.response_store.popitem(last=False)
            self.msg_store.pop(evicted_id, None)
            self.event_store.pop(evicted_id, None)
            if task := self.background_tasks.pop(evicted_id, None):
                task.cancel()
            logger.debug("Evicted response %s from store (LRU)", evicted_id)

@lisperz lisperz force-pushed the fix/responses-store-memory-leak branch from d5b5918 to 66d7d4b Compare February 17, 2026 19:40
When VLLM_ENABLE_RESPONSES_API_STORE is enabled, the response_store,
msg_store, and event_store dicts grow without bound. This was flagged
in FIXME comments but never addressed.

Changes:
- Use OrderedDict instead of dict for LRU eviction
- Add VLLM_RESPONSES_API_STORE_MAX_SIZE env var (default 10k)
- Evict oldest entries when limit is reached
- Add DELETE endpoint for manual cleanup
- Move accessed entries to end on retrieve

Tested with 7 unit tests covering eviction logic.

Signed-off-by: lisperz <zhuchen200245@163.com>
@lisperz lisperz force-pushed the fix/responses-store-memory-leak branch from 23b7ac6 to 78fe5a7 Compare February 17, 2026 19:49
@qandrew
Copy link
Copy Markdown
Contributor

qandrew commented Feb 18, 2026

Hi thanks for putting this together! curious what the motivation for adding this PR is, is there a production use case for this that's needed?

Longer term we should consider offloading state management of resposnesAPI to a 3rd party database etc (see #26934), would you have thoughts on this?

@lisperz
Copy link
Copy Markdown
Contributor Author

lisperz commented Feb 18, 2026

Hi thanks for putting this together! curious what the motivation for adding this PR is, is there a production use case for this that's needed?

Longer term we should consider offloading state management of resposnesAPI to a 3rd party database etc (see #26934), would you have thoughts on this?

Good question. The main use case I see is OpenAI Agents SDK compatibility. The Responses API relies on previous_response_id for multi-turn conversations and background mode for async agent tasks. Both require the store to function. As more users adopt the Agents SDK pattern with self-hosted models, this becomes a real need. That said, I agree with #26934 that in-memory storage is the wrong long-term answer — it doesn't survive restarts and doesn't scale across multiple API server instances. This PR was meant as a short-term fix to prevent OOM for users who have already enabled the store.

Copy link
Copy Markdown
Contributor

@qandrew qandrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lisperz thanks for the comments! I don't really think this PR as it currently is helps much, I would recommend implementing the possibility of offloading storage like the design in #26934 if you have the bandwidth :)

will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 3, 2026
Extract the three in-memory dicts (response_store, msg_store,
response_store_lock) from OpenAIServingResponses into a pluggable
ResponseStore ABC with an InMemoryResponseStore default.

Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified
class name to swap in their own backend (Redis, Postgres, etc.)
without patching vLLM.

- Add ResponseStore ABC with 5 abstract methods + close() hook
- Add InMemoryResponseStore wrapping current dict behavior with
  internal asyncio.Lock (removes external response_store_lock)
- Add create_response_store() factory reading env var
- Refactor ~15 call sites in serving.py to use self.store.*
- Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py
- Update test helper to use InMemoryResponseStore
- Add unit + integration tests for store and serving interactions

Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934
(pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 4, 2026
Extract the three in-memory dicts (response_store, msg_store,
response_store_lock) from OpenAIServingResponses into a pluggable
ResponseStore ABC with an InMemoryResponseStore default.

Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified
class name to swap in their own backend (Redis, Postgres, etc.)
without patching vLLM.

- Add ResponseStore ABC with 5 abstract methods + close() hook
- Add InMemoryResponseStore wrapping current dict behavior with
  internal asyncio.Lock (removes external response_store_lock)
- Add create_response_store() factory reading env var
- Refactor ~15 call sites in serving.py to use self.store.*
- Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py
- Update test helper to use InMemoryResponseStore
- Add unit + integration tests for store and serving interactions

Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934
(pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 4, 2026
Extract the three in-memory dicts (response_store, msg_store,
response_store_lock) from OpenAIServingResponses into a pluggable
ResponseStore ABC with an InMemoryResponseStore default.

Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified
class name to swap in their own backend (Redis, Postgres, etc.)
without patching vLLM.

- Add ResponseStore ABC with 5 abstract methods + close() hook
- Add InMemoryResponseStore wrapping current dict behavior with
  internal asyncio.Lock (removes external response_store_lock)
- Add create_response_store() factory reading env var
- Refactor ~15 call sites in serving.py to use self.store.*
- Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py
- Update test helper to use InMemoryResponseStore
- Add unit + integration tests for store and serving interactions

Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934
(pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).

Signed-off-by: Will Deines <will@garr.io>
will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 12, 2026
Extract the three in-memory dicts (response_store, msg_store,
response_store_lock) from OpenAIServingResponses into a pluggable
ResponseStore ABC with an InMemoryResponseStore default.

Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified
class name to swap in their own backend (Redis, Postgres, etc.)
without patching vLLM.

- Add ResponseStore ABC with 5 abstract methods + close() hook
- Add InMemoryResponseStore wrapping current dict behavior with
  internal asyncio.Lock (removes external response_store_lock)
- Add create_response_store() factory reading env var
- Refactor ~15 call sites in serving.py to use self.store.*
- Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py
- Update test helper to use InMemoryResponseStore
- Add unit + integration tests for store and serving interactions

Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934
(pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
will-deines pushed a commit to will-deines/vllm that referenced this pull request Mar 18, 2026
Extract the three in-memory dicts (response_store, msg_store,
response_store_lock) from OpenAIServingResponses into a pluggable
ResponseStore ABC with an InMemoryResponseStore default.

Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified
class name to swap in their own backend (Redis, Postgres, etc.)
without patching vLLM.

- Add ResponseStore ABC with 5 abstract methods + close() hook
- Add InMemoryResponseStore wrapping current dict behavior with
  internal asyncio.Lock (removes external response_store_lock)
- Add create_response_store() factory reading env var
- Refactor ~15 call sites in serving.py to use self.store.*
- Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py
- Update test helper to use InMemoryResponseStore
- Add unit + integration tests for store and serving interactions

Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934
(pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).

Signed-off-by: Will Deines <will@garr.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants