Fix memory leak in Responses API store#34738
Fix memory leak in Responses API store#34738lisperz wants to merge 5 commits intovllm-project:mainfrom
Conversation
Signed-off-by: lisperz <zhuchen200245@163.com>
Signed-off-by: lisperz <zhuchen200245@163.com>
Signed-off-by: lisperz <zhuchen200245@163.com>
|
Documentation preview: https://vllm--34738.org.readthedocs.build/en/34738/ |
There was a problem hiding this comment.
Code Review
This pull request introduces an LRU cache mechanism to fix a memory leak in the Responses API store, which is a great improvement. The implementation using OrderedDict is correct, and the addition of unit tests and a DELETE endpoint is thorough. However, I've found a critical issue where background tasks for evicted responses are not being cancelled. This could lead to the task re-inserting the response into the store upon completion, negating the effect of the eviction and re-introducing the memory growth issue. I've provided a suggestion to fix this by cancelling the task upon eviction.
| while len(self.response_store) > self.store_max_size: | ||
| evicted_id, _ = self.response_store.popitem(last=False) | ||
| self.msg_store.pop(evicted_id, None) | ||
| self.event_store.pop(evicted_id, None) | ||
| logger.debug("Evicted response %s from store (LRU)", evicted_id) |
There was a problem hiding this comment.
When a response is evicted from the store, its associated background task is not being cancelled. This is a critical issue because the task will continue to run and, upon completion, may re-add the response to the store. This would bypass the store_max_size limit, effectively re-introducing the memory leak this PR aims to fix. The task also continues to consume resources unnecessarily.
To fix this, you should also remove the task from background_tasks and cancel it when its corresponding response is evicted.
while len(self.response_store) > self.store_max_size:
evicted_id, _ = self.response_store.popitem(last=False)
self.msg_store.pop(evicted_id, None)
self.event_store.pop(evicted_id, None)
if task := self.background_tasks.pop(evicted_id, None):
task.cancel()
logger.debug("Evicted response %s from store (LRU)", evicted_id)d5b5918 to
66d7d4b
Compare
When VLLM_ENABLE_RESPONSES_API_STORE is enabled, the response_store, msg_store, and event_store dicts grow without bound. This was flagged in FIXME comments but never addressed. Changes: - Use OrderedDict instead of dict for LRU eviction - Add VLLM_RESPONSES_API_STORE_MAX_SIZE env var (default 10k) - Evict oldest entries when limit is reached - Add DELETE endpoint for manual cleanup - Move accessed entries to end on retrieve Tested with 7 unit tests covering eviction logic. Signed-off-by: lisperz <zhuchen200245@163.com>
23b7ac6 to
78fe5a7
Compare
|
Hi thanks for putting this together! curious what the motivation for adding this PR is, is there a production use case for this that's needed? Longer term we should consider offloading state management of resposnesAPI to a 3rd party database etc (see #26934), would you have thoughts on this? |
Good question. The main use case I see is OpenAI Agents SDK compatibility. The Responses API relies on |
Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction). Signed-off-by: Will Deines <will@garr.io>
Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction).
Extract the three in-memory dicts (response_store, msg_store, response_store_lock) from OpenAIServingResponses into a pluggable ResponseStore ABC with an InMemoryResponseStore default. Users can point VLLM_RESPONSES_STORE_BACKEND to a fully-qualified class name to swap in their own backend (Redis, Postgres, etc.) without patching vLLM. - Add ResponseStore ABC with 5 abstract methods + close() hook - Add InMemoryResponseStore wrapping current dict behavior with internal asyncio.Lock (removes external response_store_lock) - Add create_response_store() factory reading env var - Refactor ~15 call sites in serving.py to use self.store.* - Add VLLM_RESPONSES_STORE_BACKEND env var to envs.py - Update test helper to use InMemoryResponseStore - Add unit + integration tests for store and serving interactions Follows up on vllm-project#35740 (stateless multi-turn). Addresses RFC vllm-project#26934 (pluggable state backends) and supersedes vllm-project#34738 (LRU eviction). Signed-off-by: Will Deines <will@garr.io>
Noticed the FIXME comments in serving.py about the response store growing unbounded. This fixes it by using OrderedDict with LRU eviction.
What changed:
Testing:
Added unit tests that cover the eviction logic. All existing tests still pass.
This should fix the memory leak warning that shows up when you enable the store.