fix(cache): eliminate TOCTOU race in InMemoryEntityCache.cacheNewEntry#4
Closed
fix(cache): eliminate TOCTOU race in InMemoryEntityCache.cacheNewEntry#4
Conversation
…alue atomically within merge Replace the separate getIfPresent() pre-read with an AtomicReference captured inside the merge callback. This closes the window where another thread could replace the byId entry between the read and the merge, causing the subsequent byName cleanup to operate on stale data. Analogous fixes to vLLM MooncakeConnector PR #36014: - Capture displaced entry atomically (mirrors "set expire_time before insertion") - Remove the independent pre-read (mirrors "remove unnecessary defensive copy") - Clarify local-map comment in getOrLoadResolvedEntities Also add two targeted tests: - testGetAndRefreshIfNeededCleansUpOldByNameOnRename: validates basic rename cleanup - testConcurrentRenameDoesNotLeaveStaleByNameEntries: validates consistent state under concurrent refresh with interleaved thread timing Co-authored-by: machov <43248948+machov@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Implement review-driven race-condition fixes for GPU memory
fix(cache): eliminate TOCTOU race in InMemoryEntityCache.cacheNewEntry
Mar 10, 2026
Co-authored-by: machov <43248948+machov@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cacheNewEntry()had a read-then-merge race:getIfPresent()captured the old entry beforebyId.merge(), so a concurrent update between those two calls left a stale value used forbyNamecleanup — producing orphaned or missing name→entity mappings. Partial fix for apache/polaris#761; see also upstream rewrite apache/polaris#1339.InMemoryEntityCache.javaAtomicReferenceset inside themerge()remapping function, making the read and the displacement atomic with respect to each othergetIfPresent()pre-readInMemoryEntityCacheTest.javatestGetAndRefreshIfNeededCleansUpOldByNameOnRename— single-threaded: stalebyNameentry is cleaned up after a renametestConcurrentRenameDoesNotLeaveStaleByNameEntries— concurrent: two threads racing to refresh the same entity with different renamed versions; verifies no stale name entries survive andbyIdconverges to the newest versionChecklist
CHANGELOG.md(if needed)site/content/in-dev/unreleased(if needed)Original prompt
Create a new branch off
maininmachov/polarisand open a PR implementing the review-driven race-condition fixes described in vLLM MooncakeConnector PR vllm-project/vllm#36014:reqs_to_recv(or equivalent receive-tracking structure) happen on a single asyncio loop / owner context to avoid concurrent modification.expire_timebefore inserting pull metas into tracking dict to avoid immediate timeout.Apply analogous changes in this repository where applicable. Include clear commit messages and a PR description summarizing how the refactor addresses race conditions and prevents stranded GPU memory.
The following is the prior conversation context from the user's chat exploration (may be truncated):
User: explain the concepts related to this PR. i got a lot of comments, make a plan to fix comments
Assistant: ## What this PR is about (concepts)
1) Mooncake KV transfer + “D-side” receive lifecycle
In vLLM’s distributed KV transfer, a consumer (“D-side”, decoder side) can request KV blocks from a remote engine (prefill side / producer side). Those requests are tracked as “requests to receive”, and completion is reported back so the scheduler can free/advance state.
When a request gets stuck in a “waiting for remote KVS” state (your title mentions
WAITING_FOR_REMOTE_KVS), the request never completes, and the system can end up holding onto GPU memory (HBM) associated with the pending receive state / blocks.2) HBM “leak” here is really “resources stranded because completion never arrives”
This isn’t necessarily a classic leak (lost pointer), but a liveness bug:
So your fix introduces timeouts / abort semantics: after
VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT, force the request into a “finished (failed)” state so cleanup can happen.3) Async concurrency and race conditions (key review theme)
You’re in an
asyncio+ background task environment:asyncio.create_task(...)process_pulling_result(...)and mutate tracking structuresfetch_finished_recving_reqs()iterates those structuresIf you iterate a dict while another task mutates it, you can hit:
RuntimeError: dictionary changed size during iterationA big conceptual goal in the comments: make all mutations happen on one “owner” loop/thread (your receiver loop), or otherwise guard iteration/mutation carefully.
The specific review comments you got (all of them)
You currently have 5 PR review comments on file
vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py:Race while iterating nested dict: iterate over copies of
items()to avoid “dict changed size during iteration” (bot comment).Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)
Race: expire_time set after insertion: if
fetch_finished_recving_reqsruns beforeexpire_timeis set, it may seeexpire_time=0.0and time out immediately; set expire_time before adding to the dict (bot comment).Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)
Architectural fix: move
self.reqs_to_recv.update(...)inside_start_load_kv()so all operations onself.reqs_to_recvhappen in the sameself.receiver_loop, avoiding races (dtcccc).Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)
After moving, don’t copy: if you move updates/iteration to same loop, you shouldn’t need the defensive copying (dtcccc).
Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)
Also handle
finished_recving_reqsthere: suggestion that the same area should handle finished receive requests too (dtcccc).Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)
Plan to address comments (step-by-step, minimal rework + aligns with reviewers)
Step 0 — Decide the concurrency model (this is the core)
Adopt dtcccc’s direction: single owner =
receiver_loopfor:self.reqs_to_recvreqs_to_recv)If you do this, you can remove most “iterate over a copy” hacks because you’re no longer concurrently mutating from other loops.
Step 1 — Move tracking update into
_start_load_kv()Right now you do:
start_load_kv()updates `self.reqs_to_recv...This pull request was created from Copilot chat.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.