fix(cache): eliminate TOCTOU race in InMemoryEntityCache.cacheNewEntry by Copilot · Pull Request #4 · machov/polaris

Copilot · 2026-03-10T18:48:14Z

cacheNewEntry() had a read-then-merge race: getIfPresent() captured the old entry before byId.merge(), so a concurrent update between those two calls left a stale value used for byName cleanup — producing orphaned or missing name→entity mappings. Partial fix for apache/polaris#761; see also upstream rewrite apache/polaris#1339.

InMemoryEntityCache.java

Capture the displaced entry via AtomicReference set inside the merge() remapping function, making the read and the displacement atomic with respect to each other
Remove the now-redundant getIfPresent() pre-read

// Before — TOCTOU window
ResolvedPolarisEntity oldCacheEntry = this.byId.getIfPresent(id); // stale after concurrent merge
this.byId.asMap().merge(id, cacheEntry, (old, nw) -> isNewer(nw, old) ? nw : old);
this.byName.remove(oldNameKey, oldCacheEntry); // operates on stale data

// After — atomic capture
AtomicReference<ResolvedPolarisEntity> capturedOld = new AtomicReference<>();
this.byId.asMap().merge(id, cacheEntry, (old, nw) -> {
    capturedOld.set(old);          // captured under the map's internal lock
    return isNewer(nw, old) ? nw : old;
});
this.byName.remove(oldNameKey, capturedOld.get()); // consistent with merge outcome

InMemoryEntityCacheTest.java

testGetAndRefreshIfNeededCleansUpOldByNameOnRename — single-threaded: stale byName entry is cleaned up after a rename
testConcurrentRenameDoesNotLeaveStaleByNameEntries — concurrent: two threads racing to refresh the same entity with different renamed versions; verifies no stale name entries survive and byId converges to the newest version

Checklist

🛡️ Don't disclose security issues! (contact security@apache.org)
🔗 Clearly explained why the changes are needed, or linked related issues
🧪 Added/updated tests with good coverage, or manually tested (and explained how)
💡 Added comments for complex logic
🧾 Updated CHANGELOG.md (if needed)
📚 Updated documentation in site/content/in-dev/unreleased (if needed)

Original prompt

Create a new branch off main in machov/polaris and open a PR implementing the review-driven race-condition fixes described in vLLM MooncakeConnector PR vllm-project/vllm#36014:

Ensure all mutations/iteration of reqs_to_recv (or equivalent receive-tracking structure) happen on a single asyncio loop / owner context to avoid concurrent modification.
Set expire_time before inserting pull metas into tracking dict to avoid immediate timeout.
Remove unnecessary defensive copying once ownership is enforced.
Centralize handling of finished/failed receive requests alongside timeout cleanup.

Apply analogous changes in this repository where applicable. Include clear commit messages and a PR description summarizing how the refactor addresses race conditions and prevents stranded GPU memory.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: explain the concepts related to this PR. i got a lot of comments, make a plan to fix comments

Assistant: ## What this PR is about (concepts)

1) Mooncake KV transfer + “D-side” receive lifecycle

In vLLM’s distributed KV transfer, a consumer (“D-side”, decoder side) can request KV blocks from a remote engine (prefill side / producer side). Those requests are tracked as “requests to receive”, and completion is reported back so the scheduler can free/advance state.

When a request gets stuck in a “waiting for remote KVS” state (your title mentions WAITING_FOR_REMOTE_KVS), the request never completes, and the system can end up holding onto GPU memory (HBM) associated with the pending receive state / blocks.

2) HBM “leak” here is really “resources stranded because completion never arrives”

This isn’t necessarily a classic leak (lost pointer), but a liveness bug:

request is queued/tracked
remote never responds (or the transfer task fails)
the request is never marked “finished receiving”
scheduler never gets the “done/failed” signal to clean up
GPU memory stays reserved/stranded

So your fix introduces timeouts / abort semantics: after VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT, force the request into a “finished (failed)” state so cleanup can happen.

3) Async concurrency and race conditions (key review theme)

You’re in an asyncio + background task environment:

you start receive tasks with asyncio.create_task(...)
those tasks can call process_pulling_result(...) and mutate tracking structures
meanwhile fetch_finished_recving_reqs() iterates those structures

If you iterate a dict while another task mutates it, you can hit:

RuntimeError: dictionary changed size during iteration
or more subtle logic races (e.g., timing out a request before it’s fully initialized)

A big conceptual goal in the comments: make all mutations happen on one “owner” loop/thread (your receiver loop), or otherwise guard iteration/mutation carefully.

The specific review comments you got (all of them)

You currently have 5 PR review comments on file vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py:

Race while iterating nested dict: iterate over copies of items() to avoid “dict changed size during iteration” (bot comment).
Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)
Race: expire_time set after insertion: if fetch_finished_recving_reqs runs before expire_time is set, it may see expire_time=0.0 and time out immediately; set expire_time before adding to the dict (bot comment).
Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)
Architectural fix: move self.reqs_to_recv.update(...) inside _start_load_kv() so all operations on self.reqs_to_recv happen in the same self.receiver_loop, avoiding races (dtcccc).
Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)
After moving, don’t copy: if you move updates/iteration to same loop, you shouldn’t need the defensive copying (dtcccc).
Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)
Also handle finished_recving_reqs there: suggestion that the same area should handle finished receive requests too (dtcccc).
Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)

Plan to address comments (step-by-step, minimal rework + aligns with reviewers)

Step 0 — Decide the concurrency model (this is the core)

Adopt dtcccc’s direction: single owner = receiver_loop for:

mutating self.reqs_to_recv
expiring/removing entries
marking finished receive requests (or at least the part that depends on reqs_to_recv)

If you do this, you can remove most “iterate over a copy” hacks because you’re no longer concurrently mutating from other loops.

Step 1 — Move tracking update into `_start_load_kv()`

Right now you do:

start_load_kv() updates `self.reqs_to_recv...

This pull request was created from Copilot chat.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…alue atomically within merge Replace the separate getIfPresent() pre-read with an AtomicReference captured inside the merge callback. This closes the window where another thread could replace the byId entry between the read and the merge, causing the subsequent byName cleanup to operate on stale data. Analogous fixes to vLLM MooncakeConnector PR #36014: - Capture displaced entry atomically (mirrors "set expire_time before insertion") - Remove the independent pre-read (mirrors "remove unnecessary defensive copy") - Clarify local-map comment in getOrLoadResolvedEntities Also add two targeted tests: - testGetAndRefreshIfNeededCleansUpOldByNameOnRename: validates basic rename cleanup - testConcurrentRenameDoesNotLeaveStaleByNameEntries: validates consistent state under concurrent refresh with interleaved thread timing Co-authored-by: machov <43248948+machov@users.noreply.github.com>

Co-authored-by: machov <43248948+machov@users.noreply.github.com>

Initial plan

caa0927

Copilot AI assigned Copilot and machov Mar 10, 2026

Copilot started work on behalf of machov March 10, 2026 18:48 View session

Copilot AI changed the title ~~[WIP] Implement review-driven race-condition fixes for GPU memory~~ fix(cache): eliminate TOCTOU race in InMemoryEntityCache.cacheNewEntry Mar 10, 2026

Copilot finished work on behalf of machov March 10, 2026 19:28

Copilot started work on behalf of machov March 10, 2026 22:48 View session

ref: add link to original issue apache#761 in cacheNewEntry comment

4a621e1

Co-authored-by: machov <43248948+machov@users.noreply.github.com>

Copilot finished work on behalf of machov March 10, 2026 22:51

machov closed this Mar 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cache): eliminate TOCTOU race in InMemoryEntityCache.cacheNewEntry#4

fix(cache): eliminate TOCTOU race in InMemoryEntityCache.cacheNewEntry#4
Copilot wants to merge 3 commits intomainfrom
copilot/implement-race-condition-fixes

Copilot AI commented Mar 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

1) Mooncake KV transfer + “D-side” receive lifecycle

2) HBM “leak” here is really “resources stranded because completion never arrives”

3) Async concurrency and race conditions (key review theme)

The specific review comments you got (all of them)

Plan to address comments (step-by-step, minimal rework + aligns with reviewers)

Step 0 — Decide the concurrency model (this is the core)

Step 1 — Move tracking update into _start_load_kv()

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 10, 2026 •

edited

Loading

Step 1 — Move tracking update into `_start_load_kv()`