Skip to content

fix(cache): eliminate TOCTOU race in InMemoryEntityCache.cacheNewEntry#4

Closed
Copilot wants to merge 3 commits intomainfrom
copilot/implement-race-condition-fixes
Closed

fix(cache): eliminate TOCTOU race in InMemoryEntityCache.cacheNewEntry#4
Copilot wants to merge 3 commits intomainfrom
copilot/implement-race-condition-fixes

Conversation

Copy link

Copilot AI commented Mar 10, 2026

cacheNewEntry() had a read-then-merge race: getIfPresent() captured the old entry before byId.merge(), so a concurrent update between those two calls left a stale value used for byName cleanup — producing orphaned or missing name→entity mappings. Partial fix for apache/polaris#761; see also upstream rewrite apache/polaris#1339.

InMemoryEntityCache.java

  • Capture the displaced entry via AtomicReference set inside the merge() remapping function, making the read and the displacement atomic with respect to each other
  • Remove the now-redundant getIfPresent() pre-read
// Before — TOCTOU window
ResolvedPolarisEntity oldCacheEntry = this.byId.getIfPresent(id); // stale after concurrent merge
this.byId.asMap().merge(id, cacheEntry, (old, nw) -> isNewer(nw, old) ? nw : old);
this.byName.remove(oldNameKey, oldCacheEntry); // operates on stale data

// After — atomic capture
AtomicReference<ResolvedPolarisEntity> capturedOld = new AtomicReference<>();
this.byId.asMap().merge(id, cacheEntry, (old, nw) -> {
    capturedOld.set(old);          // captured under the map's internal lock
    return isNewer(nw, old) ? nw : old;
});
this.byName.remove(oldNameKey, capturedOld.get()); // consistent with merge outcome

InMemoryEntityCacheTest.java

  • testGetAndRefreshIfNeededCleansUpOldByNameOnRename — single-threaded: stale byName entry is cleaned up after a rename
  • testConcurrentRenameDoesNotLeaveStaleByNameEntries — concurrent: two threads racing to refresh the same entity with different renamed versions; verifies no stale name entries survive and byId converges to the newest version

Checklist

  • 🛡️ Don't disclose security issues! (contact security@apache.org)
  • 🔗 Clearly explained why the changes are needed, or linked related issues
  • 🧪 Added/updated tests with good coverage, or manually tested (and explained how)
  • 💡 Added comments for complex logic
  • 🧾 Updated CHANGELOG.md (if needed)
  • 📚 Updated documentation in site/content/in-dev/unreleased (if needed)
Original prompt

Create a new branch off main in machov/polaris and open a PR implementing the review-driven race-condition fixes described in vLLM MooncakeConnector PR vllm-project/vllm#36014:

  • Ensure all mutations/iteration of reqs_to_recv (or equivalent receive-tracking structure) happen on a single asyncio loop / owner context to avoid concurrent modification.
  • Set expire_time before inserting pull metas into tracking dict to avoid immediate timeout.
  • Remove unnecessary defensive copying once ownership is enforced.
  • Centralize handling of finished/failed receive requests alongside timeout cleanup.

Apply analogous changes in this repository where applicable. Include clear commit messages and a PR description summarizing how the refactor addresses race conditions and prevents stranded GPU memory.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: explain the concepts related to this PR. i got a lot of comments, make a plan to fix comments

Assistant: ## What this PR is about (concepts)

1) Mooncake KV transfer + “D-side” receive lifecycle

In vLLM’s distributed KV transfer, a consumer (“D-side”, decoder side) can request KV blocks from a remote engine (prefill side / producer side). Those requests are tracked as “requests to receive”, and completion is reported back so the scheduler can free/advance state.

When a request gets stuck in a “waiting for remote KVS” state (your title mentions WAITING_FOR_REMOTE_KVS), the request never completes, and the system can end up holding onto GPU memory (HBM) associated with the pending receive state / blocks.

2) HBM “leak” here is really “resources stranded because completion never arrives”

This isn’t necessarily a classic leak (lost pointer), but a liveness bug:

  • request is queued/tracked
  • remote never responds (or the transfer task fails)
  • the request is never marked “finished receiving”
  • scheduler never gets the “done/failed” signal to clean up
  • GPU memory stays reserved/stranded

So your fix introduces timeouts / abort semantics: after VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT, force the request into a “finished (failed)” state so cleanup can happen.

3) Async concurrency and race conditions (key review theme)

You’re in an asyncio + background task environment:

  • you start receive tasks with asyncio.create_task(...)
  • those tasks can call process_pulling_result(...) and mutate tracking structures
  • meanwhile fetch_finished_recving_reqs() iterates those structures

If you iterate a dict while another task mutates it, you can hit:

  • RuntimeError: dictionary changed size during iteration
  • or more subtle logic races (e.g., timing out a request before it’s fully initialized)

A big conceptual goal in the comments: make all mutations happen on one “owner” loop/thread (your receiver loop), or otherwise guard iteration/mutation carefully.


The specific review comments you got (all of them)

You currently have 5 PR review comments on file vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py:

  1. Race while iterating nested dict: iterate over copies of items() to avoid “dict changed size during iteration” (bot comment).
    Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)

  2. Race: expire_time set after insertion: if fetch_finished_recving_reqs runs before expire_time is set, it may see expire_time=0.0 and time out immediately; set expire_time before adding to the dict (bot comment).
    Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)

  3. Architectural fix: move self.reqs_to_recv.update(...) inside _start_load_kv() so all operations on self.reqs_to_recv happen in the same self.receiver_loop, avoiding races (dtcccc).
    Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)

  4. After moving, don’t copy: if you move updates/iteration to same loop, you shouldn’t need the defensive copying (dtcccc).
    Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)

  5. Also handle finished_recving_reqs there: suggestion that the same area should handle finished receive requests too (dtcccc).
    Ref: fix(mooncake): resolve HBM leak from stuck WAITING_FOR_REMOTE_KVS requests vllm-project/vllm#36014 (comment)


Plan to address comments (step-by-step, minimal rework + aligns with reviewers)

Step 0 — Decide the concurrency model (this is the core)

Adopt dtcccc’s direction: single owner = receiver_loop for:

  • mutating self.reqs_to_recv
  • expiring/removing entries
  • marking finished receive requests (or at least the part that depends on reqs_to_recv)

If you do this, you can remove most “iterate over a copy” hacks because you’re no longer concurrently mutating from other loops.

Step 1 — Move tracking update into _start_load_kv()

Right now you do:

  • start_load_kv() updates `self.reqs_to_recv...

This pull request was created from Copilot chat.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…alue atomically within merge

Replace the separate getIfPresent() pre-read with an AtomicReference captured
inside the merge callback. This closes the window where another thread could
replace the byId entry between the read and the merge, causing the subsequent
byName cleanup to operate on stale data.

Analogous fixes to vLLM MooncakeConnector PR #36014:
- Capture displaced entry atomically (mirrors "set expire_time before insertion")
- Remove the independent pre-read (mirrors "remove unnecessary defensive copy")
- Clarify local-map comment in getOrLoadResolvedEntities

Also add two targeted tests:
- testGetAndRefreshIfNeededCleansUpOldByNameOnRename: validates basic rename cleanup
- testConcurrentRenameDoesNotLeaveStaleByNameEntries: validates consistent state
  under concurrent refresh with interleaved thread timing

Co-authored-by: machov <43248948+machov@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement review-driven race-condition fixes for GPU memory fix(cache): eliminate TOCTOU race in InMemoryEntityCache.cacheNewEntry Mar 10, 2026
Co-authored-by: machov <43248948+machov@users.noreply.github.com>
@machov machov closed this Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants