fix(memory): resolve UMA concurrent profiling race condition and page cache leak#35929
fix(memory): resolve UMA concurrent profiling race condition and page cache leak#35929EmilHaase wants to merge 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
Hi @EmilHaase, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Code Review
This pull request introduces two important fixes for memory management on Unified Memory Architecture (UMA) systems. The first fix prevents a race condition during concurrent memory profiling by correctly handling non-torch memory increases on UMA hardware. The second fix addresses a page cache leak after loading .safetensors files, which previously led to inaccurate KV cache budget calculations. My review focuses on ensuring the robustness of these fixes. I've identified a potential resource leak in the page cache flushing logic and suggested a more robust implementation to prevent file descriptor leaks.
10bbb50 to
da70316
Compare
…cache leak Signed-off-by: Emil Haase <emil@tomst.com>
da70316 to
c09098a
Compare
|
xref #35356 |
|
I think this is still needed. |
Purpose
This PR resolves a critical two-part memory starvation and profiling race condition that causes
AssertionError: Error in memory profilingor negative KV cache allocations during concurrent vLLM deployments on Unified Memory Architecture (UMA) hardware (e.g., DGX Spark).1. Guarding Non-Torch Memory Profiling (Fix A)
non_torch_increase = diff_from_create.non_torch_memoryderives overhead by observing the global memory drop. Because UMA shares system and GPU RAM, multiple vLLM engines launching concurrently "steal" global memory drops from each other. Engine A's memory initialization registers as Engine B's overhead.result.non_torch_increase = 0for known UMA architectures (detected via compute capabilities(8, 7), (11, 0), (12, 1)). This gates the subtraction and prevents engines from compounding simultaneous initializations as internal local overhead.2. OS Page Cache Flushes for Weight Loading (Fix B)
MemorySnapshotleveragespsutil.virtual_memory().available. During model instantiation, safetensor loading populates the Linux OS page cache, dropping the globally available system RAM. vLLM falsely attributes the missing system RAM to internal CUDA allocations, permanently burning out the KV Cache budget..safetensorsfiles from the page cache immediately after iteration viaos.posix_fadvise(..., os.POSIX_FADV_DONTNEED). This forces the weights out of system RAM, reclaiming the free memory prior to establishing baseline snapshots.Test Plan
Qwen/Qwen3.5-2B--gpu-memory-utilizationflags on an idle UMA system to intentionally trigger the concurrent memory snapshot race condition.Test Result
Before the patch: Concurrent engines cannibalize each other's memory drops during profiling. For the smaller models tested, the engines manage to launch but with severely starved and inaccurate KV cache allocations (missing tens of gigabytes of expected capacity). When launching larger models, this exact same mathematical flaw pushes the KV cache calculation into the negative, resulting in an immediate hard crash during engine initialization:
AssertionError: Error in memory profiling.After the patch:
All three engines correctly isolate their profiling math, completely flush the OS page cache, and successfully saturate the 128GB UMA pool without overlapping math or starvation:
INFO: Application startup complete.).Related Issues