fix(memory): resolve UMA concurrent profiling race condition and page cache leak by EmilHaase · Pull Request #35929 · vllm-project/vllm

EmilHaase · 2026-03-03T23:48:40Z

Purpose

This PR resolves a critical two-part memory starvation and profiling race condition that causes AssertionError: Error in memory profiling or negative KV cache allocations during concurrent vLLM deployments on Unified Memory Architecture (UMA) hardware (e.g., DGX Spark).

1. Guarding Non-Torch Memory Profiling (Fix A)

The Flaw: The calculation non_torch_increase = diff_from_create.non_torch_memory derives overhead by observing the global memory drop. Because UMA shares system and GPU RAM, multiple vLLM engines launching concurrently "steal" global memory drops from each other. Engine A's memory initialization registers as Engine B's overhead.
The Fix: Hardcode result.non_torch_increase = 0 for known UMA architectures (detected via compute capabilities (8, 7), (11, 0), (12, 1)). This gates the subtraction and prevents engines from compounding simultaneous initializations as internal local overhead.

2. OS Page Cache Flushes for Weight Loading (Fix B)

The Flaw: On UMA systems, MemorySnapshot leverages psutil.virtual_memory().available. During model instantiation, safetensor loading populates the Linux OS page cache, dropping the globally available system RAM. vLLM falsely attributes the missing system RAM to internal CUDA allocations, permanently burning out the KV Cache budget.
The Fix: Instruct the Linux kernel to drop the .safetensors files from the page cache immediately after iteration via os.posix_fadvise(..., os.POSIX_FADV_DONTNEED). This forces the weights out of system RAM, reclaiming the free memory prior to establishing baseline snapshots.

Test Plan

Hardware: NVIDIA DGX Spark (128GB UMA)
Model: Qwen/Qwen3.5-2B
Execution: Launch 3 concurrent vLLM engines simultaneously with staggered --gpu-memory-utilization flags on an idle UMA system to intentionally trigger the concurrent memory snapshot race condition.

vllm serve Qwen/Qwen3.5-2B --port 8000 --gpu-memory-utilization 0.35 &
vllm serve Qwen/Qwen3.5-2B --port 8001 --gpu-memory-utilization 0.30 &
vllm serve Qwen/Qwen3.5-2B --port 8002 --gpu-memory-utilization 0.20 &

Test Result

Before the patch: Concurrent engines cannibalize each other's memory drops during profiling. For the smaller models tested, the engines manage to launch but with severely starved and inaccurate KV cache allocations (missing tens of gigabytes of expected capacity). When launching larger models, this exact same mathematical flaw pushes the KV cache calculation into the negative, resulting in an immediate hard crash during engine initialization: AssertionError: Error in memory profiling.

After the patch:
All three engines correctly isolate their profiling math, completely flush the OS page cache, and successfully saturate the 128GB UMA pool without overlapping math or starvation:

Engine 1 (0.35 util): Available KV cache memory: 36.64 GiB
Engine 2 (0.30 util): Available KV cache memory: 30.55 GiB
Engine 3 (0.20 util): Available KV cache memory: 18.39 GiB
All instances successfully booted HTTP servers (INFO: Application startup complete.).

Related Issues

Resolves [Bug]: UMA Memory Profiling Misattributes OS Page Cache and Fails in Concurrent Deployments #35920
Fixes crashes related to concurrent vLLM loading on Orin/Thor/Spark UMA nodes.

github-actions · 2026-03-03T23:48:49Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2026-03-03T23:53:23Z

Hi @EmilHaase, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

gemini-code-assist

Code Review

This pull request introduces two important fixes for memory management on Unified Memory Architecture (UMA) systems. The first fix prevents a race condition during concurrent memory profiling by correctly handling non-torch memory increases on UMA hardware. The second fix addresses a page cache leak after loading .safetensors files, which previously led to inaccurate KV cache budget calculations. My review focuses on ensuring the robustness of these fixes. I've identified a potential resource leak in the page cache flushing logic and suggested a more robust implementation to prevent file descriptor leaks.

vllm/model_executor/model_loader/weight_utils.py

…cache leak Signed-off-by: Emil Haase <emil@tomst.com>

ehfd · 2026-03-08T13:51:02Z

xref #35356

ehfd · 2026-03-08T13:52:09Z

Note: In v0.17.0:
Weight Offloading V2 with Prefetching: The weight offloader now hides onloading latency via prefetching (#29941), plus selective CPU weight offloading (#34535) and CPU offloading without pinned memory doubling (#32993).

ehfd · 2026-03-12T16:10:00Z

I think this is still needed.

EmilHaase requested a review from 22quinn as a code owner March 3, 2026 23:48

gemini-code-assist bot reviewed Mar 3, 2026

View reviewed changes

vllm/model_executor/model_loader/weight_utils.py Show resolved Hide resolved

EmilHaase force-pushed the fix-uma-memory-profiling branch from 10bbb50 to da70316 Compare March 3, 2026 23:59

fix: resolve UMA concurrent memory profiling race condition and page …

c09098a

…cache leak Signed-off-by: Emil Haase <emil@tomst.com>

EmilHaase force-pushed the fix-uma-memory-profiling branch from da70316 to c09098a Compare March 4, 2026 00:13

ehfd mentioned this pull request Mar 8, 2026

[Bugfix] Use is_integrated to detect UMA GPUs for memory reporting #35356

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(memory): resolve UMA concurrent profiling race condition and page cache leak#35929

fix(memory): resolve UMA concurrent profiling race condition and page cache leak#35929
EmilHaase wants to merge 1 commit intovllm-project:mainfrom
EmilHaase:fix-uma-memory-profiling

EmilHaase commented Mar 3, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ehfd commented Mar 8, 2026

Uh oh!

ehfd commented Mar 8, 2026

Uh oh!

ehfd commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

EmilHaase commented Mar 3, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Related Issues

Uh oh!

github-actions bot commented Mar 3, 2026

Uh oh!

mergify bot commented Mar 3, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ehfd commented Mar 8, 2026

Uh oh!

ehfd commented Mar 8, 2026

Uh oh!

ehfd commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EmilHaase commented Mar 3, 2026 •

edited by github-actions bot

Loading