Skip to content

fix(memory): resolve UMA concurrent profiling race condition and page cache leak#35929

Open
EmilHaase wants to merge 1 commit intovllm-project:mainfrom
EmilHaase:fix-uma-memory-profiling
Open

fix(memory): resolve UMA concurrent profiling race condition and page cache leak#35929
EmilHaase wants to merge 1 commit intovllm-project:mainfrom
EmilHaase:fix-uma-memory-profiling

Conversation

@EmilHaase
Copy link

@EmilHaase EmilHaase commented Mar 3, 2026

Purpose

This PR resolves a critical two-part memory starvation and profiling race condition that causes AssertionError: Error in memory profiling or negative KV cache allocations during concurrent vLLM deployments on Unified Memory Architecture (UMA) hardware (e.g., DGX Spark).

1. Guarding Non-Torch Memory Profiling (Fix A)

  • The Flaw: The calculation non_torch_increase = diff_from_create.non_torch_memory derives overhead by observing the global memory drop. Because UMA shares system and GPU RAM, multiple vLLM engines launching concurrently "steal" global memory drops from each other. Engine A's memory initialization registers as Engine B's overhead.
  • The Fix: Hardcode result.non_torch_increase = 0 for known UMA architectures (detected via compute capabilities (8, 7), (11, 0), (12, 1)). This gates the subtraction and prevents engines from compounding simultaneous initializations as internal local overhead.

2. OS Page Cache Flushes for Weight Loading (Fix B)

  • The Flaw: On UMA systems, MemorySnapshot leverages psutil.virtual_memory().available. During model instantiation, safetensor loading populates the Linux OS page cache, dropping the globally available system RAM. vLLM falsely attributes the missing system RAM to internal CUDA allocations, permanently burning out the KV Cache budget.
  • The Fix: Instruct the Linux kernel to drop the .safetensors files from the page cache immediately after iteration via os.posix_fadvise(..., os.POSIX_FADV_DONTNEED). This forces the weights out of system RAM, reclaiming the free memory prior to establishing baseline snapshots.

Test Plan

  • Hardware: NVIDIA DGX Spark (128GB UMA)
  • Model: Qwen/Qwen3.5-2B
  • Execution: Launch 3 concurrent vLLM engines simultaneously with staggered --gpu-memory-utilization flags on an idle UMA system to intentionally trigger the concurrent memory snapshot race condition.
vllm serve Qwen/Qwen3.5-2B --port 8000 --gpu-memory-utilization 0.35 &
vllm serve Qwen/Qwen3.5-2B --port 8001 --gpu-memory-utilization 0.30 &
vllm serve Qwen/Qwen3.5-2B --port 8002 --gpu-memory-utilization 0.20 &

Test Result

Before the patch: Concurrent engines cannibalize each other's memory drops during profiling. For the smaller models tested, the engines manage to launch but with severely starved and inaccurate KV cache allocations (missing tens of gigabytes of expected capacity). When launching larger models, this exact same mathematical flaw pushes the KV cache calculation into the negative, resulting in an immediate hard crash during engine initialization: AssertionError: Error in memory profiling.

After the patch:
All three engines correctly isolate their profiling math, completely flush the OS page cache, and successfully saturate the 128GB UMA pool without overlapping math or starvation:

  • Engine 1 (0.35 util): Available KV cache memory: 36.64 GiB
  • Engine 2 (0.30 util): Available KV cache memory: 30.55 GiB
  • Engine 3 (0.20 util): Available KV cache memory: 18.39 GiB
  • All instances successfully booted HTTP servers (INFO: Application startup complete.).

Related Issues

@EmilHaase EmilHaase requested a review from 22quinn as a code owner March 3, 2026 23:48
@github-actions
Copy link

github-actions bot commented Mar 3, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify
Copy link

mergify bot commented Mar 3, 2026

Hi @EmilHaase, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two important fixes for memory management on Unified Memory Architecture (UMA) systems. The first fix prevents a race condition during concurrent memory profiling by correctly handling non-torch memory increases on UMA hardware. The second fix addresses a page cache leak after loading .safetensors files, which previously led to inaccurate KV cache budget calculations. My review focuses on ensuring the robustness of these fixes. I've identified a potential resource leak in the page cache flushing logic and suggested a more robust implementation to prevent file descriptor leaks.

@EmilHaase EmilHaase force-pushed the fix-uma-memory-profiling branch from 10bbb50 to da70316 Compare March 3, 2026 23:59
…cache leak

Signed-off-by: Emil Haase <emil@tomst.com>
@EmilHaase EmilHaase force-pushed the fix-uma-memory-profiling branch from da70316 to c09098a Compare March 4, 2026 00:13
@ehfd
Copy link
Contributor

ehfd commented Mar 8, 2026

xref #35356

@ehfd
Copy link
Contributor

ehfd commented Mar 8, 2026

Note: In v0.17.0:
Weight Offloading V2 with Prefetching: The weight offloader now hides onloading latency via prefetching (#29941), plus selective CPU weight offloading (#34535) and CPU offloading without pinned memory doubling (#32993).

@ehfd
Copy link
Contributor

ehfd commented Mar 12, 2026

I think this is still needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: UMA Memory Profiling Misattributes OS Page Cache and Fails in Concurrent Deployments

2 participants