[Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation#32993
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable workaround for the excessive memory allocation issue with PyTorch's pin_memory during CPU offloading. The approach of using mmap and cudaHostRegister for non-pinned tensors is well-implemented and controlled via new environment variables. The related code changes across different files are consistent and logical. However, I've identified a critical issue in the C++/CUDA implementation where an error condition is not handled, which could lead to crashes.
|
Hi @wzhao18, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
that's an interesting finding. I think we can directly remove pytorch pin_memory in the offloading code path. we don't need the caching behavior in pytorch for the offloading functionality. |
| // This function assumes that `cpu_tensor` is a CPU tensor allocated with pinned | ||
| // memory, and that UVA (Unified Virtual Addressing) is enabled. | ||
| // This function assumes that `cpu_tensor` is a CPU tensor, | ||
| // and that UVA (Unified Virtual Addressing) is enabled. | ||
| torch::Tensor get_cuda_view_from_cpu_tensor(torch::Tensor& cpu_tensor) { |
There was a problem hiding this comment.
it seems to mix two functionalities in one function. I'd prefer to separate them. keep the original get_cuda_view_from_cpu_tensor untouched, and have another alloc_pinned_cpu_tensor_and_get_cuda_view(num_bytes) function.
There was a problem hiding this comment.
Thanks for reviewing. I personally find it cleaner to keep a unified function that "creates a CUDA view from a cpu tensor". The only difference is whether the CPU tensor is already allocated with pinned memory. In both cases, the returned CUDA view keeps reference of the CPU buffer.
Let me know if you insist on having separate functions. I can update it that way.
csrc/cuda_view.cu
Outdated
| cudaHostRegister(host_ptr, aligned_size, cudaHostRegisterDefault); | ||
| if (err != cudaSuccess) { | ||
| munmap(host_ptr, aligned_size); | ||
| AT_ERROR("cudaHostRegister failed: ", cudaGetErrorString(err)); |
There was a problem hiding this comment.
is cudaHostAlloc enough? do we need to use system mmap and then use cudaHostRegister?
There was a problem hiding this comment.
Will try cudaHostAlloc and get back here.
There was a problem hiding this comment.
Have changed implementation to use cudaHostAlloc instead of system mmap. Thanks for the suggestion.
youkaichao
left a comment
There was a problem hiding this comment.
this direction looks good to me!
|
This pull request has merge conflicts that must be resolved before it can be |
e9dda7f to
6ff4cf9
Compare
6ff4cf9 to
7144115
Compare
|
@youkaichao Thanks for the review. I have updated the PR based on your feedback. I am still weighing whether we should split Besides that, is there anything else missing before this is ready to merge? |
7144115 to
619ca40
Compare
|
@youkaichao @mgoin The failing CI tests seem to be flaky. The quantized model test fails on main too. Could you rerun the other failed tests? Thanks |
619ca40 to
9c1cdf1
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
… loading Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…nsfer Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
(vllm-project#34543) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
…eads to doubled allocation (vllm-project#32993) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
(vllm-project#34543) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Athrael Soju <athrael.soju@gmail.com>
…eads to doubled allocation (vllm-project#32993) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
(vllm-project#34543) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
…eads to doubled allocation (vllm-project#32993) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
(vllm-project#34543) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Eldar Kurtic <research@neuralmagic.com>
|
Hi @wzhao18, thanks for this work! IIUC, with the change to I find this a bit confusing. The name says "view" which implies aliased memory (no copy), but for non-pinned tensors it silently copies into a new pinned buffer. The caller would reasonably expect that mutations to the original CPU tensor are visible through the "view" — that's true for pinned tensors but false for non-pinned ones. Would it make more sense to rename this func, or separate the handling of pinned-/non-pinned tensor to two functions? The latter would be better for backward compatibility. |
|
Hi @minosfuture. Thanks for pointing out this concern. I agree this could lead to different behaviors (aliasing or non-aliasing) of |
…eads to doubled allocation (vllm-project#32993) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
(vllm-project#34543) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
…eads to doubled allocation (vllm-project#32993) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
(vllm-project#34543) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…eads to doubled allocation (vllm-project#32993) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Co-authored-by: Michael Goin <mgoin64@gmail.com>
(vllm-project#34543) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Just for people where the environment variables don't work; they are |
|
@ehfd Thanks for pointing this out. The flag names were updated during PR review. I will update the description to reflect this change. |
|
@wzhao18 Question: Does Moreover, what is the behavior when |
No prefetch offloader is a different backend. The two env vars are within the UVA offloader. So they won't be effective together. Disabling uva in uva backend is kind of confusing but it essentially loads the weights explicitly rather than automatically through UVA. The naming is kind of a legacy problem. But essentially the code path is a fallback for system without UVA capability. |
…h offloader The prefetch CPU offloader unconditionally allocates pinned memory, ignoring the VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY env var. On unified-memory systems like GH200 where pinned memory eats into the GPU memory pool, this causes OOM during model loading. The UVA offloader already checks this env var (added in vllm-project#32993). Apply the same check to the three places in the prefetch backend that allocate or assert pinned memory: - _offload_to_cpu_internal(): initial CPU storage allocation - _update_cpu_storage_from_param(): re-pinning after weight processing - start_onload_to_static(): pinned memory assertion Fixes vllm-project#37672 Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com>
Purpose
Background:
Current PyTorch pin_memory implementation rounds allocation to the next power of 2 (see Issue #150517, PR #171662). In the worst case, this results in 2x the expected CPU memory usage. For instance, running
nvidia/Kimi-K2-Thinking-NVFP4on a system with one GB300 GPU and 500 GB of CPU memory results in OOM, even though sufficient memory is available.Script to demonstrate the issue:
Allocating 257 GB of pinned memory leads to 512 GB CPU allocation.
Changes:
This PR provides a workaround to get around using Pytorch pin memory to use UVA-based (unified virtual addressing) CPU offloading. In addition, it fixes various issues on the non-UVA offloading code path (explicit transfer between CPU and GPU).
There is a pytorch PR that is looking to address the doubled pin memory usage issue. This PR provides a temporary workaround for the problem. The workaround is enabled through two environment variables:
VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY: disable using pytorch pin memory in CPU offloadingVLLM_WEIGHT_OFFLOADING_DISABLE_UVA: disable using UVA in CPU offloadingWhen neither env vars are set, the original offloading logic (pytorch pinned memory + UVA) is used.
Test Plan
Added testing for the env-var enabled code paths in
tests/basic_correctness/test_cpu_offload.py.Also tested
nvidia/Kimi-K2-Thinking-NVFP4on one GB300 GPU and 500 GB CPU memory:On current main:
python3 -m vllm.entrypoints.openai.api_server \ --model nvidia/Kimi-K2-Thinking-NVFP4 \ --trust-remote-code \ --cpu-offload-gb 350Test Result
With the changes, the following command will work.
VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY=1 \ python3 -m vllm.entrypoints.openai.api_server \ --model nvidia/Kimi-K2-Thinking-NVFP4 \ --trust-remote-code \ --cpu-offload-gb 350Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.