-
-
Notifications
You must be signed in to change notification settings - Fork 15.5k
[Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation #32993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support CPU Offloading without Pytorch Pinned Memory that leads to doubled allocation #32993
Changes from all commits
1c2af08
42d2ca6
ac59a01
f9882fa
666a31a
559fb8d
ae3e221
c706555
170368b
85a8eaa
e065905
5aec8c3
990a241
90252d0
290df18
9186384
d9e9614
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,29 @@ | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
|
|
||
| import pytest | ||
|
|
||
| from ..utils import compare_two_settings | ||
|
|
||
|
|
||
| def test_cpu_offload(): | ||
| @pytest.mark.parametrize("disable_pin_memory", [False, True]) | ||
| @pytest.mark.parametrize("disable_uva", [False, True]) | ||
| def test_cpu_offload(disable_pin_memory, disable_uva): | ||
| env_vars = { | ||
| "VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY": str(int(disable_pin_memory)), | ||
| "VLLM_WEIGHT_OFFLOADING_DISABLE_UVA": str(int(disable_uva)), | ||
| } | ||
|
|
||
| args = ["--cpu-offload-gb", "1"] | ||
|
|
||
| # cuda graph only works with UVA offloading | ||
| if disable_uva: | ||
| args.append("--enforce-eager") | ||
|
|
||
| compare_two_settings( | ||
| "hmellor/tiny-random-LlamaForCausalLM", [], ["--cpu-offload-gb", "1"] | ||
| model="hmellor/tiny-random-LlamaForCausalLM", | ||
| arg1=[], | ||
| arg2=args, | ||
| env1=None, | ||
| env2=env_vars, | ||
| ) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -230,6 +230,8 @@ | |
| VLLM_USE_V2_MODEL_RUNNER: bool = False | ||
| VLLM_LOG_MODEL_INSPECTION: bool = False | ||
| VLLM_DEBUG_MFU_METRICS: bool = False | ||
| VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY: bool = False | ||
| VLLM_WEIGHT_OFFLOADING_DISABLE_UVA: bool = False | ||
| VLLM_DISABLE_LOG_LOGO: bool = False | ||
| VLLM_LORA_DISABLE_PDL: bool = False | ||
|
|
||
|
|
@@ -1542,6 +1544,14 @@ def _get_or_set_default() -> str: | |
| "VLLM_DEBUG_MFU_METRICS": lambda: bool( | ||
| int(os.getenv("VLLM_DEBUG_MFU_METRICS", "0")) | ||
| ), | ||
| # Disable using pytorch's pin memory for CPU offloading. | ||
| "VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY": lambda: bool( | ||
| int(os.getenv("VLLM_WEIGHT_OFFLOADING_DISABLE_PIN_MEMORY", "0")) | ||
| ), | ||
| # Disable using UVA (Unified Virtual Addressing) for CPU offloading. | ||
| "VLLM_WEIGHT_OFFLOADING_DISABLE_UVA": lambda: bool( | ||
| int(os.getenv("VLLM_WEIGHT_OFFLOADING_DISABLE_UVA", "0")) | ||
| ), | ||
|
Comment on lines
+1547
to
+1554
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Since we also do kv offloading, it might be better to explicitly say weight offloading
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for the review. I will rename them to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. updated. |
||
| # Disable logging of vLLM logo at server startup time. | ||
| "VLLM_DISABLE_LOG_LOGO": lambda: bool(int(os.getenv("VLLM_DISABLE_LOG_LOGO", "0"))), | ||
| # Disable PDL for LoRA, as enabling PDL with LoRA on SM100 causes | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to mix two functionalities in one function. I'd prefer to separate them. keep the original
get_cuda_view_from_cpu_tensoruntouched, and have anotheralloc_pinned_cpu_tensor_and_get_cuda_view(num_bytes)function.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing. I personally find it cleaner to keep a unified function that "creates a CUDA view from a cpu tensor". The only difference is whether the CPU tensor is already allocated with pinned memory. In both cases, the returned CUDA view keeps reference of the CPU buffer.
Let me know if you insist on having separate functions. I can update it that way.