OffloadingConnector: Add cpu_bytes_to_use configuration#24498
OffloadingConnector: Add cpu_bytes_to_use configuration#24498NickLucche merged 1 commit intovllm-project:mainfrom
Conversation
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
ddf3449 to
0b6d358
Compare
vllm/v1/engine/core.py
Outdated
| num_cpu_blocks = (int(vllm_config.cache_config.swap_space_bytes) // | ||
| kv_cache_configs[0].kv_bytes_per_block) |
There was a problem hiding this comment.
There is also a knob called offloaded_block_size in #22595. IIUC, it also impacts the calculation of num_cpu_blocks, right? (i.e., if we have larger CPU blocks, we should have less number of CPU blocks)
There was a problem hiding this comment.
In v0, the offloading was part of the core.
My suggestion for v1 is to have the offloading as a connector.
I wanted to follow the convention for connectors, where all of their arguments are actually defined in their kv_connector_extra_config.
However, deriving num_cpu_blocks from some kind of a swap_space parameter requires knowledge of kv_bytes_per_block.
So basically, I need my connector (both scheduler-side and worker-side) to be aware of kv_bytes_per_block.
This requires changing things in core, so I tried to make minimal changes and came up with the approach here:
For the scheduler-side connector, report kv_bytes_per_block by setting the existing V0 field num_cpu_blocks.
For the worker-side connector, pass-on kv_cache_configs via the register_kv_caches function (in a follow-up PR).
When the offloading connector gets this num_cpu_blocks (given in GPU block size), it can derive the actual num_cpu_blocks by dividing by block_size_factor.
To sum-up, I'm trying to make minimal changes to the core.
This results in the actual offloading configuration parameters split between vllm_config.cache_config and kv_connector_extra_config.
I'm good with taking a different approach.
Your thoughts?
Perhaps we should ask other relevant folks on their opinion here?
There was a problem hiding this comment.
Yeah. This is a good point. I think at a high level, there should be two parameters that can be configured by users: (1) total_cpu_buffer_size and (2) cpu_buffer_block_size (how many tokens in each CPU block).
For (1), it's also worth thinking whether it's per rank or per vLLM instance (i.e., summed across all ranks). I feel like if it's per rank, probably it will be better to pass it in the KV connector configs, while it makes more sense to have a "global" cache size when it's configured by global configurations like --swap-space.
For (2), I think it should definitely be put into the KV connector config as it's the current CPU-offloading-connector-specific configuration.
To sum up, I feel like putting all the configs into the KV connector config will probably be better and less confusing. WDYT?
|
This pull request has merge conflicts that must be resolved before it can be |
|
I'm planning to change this PR next week to share the raw |
0b6d358 to
d9a453b
Compare
d9a453b to
b01d346
Compare
0cd30e1 to
f5f21a2
Compare
Following your comment, I've spent some time thinking on it. So PR is modified now and is much leaner. |
|
For passing in KVCacheConfig, we now have an optional |
f5f21a2 to
a80aacb
Compare
Thanks! This is what I do now: @heheda12345 does this look ok to you? @NickLucche I removed all intrusive changes, so hopefully we're good to go? |
a80aacb to
0681ecf
Compare
Yes, sounds good! |
ApostaC
left a comment
There was a problem hiding this comment.
Thanks for the update. Removing my previous "request changes"
|
@ApostaC @heheda12345 Is this PR ready to be merged? Hopefully can goes into v0.14.0. |
NickLucche
left a comment
There was a problem hiding this comment.
Leaner, thanks for the work @orozery !
|
|
||
| ```bash | ||
| --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 64, "num_cpu_blocks": 1000}}' | ||
| --kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"block_size": 64, "cpu_bytes_to_use": 1000000000}}' |
There was a problem hiding this comment.
nit: we should probably follow up with a more human readable way of expressing the value
There was a problem hiding this comment.
I'm open for suggestions :)
There was a problem hiding this comment.
thinking about unifying with max-model-len format
This commit replaces the OffloadingConnector size configuration from num_cpu_blocks to cpu_bytes_to_use. This allows for a more intuitive space allocation (per vLLM instance, across workers). Signed-off-by: Or Ozeri <oro@il.ibm.com>
Head branch was pushed to by a user without write access
5e509ce to
8208b41
Compare
|
@NickLucche rebased |
…#24498) Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Tomer Natan <tbarnatan@computelab-frontend-8.nvidia.com>
### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors vllm-project/vllm#31916 vllm-project/vllm#32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to vllm-project/vllm#24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified vllm-project/vllm#31998 - Skip some pooling tests, which are caused by vllm-project/vllm#32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs vllm-project/vllm#32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by vllm-project/vllm#32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it? Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9) - Modify import paths due to the refactors vllm-project/vllm#31916 vllm-project/vllm#32054 - Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional arguments but 3 were given` due to vllm-project/vllm#24498 - Skip the async-scheduling tests in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never verified vllm-project/vllm#31998 - Skip some pooling tests, which are caused by vllm-project/vllm#32148 where vllm is also failed https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4 We will reopen those tests when main2main reachs vllm-project/vllm#32243 - Skip some cases in `tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are broken by vllm-project/vllm#32118 ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: vllm-project/vllm@2f4e654 Signed-off-by: wjunLu <wjunlu217@gmail.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Co-authored-by: hfadzxy <starmoon_zhang@163.com>
…#24498) Signed-off-by: Or Ozeri <oro@il.ibm.com>
This PR replaces the OffloadingConnector size configuration from num_cpu_blocks to cpu_bytes_to_use.
This allows for a more intuitive space allocation (per vLLM instance, across workers).
Note
Modernizes KV offloading configuration and wiring.
num_cpu_blocks/kv_bytes_per_rankwith instance-widecpu_bytes_to_useforOffloadingConnector(docs and configs)CPUOffloadingSpecnow derivesnum_blocksfromKVCacheConfig(page size, tensors, world size) andblock_size; requires passingkv_cache_configintoOffloadingSpecFactory.create_spec(...)andOffloadingConnectorVllmConfig._post_init_kv_transfer_configto setcpu_bytes_to_use = kv_offloading_size * (1 << 30)(no per-rank split);LMCachepath unchangedMockOffloadingSpec, unit/integration offloading tests)Written by Cursor Bugbot for commit 8208b41. This will update automatically on new commits. Configure here.