[Core][NIXL] Support HMA+NixlConnector#31802
[Core][NIXL] Support HMA+NixlConnector#31802NickLucche wants to merge 75 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> Co-authored-by: KuntaiDu <kuntai@uchicago.edu>
…indow, and leading padding with null blocks Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> fixes Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu> fix get_num_blocks_to_allocate Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…ocks Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…ll blocks inside the single_type_block_manager Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…ng that in a follow-up PR Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…l_computed_tokens allocation to the same function Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…efix_lm issue Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
Signed-off-by: Yifan Qiao <yifanqiao@berkeley.edu>
…m-project#30723) Signed-off-by: mgoin <mgoin64@gmail.com>
…tion endpoints (vllm-project#30769) Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
…lm-project#30014) Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
…oups (vllm-project#29627) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Stanislaw Wozniak <stw@zurich.ibm.com>
…s only (vllm-project#30475) Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Sun Kim <sunytokki@gmail.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
|
Documentation preview: https://vllm--31802.org.readthedocs.build/en/31802/ |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces support for HMA (Hybrid Memory Allocator) with the NixlConnector, which is a significant feature. The changes span across CI configurations, low-level CUDA kernels, and core scheduling logic. The PR also includes support for Turing architecture in Marlin kernels and various refactorings and bug fixes. While the changes are extensive and well-structured, I've identified several critical and high-severity issues. These mainly consist of temporary workarounds and FIXME comments that need to be addressed before merging, as well as the removal of important safety checks (static_assert) in CUDA kernels which could lead to memory corruption.
I am having trouble creating individual review comments. Click here to see my feedback.
csrc/moe/marlin_moe_wna16/marlin_template.h (1021-1025)
The static_assert that validates shared memory layout has been removed. This check is important to prevent potential memory corruption in the CUDA kernel. Was its removal intentional? If so, could you please explain why it's no longer needed? If not, it should be restored.
csrc/quantization/gptq_marlin/marlin_template.h (876-880)
Similar to the other Marlin template, the static_assert for shared memory layout validation has been removed here. This could lead to memory corruption. Please restore it or provide a reason for its removal.
vllm/config/model.py (1126-1128)
This temporary change to disable mm_prefix_lm should be removed before merging, as indicated by the comment.
vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (80-82)
This temporary workaround should be removed before merging, as indicated by the comment.
vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (212-214)
This temporary workaround to return None should be removed before merging.
vllm/distributed/kv_transfer/kv_connector/v1/lmcache_connector.py (324-334)
This temporary workaround for request_finished_all_groups should be addressed and the comment removed before merging.
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (682)
Please address this FIXME regarding handling a tuple of blocks.
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (2195)
Please address this FIXME about marking blocks per group in the error handling path.
vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py (2264)
Please address this FIXME regarding flattening the tuple in _logical_to_kernel_block_ids.
vllm/v1/core/sched/scheduler.py (1623-1624)
Please address these FIXME comments regarding HMA and environment variable changes.
vllm/v1/core/sched/scheduler.py (1627)
Please address this FIXME about handling blocks across layers.
vllm/v1/core/sched/scheduler.py (1638)
Please address this FIXME about per-group caching.
|
closing in favor of #32204 |
This PR is based on an early version of #30166 so the diff is a mess. I will clean it up and rebase it asap and provide a more accurate description of the PR then.
UPDATE: check out #32204 for the updated PR
Overview
Currently connectors are not able to take full advantage of models that employ hybrid attention (FA+SWA) and treat all layers as FA, as the Hybrid Kv Cache Manager is disabled.
This PR enables NixlConnector to work with the HMA, resulting in drastically reducing the number of bytes/regions moved with a xfer for SWA+FA models, while laying the ground for state-based ones (mamba etc).
Example of the former:
Test with
Enable HMA experimental support with
--no-disable-hybrid-kv-cache-manager:lm-eval results:
TODO
cc working with @heheda12345 @KuntaiDu @ivanium