[CPU] V1 support for the CPU backend#16441
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
fa7a156 to
88b96c4
Compare
|
Nice, can you also update the V1 User Guide to reflect this support? |
|
@WoosukKwon @Isotr0py @mgoin can you take a look at this as well? |
Isotr0py
left a comment
There was a problem hiding this comment.
Added some initial comments, PTAL!
There was a problem hiding this comment.
| return "TORCH_SDPA" | |
| return "TORCH_SDPA_VLLM_V1" |
vllm/v1/worker/cpu_model_runner.py
Outdated
There was a problem hiding this comment.
I think we will also need a ModelRunnerBase class for v1 to avoid directly inherit from GPU runner.
There was a problem hiding this comment.
Absolutely, maybe it should be done when the GPU runner becomes stable. At least the CPU V1 runner can fully reuse the GPU runner with limited additional changes.
.
There was a problem hiding this comment.
I think we should put the v1 sdpa attention backend at vllm/v1/attention/backends to avoid coupling.
There was a problem hiding this comment.
We should add a new AttentionImpl for v1 instead of using the legacy ones, because prefix caching and chunk prefill are always enabled by default.
There was a problem hiding this comment.
Yes, chunked_prefill is always enabled when building metadata for v1. Right now I think the legacy impl is enough for enabling v1 for CPU, and perhaps we can provide a new unified CPU attn impl for multiple attn features in the future.
There was a problem hiding this comment.
perhaps we can provide a new unified CPU attn impl for multiple attn features in the future.
Sure! In fact, I also expect Flex Attn can support on CPU backend: #16078. Perhaps we can switch to FlexAttn from SDPA once it lands in the future.
| @@ -177,7 +177,7 @@ def build(self, seq_lens, query_lens, cuda_graph_pad_size, batch_size): | |||
| seq_lens_tensor=seq_lens_tensor, | |||
| max_query_len=max_query_len, | |||
| max_kv_len=max_kv_len, | |||
There was a problem hiding this comment.
why does this file require changes?
| # For chunked prefill only | ||
| max_query_len: Optional[int] = None | ||
| max_kv_len: Optional[int] = None | ||
| query_start_loc: Optional[torch.Tensor] = None |
There was a problem hiding this comment.
why does this file require changes?
There was a problem hiding this comment.
It is a naming conflict. The V1 model runner use query_start_loc for logits indexing specially, contains all tokens in a batch. But in torch_sdpa, query_start_loc contains prefill tokens only, so rename it to prefill_query_start_loc.
There was a problem hiding this comment.
I would try to avoid the renaming in torch sdpa instead of the global change
| num_kv_heads, head_size) | ||
|
|
||
| @staticmethod | ||
| def swap_blocks( |
There was a problem hiding this comment.
I dont think swap_blocks or copy_blocks are needed for V1
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for the contribution!
vllm/v1/worker/cpu_model_runner.py
Outdated
There was a problem hiding this comment.
This is a bit hacky. Have you considered just duplicating GPUModelRunner while cutting out all unsupported features or unnecessary device/cpu double tensors?
We have done something similar for TPU https://github.com/vllm-project/vllm/blob/main/vllm/v1/worker/tpu_model_runner.py.
It's uglier to the eyes but it's easier to control until things are more stable.
There was a problem hiding this comment.
Exactly it is a bit hacky. We actually did the way in V0, it did well to cut out some hardcoded unsupported features for CPU such as CUDAGraph, FlashAttnBackend. But eventually the V0 CPU runner diverged with the GPU runner and can't leverage some new features and refactors directly.
Compared with V0, V1 model runner has better designs in abstraction, like unified input data management, vLLM compilation, attention backend builder and reorder, .etc. These abstraction allows the CPU model runner be more compatible with the GPU. Moreover, I think the V1 model runner has became relatively stable as almost no changes required when rebasing this PR.
So I would prefer to totally reuse the GPU model runner for now, even with some hacky. If some day we have to decouple them, I think it will not be very difficult.
There was a problem hiding this comment.
While I definitely see your point and the strengths of leveraging an hierarchical structure, I am not 100% sure there won't be quite recurring hiccups to fix the runner as an unintended consequence of some other unrelated PR.
Still, I am not against this implementation, perhaps we could a better job at writing platform-aware code in the other parts of the code.
Eg (not for this PR) abstracting away those cpu/device tensors in a base class so that we don't have to resort to this workaround here.
vllm/platforms/cpu.py
Outdated
There was a problem hiding this comment.
Yes I think the cascade attn is only supported in the flash attn backend so I disable it here.
I noticed support_sleep_mode has became a platform attribute, remove the checking here.
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
|
Hi @simon-mo, all required tests became green. We enabled CPU V1 as default and all CPU tests compatible with the V1 engine passed. https://buildkite.com/vllm/ci/builds/21328#0197361e-3946-4915-953d-7a8baeae1137 I think this PR is ready to merge, please take a look, thanks :) |
resolve #16056
Support all features listed in the CPU doc excepts FP8 KV cache.
Changes
CPUWorkerandCPUModelRunner, derived fromWorkerandGPUModelRunnerto reduce code duplication.TorchSDPABackendwith compatible interfaces forGPUModelRunner, such asreorder_batchandbuild.GPUModelRunnerto avoid importing flash-attn explicitly and using defaultnccldist backend.