-
-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Enable Inference Support for the New Baichuan-M1 Model #12251
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
This pull request has merge conflicts that must be resolved before it can be |
LGTM |
@youkaichao @zhuohan123 @DarkLight1337 @WoosukKwon |
Signed-off-by: dangshunya <[email protected]>
ping @youkaichao @DarkLight1337 @njhill @comaniac @zhuohan123 @WoosukKwon @alexm-redhat |
The model itself LGTM, but I'm not so sure about the custom KV cache. Is anyone else familiar with this part of the code? |
Regarding the SWA, can we minimize the code change for now by adopting #10584? While we will work on refactoring the memory manager in #11382 by @heheda12345. |
Because the kvcache used by ordinary layers and swa layers is inconsistent (we have 2 kv heads in normal attention, but 8 kv heads in swa), we cannot simply treat them the same way as in #10584 , but instead need to separately calculate the memory usage. |
For vLLM v1 engine, you can support normal attention with different hidden size by extending this function. vllm/vllm/v1/core/kv_cache_utils.py Line 410 in bf21481
Then you can try #10584 in v1 to support the mix of normal attention and SWA. If that works, we can raise an error to ask the user to use v1 engine to run this model if they do not enable vLLM v1. |
Over the past few days, we attempted to extend this function and modified our model's forward process, successfully running our model on vLLM v1 in simple cases. However, we encountered new challenges when max_num_batched_tokens > 2048, which appear difficult to resolve: In such cases, chunked prefill and cascade attention come into effect. Our model has a sliding_window_size, which is not supported in cascade attention. Additionally, similar to Mamba, our model is stateful. We use tokens from previous time steps to smooth our key and value in attention compute, and treat prefill and decode phases differently: In summary, supporting our model in v1 seems considerably more complicated than in v0. Therefore, v0 might be the better choice for now. |
This pull request adds the necessary support to the vLLM framework for the Baichuan-M1 model.
HuggingFace Page:
https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base
https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Instruct
The Baichuan-M1 (M stands for medicine) model is a medical-enhanced general large model, designed to deliver exceptional performance in healthcare applications while maintaining strong general capabilities. This update ensures that VLLM can seamlessly handle inference for the Baichuan-M1 model, providing both compatibility and optimal performance for a wide range of natural language processing tasks, especially in the medical domain.