[Model] Enable Inference Support for the New Baichuan-M1 Model #12251

rainkert · 2025-01-21T07:59:24Z

This pull request adds the necessary support to the vLLM framework for the Baichuan-M1 model.

HuggingFace Page:
https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base
https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Instruct

The Baichuan-M1 (M stands for medicine) model is a medical-enhanced general large model, designed to deliver exceptional performance in healthcare applications while maintaining strong general capabilities. This update ensures that VLLM can seamlessly handle inference for the Baichuan-M1 model, providing both compatibility and optimal performance for a wide range of natural language processing tasks, especially in the medical domain.

github-actions · 2025-01-21T07:59:35Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2025-01-21T08:00:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rainkert.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jameswu2014 · 2025-01-22T07:48:37Z

LGTM

rainkert · 2025-01-22T08:13:58Z

@youkaichao @zhuohan123 @DarkLight1337 @WoosukKwon
We will be releasing our model on Hugging Face on January 24th（The day after tomorrow), but you can review the code beforehand to identify any issues so we can address them in advance.

vllm/model_executor/models/baichuan_m1.py

Signed-off-by: dangshunya <[email protected]>

rainkert · 2025-01-24T06:32:25Z

ping @youkaichao @DarkLight1337 @njhill @comaniac @zhuohan123 @WoosukKwon @alexm-redhat
We've released our new model today, plz review this PR and merge ASAP.

DarkLight1337 · 2025-01-24T06:54:54Z

The model itself LGTM, but I'm not so sure about the custom KV cache. Is anyone else familiar with this part of the code?

simon-mo · 2025-01-25T04:48:45Z

Regarding the SWA, can we minimize the code change for now by adopting #10584? While we will work on refactoring the memory manager in #11382 by @heheda12345.

rainkert · 2025-01-25T05:18:47Z

Regarding the SWA, can we minimize the code change for now by adopting #10584? While we will work on refactoring the memory manager in #11382 by @heheda12345.

Because the kvcache used by ordinary layers and swa layers is inconsistent (we have 2 kv heads in normal attention, but 8 kv heads in swa), we cannot simply treat them the same way as in #10584 , but instead need to separately calculate the memory usage.

heheda12345 · 2025-01-25T05:35:16Z

For vLLM v1 engine, you can support normal attention with different hidden size by extending this function.

vllm/vllm/v1/core/kv_cache_utils.py

Line 410 in bf21481

def get_kv_cache_config(vllm_config: VllmConfig, kv_cache_spec: KVCacheSpec,

Then you can try #10584 in v1 to support the mix of normal attention and SWA.
If that works, we can raise an error to ask the user to use v1 engine to run this model if they do not enable vLLM v1.

rainkert · 2025-02-13T03:04:02Z

For vLLM v1 engine, you can support normal attention with different hidden size by extending this function.

vllm/vllm/v1/core/kv_cache_utils.py

Line 410 in bf21481

def get_kv_cache_config(vllm_config: VllmConfig, kv_cache_spec: KVCacheSpec,

Then you can try #10584 in v1 to support the mix of normal attention and SWA.
If that works, we can raise an error to ask the user to use v1 engine to run this model if they do not enable vLLM v1.

Over the past few days, we attempted to extend this function and modified our model's forward process, successfully running our model on vLLM v1 in simple cases. However, we encountered new challenges when max_num_batched_tokens > 2048, which appear difficult to resolve:

In such cases, chunked prefill and cascade attention come into effect. Our model has a sliding_window_size, which is not supported in cascade attention. Additionally, similar to Mamba, our model is stateful. We use tokens from previous time steps to smooth our key and value in attention compute, and treat prefill and decode phases differently:
https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base/blob/main/modeling_baichuan.py#L482
This becomes significantly more complex with chunked attention and v1.

In summary, supporting our model in v1 seems considerably more complicated than in v0. Therefore, v0 might be the better choice for now.

rainkert requested review from DarkLight1337, ywang96, WoosukKwon, robertgshaw2-redhat, njhill, comaniac, alexm-redhat, zhuohan123 and youkaichao as code owners January 21, 2025 07:59

mergify bot added documentation Improvements or additions to documentation needs-rebase labels Jan 21, 2025

rainkert force-pushed the baichuan-m1 branch from ca4022d to 8e0a9c7 Compare January 21, 2025 10:16

mergify bot removed the needs-rebase label Jan 21, 2025

jeejeelee added the new model Requests to new models label Jan 21, 2025

rainkert force-pushed the baichuan-m1 branch from 8e0a9c7 to 23b8eab Compare January 22, 2025 07:59

jeejeelee reviewed Jan 22, 2025

View reviewed changes

vllm/model_executor/models/baichuan_m1.py Show resolved Hide resolved

[New Model] support Baichuan-M1

8939174

Signed-off-by: dangshunya <[email protected]>

rainkert force-pushed the baichuan-m1 branch from ba31039 to 8939174 Compare January 24, 2025 04:07

DarkLight1337 mentioned this pull request Jan 24, 2025

Release v0.7.0 #12365

Closed

8 tasks

simon-mo mentioned this pull request Jan 27, 2025

Release v0.7.3 #12465

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Enable Inference Support for the New Baichuan-M1 Model #12251

[Model] Enable Inference Support for the New Baichuan-M1 Model #12251

rainkert commented Jan 21, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 21, 2025

mergify bot commented Jan 21, 2025

jameswu2014 commented Jan 22, 2025

rainkert commented Jan 22, 2025

rainkert commented Jan 24, 2025

DarkLight1337 commented Jan 24, 2025 •

edited

Loading

simon-mo commented Jan 25, 2025

rainkert commented Jan 25, 2025 •

edited

Loading

heheda12345 commented Jan 25, 2025

rainkert commented Feb 13, 2025 •

edited

Loading

[Model] Enable Inference Support for the New Baichuan-M1 Model #12251

Are you sure you want to change the base?

[Model] Enable Inference Support for the New Baichuan-M1 Model #12251

Conversation

rainkert commented Jan 21, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 21, 2025

mergify bot commented Jan 21, 2025

jameswu2014 commented Jan 22, 2025

rainkert commented Jan 22, 2025

rainkert commented Jan 24, 2025

DarkLight1337 commented Jan 24, 2025 • edited Loading

simon-mo commented Jan 25, 2025

rainkert commented Jan 25, 2025 • edited Loading

heheda12345 commented Jan 25, 2025

rainkert commented Feb 13, 2025 • edited Loading

rainkert commented Jan 21, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Jan 24, 2025 •

edited

Loading

rainkert commented Jan 25, 2025 •

edited

Loading

rainkert commented Feb 13, 2025 •

edited

Loading