[Model] Add LongCat-Flash by chuyuelin · Pull Request #3833 · vllm-project/vllm-ascend

chuyuelin · 2025-10-28T09:53:58Z

What this PR does / why we need it?

Add LongCat-Flash support.

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

CI passed

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

github-actions · 2025-10-28T09:54:08Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for the LongCat-Flash model. The changes span across attention mechanisms, Mixture-of-Experts (MoE) layers, quantization configurations, and the model runner to accommodate the specific architecture of this new model. My review focuses on ensuring the changes are robust and maintainable. I've identified a critical issue related to a side effect on a shared configuration object, which could lead to unpredictable behavior, and a high-severity issue regarding brittle string parsing for layer indexing.

gemini-code-assist · 2025-10-28T09:56:49Z

+        if hf_config.model_type == "longcat_flash":
+            self.debug_layer_idx = int(self.prefix.split(".")[2])
+            hf_config.first_k_dense_replace = 0
+        else:
+            self.debug_layer_idx = int(self.prefix.split(".")[-2])
        self.first_k_dense_replace = hf_config.first_k_dense_replace


Modifying the shared hf_config object directly is a dangerous side effect that can lead to unpredictable behavior in other parts of the application. Configuration objects should be treated as immutable within model layers.

To fix this, you should set the self.first_k_dense_replace attribute based on the condition, without altering hf_config.

Suggested change

if hf_config.model_type == "longcat_flash":

self.debug_layer_idx = int(self.prefix.split(".")[2])

hf_config.first_k_dense_replace = 0

else:

self.debug_layer_idx = int(self.prefix.split(".")[-2])

self.first_k_dense_replace = hf_config.first_k_dense_replace

if hf_config.model_type == "longcat_flash":

self.debug_layer_idx = int(self.prefix.split(".")[2])

self.first_k_dense_replace = 0

else:

self.debug_layer_idx = int(self.prefix.split(".")[-2])

self.first_k_dense_replace = hf_config.first_k_dense_replace

gemini-code-assist · 2025-10-28T09:56:49Z

        ).enable_shared_expert_dp
-        self.debug_layer_idx = int(self.prefix.split(".")[-2])
+        if hf_config.model_type == "longcat_flash":
+            self.debug_layer_idx = int(self.prefix.split(".")[2])


Using a hardcoded index [2] to parse the layer index from the prefix string is brittle and assumes a fixed prefix structure (e.g., model.layers.{idx}.<...>). This can easily break if the model's naming convention or the prefix structure changes. A more robust approach would be to use regular expressions or a more structured method to extract the layer index, which would make the code more resilient to future changes.

Signed-off-by: chuyuelin <923822139@qq.com>

github-actions · 2025-11-04T09:29:08Z