[Feature]Supports DSv3.1 PD separation and C8 quantization#7222
[Feature]Supports DSv3.1 PD separation and C8 quantization#7222zzzzwwjj merged 16 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates DSv3.1 C8 quantization, a significant enhancement for running large language models on Ascend NPUs. The changes span across the attention mechanism, distributed KV cache management, and model loading infrastructure, ensuring that quantized models can leverage specialized hardware operations for improved efficiency and reduced memory footprint without compromising accuracy. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for DSv3.1 C8 quantization, specifically for attention layers and KV cache. The changes are comprehensive, integrating the new quantization method across various components including attention implementation, KV transfer mechanisms, weight loading utilities, and quantization configuration. The implementation includes new data structures, conditional logic for new NPU operators, and specific weight processing for quantized tensors. The approach seems well-structured, ensuring that the new quantization is applied dynamically and correctly without impacting existing functionalities. The code adheres to the specified requirements for enabling and managing this new feature.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for C8 quantization and prefill/decode separation, which involves extensive changes across attention mechanisms, KV cache transfer, and quantization configurations. The core logic for quantized attention is implemented in vllm_ascend/attention/mla_v1.py, utilizing a new version of the fused attention kernel. A new quantization method, AscendFAQuantAttentionMethod, is also added. While the changes are comprehensive, I've identified a couple of critical issues. One is a potential ValueError due to an unsafe string-to-integer conversion. The other is a logic error where a decode-specific quantization scale is incorrectly passed to the prefill function, while the decode function call is missing this necessary argument. Addressing these issues is crucial for the correctness of the new quantization feature.
| prefill_preprocess_res.value, | ||
| kv_cache, | ||
| attn_metadata, | ||
| decode_preprocess_res.dequant_scale_q_nope, |
There was a problem hiding this comment.
There seems to be a mis-wiring of the dequant_scale_q_nope argument. This scale is generated during decode preprocessing and should be used in the decode path, not the prefill path.
- This argument should be removed from the
_forward_prefillcall. - It should be passed to the
_forward_decodecall around line 1655, which is currently missing it. The_forward_decodefunction signature has been updated to accept it, but the call site has not been updated.
| id = "".join(re.findall(r"\.(\d+)\.", layer_name)) | ||
| if int(id) in quant_config.kvcache_quant_layers: | ||
| fa_quant_layer = True |
There was a problem hiding this comment.
If re.findall returns an empty list, id will be an empty string, and int(id) will raise a ValueError. This can happen if layer_name does not contain a number between dots. It's safer to check if a valid ID was found before converting to an integer. Also, using id as a variable name shadows the built-in function.
| id = "".join(re.findall(r"\.(\d+)\.", layer_name)) | |
| if int(id) in quant_config.kvcache_quant_layers: | |
| fa_quant_layer = True | |
| layer_id_str = "".join(re.findall(r"\.(\d+)\.", layer_name)) | |
| if layer_id_str.isdigit(): | |
| if int(layer_id_str) in quant_config.kvcache_quant_layers: | |
| fa_quant_layer = True |
| ) | ||
|
|
||
| # Load cache data into buffers | ||
| torch_npu.atb.npu_paged_cache_load( |
There was a problem hiding this comment.
The atb operator may report an error in A5.
There was a problem hiding this comment.
The atb operator needs to be migrated to the vllm-ascend custom ops and adapted to A5.
| actual_seq_qlen=actual_seq_lengths, | ||
| workspace=graph_params.workspaces.get(num_tokens), | ||
| out=[attn_output, softmax_lse], | ||
| ) |
There was a problem hiding this comment.
May can replace v1 with v2.
| # ordering expected by graph parameter update logic in attention backends. | ||
| mamba_layers: dict[str, MambaBase] = {} | ||
|
|
||
| def dtype_to_bytes(dtype: torch.dtype) -> int: |
There was a problem hiding this comment.
plz move this function to utils
| self.kv_send_layer_thread.send_queue.put(layer_send_task) | ||
| self.current_layer += 1 | ||
|
|
||
| def trans_nd_to_nz(self, cache_tensor: torch.Tensor, layer_group_idx: int): |
There was a problem hiding this comment.
plz add instruction on this function, and maybe this should also in utils?
| ) | ||
|
|
||
| # Load cache data into buffers | ||
| torch_npu.atb.npu_paged_cache_load( |
There was a problem hiding this comment.
The atb operator needs to be migrated to the vllm-ascend custom ops and adapted to A5.
| ) | ||
| buffer_list.append(self.k_buffer) | ||
| buffer_list.append(self.v_buffer) | ||
| if self.enable_kv_quant: |
There was a problem hiding this comment.
As the future GQA model will also use C8 quantization, the buffer can be allocated based on the input dtype without the need to write additional branches.
There was a problem hiding this comment.
Add a TODO, refactor it in the next pr
| for group_remote_block_id, group_local_block_id in zip( | ||
| grouped_remote_block_ids, grouped_local_block_ids | ||
| # kv cache quantization scenario | ||
| if self.enable_kv_quant and send_task.k_quant_cache is not None: |
There was a problem hiding this comment.
This branch behaves similarly to branches where pd_head_ratio > 1. Can these be simplified into the same code segment?
| @@ -2678,80 +2666,73 @@ def _allocate_kv_cache_tensors(self, kv_cache_config: KVCacheConfig) -> dict[str | |||
| # For deepseek mla, we need to spilt cache tensor accrodding to the nope head dim | |||
| # and rope head dim. | |||
| if self.model_config.use_mla: | |||
There was a problem hiding this comment.
@MengqingCao I think we should refactor the initialize_kvcache_tensors function to reduce unnecessary branching.
| self.query_lens: torch.Tensor | None = None | ||
| self.cpu_slot_mapping = None | ||
| self.sampling_done_event: torch.npu.Event | None = None | ||
| self.kvbytes = {} |
There was a problem hiding this comment.
Removable optional parameters
…to fa_0313 # Conflicts: # vllm_ascend/quantization/modelslim_config.py # vllm_ascend/worker/model_runner_v1.py
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
…to fa_0313 # Conflicts: # vllm_ascend/patch/__init__.py # vllm_ascend/worker/model_runner_v1.py
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
1 similar comment
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
…to fa_0313 # Conflicts: # vllm_ascend/worker/model_runner_v1.py
| setattr(layer, name, torch.nn.Module()) | ||
| params_dict = {} | ||
| dtype = torch.get_default_dtype() | ||
| layer.num_kv_heads = 1 |
There was a problem hiding this comment.
The layer.num_kv_heads parameter does not need to be assigned
…ect#7222) Co-authored-by: kunpengW-code <1289706727@qq.com> Co-authored-by: linsheng1 <1950916997@qq.com> ### What this PR does / why we need it? Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8 supports only the PD separation scenario. C8 refers to quantizing the KV cache to int8, which aims to reduce the GPU memory usage of the KV cache and improve the inference throughput. Constraints: 1. Only the PD separation mode can be used and MooncakeLayerwiseConnector can be used to run the model. 2. Currently, only the activation value supports dynamic quantization, and the KV cache supports static quantization. C8 quantization with MTP is not supported. You can use ModelSlim for quantization. The quantization procedure is as follows: pip install transformers==4.48.2 git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh cd example/DeepSeek/ python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path <path/quant_weight> --anti_dataset../common/deepseek_anti_prompt_50_v3_1.json --calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot --trust_remote_code True --fa_quant --dynamic --anti_method m6 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: pichangping <1337510399@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>
…ect#7222) Co-authored-by: kunpengW-code <1289706727@qq.com> Co-authored-by: linsheng1 <1950916997@qq.com> ### What this PR does / why we need it? Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8 supports only the PD separation scenario. C8 refers to quantizing the KV cache to int8, which aims to reduce the GPU memory usage of the KV cache and improve the inference throughput. Constraints: 1. Only the PD separation mode can be used and MooncakeLayerwiseConnector can be used to run the model. 2. Currently, only the activation value supports dynamic quantization, and the KV cache supports static quantization. C8 quantization with MTP is not supported. You can use ModelSlim for quantization. The quantization procedure is as follows: pip install transformers==4.48.2 git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh cd example/DeepSeek/ python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path <path/quant_weight> --anti_dataset../common/deepseek_anti_prompt_50_v3_1.json --calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot --trust_remote_code True --fa_quant --dynamic --anti_method m6 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: pichangping <1337510399@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: xutianyi <xutianyi5@huawei.com>
…scend into qwen3next_graph * 'qwen3next_graph' of https://github.com/845473182/vllm-ascend: (62 commits) [doc] Refresh the documentation for DeepSeek-V3.2 (vllm-project#7403) [bugfix][accuracy] Fix ds indexer accuracy problem caused by k rope (vllm-project#7341) [P/D] LayerwiseConnector supports the virtual push functionality on node D. (vllm-project#7361) [CI] Add PAT_TOKEN when checkout (vllm-project#7400) [main2main] upgrade vllm to 0308 (vllm-project#7213) [CI] add scheduled stale issue management (vllm-project#7354) [CI] expand issue labeler rules for feature/model triage (vllm-project#7356) [Bugfix] Assertion error when decode prefix cache fully hits (vllm-project#7236) [doc] Refresh the documentation for GLM-4.7 (vllm-project#7292) [BugFix]A2 MOE method&& layerwise MTP bugfix && Mamba gdn_metadata bugfix (vllm-project#7364) [doc] Upload doc for qwen3.5-27B and qwen3.5-397B-A17B on Ascend (vllm-project#7313) [bugfix]Enable dispatch_ffn_combine feature for qwen3.5 (vllm-project#7066) [bugfix] fix unzip file path for fia operator (vllm-project#7367) [Perf] Optimize bias handling in AscendRMSNorm (vllm-project#7226) [eagle3][pcp] fix bug for eagle3 and cp enable (vllm-project#7309) [Bugfix] fix TransposeKvCacheByBlock op error report in plog (vllm-project#7235) [Feature]Supports DSv3.1 PD separation and C8 quantization (vllm-project#7222) [main][bugfix] Fixed the problem that eagle3 will crash in FULL_DECODE_ONLY (vllm-project#7290) [xlite][Bugfix] Support mrope and deepstack features in xlite backend (vllm-project#7295) [model_runner_v2]optimize the performance of the _topk_log_softmax_kernel (vllm-project#7221) ...
…notes - Refine Balance scheduling description (line 9) - Relocate Flash Comm V1 from Highlights to Features (line 10) - Add C8 INT8 KV (vllm-project#7474) and DSv3.1 C8 quant (vllm-project#7222) to Highlights (line 12) - Remove LayerwiseConnector entry from Features (line 18) - Remove Dependencies section / vLLM upgrade entry (line 27) - Remove enable_sparse_c8 doc entry (line 31) - Remove lowered PD log level entry from Others (line 36) - Remove speculative decoding proposer fix entry (line 40) Signed-off-by: MengqingCao <cmq0113@163.com>
…ect#7222) Co-authored-by: kunpengW-code <1289706727@qq.com> Co-authored-by: linsheng1 <1950916997@qq.com> ### What this PR does / why we need it? Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8 supports only the PD separation scenario. C8 refers to quantizing the KV cache to int8, which aims to reduce the GPU memory usage of the KV cache and improve the inference throughput. Constraints: 1. Only the PD separation mode can be used and MooncakeLayerwiseConnector can be used to run the model. 2. Currently, only the activation value supports dynamic quantization, and the KV cache supports static quantization. C8 quantization with MTP is not supported. You can use ModelSlim for quantization. The quantization procedure is as follows: pip install transformers==4.48.2 git clone https://gitcode.com/Ascend/msmodelslim.git cd msmodelslim bash install.sh cd example/DeepSeek/ python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path <path/quant_weight> --anti_dataset../common/deepseek_anti_prompt_50_v3_1.json --calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot --trust_remote_code True --fa_quant --dynamic --anti_method m6 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: vllm-project/vllm@4034c3d --------- Signed-off-by: pichangping <1337510399@qq.com> Signed-off-by: Wang Kunpeng <1289706727@qq.com> Co-authored-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: kunpengW-code 1289706727@qq.com
Co-authored-by: linsheng1 1950916997@qq.com
What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8 supports only the PD separation scenario. C8 refers to quantizing the KV cache to int8, which aims to reduce the GPU memory usage of the KV cache and improve the inference throughput.
Constraints:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path <path/quant_weight> --anti_dataset../common/deepseek_anti_prompt_50_v3_1.json --calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot --trust_remote_code True --fa_quant --dynamic --anti_method m6
Does this PR introduce any user-facing change?
no
How was this patch tested?