[Feature] add DFlash Support#8118
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive support for DFlash, a parallel speculative decoding algorithm, within the vllm-ascend framework. It involves significant modifications to the attention mechanism to support bidirectional attention, the addition of a dedicated DFlash proposer, and the integration of a new Triton kernel for efficient input processing. These changes enable vllm-ascend to leverage DFlash for improved inference performance on Ascend NPUs, aligning with recent upstream vLLM developments. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request implements DFlash speculative decoding for Ascend, introducing the AscendDflashProposer, a Triton kernel for input expansion, and patches for DFlashQwen3Model to enable fused KV precomputation. It also updates the attention mechanism to support non-causal sequences. Feedback was provided to refactor the attention forward pass to eliminate code duplication and improve maintainability.
Suggested PR Title:
[Attention][Feature] Implement DFlash speculative decoding supportSuggested PR Summary:
### What this PR does / why we need it?
This PR implements DFlash speculative decoding for Ascend, introducing the `AscendDflashProposer`, a Triton kernel for input expansion, and patches for `DFlashQwen3Model` to enable fused KV precomputation. It also updates the attention mechanism to support non-causal sequences. Feedback was provided to refactor the attention forward pass to eliminate code duplication and improve maintainability.
### Does this PR introduce _any_ user-facing change?
Yes, it adds 'dflash' as a speculative decoding method.
### How was this patch tested?
The changes were integrated into the speculative decoding framework.| if not attn_metadata.causal: | ||
| attn_output, _ = torch_npu.npu_fused_infer_attention_score( | ||
| query=query, | ||
| key=key, | ||
| value=value, | ||
| block_table=block_table, | ||
| input_layout="TND", | ||
| block_size=block_size, | ||
| actual_seq_lengths=attn_metadata.actual_seq_lengths_q, | ||
| actual_seq_lengths_kv=actual_seq_lengths_kv, | ||
| num_key_value_heads=self.num_kv_heads, | ||
| num_heads=self.num_heads, | ||
| scale=self.scale, | ||
| sparse_mode=0, | ||
| ) | ||
| else: | ||
| attn_output, _ = torch_npu.npu_fused_infer_attention_score( | ||
| query=query, | ||
| key=key, | ||
| value=value, | ||
| atten_mask=attn_metadata.attn_mask, | ||
| block_table=block_table, | ||
| input_layout="TND", | ||
| block_size=block_size, | ||
| actual_seq_lengths=attn_metadata.actual_seq_lengths_q, | ||
| actual_seq_lengths_kv=actual_seq_lengths_kv, | ||
| num_key_value_heads=self.num_kv_heads, | ||
| num_heads=self.num_heads, | ||
| scale=self.scale, | ||
| sparse_mode=3, | ||
| ) |
There was a problem hiding this comment.
There is significant code duplication between the if and else blocks. This makes the code harder to maintain, as changes to the arguments of torch_npu.npu_fused_infer_attention_score must be applied in two places, increasing the risk of introducing bugs.
To improve maintainability, you can refactor the common arguments into a dictionary.
common_args = {
"query": query,
"key": key,
"value": value,
"block_table": block_table,
"input_layout": "TND",
"block_size": block_size,
"actual_seq_lengths": attn_metadata.actual_seq_lengths_q,
"actual_seq_lengths_kv": actual_seq_lengths_kv,
"num_key_value_heads": self.num_kv_heads,
"num_heads": self.num_heads,
"scale": self.scale,
}
if not attn_metadata.causal:
attn_output, _ = torch_npu.npu_fused_infer_attention_score(
**common_args,
sparse_mode=0,
)
else:
attn_output, _ = torch_npu.npu_fused_infer_attention_score(
**common_args,
atten_mask=attn_metadata.attn_mask,
sparse_mode=3,
)b8ba62c to
d59c789
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
d59c789 to
79add5a
Compare
1a5e879 to
2b2eb49
Compare
32974e4 to
bfa2d6f
Compare
Signed-off-by: chenaoxuan <cax1165@163.com>
bfa2d6f to
e2356d7
Compare
**This PR is inherited from PR-[7162 ](vllm-project#7162) and supports the latest vllm-ascend main. The old version is closed.** ### Purpose **We first supported DFlash on Ascend-NPU and then maintained it.** > DFlash ("[DFlash: Block Diffusion for Flash Speculative Decoding](https://arxiv.org/abs/2602.06036)") is a parallel speculative decoding algorithm that generates multiple candidate tokens at once through a diffusion process. Main changes: - Corresponds to the official support of vllm merged PR-[36847](vllm-project/vllm#36847). - Add dflash proposer implementation on the basis of SpecDecodeBaseProposer. - Modify the attention backend and add bidirectional attention branch. - Modify model_runner_v1 to support calling the dflash module. ### Quick Start [!Attention!] As of April 10, vllm-ascend is not compatible with vllm that supports DFlash. Therefore, cherry-pick is required: `cd vllm` `git checkout -b new-branch v0.19.0` `git cherry-pick dc14cbf0c06e8a124bdf0c03e8e267feef60887e` [Weights] Use official DFlash [weights](https://huggingface.co/collections/z-lab/dflash). [Config] --speculative-config '{"num_speculative_tokens": 8, "method":"dflash","model":"weight_path","enforce_eager": true}' ### Test Results #### Acceptance rate Verified with Sglang(GPU) and Vllm(GPU) version of Qwen3-8B-DFlash-b16 in GSM8K dataset. T=0 Draft Tokens = 16, Max Tokens = 2048 | Batch Size | Framework | Mean Acceptance Length | |-----|-----|-----| | 4 | SGlang | 6.07 | | 4 | vLLM | 6.08 | | 4 | vLLM-Ascend | 6.05 | | 8 | SGlang | 6.07 | | 8 | vLLM | 6.08 | | 8 | vLLM-Ascend | 6.06 | | 16 | SGlang | 6.08 | | 16 | vLLM | 6.08 | | 16 | vLLM-Ascend | 6.08 | | 32 | SGlang | 6.08 | | 32 | vLLM | 6.09 | | 32 | vLLM-Ascend | 6.08 | #### Performance Qwen3-8B, DP1/TP1, constructing gsm8k dataset to repeat the input length to 3.5K/output length 1.5K, data num 400, batch_size 16, temperature 0 | Method | Graph Mode | Spec Num | Mean Acceptance Length | TOPT(ms) | Output Token Throughput(token/s)| |-----|-----|-----|-----|-----|-----| | Eagle3 | FULL_DECODE_ONLY | 3 | 2.81 | 16.4 | 943.60(baseline) | | Eagle3 | FULL_DECODE_ONLY | 8 | 3.60 | 19.5 | 795.34(↓15.7%) | | Dflash | PIECEWISE | 8 | 5.25 | 12.4 | 1248.93(↑32.4%) | #### Accuracy Qwen3-8B, DP1/TP1, output length 3.5K, data num 300, batch_size 16, temperature 0 | Method | Graph Mode | Spec Num | Dataset | Accuracy(%) | |-----|-----|-----|-----|-----| | Eagle3 | FULL_DECODE_ONLY | 3 | gsm8k | 84.67 | | Dflash | PIECEWISE | 8 | gsm8k | 85.00 | ### Next Plan - Support FULL_DECODE_ONLY - Support Qwen3.5 - The NPU Triton multi-core is faulty. Currently, only use a single core to process all reqs, which needs to be improved. - Operator optimization: now the maximum number of TND layout's for the FIA operator is 16. Therefore, the maximum sepc_num is 15. Although this issue can be bypassed, the performance will be affected. - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 Signed-off-by: chenaoxuan <cax1165@163.com>
**This PR is inherited from PR-[7162 ](vllm-project#7162) and supports the latest vllm-ascend main. The old version is closed.** ### Purpose **We first supported DFlash on Ascend-NPU and then maintained it.** > DFlash ("[DFlash: Block Diffusion for Flash Speculative Decoding](https://arxiv.org/abs/2602.06036)") is a parallel speculative decoding algorithm that generates multiple candidate tokens at once through a diffusion process. Main changes: - Corresponds to the official support of vllm merged PR-[36847](vllm-project/vllm#36847). - Add dflash proposer implementation on the basis of SpecDecodeBaseProposer. - Modify the attention backend and add bidirectional attention branch. - Modify model_runner_v1 to support calling the dflash module. ### Quick Start [!Attention!] As of April 10, vllm-ascend is not compatible with vllm that supports DFlash. Therefore, cherry-pick is required: `cd vllm` `git checkout -b new-branch v0.19.0` `git cherry-pick dc14cbf0c06e8a124bdf0c03e8e267feef60887e` [Weights] Use official DFlash [weights](https://huggingface.co/collections/z-lab/dflash). [Config] --speculative-config '{"num_speculative_tokens": 8, "method":"dflash","model":"weight_path","enforce_eager": true}' ### Test Results #### Acceptance rate Verified with Sglang(GPU) and Vllm(GPU) version of Qwen3-8B-DFlash-b16 in GSM8K dataset. T=0 Draft Tokens = 16, Max Tokens = 2048 | Batch Size | Framework | Mean Acceptance Length | |-----|-----|-----| | 4 | SGlang | 6.07 | | 4 | vLLM | 6.08 | | 4 | vLLM-Ascend | 6.05 | | 8 | SGlang | 6.07 | | 8 | vLLM | 6.08 | | 8 | vLLM-Ascend | 6.06 | | 16 | SGlang | 6.08 | | 16 | vLLM | 6.08 | | 16 | vLLM-Ascend | 6.08 | | 32 | SGlang | 6.08 | | 32 | vLLM | 6.09 | | 32 | vLLM-Ascend | 6.08 | #### Performance Qwen3-8B, DP1/TP1, constructing gsm8k dataset to repeat the input length to 3.5K/output length 1.5K, data num 400, batch_size 16, temperature 0 | Method | Graph Mode | Spec Num | Mean Acceptance Length | TOPT(ms) | Output Token Throughput(token/s)| |-----|-----|-----|-----|-----|-----| | Eagle3 | FULL_DECODE_ONLY | 3 | 2.81 | 16.4 | 943.60(baseline) | | Eagle3 | FULL_DECODE_ONLY | 8 | 3.60 | 19.5 | 795.34(↓15.7%) | | Dflash | PIECEWISE | 8 | 5.25 | 12.4 | 1248.93(↑32.4%) | #### Accuracy Qwen3-8B, DP1/TP1, output length 3.5K, data num 300, batch_size 16, temperature 0 | Method | Graph Mode | Spec Num | Dataset | Accuracy(%) | |-----|-----|-----|-----|-----| | Eagle3 | FULL_DECODE_ONLY | 3 | gsm8k | 84.67 | | Dflash | PIECEWISE | 8 | gsm8k | 85.00 | ### Next Plan - Support FULL_DECODE_ONLY - Support Qwen3.5 - The NPU Triton multi-core is faulty. Currently, only use a single core to process all reqs, which needs to be improved. - Operator optimization: now the maximum number of TND layout's for the FIA operator is 16. Therefore, the maximum sepc_num is 15. Although this issue can be bypassed, the performance will be affected. - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 Signed-off-by: chenaoxuan <cax1165@163.com>
**This PR is inherited from PR-[7162 ](vllm-project#7162) and supports the latest vllm-ascend main. The old version is closed.** ### Purpose **We first supported DFlash on Ascend-NPU and then maintained it.** > DFlash ("[DFlash: Block Diffusion for Flash Speculative Decoding](https://arxiv.org/abs/2602.06036)") is a parallel speculative decoding algorithm that generates multiple candidate tokens at once through a diffusion process. Main changes: - Corresponds to the official support of vllm merged PR-[36847](vllm-project/vllm#36847). - Add dflash proposer implementation on the basis of SpecDecodeBaseProposer. - Modify the attention backend and add bidirectional attention branch. - Modify model_runner_v1 to support calling the dflash module. ### Quick Start [!Attention!] As of April 10, vllm-ascend is not compatible with vllm that supports DFlash. Therefore, cherry-pick is required: `cd vllm` `git checkout -b new-branch v0.19.0` `git cherry-pick dc14cbf0c06e8a124bdf0c03e8e267feef60887e` [Weights] Use official DFlash [weights](https://huggingface.co/collections/z-lab/dflash). [Config] --speculative-config '{"num_speculative_tokens": 8, "method":"dflash","model":"weight_path","enforce_eager": true}' ### Test Results #### Acceptance rate Verified with Sglang(GPU) and Vllm(GPU) version of Qwen3-8B-DFlash-b16 in GSM8K dataset. T=0 Draft Tokens = 16, Max Tokens = 2048 | Batch Size | Framework | Mean Acceptance Length | |-----|-----|-----| | 4 | SGlang | 6.07 | | 4 | vLLM | 6.08 | | 4 | vLLM-Ascend | 6.05 | | 8 | SGlang | 6.07 | | 8 | vLLM | 6.08 | | 8 | vLLM-Ascend | 6.06 | | 16 | SGlang | 6.08 | | 16 | vLLM | 6.08 | | 16 | vLLM-Ascend | 6.08 | | 32 | SGlang | 6.08 | | 32 | vLLM | 6.09 | | 32 | vLLM-Ascend | 6.08 | #### Performance Qwen3-8B, DP1/TP1, constructing gsm8k dataset to repeat the input length to 3.5K/output length 1.5K, data num 400, batch_size 16, temperature 0 | Method | Graph Mode | Spec Num | Mean Acceptance Length | TOPT(ms) | Output Token Throughput(token/s)| |-----|-----|-----|-----|-----|-----| | Eagle3 | FULL_DECODE_ONLY | 3 | 2.81 | 16.4 | 943.60(baseline) | | Eagle3 | FULL_DECODE_ONLY | 8 | 3.60 | 19.5 | 795.34(↓15.7%) | | Dflash | PIECEWISE | 8 | 5.25 | 12.4 | 1248.93(↑32.4%) | #### Accuracy Qwen3-8B, DP1/TP1, output length 3.5K, data num 300, batch_size 16, temperature 0 | Method | Graph Mode | Spec Num | Dataset | Accuracy(%) | |-----|-----|-----|-----|-----| | Eagle3 | FULL_DECODE_ONLY | 3 | gsm8k | 84.67 | | Dflash | PIECEWISE | 8 | gsm8k | 85.00 | ### Next Plan - Support FULL_DECODE_ONLY - Support Qwen3.5 - The NPU Triton multi-core is faulty. Currently, only use a single core to process all reqs, which needs to be improved. - Operator optimization: now the maximum number of TND layout's for the FIA operator is 16. Therefore, the maximum sepc_num is 15. Although this issue can be bypassed, the performance will be affected. - vLLM main: https://github.com/vllm-project/vllm/commit/v0.19.0 Signed-off-by: chenaoxuan <cax1165@163.com>
This PR is inherited from PR-7162 and supports the latest vllm-ascend main. The old version is closed.
Purpose
We first supported DFlash on Ascend-NPU and then maintained it.
Main changes:
Quick Start
[!Attention!]
As of April 10, vllm-ascend is not compatible with vllm that supports DFlash.
Therefore, cherry-pick is required:
cd vllmgit checkout -b new-branch v0.19.0git cherry-pick 494636b29d3b3a7b35020e4becb6c6995e200f9d[Weights]
Use official DFlash weights.
[Config]
--speculative-config '{"num_speculative_tokens": 8, "method":"dflash","model":"weight_path","enforce_eager": true}'
Test Results
Acceptance rate
Verified with Sglang(GPU) and Vllm(GPU) version of Qwen3-8B-DFlash-b16 in GSM8K dataset.
T=0 Draft Tokens = 16, Max Tokens = 2048
Performance
Qwen3-8B, DP1/TP1, constructing gsm8k dataset to repeat the input length to 3.5K/output length 1.5K, data num 400, batch_size 16, temperature 0
Accuracy
Qwen3-8B, DP1/TP1, output length 3.5K, data num 300, batch_size 16, temperature 0
Next Plan