[Feat] Refactor rejection sampler#4975
Conversation
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request refactors the rejection sampler by removing the custom AscendRejectionSampler and using the upstream vllm.RejectionSampler with monkey-patching for Ascend-specific functions. This is a good change that reduces code duplication and aligns with the upstream implementation. The code is well-refactored into smaller, more readable methods like _sample and _bookkeeping_sync. My main concern is a potential issue with the log probability calculation for speculative decoding. The logic for handling logprobs seems incomplete, which might lead to incorrect results when logprobs are requested.
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
…mpler-rebase # Conflicts: # vllm_ascend/worker/model_runner_v1.py
| list[int], | ||
| ]: | ||
| num_reqs = self.input_batch.num_reqs | ||
| discard_sampled_tokens_req_indices = np.nonzero( |
There was a problem hiding this comment.
can some attrs like discard_requests_mask be deleted in prepare_inputs?
There was a problem hiding this comment.
we need to implement vllm-project/vllm#28597 first.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
…mpler-rebase # Conflicts: # vllm_ascend/worker/model_runner_v1.py
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
| @@ -33,3 +33,4 @@ | |||
| import vllm_ascend.patch.worker.patch_qwen3_vl # noqa | |||
There was a problem hiding this comment.
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend committer team. @zzzzwwjj --- - Review Quality: He has completed 80+reviews since April. 2025, include #3232 (comment), #4822 (comment), #4768 (comment) high quality review. - Sustained Contributions 15+ Valuable bug fix and refactor is very good. https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved Continuous optimization of code architecture https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged - Quality Contribution: #1229 #1979 #4359 #4878 - Community Involvement: He lead the #1147, to refactor AscendFusedMoE at the first time. He shared topics about large-scale distributed inference and reinforcement learning on vLLM-Ascend meetup on August 2nd. @realliujiaxu --- - Review Quality: He has completed about [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+) since September, include #4868 (comment), #2275 (comment). - Sustained Contributions He has completed (17 commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged], continuously optimizing the performance of the MoE model. - Quality Contribution: Contributed the Flash Comm1 feature to the community, supporting both eager and aclgraph execution modes, while compatible with multiple MoE models including DeepSeek and GLM4.5. - #3334 - #3420 - #3015 co-author: - #3495 - #4868 - Community Involvement: 1. Completed two major refactors, enabling vllm-ascend to evolve more rapidly and robustly: [Linear module](#2867) and [rejection sampler](#4975) 2. [fixed 8 bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+) in graph mode, spec decoding and async scheduling. @LCAIZJ --- - Review Quality: He's been the go-to reviewer for virtually all PD disaggregation and KV Pool related PRs, having completed [30+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+) since May 2025. Notable examples include [discussion_r2553887360](#4345 (comment)), [issuecomment-3540994801](#4161 (comment)), and [discussion_r2492593988](#3981 (comment)), all demonstrating thorough and insightful feedback. - Sustained and Quality Contributions: His contributions reflect a strong grasp of both vLLM and vLLM Ascend codebases, particularly in prefill-decode disaggregation and KV pool areas ([7 PRs merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)). Prefill-Decode Disaggregation: Delivered KV transfer functionality using Mooncake TransferEngine and enabled layerwise KV transfer #1568 #2602 KV Pool: Developed the foundational KV Pool infrastructure and migrated it to the latest ADXL stack #2913 #3350 - Quality Contribution: #1568 #2602 #2913 #3350 - Community Involvement: He actively responds to [community issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ), continuously monitors functionality and accuracy issues related to PD disaggregation and KV Pool, and proactively delivers [bug fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix). - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with vllm-project#4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>
…t#5152) I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend committer team. @zzzzwwjj --- - Review Quality: He has completed 80+reviews since April. 2025, include vllm-project#3232 (comment), vllm-project#4822 (comment), vllm-project#4768 (comment) high quality review. - Sustained Contributions 15+ Valuable bug fix and refactor is very good. https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved Continuous optimization of code architecture https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged - Quality Contribution: vllm-project#1229 vllm-project#1979 vllm-project#4359 vllm-project#4878 - Community Involvement: He lead the vllm-project#1147, to refactor AscendFusedMoE at the first time. He shared topics about large-scale distributed inference and reinforcement learning on vLLM-Ascend meetup on August 2nd. @realliujiaxu --- - Review Quality: He has completed about [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+) since September, include vllm-project#4868 (comment), vllm-project#2275 (comment). - Sustained Contributions He has completed (17 commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged], continuously optimizing the performance of the MoE model. - Quality Contribution: Contributed the Flash Comm1 feature to the community, supporting both eager and aclgraph execution modes, while compatible with multiple MoE models including DeepSeek and GLM4.5. - vllm-project#3334 - vllm-project#3420 - vllm-project#3015 co-author: - vllm-project#3495 - vllm-project#4868 - Community Involvement: 1. Completed two major refactors, enabling vllm-ascend to evolve more rapidly and robustly: [Linear module](vllm-project#2867) and [rejection sampler](vllm-project#4975) 2. [fixed 8 bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+) in graph mode, spec decoding and async scheduling. @LCAIZJ --- - Review Quality: He's been the go-to reviewer for virtually all PD disaggregation and KV Pool related PRs, having completed [30+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+) since May 2025. Notable examples include [discussion_r2553887360](vllm-project#4345 (comment)), [issuecomment-3540994801](vllm-project#4161 (comment)), and [discussion_r2492593988](vllm-project#3981 (comment)), all demonstrating thorough and insightful feedback. - Sustained and Quality Contributions: His contributions reflect a strong grasp of both vLLM and vLLM Ascend codebases, particularly in prefill-decode disaggregation and KV pool areas ([7 PRs merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)). Prefill-Decode Disaggregation: Delivered KV transfer functionality using Mooncake TransferEngine and enabled layerwise KV transfer vllm-project#1568 vllm-project#2602 KV Pool: Developed the foundational KV Pool infrastructure and migrated it to the latest ADXL stack vllm-project#2913 vllm-project#3350 - Quality Contribution: vllm-project#1568 vllm-project#2602 vllm-project#2913 vllm-project#3350 - Community Involvement: He actively responds to [community issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ), continuously monitors functionality and accuracy issues related to PD disaggregation and KV Pool, and proactively delivers [bug fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix). - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with vllm-project#4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…t#5152) I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend committer team. @zzzzwwjj --- - Review Quality: He has completed 80+reviews since April. 2025, include vllm-project#3232 (comment), vllm-project#4822 (comment), vllm-project#4768 (comment) high quality review. - Sustained Contributions 15+ Valuable bug fix and refactor is very good. https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved Continuous optimization of code architecture https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged - Quality Contribution: vllm-project#1229 vllm-project#1979 vllm-project#4359 vllm-project#4878 - Community Involvement: He lead the vllm-project#1147, to refactor AscendFusedMoE at the first time. He shared topics about large-scale distributed inference and reinforcement learning on vLLM-Ascend meetup on August 2nd. @realliujiaxu --- - Review Quality: He has completed about [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+) since September, include vllm-project#4868 (comment), vllm-project#2275 (comment). - Sustained Contributions He has completed (17 commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged], continuously optimizing the performance of the MoE model. - Quality Contribution: Contributed the Flash Comm1 feature to the community, supporting both eager and aclgraph execution modes, while compatible with multiple MoE models including DeepSeek and GLM4.5. - vllm-project#3334 - vllm-project#3420 - vllm-project#3015 co-author: - vllm-project#3495 - vllm-project#4868 - Community Involvement: 1. Completed two major refactors, enabling vllm-ascend to evolve more rapidly and robustly: [Linear module](vllm-project#2867) and [rejection sampler](vllm-project#4975) 2. [fixed 8 bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+) in graph mode, spec decoding and async scheduling. @LCAIZJ --- - Review Quality: He's been the go-to reviewer for virtually all PD disaggregation and KV Pool related PRs, having completed [30+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+) since May 2025. Notable examples include [discussion_r2553887360](vllm-project#4345 (comment)), [issuecomment-3540994801](vllm-project#4161 (comment)), and [discussion_r2492593988](vllm-project#3981 (comment)), all demonstrating thorough and insightful feedback. - Sustained and Quality Contributions: His contributions reflect a strong grasp of both vLLM and vLLM Ascend codebases, particularly in prefill-decode disaggregation and KV pool areas ([7 PRs merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)). Prefill-Decode Disaggregation: Delivered KV transfer functionality using Mooncake TransferEngine and enabled layerwise KV transfer vllm-project#1568 vllm-project#2602 KV Pool: Developed the foundational KV Pool infrastructure and migrated it to the latest ADXL stack vllm-project#2913 vllm-project#3350 - Quality Contribution: vllm-project#1568 vllm-project#2602 vllm-project#2913 vllm-project#3350 - Community Involvement: He actively responds to [community issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ), continuously monitors functionality and accuracy issues related to PD disaggregation and KV Pool, and proactively delivers [bug fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix). - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with vllm-project#4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…t#5152) I'd like to nominate @zzzzwwjj @realliujiaxu @LCAIZJ to join vLLM Ascend committer team. @zzzzwwjj --- - Review Quality: He has completed 80+reviews since April. 2025, include vllm-project#3232 (comment), vllm-project#4822 (comment), vllm-project#4768 (comment) high quality review. - Sustained Contributions 15+ Valuable bug fix and refactor is very good. https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Azzzzwwjj+is%3Aclosed+review%3Aapproved Continuous optimization of code architecture https://github.com/vllm-project/vllm-ascend/pulls?q=author%3Azzzzwwjj+is%3Amerged - Quality Contribution: vllm-project#1229 vllm-project#1979 vllm-project#4359 vllm-project#4878 - Community Involvement: He lead the vllm-project#1147, to refactor AscendFusedMoE at the first time. He shared topics about large-scale distributed inference and reinforcement learning on vLLM-Ascend meetup on August 2nd. @realliujiaxu --- - Review Quality: He has completed about [40+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3Arealliujiaxu+-author%3Arealliujiaxu+) since September, include vllm-project#4868 (comment), vllm-project#2275 (comment). - Sustained Contributions He has completed (17 commits)[https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged], continuously optimizing the performance of the MoE model. - Quality Contribution: Contributed the Flash Comm1 feature to the community, supporting both eager and aclgraph execution modes, while compatible with multiple MoE models including DeepSeek and GLM4.5. - vllm-project#3334 - vllm-project#3420 - vllm-project#3015 co-author: - vllm-project#3495 - vllm-project#4868 - Community Involvement: 1. Completed two major refactors, enabling vllm-ascend to evolve more rapidly and robustly: [Linear module](vllm-project#2867) and [rejection sampler](vllm-project#4975) 2. [fixed 8 bugs](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3Arealliujiaxu+is%3Amerged+bugfix+) in graph mode, spec decoding and async scheduling. @LCAIZJ --- - Review Quality: He's been the go-to reviewer for virtually all PD disaggregation and KV Pool related PRs, having completed [30+ reviews](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+commenter%3ALCAIZJ+is%3Aopen+-author%3ALCAIZJ+) since May 2025. Notable examples include [discussion_r2553887360](vllm-project#4345 (comment)), [issuecomment-3540994801](vllm-project#4161 (comment)), and [discussion_r2492593988](vllm-project#3981 (comment)), all demonstrating thorough and insightful feedback. - Sustained and Quality Contributions: His contributions reflect a strong grasp of both vLLM and vLLM Ascend codebases, particularly in prefill-decode disaggregation and KV pool areas ([7 PRs merged](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+)). Prefill-Decode Disaggregation: Delivered KV transfer functionality using Mooncake TransferEngine and enabled layerwise KV transfer vllm-project#1568 vllm-project#2602 KV Pool: Developed the foundational KV Pool infrastructure and migrated it to the latest ADXL stack vllm-project#2913 vllm-project#3350 - Quality Contribution: vllm-project#1568 vllm-project#2602 vllm-project#2913 vllm-project#3350 - Community Involvement: He actively responds to [community issues](https://github.com/vllm-project/vllm-ascend/issues?q=is%3Aissue%20commenter%3ALCAIZJ%20is%3Aopen%20-author%3ALCAIZJ), continuously monitors functionality and accuracy issues related to PD disaggregation and KV Pool, and proactively delivers [bug fixes](https://github.com/vllm-project/vllm-ascend/pulls?q=is%3Apr+author%3ALCAIZJ+is%3Amerged+bugfix). - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
What this PR does / why we need it?
Currently, we are using
AscendRejctionSamplerthat extends fromRejctionSamplerin spec decoding.AscendRejctionSampleroverrideforwardofRejctionSampler, only aming to replacerejection_samplefunc. Thiscauses a lot of code of
RejctionSamplercannot be reused, for example:Proposed Change:
AscendRejctionSamplerand useRejctionSamplerdirectly in model runner.RejctionSampler.expand_batch_to_tokensandRejctionSampler.rejection_sample, maybe a better way is to make them as custom ops.NPUModelRunnerfollowing [V1][spec decode] return logprobs for spec decoding vllm#26060Does this PR introduce any user-facing change?
No
How was this patch tested?
test logits processor for spec decoding
test logprobs for spec decoding
test logprobs for spec decoding + async shcheduling (test with [Feat] Support async_scheduler and disable_padded_drafter_batch in eagle #4893)
vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e