[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle#4893
[Feat] Support async_scheduler and disable_padded_drafter_batch in eagle#4893MengqingCao merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors eagle_proposer.py to support async_scheduler and disable_padded_drafter_batch, aligning it with recent changes in vLLM. The changes are extensive, involving new methods for preparing inputs and handling attention metadata. I've found a couple of critical issues that would cause runtime errors due to undefined variables. Please address these to ensure the new logic works as intended.
| # Currently we do not use pcp. This is used to adapt the pcp branch. | ||
| self.pcp_size = 0 | ||
| self.backup_next_token_ids = CpuGpuBuffer( | ||
| max_batch_size, | ||
| dtype=torch.int32, | ||
| pin_memory=is_pin_memory_available(), | ||
| device=device, | ||
| with_numpy=True, | ||
| ) |
There was a problem hiding this comment.
The variable max_batch_size is used here without being defined, which will cause a NameError at runtime. It should be initialized from vllm_config.scheduler_config.max_num_seqs before being used.
| # Currently we do not use pcp. This is used to adapt the pcp branch. | |
| self.pcp_size = 0 | |
| self.backup_next_token_ids = CpuGpuBuffer( | |
| max_batch_size, | |
| dtype=torch.int32, | |
| pin_memory=is_pin_memory_available(), | |
| device=device, | |
| with_numpy=True, | |
| ) | |
| max_batch_size = vllm_config.scheduler_config.max_num_seqs | |
| # Currently we do not use pcp. This is used to adapt the pcp branch. | |
| self.pcp_size = 0 | |
| self.backup_next_token_ids = CpuGpuBuffer( | |
| max_batch_size, | |
| dtype=torch.int32, | |
| pin_memory=is_pin_memory_available(), | |
| device=device, | |
| with_numpy=True, | |
| ) |
| @@ -549,29 +435,28 @@ def _propose( | |||
|
|
|||
| # Compute the slot mapping. | |||
| block_numbers = (clamped_positions_cpu // self.block_size) | |||
There was a problem hiding this comment.
The variable clamped_positions_cpu is used here but it is not defined in this scope. This seems to be a leftover from refactoring and will cause a NameError. You probably meant to use clamped_positions, which is defined a few lines above and is on the correct device for the subsequent GPU operations.
| block_numbers = (clamped_positions_cpu // self.block_size) | |
| block_numbers = (clamped_positions // self.block_size) |
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
e68dae0 to
db94642
Compare
vllm_ascend/attention/utils.py
Outdated
| is_only_prefill=self.is_only_prefill, | ||
| graph_pad_size=-1, # It should be -1 when not run in fullgraph mode. | ||
| num_input_tokens=num_actual_tokens, | ||
| cos=self.cos, |
There was a problem hiding this comment.
check whether cos/sin needed to be sliced.
There was a problem hiding this comment.
Checked. In the current scene, cos/sin will only be None, so won't cause any errors. Slicing has been added for subsequent scenes.
4af6f05 to
364e8a0
Compare
40f9983 to
9fa55d2
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
9fa55d2 to
6cc8516
Compare
6cc8516 to
5fdcaab
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
5fdcaab to
bcb9415
Compare
ca2a69e to
7930081
Compare
| if self.speculative_config: | ||
| use_padded_batch_for_eagle = self.speculative_config and \ | ||
| self.speculative_config.method == "mtp" and \ | ||
| self.speculative_config.method in ("mtp", "eagle", "eagle3") and \ |
There was a problem hiding this comment.
use self.speculative_config.use_eagle()
835fc47 to
2bda5de
Compare
56e2ea1 to
e3be15f
Compare
e101000 to
91ae862
Compare
### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with #4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>
| ) | ||
|
|
||
| # Create an LLM. | ||
| llm = LLM( |
There was a problem hiding this comment.
I just noticed you're using LLM in a test case, we must clean up npu hbm when not using VllmRunner. plz modify this case to use VllmRunner or clean up by hand if keeping it as is.
plz refer to https://github.com/vllm-project/vllm-ascend/blob/main/tests/e2e/conftest.py#L81
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
71b5955 to
79662d2
Compare
Co-authored-by: drslark <slarksblood@qq.com> Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
79662d2 to
fafd238
Compare
### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with vllm-project#4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com>
…gle (vllm-project#4893) ### What this PR does / why we need it? We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: drslark <slarksblood@qq.com>
### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with vllm-project#4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…gle (vllm-project#4893) ### What this PR does / why we need it? We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: drslark <slarksblood@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it? Currently, we are using `AscendRejctionSampler` that extends from `RejctionSampler` in spec decoding. `AscendRejctionSampler` override `forward` of `RejctionSampler`, only aming to replace `rejection_sample` func. This causes a lot of code of `RejctionSampler` cannot be reused, for example: - vllm-project/vllm#19482 - vllm-project/vllm#26060 - vllm-project/vllm#29223 #### Proposed Change: - Delete `AscendRejctionSampler` and use `RejctionSampler` directly in model runner. - Patch `RejctionSampler.expand_batch_to_tokens` and `RejctionSampler.rejection_sample`, maybe a better way is to make them as custom ops. - Modify `NPUModelRunner` following vllm-project/vllm#26060 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - [x] test logits processor for spec decoding - [x] test logprobs for spec decoding - [x] test logprobs for spec decoding + async shcheduling (test with vllm-project#4893) - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: realliujiaxu <realliujiaxu@163.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…gle (vllm-project#4893) ### What this PR does / why we need it? We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler. - vLLM version: v0.12.0 - vLLM main: vllm-project/vllm@ad32e3e --------- Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: drslark <slarksblood@qq.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
What this PR does / why we need it?
We refactored the eagle_proposer.py to adapt the framework of eagle.py in vllm-v0.12.0, to support the logit of padded drafter batch and async-scheduler.
Does this PR introduce any user-facing change?
How was this patch tested?