[main] add pd transfer for ascend scheduler#2753
[main] add pd transfer for ascend scheduler#2753wangxiyuan merged 24 commits intovllm-project:mainfrom
Conversation
Signed-off-by: CaranLic <740821011@qq.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a 'pd transfer' feature for the Ascend scheduler, aiming to optimize performance by separating prefill and decode phases. The changes involve adding an enable_pd_transfer configuration, implementing a two-phase state machine ('prefill', 'decode') in the AscendScheduler, and adding unit tests to validate the new functionality. My review has identified a few areas for improvement. The phase transition logic currently only allows a one-way switch from 'prefill' to 'decode', which may be suboptimal for dynamic workloads. Additionally, the new unit test could be made more robust by avoiding manual manipulation of the scheduler's internal state. Finally, there's a minor point on configuration access that could be improved to make the code more robust against configuration errors.
|
|
||
| def test_scheduler_with_pd_transfer(self): | ||
| scheduler = self.create_scheduler() | ||
| scheduler.phase = "prefill" |
There was a problem hiding this comment.
Manually setting the internal state scheduler.phase makes this test brittle and less representative of real usage. If the initialization logic in AscendScheduler.__init__ changes, this test would not catch the regression. A better approach is to initialize the scheduler with enable_pd_transfer=True in its configuration, which would correctly set the initial phase.
To achieve this, you could modify the create_scheduler helper method to accept configuration overrides. For example:
def create_scheduler(self, mock_compute_encoder_budget, scheduler_config_override: Optional[Dict[str, Any]] = None):
# ... existing setup ...
scheduler_config = SchedulerConfig(
# ...
)
if scheduler_config_override:
for key, value in scheduler_config_override.items():
setattr(scheduler_config, key, value)
# ... rest of the function ...Then, the test can be updated to:
scheduler = self.create_scheduler(scheduler_config_override={"enable_pd_transfer": True})
self.assertEqual(scheduler.phase, "prefill")This change would make the test more robust and also serve to verify the scheduler's initialization logic.
| enable_pd_transfer = getattr(self.scheduler_config, | ||
| 'enable_pd_transfer', | ||
| False) |
There was a problem hiding this comment.
Using getattr with a default value for enable_pd_transfer can mask configuration issues. The AscendSchedulerConfig is expected to always have the enable_pd_transfer attribute. If it's missing for some reason (e.g., a typo in the attribute name in the config class), getattr will silently fall back to False, disabling the feature without any warning. This can make debugging difficult.
It's more robust to access the attribute directly. This will ensure that any configuration problem results in a clear AttributeError, which is a fail-fast approach.
enable_pd_transfer = self.scheduler_config.enable_pd_transfer| if not self.waiting and not self.running: | ||
| self.phase = "decode" |
There was a problem hiding this comment.
The phase transition from 'prefill' to 'decode' is currently one-way. Once the scheduler enters the 'decode' phase, it never returns to 'prefill'. If new requests arrive while the system is in the 'decode' phase, they will be prefilled and then immediately start decoding, which might not be the most efficient approach for the Ascend hardware this feature is targeting, as it breaks the strict batching of prefill operations.
To improve performance for dynamic workloads, consider adding logic to allow the scheduler to switch back to the 'prefill' phase. For instance, you could add a check at the beginning of the schedule method:
if self.phase == "decode" and not self.running and self.waiting:
self.phase = "prefill"This would ensure that if the decoding queue is empty and new requests are waiting, the scheduler can switch back to the more efficient batch prefill mode.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
|
Enabling this feature will result differences in definition of |
Signed-off-by: CaranLic <740821011@qq.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2753 +/- ##
==========================================
+ Coverage 74.76% 74.90% +0.13%
==========================================
Files 150 152 +2
Lines 20891 21008 +117
==========================================
+ Hits 15620 15736 +116
- Misses 5271 5272 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: CaranLic <740821011@qq.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
| self.min_p = self.min_p_device[:0] | ||
|
|
||
|
|
||
| MinPLogitsProcessor.__init__ = min_p_logits_processor_init_func |
There was a problem hiding this comment.
Do not patch vLLM, you should create ascend build_logitsprocs in npu_input_batch instead.
There was a problem hiding this comment.
change from patch MinPLogitsProcessor init func to redefine build_logitsprocs, code in 30571f1
Signed-off-by: CaranLic <740821011@qq.com>
### What this PR does / why we need it?
For offline scenarios, adjust the scheduling process to prioritize the
prefill phase of all requests, then process the decode phase of all
requests.
### How was this patch tested?
```
max_num_seqs=24,
additional_config={
"ascend_scheduler_config":{
"enabled": True,
"enable_pd_transfer": True,
"decode_max_num_seqs": 24,
"enable_chunked_prefill": False
}
},
```
| input | output | num prompts | max_num_seqs | dp | tp | scheduler |
tps |
| ------ | ------ | ---------- | ---------------- | ---- | ---- |
---------------- | --------------- |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | v1 | 234.06 |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | pd transfer | 239.59(+2.4%) |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | v1 | 222.85 |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | pd transfer | 225.81(+1.3%) |
- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@6fb2788
---------
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: offline0806 <z00858301@china.huawei.com>
### What this PR does / why we need it?
For offline scenarios, adjust the scheduling process to prioritize the
prefill phase of all requests, then process the decode phase of all
requests.
### How was this patch tested?
```
max_num_seqs=24,
additional_config={
"ascend_scheduler_config":{
"enabled": True,
"enable_pd_transfer": True,
"decode_max_num_seqs": 24,
"enable_chunked_prefill": False
}
},
```
| input | output | num prompts | max_num_seqs | dp | tp | scheduler |
tps |
| ------ | ------ | ---------- | ---------------- | ---- | ---- |
---------------- | --------------- |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | v1 | 234.06 |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | pd transfer | 239.59(+2.4%) |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | v1 | 222.85 |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | pd transfer | 225.81(+1.3%) |
- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@6fb2788
---------
Signed-off-by: CaranLic <740821011@qq.com>
### What this PR does / why we need it?
For offline scenarios, adjust the scheduling process to prioritize the
prefill phase of all requests, then process the decode phase of all
requests.
### How was this patch tested?
```
max_num_seqs=24,
additional_config={
"ascend_scheduler_config":{
"enabled": True,
"enable_pd_transfer": True,
"decode_max_num_seqs": 24,
"enable_chunked_prefill": False
}
},
```
| input | output | num prompts | max_num_seqs | dp | tp | scheduler |
tps |
| ------ | ------ | ---------- | ---------------- | ---- | ---- |
---------------- | --------------- |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | v1 | 234.06 |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | pd transfer | 239.59(+2.4%) |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | v1 | 222.85 |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | pd transfer | 225.81(+1.3%) |
- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@6fb2788
---------
Signed-off-by: CaranLic <740821011@qq.com>
### What this PR does / why we need it?
For offline scenarios, adjust the scheduling process to prioritize the
prefill phase of all requests, then process the decode phase of all
requests.
### How was this patch tested?
```
max_num_seqs=24,
additional_config={
"ascend_scheduler_config":{
"enabled": True,
"enable_pd_transfer": True,
"decode_max_num_seqs": 24,
"enable_chunked_prefill": False
}
},
```
| input | output | num prompts | max_num_seqs | dp | tp | scheduler |
tps |
| ------ | ------ | ---------- | ---------------- | ---- | ---- |
---------------- | --------------- |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | v1 | 234.06 |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | pd transfer | 239.59(+2.4%) |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | v1 | 222.85 |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | pd transfer | 225.81(+1.3%) |
- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@6fb2788
---------
Signed-off-by: CaranLic <740821011@qq.com>
### What this PR does / why we need it?
For offline scenarios, adjust the scheduling process to prioritize the
prefill phase of all requests, then process the decode phase of all
requests.
### How was this patch tested?
```
max_num_seqs=24,
additional_config={
"ascend_scheduler_config":{
"enabled": True,
"enable_pd_transfer": True,
"decode_max_num_seqs": 24,
"enable_chunked_prefill": False
}
},
```
| input | output | num prompts | max_num_seqs | dp | tp | scheduler |
tps |
| ------ | ------ | ---------- | ---------------- | ---- | ---- |
---------------- | --------------- |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | v1 | 234.06 |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | pd transfer | 239.59(+2.4%) |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | v1 | 222.85 |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | pd transfer | 225.81(+1.3%) |
- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@6fb2788
---------
Signed-off-by: CaranLic <740821011@qq.com>
Signed-off-by: nsdie <yeyifan@huawei.com>
### What this PR does / why we need it?
For offline scenarios, adjust the scheduling process to prioritize the
prefill phase of all requests, then process the decode phase of all
requests.
### How was this patch tested?
```
max_num_seqs=24,
additional_config={
"ascend_scheduler_config":{
"enabled": True,
"enable_pd_transfer": True,
"decode_max_num_seqs": 24,
"enable_chunked_prefill": False
}
},
```
| input | output | num prompts | max_num_seqs | dp | tp | scheduler |
tps |
| ------ | ------ | ---------- | ---------------- | ---- | ---- |
---------------- | --------------- |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | v1 | 234.06 |
| dapo-math-17K | 2K | 384 | 24 | 2 | 1 | pd transfer | 239.59(+2.4%) |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | v1 | 222.85 |
| dapo-math-17K| 2K | 384 | 24 | 4 | 1 | pd transfer | 225.81(+1.3%) |
- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@6fb2788
---------
Signed-off-by: CaranLic <740821011@qq.com>
What this PR does / why we need it?
For offline scenarios, adjust the scheduling process to prioritize the prefill phase of all requests, then process the decode phase of all requests.
Does this PR introduce any user-facing change?
How was this patch tested?