[Feat] proxy delay to remove instances#5934
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to delay the removal of proxy instances when there are active requests, by 'tainting' them to prevent new traffic. While this is a valuable feature for graceful node isolation, the current implementation has several critical issues related to state management that could lead to incorrect behavior, such as corrupted state and infinite loops. I've identified and provided suggestions for these issues, including a variable shadowing bug, potential for duplicate entries in tainted lists, and failure to clear state after processing. Addressing these points will make the feature robust and reliable.
0cc2eda to
6934188
Compare
Signed-off-by: yuxinshan <syx_ctyg@126.com>
6934188 to
a573bd8
Compare
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (86 commits) [refactor] refactor excute_model and _dymmy_run method (vllm-project#6043) [Refactor] profiler config optimze (vllm-project#6141) [Graph][Fusion] Add MatmulAllReduceAddRMSNorm graph fusion for npugraph_ex. (vllm-project#6006) [UT]: refactoring 310p ops ut (vllm-project#6296) [Refact.]: refactoring 310p-kv cache allocator, align with main branch (vllm-project#6270) [Misc] Removes unnecessary graph size re-initialization (vllm-project#6280) [Main2Main] Upgrade vllm commit to 0123 (vllm-project#6169) [BugFix] Fix wheel package build workflow (vllm-project#6276) [CI][BugFix] Qwen3-Next nightly test fix. (vllm-project#6247) [Doc] quick fix for vllm-ascend version (vllm-project#6278) [Community] Nominate whx-sjtu as maintainer (vllm-project#6268) [Lint] Fix mypy issue to make CI happy (vllm-project#6272) BugFix: Fix moe_load accumulation error in ACL graph mode (vllm-project#6182) [Patch] Remove the patch of ECExampleConnector (vllm-project#5976) [Bugfix] Fix PP+PCP and PP+flashcomm1 bugs (vllm-project#5416) [Feat] proxy delay to remove instances (vllm-project#5934) [CI] Add workfolw_dispatch for nightly image build (vllm-project#6269) [bugfix][npugraph_ex]fix static kernel uninstall issue (vllm-project#6128) [Doc] 310P Documents update (vllm-project#6246) [Feature] Mooncake connector get remote ptp size (vllm-project#5822) ...
### What this PR does / why we need it?
For the proxy, we should remove instances when the proxy are not
processing requests.
But sometimes, We need to **isolate** some faulty nodes when a large
number of **requests** are coming in.
So we support to **isolate** faulty nodes by **lowering their priority**
and **deleted** them when the proxy does not process requests.
### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`,
when using `/instances/remove` API to delete the node from the proxy
server:
```txt
curl -X POST http://localhost:9000/instances/remove \
-H "Content-Type: application/json" \
-d '{
"type": "decode",
"instances": "127.0.0.1:8201"
}'
```
There are 2 situations:
* 【New】When the proxy is processing requests, isolate the nodes and
remove them when the proxy is free.
```txt
{"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* When the proxy is free, remove the nodes directly.
```txt
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@11b6af5
Signed-off-by: yuxinshan <syx_ctyg@126.com>
### What this PR does / why we need it?
For the proxy, we should remove instances when the proxy are not
processing requests.
But sometimes, We need to **isolate** some faulty nodes when a large
number of **requests** are coming in.
So we support to **isolate** faulty nodes by **lowering their priority**
and **deleted** them when the proxy does not process requests.
### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`,
when using `/instances/remove` API to delete the node from the proxy
server:
```txt
curl -X POST http://localhost:9000/instances/remove \
-H "Content-Type: application/json" \
-d '{
"type": "decode",
"instances": "127.0.0.1:8201"
}'
```
There are 2 situations:
* 【New】When the proxy is processing requests, isolate the nodes and
remove them when the proxy is free.
```txt
{"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* When the proxy is free, remove the nodes directly.
```txt
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@11b6af5
Signed-off-by: yuxinshan <syx_ctyg@126.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it?
For the proxy, we should remove instances when the proxy are not
processing requests.
But sometimes, We need to **isolate** some faulty nodes when a large
number of **requests** are coming in.
So we support to **isolate** faulty nodes by **lowering their priority**
and **deleted** them when the proxy does not process requests.
### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`,
when using `/instances/remove` API to delete the node from the proxy
server:
```txt
curl -X POST http://localhost:9000/instances/remove \
-H "Content-Type: application/json" \
-d '{
"type": "decode",
"instances": "127.0.0.1:8201"
}'
```
There are 2 situations:
* 【New】When the proxy is processing requests, isolate the nodes and
remove them when the proxy is free.
```txt
{"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* When the proxy is free, remove the nodes directly.
```txt
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@11b6af5
Signed-off-by: yuxinshan <syx_ctyg@126.com>
### What this PR does / why we need it?
For the proxy, we should remove instances when the proxy are not
processing requests.
But sometimes, We need to **isolate** some faulty nodes when a large
number of **requests** are coming in.
So we support to **isolate** faulty nodes by **lowering their priority**
and **deleted** them when the proxy does not process requests.
### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`,
when using `/instances/remove` API to delete the node from the proxy
server:
```txt
curl -X POST http://localhost:9000/instances/remove \
-H "Content-Type: application/json" \
-d '{
"type": "decode",
"instances": "127.0.0.1:8201"
}'
```
There are 2 situations:
* 【New】When the proxy is processing requests, isolate the nodes and
remove them when the proxy is free.
```txt
{"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* When the proxy is free, remove the nodes directly.
```txt
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@11b6af5
Signed-off-by: yuxinshan <syx_ctyg@126.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
### What this PR does / why we need it?
For the proxy, we should remove instances when the proxy are not
processing requests.
But sometimes, We need to **isolate** some faulty nodes when a large
number of **requests** are coming in.
So we support to **isolate** faulty nodes by **lowering their priority**
and **deleted** them when the proxy does not process requests.
### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`,
when using `/instances/remove` API to delete the node from the proxy
server:
```txt
curl -X POST http://localhost:9000/instances/remove \
-H "Content-Type: application/json" \
-d '{
"type": "decode",
"instances": "127.0.0.1:8201"
}'
```
There are 2 situations:
* 【New】When the proxy is processing requests, isolate the nodes and
remove them when the proxy is free.
```txt
{"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* When the proxy is free, remove the nodes directly.
```txt
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@11b6af5
Signed-off-by: yuxinshan <syx_ctyg@126.com>
### What this PR does / why we need it?
For the proxy, we should remove instances when the proxy are not
processing requests.
But sometimes, We need to **isolate** some faulty nodes when a large
number of **requests** are coming in.
So we support to **isolate** faulty nodes by **lowering their priority**
and **deleted** them when the proxy does not process requests.
### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`,
when using `/instances/remove` API to delete the node from the proxy
server:
```txt
curl -X POST http://localhost:9000/instances/remove \
-H "Content-Type: application/json" \
-d '{
"type": "decode",
"instances": "127.0.0.1:8201"
}'
```
There are 2 situations:
* 【New】When the proxy is processing requests, isolate the nodes and
remove them when the proxy is free.
```txt
{"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* When the proxy is free, remove the nodes directly.
```txt
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@11b6af5
Signed-off-by: yuxinshan <syx_ctyg@126.com>
### What this PR does / why we need it?
For the proxy, we should remove instances when the proxy are not
processing requests.
But sometimes, We need to **isolate** some faulty nodes when a large
number of **requests** are coming in.
So we support to **isolate** faulty nodes by **lowering their priority**
and **deleted** them when the proxy does not process requests.
### Does this PR introduce _any_ user-facing change?
For
`examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py`,
when using `/instances/remove` API to delete the node from the proxy
server:
```txt
curl -X POST http://localhost:9000/instances/remove \
-H "Content-Type: application/json" \
-d '{
"type": "decode",
"instances": "127.0.0.1:8201"
}'
```
There are 2 situations:
* 【New】When the proxy is processing requests, isolate the nodes and
remove them when the proxy is free.
```txt
{"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}
```
* When the proxy is free, remove the nodes directly.
```txt
{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}
```
### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
vllm-project/vllm@11b6af5
Signed-off-by: yuxinshan <syx_ctyg@126.com>
What this PR does / why we need it?
For the proxy, we should remove instances when the proxy are not processing requests.
But sometimes, We need to isolate some faulty nodes when a large number of requests are coming in.
So we support to isolate faulty nodes by lowering their priority and deleted them when the proxy does not process requests.
Does this PR introduce any user-facing change?
For
examples/disaggregated_prefill_v1/load_balance_proxy_server_example.py, when using/instances/removeAPI to delete the node from the proxy server:curl -X POST http://localhost:9000/instances/remove \ -H "Content-Type: application/json" \ -d '{ "type": "decode", "instances": "127.0.0.1:8201" }'There are 2 situations:
{"message": "Instances ['127.0.0.1:8201'] are isolated and waiting to be removed.", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200', '127.0.0.1:8201']}{"message": "remove decode instances: ['127.0.0.1:8201'].", "current_prefill_instances": ['127.0.0.1:8100', '127.0.0.1:8101'], "current_decode_instances": ['127.0.0.1:8200']}How was this patch tested?