-
Notifications
You must be signed in to change notification settings - Fork 7k
[Serve] Prioritize stopping most recently scaled-up replicas during downscaling #52929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b081d11 to
02a57df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since node_to_running_replicas_of_target_deployment[node_id] is a set, we dont get any guarantees that its going to stop replicas in reverse order. This needs a different implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delayed response — you were absolutely right.
node_to_running_replicas_of_target_deployment[node_id] was a set, so relying on reversed(list(...)) didn’t guarantee replica stop order. That was indeed a problem.
To address this, I've updated the implementation to use:
for node_id, _ in reversed( # noqa: C413
sorted(node_to_running_replicas_of_all_deployments.items(), key=key)
):There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think node_id is chronologically increasing, but I could be wrong, can you look into that. If I am right then the sorted node_id will not help with the task you set out to achieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, @abrarsheikh.
You’re right—sorting by node_id doesn’t give chronological order, so this change doesn’t achieve the intended behavior. I’ll close this PR and revisit the down-scale logic in a separate patch. Appreciate your time and feedback!
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
d9e507b to
277db74
Compare
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
Reopening this PR after closing it due to the use of The goal of this change is to prefer stopping more recently launched replicas (which are often less optimized) before long-lived ones. This helps preserve well-placed warm replicas and improves scaling behavior. By replacing The benefits of more stable and intelligent downscaling justify reopening this PR. |
Signed-off-by: kitae <[email protected]>
Signed-off-by: kitae <[email protected]>
…C413 Signed-off-by: kitae <[email protected]>
Signed-off-by: kitae <[email protected]>
…scale-down behavior Signed-off-by: kitae <[email protected]>
3cf778b to
f7df605
Compare
Signed-off-by: kitae <[email protected]>
| ) -> Set[ReplicaID]: | ||
| """Prioritize replicas running on a node with fewest replicas of | ||
| all deployments. | ||
| """Prioritize replicas on nodes with fewest replicas of all deployments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the old function implementation with inline comments, and variable names were easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I've reverted the original variable names and inline comments while keeping the list+reversed logic. Let me know if anything else looks off.
…ewest replicas first Signed-off-by: kitae <[email protected]>
| https://github.com/ray-project/ray/issues/20599. | ||
| """ | ||
| replicas_to_stop = set() | ||
| replicas_to_stop: List[ReplicaID] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does replicas_to_stop need to be a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a list to preserve insertion order. Each replica is inserted into
self._running_replicas[deployment_id] exactly once in
on_replica_running() when it reaches RUNNING:
self._running_replicas[deployment_id][replica_id] = node_id
Python 3.7+ dicts preserve that insertion order, so keys are oldest → newest.
By iterating with reversed(list(...)) we get newest → oldest, which lets us
stop the most recently launched replica first when multiple replicas live on
the same node. We cast the list back to a set before returning so the public
API stays unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, but why does replicas_to_stop need to be a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm — are you asking why we keep replicas_to_stop itself as a list even though newest_first_replicas is already ordered newest → oldest?
We need the list while filling it: we append newest → oldest and exit once len == max_num_to_stop; a set would drop that order and break the LIFO guarantee.
Right before returning we cast to set(...), so the caller still gets a Set[ReplicaID]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly, my question is why do we need to have replicas_to_stop be ordered?
You seem to just be
- initializing an empty list
replicas_to_stop - running
replicas_to_stop.append(pending_launching_recovering_replica) - running
if running_replica not in replicas_to_stop: replicas_to_stop.append(running_replica) - returning
set(replicas_to_stop)
If replicas_to_stop was a set, you could omit the check in (3)? Perhaps I'm missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a set would drop that order and break the LIFO guarantee
@ktyxx why does replicas_to_stop also need to be in order if LIFO is already guaranteed at the previous step when iterating over reversed(list(...))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right—the container returned by _get_replicas_to_stop itself doesn’t need to be ordered.
My previous patch changed it to a list, but that was unnecessary.
I’ve switched replicas_to_stop back to a set and rewrote the selection logic so the LIFO rule is enforced by
1. taking the per-deployment _running_replicas, reversing it once (newest → oldest)
2. selecting from each node’s bucket in that order.
Thanks for pointing this out!
| ordered_running_replicas.reverse() | ||
|
|
||
| # Bucket the (newest-first) replicas by node for fast lookup. | ||
| replicas_grouped_by_node: Dict[str, List[ReplicaID]] = defaultdict(list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like we should replace node_to_running_replicas_of_target_deployment with this new replicas_grouped_by_node? maybe name it something like ordered_running_replicas_of_target_deployment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks good catch. I removed the redundant node_to_running_replicas_of_target_deployment check and now use a single source of truth: ordered_running_replicas_of_target_deployment
11c2be9 to
f8e7e27
Compare
…d_running_replicas_of_target_deployment Signed-off-by: kitae <[email protected]>
f8e7e27 to
63dbed7
Compare
|
@zcin Hi! Just a gentle ping on this PR when you get a chance. |
…ownscaling (ray-project#52929) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> This PR improves the downscaling behavior in Ray Serve by modifying the logic in `_get_replicas_to_stop()` within Default `DeploymentScheduler`. Previously, the scheduler selected replicas to stop by traversing the least loaded nodes in ascending order. This often resulted in stopping replicas that had been scheduled earlier and placed optimally using the `_best_fit_node()` strategy. This led to several drawbacks: - Long-lived replicas, which were scheduled on best-fit nodes, were removed first — leading to inefficient reuse of resources. - Recently scaled-up replicas, which were placed on less utilized nodes, were kept longer despite being suboptimal. - Cold-start overhead increased, as newer replicas were removed before fully warming up. This PR reverses the node traversal order during downscaling so that **more recently added replicas are prioritized for termination**, *in cases where other conditions (e.g., running state and number of replicas per node) are equal*. These newer replicas are typically less optimal in placement and not yet fully warmed up. Preserving long-lived replicas improves performance stability and reduces unnecessary resource fragmentation. ## Related issue number <!-- For example: "Closes ray-project#1234" --> N/A ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kitae <[email protected]>
…ownscaling (ray-project#52929) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> This PR improves the downscaling behavior in Ray Serve by modifying the logic in `_get_replicas_to_stop()` within Default `DeploymentScheduler`. Previously, the scheduler selected replicas to stop by traversing the least loaded nodes in ascending order. This often resulted in stopping replicas that had been scheduled earlier and placed optimally using the `_best_fit_node()` strategy. This led to several drawbacks: - Long-lived replicas, which were scheduled on best-fit nodes, were removed first — leading to inefficient reuse of resources. - Recently scaled-up replicas, which were placed on less utilized nodes, were kept longer despite being suboptimal. - Cold-start overhead increased, as newer replicas were removed before fully warming up. This PR reverses the node traversal order during downscaling so that **more recently added replicas are prioritized for termination**, *in cases where other conditions (e.g., running state and number of replicas per node) are equal*. These newer replicas are typically less optimal in placement and not yet fully warmed up. Preserving long-lived replicas improves performance stability and reduces unnecessary resource fragmentation. ## Related issue number <!-- For example: "Closes ray-project#1234" --> N/A ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kitae <[email protected]> Signed-off-by: YK <[email protected]>
…ownscaling (ray-project#52929) <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? <!-- Please give a short summary of the change and the problem this solves. --> This PR improves the downscaling behavior in Ray Serve by modifying the logic in `_get_replicas_to_stop()` within Default `DeploymentScheduler`. Previously, the scheduler selected replicas to stop by traversing the least loaded nodes in ascending order. This often resulted in stopping replicas that had been scheduled earlier and placed optimally using the `_best_fit_node()` strategy. This led to several drawbacks: - Long-lived replicas, which were scheduled on best-fit nodes, were removed first — leading to inefficient reuse of resources. - Recently scaled-up replicas, which were placed on less utilized nodes, were kept longer despite being suboptimal. - Cold-start overhead increased, as newer replicas were removed before fully warming up. This PR reverses the node traversal order during downscaling so that **more recently added replicas are prioritized for termination**, *in cases where other conditions (e.g., running state and number of replicas per node) are equal*. These newer replicas are typically less optimal in placement and not yet fully warmed up. Preserving long-lived replicas improves performance stability and reduces unnecessary resource fragmentation. ## Related issue number <!-- For example: "Closes ray-project#1234" --> N/A ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: kitae <[email protected]>
Why are these changes needed?
This PR improves the downscaling behavior in Ray Serve by modifying the logic in
_get_replicas_to_stop()within DefaultDeploymentScheduler.Previously, the scheduler selected replicas to stop by traversing the least loaded nodes in ascending order. This often resulted in stopping replicas that had been scheduled earlier and placed optimally using the
_best_fit_node()strategy.This led to several drawbacks:
This PR reverses the node traversal order during downscaling so that more recently added replicas are prioritized for termination, in cases where other conditions (e.g., running state and number of replicas per node) are equal. These newer replicas are typically less optimal in placement and not yet fully warmed up.
Preserving long-lived replicas improves performance stability and reduces unnecessary resource fragmentation.
Related issue number
N/A
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.