Skip to content

Conversation

@ktyxx
Copy link
Contributor

@ktyxx ktyxx commented May 10, 2025

Why are these changes needed?

This PR improves the downscaling behavior in Ray Serve by modifying the logic in _get_replicas_to_stop() within Default DeploymentScheduler.

Previously, the scheduler selected replicas to stop by traversing the least loaded nodes in ascending order. This often resulted in stopping replicas that had been scheduled earlier and placed optimally using the _best_fit_node() strategy.

This led to several drawbacks:

  • Long-lived replicas, which were scheduled on best-fit nodes, were removed first — leading to inefficient reuse of resources.
  • Recently scaled-up replicas, which were placed on less utilized nodes, were kept longer despite being suboptimal.
  • Cold-start overhead increased, as newer replicas were removed before fully warming up.

This PR reverses the node traversal order during downscaling so that more recently added replicas are prioritized for termination, in cases where other conditions (e.g., running state and number of replicas per node) are equal. These newer replicas are typically less optimal in placement and not yet fully warmed up.

Preserving long-lived replicas improves performance stability and reduces unnecessary resource fragmentation.

Related issue number

N/A

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label May 12, 2025
@masoudcharkhabi masoudcharkhabi added serve Ray Serve Related Issue stability labels May 12, 2025
@ktyxx ktyxx force-pushed the fix-replica-scale-down-order branch from b081d11 to 02a57df Compare May 13, 2025 09:15
@akshay-anyscale akshay-anyscale requested a review from a team May 23, 2025 13:20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since node_to_running_replicas_of_target_deployment[node_id] is a set, we dont get any guarantees that its going to stop replicas in reverse order. This needs a different implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delayed response — you were absolutely right.

node_to_running_replicas_of_target_deployment[node_id] was a set, so relying on reversed(list(...)) didn’t guarantee replica stop order. That was indeed a problem.

To address this, I've updated the implementation to use:

for node_id, _ in reversed(  # noqa: C413
    sorted(node_to_running_replicas_of_all_deployments.items(), key=key)
):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think node_id is chronologically increasing, but I could be wrong, can you look into that. If I am right then the sorted node_id will not help with the task you set out to achieve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, @abrarsheikh.
You’re right—sorting by node_id doesn’t give chronological order, so this change doesn’t achieve the intended behavior. I’ll close this PR and revisit the down-scale logic in a separate patch. Appreciate your time and feedback!

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 10, 2025
@ktyxx ktyxx force-pushed the fix-replica-scale-down-order branch 2 times, most recently from d9e507b to 277db74 Compare June 11, 2025 05:18
@github-actions github-actions bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 12, 2025
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 26, 2025
@ktyxx ktyxx closed this Jun 27, 2025
@ktyxx
Copy link
Contributor Author

ktyxx commented Jul 14, 2025

Reopening this PR after closing it due to the use of set() which doesn’t preserve insertion order.

The goal of this change is to prefer stopping more recently launched replicas (which are often less optimized) before long-lived ones. This helps preserve well-placed warm replicas and improves scaling behavior.

By replacing replicas_to_stop with a list and preserving replica order, we avoid non-deterministic stop behavior. The performance impact is negligible, and the change aligns well with the scheduling goals of Serve.

The benefits of more stable and intelligent downscaling justify reopening this PR.

@ktyxx ktyxx reopened this Jul 14, 2025
@ktyxx ktyxx force-pushed the fix-replica-scale-down-order branch from 3cf778b to f7df605 Compare July 14, 2025 05:46
@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Jul 14, 2025
) -> Set[ReplicaID]:
"""Prioritize replicas running on a node with fewest replicas of
all deployments.
"""Prioritize replicas on nodes with fewest replicas of all deployments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the old function implementation with inline comments, and variable names were easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I've reverted the original variable names and inline comments while keeping the list+reversed logic. Let me know if anything else looks off.

https://github.com/ray-project/ray/issues/20599.
"""
replicas_to_stop = set()
replicas_to_stop: List[ReplicaID] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does replicas_to_stop need to be a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a list to preserve insertion order. Each replica is inserted into
self._running_replicas[deployment_id] exactly once in
on_replica_running() when it reaches RUNNING:

self._running_replicas[deployment_id][replica_id] = node_id

Python 3.7+ dicts preserve that insertion order, so keys are oldest → newest.
By iterating with reversed(list(...)) we get newest → oldest, which lets us
stop the most recently launched replica first when multiple replicas live on
the same node. We cast the list back to a set before returning so the public
API stays unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, but why does replicas_to_stop need to be a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm — are you asking why we keep replicas_to_stop itself as a list even though newest_first_replicas is already ordered newest → oldest?

We need the list while filling it: we append newest → oldest and exit once len == max_num_to_stop; a set would drop that order and break the LIFO guarantee.
Right before returning we cast to set(...), so the caller still gets a Set[ReplicaID]

Copy link
Contributor

@zcin zcin Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly, my question is why do we need to have replicas_to_stop be ordered?
You seem to just be

  1. initializing an empty list replicas_to_stop
  2. running replicas_to_stop.append(pending_launching_recovering_replica)
  3. running if running_replica not in replicas_to_stop: replicas_to_stop.append(running_replica)
  4. returning set(replicas_to_stop)

If replicas_to_stop was a set, you could omit the check in (3)? Perhaps I'm missing something here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a set would drop that order and break the LIFO guarantee

@ktyxx why does replicas_to_stop also need to be in order if LIFO is already guaranteed at the previous step when iterating over reversed(list(...))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re right—the container returned by _get_replicas_to_stop itself doesn’t need to be ordered.
My previous patch changed it to a list, but that was unnecessary.

I’ve switched replicas_to_stop back to a set and rewrote the selection logic so the LIFO rule is enforced by
1. taking the per-deployment _running_replicas, reversing it once (newest → oldest)
2. selecting from each node’s bucket in that order.

Thanks for pointing this out!

ordered_running_replicas.reverse()

# Bucket the (newest-first) replicas by node for fast lookup.
replicas_grouped_by_node: Dict[str, List[ReplicaID]] = defaultdict(list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we should replace node_to_running_replicas_of_target_deployment with this new replicas_grouped_by_node? maybe name it something like ordered_running_replicas_of_target_deployment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks good catch. I removed the redundant node_to_running_replicas_of_target_deployment check and now use a single source of truth: ordered_running_replicas_of_target_deployment

@ktyxx ktyxx force-pushed the fix-replica-scale-down-order branch from 11c2be9 to f8e7e27 Compare August 27, 2025 02:44
…d_running_replicas_of_target_deployment

Signed-off-by: kitae <[email protected]>
@ktyxx ktyxx force-pushed the fix-replica-scale-down-order branch from f8e7e27 to 63dbed7 Compare August 27, 2025 02:50
@zcin zcin self-requested a review August 27, 2025 18:49
@ktyxx ktyxx marked this pull request as draft November 17, 2025 01:42
@ktyxx ktyxx marked this pull request as ready for review November 17, 2025 01:46
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Nov 19, 2025
@ktyxx
Copy link
Contributor Author

ktyxx commented Nov 20, 2025

@zcin Hi! Just a gentle ping on this PR when you get a chance.
No rush, just making sure it's on your radar. Thanks!

@zcin zcin merged commit eaf2af4 into ray-project:master Nov 21, 2025
6 checks passed
400Ping pushed a commit to 400Ping/ray that referenced this pull request Nov 21, 2025
…ownscaling (ray-project#52929)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->
This PR improves the downscaling behavior in Ray Serve by modifying the
logic in `_get_replicas_to_stop()` within Default `DeploymentScheduler`.

Previously, the scheduler selected replicas to stop by traversing the
least loaded nodes in ascending order. This often resulted in stopping
replicas that had been scheduled earlier and placed optimally using the
`_best_fit_node()` strategy.

This led to several drawbacks:
- Long-lived replicas, which were scheduled on best-fit nodes, were
removed first — leading to inefficient reuse of resources.
- Recently scaled-up replicas, which were placed on less utilized nodes,
were kept longer despite being suboptimal.
- Cold-start overhead increased, as newer replicas were removed before
fully warming up.

This PR reverses the node traversal order during downscaling so that
**more recently added replicas are prioritized for termination**, *in
cases where other conditions (e.g., running state and number of replicas
per node) are equal*. These newer replicas are typically less optimal in
placement and not yet fully warmed up.

Preserving long-lived replicas improves performance stability and
reduces unnecessary resource fragmentation.
## Related issue number

<!-- For example: "Closes ray-project#1234" -->
N/A
## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kitae <[email protected]>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…ownscaling (ray-project#52929)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->
This PR improves the downscaling behavior in Ray Serve by modifying the
logic in `_get_replicas_to_stop()` within Default `DeploymentScheduler`.

Previously, the scheduler selected replicas to stop by traversing the
least loaded nodes in ascending order. This often resulted in stopping
replicas that had been scheduled earlier and placed optimally using the
`_best_fit_node()` strategy.

This led to several drawbacks:
- Long-lived replicas, which were scheduled on best-fit nodes, were
removed first — leading to inefficient reuse of resources.
- Recently scaled-up replicas, which were placed on less utilized nodes,
were kept longer despite being suboptimal.
- Cold-start overhead increased, as newer replicas were removed before
fully warming up.

This PR reverses the node traversal order during downscaling so that
**more recently added replicas are prioritized for termination**, *in
cases where other conditions (e.g., running state and number of replicas
per node) are equal*. These newer replicas are typically less optimal in
placement and not yet fully warmed up.

Preserving long-lived replicas improves performance stability and
reduces unnecessary resource fragmentation.
## Related issue number

<!-- For example: "Closes ray-project#1234" -->
N/A
## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kitae <[email protected]>
Signed-off-by: YK <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…ownscaling (ray-project#52929)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->
This PR improves the downscaling behavior in Ray Serve by modifying the
logic in `_get_replicas_to_stop()` within Default `DeploymentScheduler`.

Previously, the scheduler selected replicas to stop by traversing the
least loaded nodes in ascending order. This often resulted in stopping
replicas that had been scheduled earlier and placed optimally using the
`_best_fit_node()` strategy.

This led to several drawbacks:
- Long-lived replicas, which were scheduled on best-fit nodes, were
removed first — leading to inefficient reuse of resources.
- Recently scaled-up replicas, which were placed on less utilized nodes,
were kept longer despite being suboptimal.
- Cold-start overhead increased, as newer replicas were removed before
fully warming up.

This PR reverses the node traversal order during downscaling so that
**more recently added replicas are prioritized for termination**, *in
cases where other conditions (e.g., running state and number of replicas
per node) are equal*. These newer replicas are typically less optimal in
placement and not yet fully warmed up.

Preserving long-lived replicas improves performance stability and
reduces unnecessary resource fragmentation.
## Related issue number

<!-- For example: "Closes ray-project#1234" -->
N/A
## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: kitae <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants