[3/n] [Serve] Defer rank assignment after replica is allocated #58477

abrarsheikh · 2025-11-08T20:30:49Z

Summary

Modified replica rank assignment to defer rank allocation until the replica is actually allocated, rather than assigning it during the startup call. This is necessary when we want to add node local rank in future, in order to support node rank and node local rank we need to know the node_id which is only known after replica is allocated.

Changes

Changed start() method signature to accept assign_rank_callback instead of a pre-assigned rank parameter
Rank is now assigned after _allocated_obj_ref is resolved, ensuring the replica is allocated before rank assignment
Pass rank to initialize_and_get_metadata() method on the replica actor, allowing rank to be set during initialization
Updated ReplicaBase.initialize() to accept rank as a parameter and set it along with the internal replica context
Added PENDING_INITIALIZATION status check to handle cases where _ready_obj_ref is not yet set

Next PR #58479

Signed-off-by: abrar <[email protected]>

…ar-rank-p1

Signed-off-by: abrar <[email protected]>

…ar-rank-p2

…58473) ### Summary This PR refactors the replica rank system to support multi-dimensional ranking (global, node-level, and local ranks) in preparation for node-local rank tracking. The `ReplicaRank` object now contains three fields instead of being a simple integer, enabling better coordination of replicas across nodes. ### Motivation Currently, Ray Serve only tracks a single global rank per replica. For advanced use cases like tensor parallelism, model sharding across nodes, and node-aware coordination, we need to track: - **Global rank**: Replica's rank across all nodes (0 to N-1) - **Node rank**: Which node the replica is on (0 to M-1) - **Local rank**: Replica's rank on its specific node (0 to K-1) This PR lays the groundwork by introducing the expanded `ReplicaRank` schema while maintaining backward compatibility in feature. ### Changes #### Core Implementation - **`schema.py`**: Extended `ReplicaRank` to include `node_rank` and `local_rank` fields (currently set to -1 as placeholders) - **`replica.py`**: Updated replica actors to handle `ReplicaRank` objects - **`context.py`**: Changed `ReplicaContext.rank` type from `Optional[int]` to `ReplicaRank` ### Current Behavior - `node_rank` and `local_rank` are set to `-1` (placeholder values). Will change in future - Global rank assignment and management works as before - All existing functionality is preserved ### Breaking Changes Rank is changing from `int` to `ReplicaRank` Next PR #58477 --------- Signed-off-by: abrar <[email protected]>

…ar-rank-p3

Signed-off-by: abrar <[email protected]>

cursor

Bug: Recovery Breaks Deferred Rank Assignment

During replica recovery after controller restart, initialize_and_get_metadata.remote() is called without passing a rank parameter (line 671), but the method signature now requires it. The _assign_rank_callback is never set during recovery since start() isn't called, and the rank assignment logic in check_ready() (lines 743-746) is skipped because _ready_obj_ref is already set by recover(). This causes recovered replicas to initialize without a rank, breaking the deferred rank assignment feature.

python/ray/serve/_private/deployment_state.py#L627-L673

ray/python/ray/serve/_private/deployment_state.py

Lines 627 to 673 in d7131f0

    
               def recover(self) -> bool: 
        
                   """Recover replica version from a live replica actor. 
        
                   When controller dies, the deployment state loses the info on the version that's 
        
                   running on each individual replica actor, so as part of the recovery process, we 
        
                   need to recover the version that is running on the replica actor. 
        
                   Also confirm that actor is allocated and initialized before marking as running. 
        
                   Returns: False if the replica actor is no longer alive; the 
        
                       actor could have been killed in the time between when the 
        
                       controller fetching all Serve actors in the cluster and when 
        
                       the controller tries to recover it. Otherwise, return True. 
        
                   """ 
        
                   logger.info(f"Recovering {self.replica_id}.") 
        
                   try: 
        
                       self._actor_handle = ray.get_actor( 
        
                           self._actor_name, namespace=SERVE_NAMESPACE 
        
                       ) 
        
                   except ValueError: 
        
                       logger.warning( 
        
                           f"Failed to get handle to replica {self._actor_name} " 
        
                           "during controller recovery. Marking as dead." 
        
                       ) 
        
                       return False 
        
                   try: 
        
                       self._placement_group = ray.util.get_placement_group( 
        
                           self._actor_name, 
        
                       ) 
        
                   except ValueError: 
        
                       # ValueError is raised if the placement group does not exist. 
        
                       self._placement_group = None 
        
                   # Re-fetch initialization proof 
        
                   self._allocated_obj_ref = self._actor_handle.is_allocated.remote() 
        
                   # Running actor handle already has all info needed, thus successful 
        
                   # starting simply means retrieving replica version hash from actor 
        
                   if self._is_cross_language: 
        
                       self._ready_obj_ref = self._actor_handle.check_health.remote() 
        
                   else: 
        
                       self._ready_obj_ref = ( 
        
                           self._actor_handle.initialize_and_get_metadata.remote() 
        
                       )

abrarsheikh · 2025-11-14T07:25:50Z

During replica recovery after controller restart, initialize_and_get_metadata.remote() is called without passing a rank parameter (line 671), but the method signature now requires it. The _assign_rank_callback is never set during recovery since start() isn't called, and the rank assignment logic in check_ready() (lines 743-746) is skipped because _ready_obj_ref is already set by recover(). This causes recovered replicas to initialize without a rank, breaking the deferred rank assignment feature.

During replica recovery after controller restart, initialize_and_get_metadata.remote() is called without passing a rank parameter (line 671), but the method signature now requires it

No true, rank is not required.

This causes recovered replicas to initialize without a rank, breaking the deferred rank assignment feature.

this is expect, because we want to fetch rank from already running replica instead to assigning it a rank during recovery.

Signed-off-by: abrar <[email protected]>

cursor

Bug: Broken Rank Assignment in Replica Recovery

During replica recovery after controller restart, initialize_and_get_metadata.remote() is called without passing a rank parameter, but the method signature now requires it. The _assign_rank_callback is never set during recovery since start() isn't called, and the rank assignment logic in check_ready() is skipped because _ready_obj_ref is already set by recover(). This causes recovered replicas to initialize without a rank, breaking the deferred rank assignment system.

python/ray/serve/_private/deployment_state.py#L669-L672

ray/python/ray/serve/_private/deployment_state.py

Lines 669 to 672 in 29ce266

    
           else: 
        
               self._ready_obj_ref = ( 
        
                   self._actor_handle.initialize_and_get_metadata.remote() 
        
               )

…ay-project#58473) ### Summary This PR refactors the replica rank system to support multi-dimensional ranking (global, node-level, and local ranks) in preparation for node-local rank tracking. The `ReplicaRank` object now contains three fields instead of being a simple integer, enabling better coordination of replicas across nodes. ### Motivation Currently, Ray Serve only tracks a single global rank per replica. For advanced use cases like tensor parallelism, model sharding across nodes, and node-aware coordination, we need to track: - **Global rank**: Replica's rank across all nodes (0 to N-1) - **Node rank**: Which node the replica is on (0 to M-1) - **Local rank**: Replica's rank on its specific node (0 to K-1) This PR lays the groundwork by introducing the expanded `ReplicaRank` schema while maintaining backward compatibility in feature. ### Changes #### Core Implementation - **`schema.py`**: Extended `ReplicaRank` to include `node_rank` and `local_rank` fields (currently set to -1 as placeholders) - **`replica.py`**: Updated replica actors to handle `ReplicaRank` objects - **`context.py`**: Changed `ReplicaContext.rank` type from `Optional[int]` to `ReplicaRank` ### Current Behavior - `node_rank` and `local_rank` are set to `-1` (placeholder values). Will change in future - Global rank assignment and management works as before - All existing functionality is preserved ### Breaking Changes Rank is changing from `int` to `ReplicaRank` Next PR ray-project#58477 --------- Signed-off-by: abrar <[email protected]>

zcin

what is the impact of this change on replica startup time?

abrarsheikh · 2025-11-17T23:25:52Z

what is the impact of this change on replica startup time?

I think this adds RAY_SERVE_CONTROL_LOOP_INTERVAL_S delay in replica startup time.

zcin · 2025-11-17T23:43:53Z

Can we test / measure e2e

abrarsheikh · 2025-11-19T00:06:25Z

Can we test / measure e2e

test application

applications:
- import_path: app:app
  deployments:
  - name: DITest
    num_replicas: 4
    ray_actor_options:
      num_cpus: 0.1
  
  - name: d_2
    num_replicas: 4
    ray_actor_options:
      num_cpus: 0.1
  - name: d_1
    num_replicas: 4
    ray_actor_options:
      num_cpus: 0.1

profile code

diff --git a/python/ray/serve/_private/deployment_state.py b/python/ray/serve/_private/deployment_state.py
index 4b96833aff..829e0a84f0 100644
--- a/python/ray/serve/_private/deployment_state.py
+++ b/python/ray/serve/_private/deployment_state.py
@@ -283,6 +283,7 @@ class ActorReplicaWrapper:

         # Outbound deployments polling state
         self._outbound_deployments: Optional[List[DeploymentID]] = None
+        self._replica_start_ts: float = 0.0

     @property
     def replica_id(self) -> str:
@@ -460,7 +461,7 @@ class ActorReplicaWrapper:
         self._deployment_is_cross_language = (
             deployment_info.deployment_config.is_cross_language
         )
-
+        self._replica_start_ts = time.time()
         logger.info(
             f"Starting {self.replica_id}.",
             extra={"log_to_stderr": False},
@@ -790,6 +791,9 @@ class ActorReplicaWrapper:
                 )
                 return ReplicaStartupStatus.FAILED, repr(e)

+        time_taken = time.time() - self._replica_start_ts
+        logger.info(f"Replica {self._replica_id} started in {time_taken:.2f}s.")
+
         return ReplicaStartupStatus.SUCCEEDED, None

     @property

From this PR

❯ RAY_SERVE_CONTROL_LOOP_INTERVAL_S=2 serve run raw_config.yaml
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,908 controller 3251490 -- Replica Replica(id='ykzh0lp3', deployment='d_1', app='default') started in 4.08s.
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,908 controller 3251490 -- Replica Replica(id='wj1jwbst', deployment='d_1', app='default') started in 4.08s.
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,909 controller 3251490 -- Replica Replica(id='cob2l0ts', deployment='d_1', app='default') started in 4.08s.
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,909 controller 3251490 -- Replica Replica(id='luh8ykjt', deployment='d_1', app='default') started in 4.08s.
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,909 controller 3251490 -- Replica Replica(id='8vsm1wg7', deployment='d_2', app='default') started in 4.08s.
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,909 controller 3251490 -- Replica Replica(id='ax8lcckm', deployment='d_2', app='default') started in 4.08s.
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,910 controller 3251490 -- Replica Replica(id='rdhkqk0q', deployment='d_2', app='default') started in 4.08s.
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,910 controller 3251490 -- Replica Replica(id='888udqd7', deployment='d_2', app='default') started in 4.08s.
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,910 controller 3251490 -- Replica Replica(id='15e088r3', deployment='DITest', app='default') started in 4.08s.
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,911 controller 3251490 -- Replica Replica(id='4r4b7on4', deployment='DITest', app='default') started in 4.08s.
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,911 controller 3251490 -- Replica Replica(id='qq5kzseb', deployment='DITest', app='default') started in 4.08s.
(ServeController pid=3251490) INFO 2025-11-19 00:02:14,911 controller 3251490 -- Replica Replica(id='iffoha2e', deployment='DITest', app='default') started in 4.08s.

From master


❯ RAY_SERVE_CONTROL_LOOP_INTERVAL_S=2 serve run raw_config.yaml
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,959 controller 3257145 -- Replica Replica(id='9nry8lw4', deployment='d_1', app='default') started in 2.06s.
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,960 controller 3257145 -- Replica Replica(id='q2nq0n66', deployment='d_1', app='default') started in 2.06s.
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,960 controller 3257145 -- Replica Replica(id='r9ot49qp', deployment='d_1', app='default') started in 2.06s.
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,960 controller 3257145 -- Replica Replica(id='xbmeloll', deployment='d_1', app='default') started in 2.06s.
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,961 controller 3257145 -- Replica Replica(id='rl0glnyh', deployment='d_2', app='default') started in 2.06s.
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,961 controller 3257145 -- Replica Replica(id='fnmdu1an', deployment='d_2', app='default') started in 2.06s.
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,961 controller 3257145 -- Replica Replica(id='erjxm7ur', deployment='d_2', app='default') started in 2.06s.
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,962 controller 3257145 -- Replica Replica(id='nuh5449r', deployment='d_2', app='default') started in 2.06s.
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,962 controller 3257145 -- Replica Replica(id='tc2zxiwd', deployment='DITest', app='default') started in 2.06s.
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,962 controller 3257145 -- Replica Replica(id='uj60v4l7', deployment='DITest', app='default') started in 2.06s.
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,962 controller 3257145 -- Replica Replica(id='0p05e6wp', deployment='DITest', app='default') started in 2.06s.
(ServeController pid=3257145) INFO 2025-11-19 00:04:07,963 controller 3257145 -- Replica Replica(id='ab7y7a56', deployment='DITest', app='default') started in 2.06s.

zcin · 2025-11-19T01:46:20Z

Hmm that's a pretty significant increase. Is there a way to avoid it?

abrarsheikh · 2025-11-19T02:46:16Z

+2 additional seconds to start the replica is because I set RAY_SERVE_CONTROL_LOOP_INTERVAL_S=2 in my run, mkaing sure you saw that. So in the default that the effect is not this prominant.

The other option I can think of to start the replica in the same controller iteration is to use _on_completed API from core, but @edoakes recommended against it.

abrarsheikh · 2025-11-19T18:03:07Z

without the constant

on master
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,685 controller 3469525 -- Replica Replica(id='wa1t182b', deployment='d_1', app='default') started in 0.68s.
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,686 controller 3469525 -- Replica Replica(id='1gh8kc4h', deployment='d_1', app='default') started in 0.68s.
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,686 controller 3469525 -- Replica Replica(id='8ht8dpb4', deployment='d_1', app='default') started in 0.68s.
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,687 controller 3469525 -- Replica Replica(id='ha0yyojf', deployment='d_1', app='default') started in 0.68s.
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,687 controller 3469525 -- Replica Replica(id='0bi24lia', deployment='d_2', app='default') started in 0.68s.
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,687 controller 3469525 -- Replica Replica(id='zw0vl107', deployment='d_2', app='default') started in 0.67s.
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,687 controller 3469525 -- Replica Replica(id='75ctfmjf', deployment='d_2', app='default') started in 0.68s.
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,688 controller 3469525 -- Replica Replica(id='f8fkr17n', deployment='d_2', app='default') started in 0.68s.
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,688 controller 3469525 -- Replica Replica(id='qbppjta5', deployment='DITest', app='default') started in 0.68s.
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,688 controller 3469525 -- Replica Replica(id='zdumdewc', deployment='DITest', app='default') started in 0.67s.
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,688 controller 3469525 -- Replica Replica(id='au93ydr2', deployment='DITest', app='default') started in 0.67s.
(ServeController pid=3469525) INFO 2025-11-19 17:48:20,689 controller 3469525 -- Replica Replica(id='nx4mtlil', deployment='DITest', app='default') started in 0.67s.


with changes in PR
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,600 controller 3474912 -- Replica Replica(id='930w30uy', deployment='d_1', app='default') initialization time: 0.68s
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,600 controller 3474912 -- Replica Replica(id='r1vgpqpz', deployment='d_1', app='default') initialization time: 0.68s
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,601 controller 3474912 -- Replica Replica(id='cu4clrvo', deployment='d_1', app='default') initialization time: 0.68s
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,601 controller 3474912 -- Replica Replica(id='7vcgxjv8', deployment='d_1', app='default') initialization time: 0.68s
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,601 controller 3474912 -- Replica Replica(id='l1ei9uz0', deployment='d_2', app='default') initialization time: 0.68s
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,602 controller 3474912 -- Replica Replica(id='76hxiwqk', deployment='d_2', app='default') initialization time: 0.68s
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,602 controller 3474912 -- Replica Replica(id='do47qcqo', deployment='d_2', app='default') initialization time: 0.68s
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,602 controller 3474912 -- Replica Replica(id='uoz56m8r', deployment='d_2', app='default') initialization time: 0.68s
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,603 controller 3474912 -- Replica Replica(id='fa11aprv', deployment='DITest', app='default') initialization time: 0.68s
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,603 controller 3474912 -- Replica Replica(id='v8c9wyyj', deployment='DITest', app='default') initialization time: 0.68s
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,604 controller 3474912 -- Replica Replica(id='cvlh1fbc', deployment='DITest', app='default') initialization time: 0.68s
(ServeController pid=3474912) INFO 2025-11-19 17:50:20,604 controller 3474912 -- Replica Replica(id='j52efolh', deployment='DITest', app='default') initialization time: 0.68s

…ay-project#58473) ### Summary This PR refactors the replica rank system to support multi-dimensional ranking (global, node-level, and local ranks) in preparation for node-local rank tracking. The `ReplicaRank` object now contains three fields instead of being a simple integer, enabling better coordination of replicas across nodes. ### Motivation Currently, Ray Serve only tracks a single global rank per replica. For advanced use cases like tensor parallelism, model sharding across nodes, and node-aware coordination, we need to track: - **Global rank**: Replica's rank across all nodes (0 to N-1) - **Node rank**: Which node the replica is on (0 to M-1) - **Local rank**: Replica's rank on its specific node (0 to K-1) This PR lays the groundwork by introducing the expanded `ReplicaRank` schema while maintaining backward compatibility in feature. ### Changes #### Core Implementation - **`schema.py`**: Extended `ReplicaRank` to include `node_rank` and `local_rank` fields (currently set to -1 as placeholders) - **`replica.py`**: Updated replica actors to handle `ReplicaRank` objects - **`context.py`**: Changed `ReplicaContext.rank` type from `Optional[int]` to `ReplicaRank` ### Current Behavior - `node_rank` and `local_rank` are set to `-1` (placeholder values). Will change in future - Global rank assignment and management works as before - All existing functionality is preserved ### Breaking Changes Rank is changing from `int` to `ReplicaRank` Next PR ray-project#58477 --------- Signed-off-by: abrar <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

…roject#58477) **Summary** Modified replica rank assignment to defer rank allocation until the replica is actually allocated, rather than assigning it during the startup call. This is necessary when we want to add node local rank in future, in order to support node rank and node local rank we need to know the node_id which is only known after replica is allocated. **Changes** - Changed `start()` method signature to accept `assign_rank_callback` instead of a pre-assigned `rank` parameter - Rank is now assigned after `_allocated_obj_ref` is resolved, ensuring the replica is allocated before rank assignment - Pass rank to `initialize_and_get_metadata()` method on the replica actor, allowing rank to be set during initialization - Updated `ReplicaBase.initialize()` to accept rank as a parameter and set it along with the internal replica context - Added `PENDING_INITIALIZATION` status check to handle cases where `_ready_obj_ref` is not yet set Next PR ray-project#58479 --------- Signed-off-by: abrar <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

…roject#58477) **Summary** Modified replica rank assignment to defer rank allocation until the replica is actually allocated, rather than assigning it during the startup call. This is necessary when we want to add node local rank in future, in order to support node rank and node local rank we need to know the node_id which is only known after replica is allocated. **Changes** - Changed `start()` method signature to accept `assign_rank_callback` instead of a pre-assigned `rank` parameter - Rank is now assigned after `_allocated_obj_ref` is resolved, ensuring the replica is allocated before rank assignment - Pass rank to `initialize_and_get_metadata()` method on the replica actor, allowing rank to be set during initialization - Updated `ReplicaBase.initialize()` to accept rank as a parameter and set it along with the internal replica context - Added `PENDING_INITIALIZATION` status check to handle cases where `_ready_obj_ref` is not yet set Next PR ray-project#58479 --------- Signed-off-by: abrar <[email protected]>

…ay-project#58473) ### Summary This PR refactors the replica rank system to support multi-dimensional ranking (global, node-level, and local ranks) in preparation for node-local rank tracking. The `ReplicaRank` object now contains three fields instead of being a simple integer, enabling better coordination of replicas across nodes. ### Motivation Currently, Ray Serve only tracks a single global rank per replica. For advanced use cases like tensor parallelism, model sharding across nodes, and node-aware coordination, we need to track: - **Global rank**: Replica's rank across all nodes (0 to N-1) - **Node rank**: Which node the replica is on (0 to M-1) - **Local rank**: Replica's rank on its specific node (0 to K-1) This PR lays the groundwork by introducing the expanded `ReplicaRank` schema while maintaining backward compatibility in feature. ### Changes #### Core Implementation - **`schema.py`**: Extended `ReplicaRank` to include `node_rank` and `local_rank` fields (currently set to -1 as placeholders) - **`replica.py`**: Updated replica actors to handle `ReplicaRank` objects - **`context.py`**: Changed `ReplicaContext.rank` type from `Optional[int]` to `ReplicaRank` ### Current Behavior - `node_rank` and `local_rank` are set to `-1` (placeholder values). Will change in future - Global rank assignment and management works as before - All existing functionality is preserved ### Breaking Changes Rank is changing from `int` to `ReplicaRank` Next PR ray-project#58477 --------- Signed-off-by: abrar <[email protected]> Signed-off-by: YK <[email protected]>

…roject#58477) **Summary** Modified replica rank assignment to defer rank allocation until the replica is actually allocated, rather than assigning it during the startup call. This is necessary when we want to add node local rank in future, in order to support node rank and node local rank we need to know the node_id which is only known after replica is allocated. **Changes** - Changed `start()` method signature to accept `assign_rank_callback` instead of a pre-assigned `rank` parameter - Rank is now assigned after `_allocated_obj_ref` is resolved, ensuring the replica is allocated before rank assignment - Pass rank to `initialize_and_get_metadata()` method on the replica actor, allowing rank to be set during initialization - Updated `ReplicaBase.initialize()` to accept rank as a parameter and set it along with the internal replica context - Added `PENDING_INITIALIZATION` status check to handle cases where `_ready_obj_ref` is not yet set Next PR ray-project#58479 --------- Signed-off-by: abrar <[email protected]> Signed-off-by: YK <[email protected]>

…ay-project#58473) ### Summary This PR refactors the replica rank system to support multi-dimensional ranking (global, node-level, and local ranks) in preparation for node-local rank tracking. The `ReplicaRank` object now contains three fields instead of being a simple integer, enabling better coordination of replicas across nodes. ### Motivation Currently, Ray Serve only tracks a single global rank per replica. For advanced use cases like tensor parallelism, model sharding across nodes, and node-aware coordination, we need to track: - **Global rank**: Replica's rank across all nodes (0 to N-1) - **Node rank**: Which node the replica is on (0 to M-1) - **Local rank**: Replica's rank on its specific node (0 to K-1) This PR lays the groundwork by introducing the expanded `ReplicaRank` schema while maintaining backward compatibility in feature. ### Changes #### Core Implementation - **`schema.py`**: Extended `ReplicaRank` to include `node_rank` and `local_rank` fields (currently set to -1 as placeholders) - **`replica.py`**: Updated replica actors to handle `ReplicaRank` objects - **`context.py`**: Changed `ReplicaContext.rank` type from `Optional[int]` to `ReplicaRank` ### Current Behavior - `node_rank` and `local_rank` are set to `-1` (placeholder values). Will change in future - Global rank assignment and management works as before - All existing functionality is preserved ### Breaking Changes Rank is changing from `int` to `ReplicaRank` Next PR ray-project#58477 --------- Signed-off-by: abrar <[email protected]>

…roject#58477) **Summary** Modified replica rank assignment to defer rank allocation until the replica is actually allocated, rather than assigning it during the startup call. This is necessary when we want to add node local rank in future, in order to support node rank and node local rank we need to know the node_id which is only known after replica is allocated. **Changes** - Changed `start()` method signature to accept `assign_rank_callback` instead of a pre-assigned `rank` parameter - Rank is now assigned after `_allocated_obj_ref` is resolved, ensuring the replica is allocated before rank assignment - Pass rank to `initialize_and_get_metadata()` method on the replica actor, allowing rank to be set during initialization - Updated `ReplicaBase.initialize()` to accept rank as a parameter and set it along with the internal replica context - Added `PENDING_INITIALIZATION` status check to handle cases where `_ready_obj_ref` is not yet set Next PR ray-project#58479 --------- Signed-off-by: abrar <[email protected]>

…ay-project#58473) ### Summary This PR refactors the replica rank system to support multi-dimensional ranking (global, node-level, and local ranks) in preparation for node-local rank tracking. The `ReplicaRank` object now contains three fields instead of being a simple integer, enabling better coordination of replicas across nodes. ### Motivation Currently, Ray Serve only tracks a single global rank per replica. For advanced use cases like tensor parallelism, model sharding across nodes, and node-aware coordination, we need to track: - **Global rank**: Replica's rank across all nodes (0 to N-1) - **Node rank**: Which node the replica is on (0 to M-1) - **Local rank**: Replica's rank on its specific node (0 to K-1) This PR lays the groundwork by introducing the expanded `ReplicaRank` schema while maintaining backward compatibility in feature. ### Changes #### Core Implementation - **`schema.py`**: Extended `ReplicaRank` to include `node_rank` and `local_rank` fields (currently set to -1 as placeholders) - **`replica.py`**: Updated replica actors to handle `ReplicaRank` objects - **`context.py`**: Changed `ReplicaContext.rank` type from `Optional[int]` to `ReplicaRank` ### Current Behavior - `node_rank` and `local_rank` are set to `-1` (placeholder values). Will change in future - Global rank assignment and management works as before - All existing functionality is preserved ### Breaking Changes Rank is changing from `int` to `ReplicaRank` Next PR ray-project#58477 --------- Signed-off-by: abrar <[email protected]> Signed-off-by: Future-Outlier <[email protected]>

abrarsheikh added 4 commits November 8, 2025 02:18

Refactor replica rank to prepare for node local ranks

8e8f393

Signed-off-by: abrar <[email protected]>

referance schema

5f118c2

Signed-off-by: abrar <[email protected]>

[Serve] Refactor replica rank to prepare for node local ranks

0f7ae3c

Signed-off-by: abrar <[email protected]>

pass rank into replicas initialize method

8d4fbbc

Signed-off-by: abrar <[email protected]>

abrarsheikh mentioned this pull request Nov 8, 2025

[2/n] [Serve] Refactor replica rank to prepare for node local ranks #58473

Merged

abrarsheikh added the go add ONLY when ready to merge, run all tests label Nov 8, 2025

abrarsheikh changed the title ~~pass rank into replicas initialize method~~ [3/n] [Serve] Defer rank assignment after replica is allocated Nov 8, 2025

abrarsheikh added 9 commits November 8, 2025 22:05

fix test

51f2935

Signed-off-by: abrar <[email protected]>

fix java test

d66a7b5

Signed-off-by: abrar <[email protected]>

add fail on rank

ad660b3

Signed-off-by: abrar <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into LLM-2497-abr…

41a95a1

…ar-rank-p1

add fail on rank

dd960d3

Signed-off-by: abrar <[email protected]>

remove unused code

6d40e49

Signed-off-by: abrar <[email protected]>

Merge branch 'LLM-2497-abrar-rank-p1' into LLM-2497-abrar-rank-p2

7fa563e

Merge branch 'master' of github.com:ray-project/ray into LLM-2497-abr…

372819f

…ar-rank-p2

Merge branch 'master' of github.com:ray-project/ray into LLM-2497-abr…

5b3de81

…ar-rank-p2

Base automatically changed from LLM-2497-abrar-rank-p2 to master November 14, 2025 06:05

abrarsheikh added 3 commits November 14, 2025 06:07

Merge branch 'LLM-2497-abrar-rank-p2' into LLM-2497-abrar-rank-p3

0e174e3

Merge branch 'master' of github.com:ray-project/ray into LLM-2497-abr…

c231316

…ar-rank-p3

make rank assign part of control loop

d7131f0

Signed-off-by: abrar <[email protected]>

abrarsheikh marked this pull request as ready for review November 14, 2025 07:14

abrarsheikh requested a review from a team as a code owner November 14, 2025 07:14

abrarsheikh requested review from kouroshHakha and zcin November 14, 2025 07:15

cursor bot reviewed Nov 14, 2025

View reviewed changes

fix function types

29ce266

Signed-off-by: abrar <[email protected]>

cursor bot reviewed Nov 14, 2025

View reviewed changes

ray-gardener bot added the serve Ray Serve Related Issue label Nov 14, 2025

abrarsheikh requested a review from akyang-anyscale November 17, 2025 22:07

zcin reviewed Nov 17, 2025

View reviewed changes

zcin approved these changes Nov 19, 2025

View reviewed changes

abrarsheikh merged commit fa625a6 into master Nov 19, 2025
6 checks passed

abrarsheikh deleted the LLM-2497-abrar-rank-p3 branch November 19, 2025 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3/n] [Serve] Defer rank assignment after replica is allocated #58477

[3/n] [Serve] Defer rank assignment after replica is allocated #58477

Uh oh!

abrarsheikh commented Nov 8, 2025 •

edited

Loading

Uh oh!

cursor bot left a comment

Uh oh!

abrarsheikh commented Nov 14, 2025

Uh oh!

cursor bot left a comment

Uh oh!

zcin left a comment

Uh oh!

abrarsheikh commented Nov 17, 2025

Uh oh!

zcin commented Nov 17, 2025

Uh oh!

abrarsheikh commented Nov 19, 2025

Uh oh!

zcin commented Nov 19, 2025

Uh oh!

abrarsheikh commented Nov 19, 2025

Uh oh!

abrarsheikh commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


	def recover(self) -> bool:
	"""Recover replica version from a live replica actor.

	When controller dies, the deployment state loses the info on the version that's
	running on each individual replica actor, so as part of the recovery process, we
	need to recover the version that is running on the replica actor.

	Also confirm that actor is allocated and initialized before marking as running.

	Returns: False if the replica actor is no longer alive; the
	actor could have been killed in the time between when the
	controller fetching all Serve actors in the cluster and when
	the controller tries to recover it. Otherwise, return True.
	"""
	logger.info(f"Recovering {self.replica_id}.")
	try:
	self._actor_handle = ray.get_actor(
	self._actor_name, namespace=SERVE_NAMESPACE
	)
	except ValueError:
	logger.warning(
	f"Failed to get handle to replica {self._actor_name} "
	"during controller recovery. Marking as dead."
	)
	return False

	try:
	self._placement_group = ray.util.get_placement_group(
	self._actor_name,
	)
	except ValueError:
	# ValueError is raised if the placement group does not exist.
	self._placement_group = None

	# Re-fetch initialization proof
	self._allocated_obj_ref = self._actor_handle.is_allocated.remote()

	# Running actor handle already has all info needed, thus successful
	# starting simply means retrieving replica version hash from actor
	if self._is_cross_language:
	self._ready_obj_ref = self._actor_handle.check_health.remote()
	else:
	self._ready_obj_ref = (
	self._actor_handle.initialize_and_get_metadata.remote()
	)

[3/n] [Serve] Defer rank assignment after replica is allocated #58477

[3/n] [Serve] Defer rank assignment after replica is allocated #58477

Uh oh!

Conversation

abrarsheikh commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Recovery Breaks Deferred Rank Assignment

Uh oh!

abrarsheikh commented Nov 14, 2025

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Broken Rank Assignment in Replica Recovery

Uh oh!

zcin left a comment

Choose a reason for hiding this comment

Uh oh!

abrarsheikh commented Nov 17, 2025

Uh oh!

zcin commented Nov 17, 2025

Uh oh!

abrarsheikh commented Nov 19, 2025

test application

profile code

From this PR

From master

Uh oh!

zcin commented Nov 19, 2025

Uh oh!

abrarsheikh commented Nov 19, 2025

Uh oh!

abrarsheikh commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

abrarsheikh commented Nov 8, 2025 •

edited

Loading