[rollout] fix: correct heap-based load balancing in AsyncLLMServerManager by hellcatCS · Pull Request #4505 · verl-project/verl

hellcatCS · 2025-12-12T16:52:09Z

What does this PR do?

This PR fixes the load balancing issue in AsyncLLMServerManager where the heap-based server selection was using hash values instead of indices, causing unpredictable server selection order after shuffling.

Problem:

The original implementation used hash(server) as the secondary sort key in the heap
When all servers had the same request count (0), the heap would select the server with the minimum hash value, not the first server in the shuffled list
This resulted in poor load distribution and defeated the purpose of random shuffling

Solution:

Replace hash(server) with explicit indices in the heap structure
Heap now sorts by (request_count, index, server) instead of (request_count, hash, server)
Ensures deterministic selection: when request counts are equal, the server with the lowest index (first in the shuffled list) is always chosen

Example:

Before (❌ Broken):

server_handles = [Server_A, Server_B, Server_C]
random.shuffle(server_handles) # → [Server_C, Server_A, Server_B]
weighted_servers = [[0, hash(s), s] for s in server_handles]

Heap might select Server_A first (min hash), not Server_C!

After (✅ Fixed):

server_handles = [Server_A, Server_B, Server_C]
random.shuffle(server_handles) # → [Server_C, Server_A, Server_B]
weighted_servers = [[0, idx, s] for idx, s in enumerate(server_handles)]

Heap correctly selects Server_C first (idx=0)

gemini-code-assist

Code Review

This pull request correctly fixes a load balancing issue in AsyncLLMServerManager by replacing the unpredictable hash(server) with a deterministic index as a tie-breaker in the server selection heap. The change ensures that servers are selected in a predictable, shuffled order when their request counts are equal. My review includes a suggestion to refactor the heap update logic in _choose_server to be more idiomatic and readable, improving maintainability.

gemini-code-assist · 2025-12-12T16:53:36Z

verl/experimental/agent_loop/agent_loop.py

+        _, _, server = self.weighted_serveres[0]
        self.weighted_serveres[0][0] += 1
        heapq.heapreplace(self.weighted_serveres, self.weighted_serveres[0])
        self.request_id_to_server[request_id] = server


The current implementation for updating the heap is a bit obscure. It modifies the top element of the heap in-place, which violates the heap invariant, and then uses heapq.heapreplace to restore it. While this works, it's not very readable and relies on implementation details of heapq.heapreplace.

A more idiomatic and clearer approach would be to use heapq.heappop followed by heapq.heappush. This makes the intent of the code much more explicit and improves maintainability.

Suggested change

_, _, server = self.weighted_serveres[0]

self.weighted_serveres[0][0] += 1

heapq.heapreplace(self.weighted_serveres, self.weighted_serveres[0])

self.request_id_to_server[request_id] = server

count, idx, server = heapq.heappop(self.weighted_serveres)

heapq.heappush(self.weighted_serveres, [count + 1, idx, server])

self.request_id_to_server[request_id] = server

hellcatCS · 2025-12-13T14:14:50Z

@vermouth1992 @wuxibin89 @PeterSH6 @chenhaiq @zhaochenyang20 @SwordFaith please review the request

…ager (verl-project#4505) ### What does this PR do? This PR fixes the load balancing issue in AsyncLLMServerManager where the heap-based server selection was using hash values instead of indices, causing unpredictable server selection order after shuffling. Problem: The original implementation used hash(server) as the secondary sort key in the heap When all servers had the same request count (0), the heap would select the server with the minimum hash value, not the first server in the shuffled list This resulted in poor load distribution and defeated the purpose of random shuffling Solution: Replace hash(server) with explicit indices in the heap structure Heap now sorts by (request_count, index, server) instead of (request_count, hash, server) Ensures deterministic selection: when request counts are equal, the server with the lowest index (first in the shuffled list) is always chosen Example: # Before (❌ Broken): server_handles = [Server_A, Server_B, Server_C] random.shuffle(server_handles) # → [Server_C, Server_A, Server_B] weighted_servers = [[0, hash(s), s] for s in server_handles] # Heap might select Server_A first (min hash), not Server_C! # After (✅ Fixed): server_handles = [Server_A, Server_B, Server_C] random.shuffle(server_handles) # → [Server_C, Server_A, Server_B] weighted_servers = [[0, idx, s] for idx, s in enumerate(server_handles)] # Heap correctly selects Server_C first (idx=0) Co-authored-by: Aleksandr Semikin <aesemikin@alice-a100.sas.yp-c.yandex.net>

fixed_server_shuffling

4ee3c0e

gemini-code-assist bot reviewed Dec 12, 2025

View reviewed changes

wuxibin89 approved these changes Dec 15, 2025

View reviewed changes

wuxibin89 merged commit 36c2d4a into verl-project:main Dec 15, 2025
74 of 77 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rollout] fix: correct heap-based load balancing in AsyncLLMServerManager#4505

[rollout] fix: correct heap-based load balancing in AsyncLLMServerManager#4505
wuxibin89 merged 1 commit intoverl-project:mainfrom
hellcatCS:fix_server_shuffling

hellcatCS commented Dec 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 12, 2025

Uh oh!

hellcatCS commented Dec 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hellcatCS commented Dec 12, 2025

What does this PR do?

Before (❌ Broken):

Heap might select Server_A first (min hash), not Server_C!

After (✅ Fixed):

Heap correctly selects Server_C first (idx=0)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

hellcatCS commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hellcatCS commented Dec 13, 2025 •

edited

Loading