[core] Limit core worker gRPC reply threads to 2 by default #58771

Yicheng-Lu-llll · 2025-11-19T04:01:50Z

Description

Previously, RAY_num_server_call_thread controlled the gRPC reply thread pool size for all processes (including CoreWorkers), and its default value was tied to the number of CPUs, which could oversubscribe threads in CoreWorkers on large instances. In this PR, we introduce RAY_core_worker_num_server_call_thread to separately control CoreWorkers, defaulting to min(2, max(1, num cpu/4)), and scope RAY_num_server_call_thread to system components (raylet, GCS, etc.) only.

This keeps per-worker reply pools tiny so we can run many workers on the same node without oversubscribing threads; the choice of “2” is based on the microbenchmarks in #58351.

Related issues

Closes #58351

Test

#!/usr/bin/env python3
import os
import subprocess
import sys


def get_thread_count(config_value):
    subprocess.run(["ray", "stop", "-f"], capture_output=True)
    
    env = os.environ.copy()
    if config_value is not None:
        env["RAY_core_worker_num_server_call_thread"] = str(config_value)
    
    test_code = """
import ray
import psutil
import os
import time

@ray.remote
def count_threads():
    return len(psutil.Process(os.getpid()).threads())

ray.init()

# Warm up once to make sure thread pools are instantiated.
ray.get(count_threads.remote())
time.sleep(1)

print(ray.get(count_threads.remote()))
"""
    
    result = subprocess.run(
        [sys.executable, "-c", test_code],
        env=env,
        capture_output=True,
        text=True,
        check=True
    )
    return int(result.stdout.strip())


if __name__ == "__main__":
    default_threads = get_thread_count(None)
    with_config_10 = get_thread_count(10)
    subprocess.run(["ray", "stop", "-f"], capture_output=True)
    
    print(f"Default (RAY_core_worker_num_server_call_thread=2): {default_threads} threads")
    print(f"With RAY_core_worker_num_server_call_thread=10: {with_config_10} threads")

#Default (RAY_core_worker_num_server_call_thread=2): 52 threads
#With RAY_core_worker_num_server_call_thread=10: 60 threads

By default, this setting creates two threads. After changing it to ten, we typically observe eight additional threads.
(Because of #55215, the exact count may differ, but in most cases the delta is three.)

Signed-off-by: yicheng <[email protected]>

…_to_two

Yicheng-Lu-llll · 2025-11-19T19:17:00Z

src/ray/common/ray_config_def.h


-/// The pool size for grpc server call.
+/// The pool size for grpc server call for system components (raylet, GCS, etc.).
 RAY_CONFIG(int64_t,


This change does alter the meaning of num_server_call_thread, but I believe that’s exactly what we want.

let's nix etc. There is no etc. after raylet and GCS, and this specifically doesn't effect workers. So let's not make it confusing.

Signed-off-by: yicheng <[email protected]>

src/ray/common/ray_config_def.h

Yicheng-Lu-llll · 2025-11-19T22:51:19Z

@edoakes Let me know if the current implementation looks good to you. Thank you!

Signed-off-by: yicheng <[email protected]>

edoakes · 2025-11-19T23:54:48Z

@ZacAttack PTAL

ZacAttack · 2025-11-20T02:10:59Z

src/ray/common/ray_config_def.h

+/// reply path is light enough that 2 threads is sufficient.
+RAY_CONFIG(int64_t,
+           core_worker_num_server_call_thread,
+           std::min((int64_t)2,


This is the min(2, max(1, env/4))..... This is a little verbose for a line that either resolves to 2 or 1.

I can simplify it to:

RAY_CONFIG(int64_t, core_worker_num_server_call_thread, std::thread::hardware_concurrency() >= 8 ? 2 : 1);

I'll add a tiny comment to avoid the threshold looking "magic". Let me know if this works.

dayshah

One note here is that during the benchmarking, you measured 100kb. This is the max for a single object, but a task can have many returns, and the max size inline size for all obj's is actually 10mb, so a single PushTaskReply or PushTaskRequest can be up to 10mb.

ray/src/ray/common/ray_config_def.h

Lines 574 to 575 in dd74ba5

    
           // Max number bytes of inlined objects in a task rpc request/response. 
        
           RAY_CONFIG(int64_t, task_rpc_inlined_bytes_limit, 10 * 1024 * 1024)

None of our microbenchmarks really stress this either afaik and it's a little impractical. Tasks also have very high overhead on just launching so streaming generators are generally gonna be better for stressing this.

And also while we're here and reducing sender threads, there's one more interesting thing here - when receiving requests, we actually do the copy into the proto request object on the main io context thread, not the polling thread, so there's a lot more work that the polling thread(s) can do to lessen the burden on the io_context / remove bottlenecks. We should consider the # of req receiver threads and sender in conjunction with each other.
#55904

dayshah · 2025-11-20T07:20:32Z

src/ray/common/ray_config_def.h

 RAY_CONFIG(int64_t, health_check_failure_threshold, 5)

-/// The pool size for grpc server call.
+/// The pool size for grpc server call for system components (raylet, GCS, etc.).


Can you update this comment to mention that this is specifically for sending replies

dayshah · 2025-11-20T07:57:10Z

src/ray/rpc/server_call.h


+/// Set which config this process uses for the global reply thread pool.
+/// Call before the first GetServerCallExecutor().
+void SetServerCallThreadPoolMode(ServerCallThreadPoolMode mode);


Needing to set this at the right time is a little weird.

I guess you could avoid it by passing in the type of server from ServerCallImpl as an arg or template param but that would mean a new template param or arg all the way down the stack and idk if we want that just for this.

If it "helps" we already have this pattern for setting vars for GRPC elsewhere.... Though that is perhaps a weak justification.

Thank you! Let me rethink this.

The reason I went with this approach is that I saw a similar pattern in the core worker code lol, so I followed that and added SetServerCallThreadPoolMode right after it.

OK — I don’t really see a better option here besides (1) “set before using” or (2) “pass the parameter all the way down the stack”.

Given the size of this change and the current setup, we already require InitializeSystemConfig() to be called before GetServerCallExecutor(). We also have service_handler_.WaitUntilInitialized(), so I think it’s quite safe to call SetServerCallThreadPoolMode during core worker init.

Let me know what you think.

Ya that's ok for now, if we use this pattern more can think more about it

Yicheng-Lu-llll · 2025-11-20T18:57:50Z

@dayshah Thank you for the detailed comment, this is super helpful!

Let me check that I fully understand the behavior:

By default, Ray treats the return value as a single object, so it must be < max_direct_call_object_size (100KB by default).
In contrast, if we use num_returns or streaming generators, multiple return objects can be inlined into the same RPC, and their sizes are accumulated up to task_rpc_inlined_bytes_limit (10MB):
```
@ray.remote(num_returns=2)
def f():
    return a, b

x_ref, y_ref = f.remote()
```

For streaming generators, my current understanding is:
each yield produces one object that still needs to be < max_direct_call_object_size, and all yielded objects for the same task share the same PushTaskRequest/PushTaskReply, so the total inlined bytes are accumulated and capped by the 10MB task_rpc_inlined_bytes_limit. From the reply-thread perspective, we’re effectively sending one object’s payload at a time (each ≤ 100KB), but over the lifetime of the task the cumulative inline data on that RPC can grow up to ~10MB.

So using streaming essentially gives us high QPS, while each individual send still only handles a ≤100KB chunk at a time.

I can re-measure the cost of ServerCall::Finish(...) using a multi-return case like:

@ray.remote(num_returns=100)
def f():
    item = b"a" * (100 * 1024)
    return tuple(item for _ in range(100))

Please let me know if this works!

dayshah · 2025-11-20T21:16:59Z

Please let me know if this works!

Yup num_returns=100 with 100kb objects would be worst case. Also want to mention that this probably isn't that practical, most users will only have num_returns=1...

The streaming generator thing actually doesn't matter here the more I thought of it, because each yield is actually a separate ReportGeneratorItemReturnsRequest so I guess it doesn't really apply here at all since request writing and request receiving aren't dependent on this threadpool.

Yicheng-Lu-llll · 2025-11-21T21:35:37Z

@dayshah @edoakes I re-ran the Finish() sync-slice timing with the worst-case inline payload (100 returns × 100 KiB = 10 MiB total) :

268.272 µs
277.566 µs
280.853 µs
284.206 µs
284.644 µs

Using the method here, even a contrived 10k QPS burst with this worst-case return just needs ~2.x threads to avoid tail latency (roughly time × 10,000). Since async actors default to DEFAULT_MAX_CONCURRENCY_ASYNC = 1000 (which would need ≪1 thread), I’m inclined to stick with 2 threads.

Signed-off-by: yicheng <[email protected]>

…_to_two

dayshah

Nice investigation!

dayshah · 2025-11-23T07:32:32Z

src/ray/rpc/server_call.h


+/// Set which config this process uses for the global reply thread pool.
+/// Call before the first GetServerCallExecutor().
+void SetServerCallThreadPoolMode(ServerCallThreadPoolMode mode);


Ya that's ok for now, if we use this pattern more can think more about it

dayshah · 2025-11-23T07:33:17Z

src/ray/rpc/server_call.cc

 std::unique_ptr<boost::asio::thread_pool> &_GetServerCallExecutor() {
  static auto thread_pool = std::make_unique<boost::asio::thread_pool>(
-      ::RayConfig::instance().num_server_call_thread());
+      ThreadPoolMode().load(std::memory_order_acquire) ==


imo the memory ordering is cool but adds unnecessary overhead when reading for a part that's not perf sensitive at all

…_to_two

Signed-off-by: yicheng <[email protected]>

…ect#58771) Previously, `RAY_num_server_call_thread` controlled the gRPC reply thread pool size for all processes (including CoreWorkers), and its default value was tied to the number of CPUs, which could oversubscribe threads in CoreWorkers on large instances. In this PR, we introduce `RAY_core_worker_num_server_call_thread` to separately control CoreWorkers, defaulting to `min(2, max(1, num cpu/4))`, and scope `RAY_num_server_call_thread` to system components (raylet, GCS, etc.) only. This keeps per-worker reply pools tiny so we can run many workers on the same node without oversubscribing threads; the choice of “2” is based on the microbenchmarks in ray-project#58351. ## Related issues Closes ray-project#58351 ## Test ```python #!/usr/bin/env python3 import os import subprocess import sys def get_thread_count(config_value): subprocess.run(["ray", "stop", "-f"], capture_output=True) env = os.environ.copy() if config_value is not None: env["RAY_core_worker_num_server_call_thread"] = str(config_value) test_code = """ import ray import psutil import os import time @ray.remote def count_threads(): return len(psutil.Process(os.getpid()).threads()) ray.init() # Warm up once to make sure thread pools are instantiated. ray.get(count_threads.remote()) time.sleep(1) print(ray.get(count_threads.remote())) """ result = subprocess.run( [sys.executable, "-c", test_code], env=env, capture_output=True, text=True, check=True ) return int(result.stdout.strip()) if __name__ == "__main__": default_threads = get_thread_count(None) with_config_10 = get_thread_count(10) subprocess.run(["ray", "stop", "-f"], capture_output=True) print(f"Default (RAY_core_worker_num_server_call_thread=2): {default_threads} threads") print(f"With RAY_core_worker_num_server_call_thread=10: {with_config_10} threads") ``` ```shell #Default (RAY_core_worker_num_server_call_thread=2): 52 threads #With RAY_core_worker_num_server_call_thread=10: 60 threads ``` By default, this setting creates two threads. After changing it to ten, we typically observe eight additional threads. (Because of ray-project#55215, the exact count may differ, but in most cases the delta is three.) --------- Signed-off-by: yicheng <[email protected]> Co-authored-by: yicheng <[email protected]> Signed-off-by: YK <[email protected]>

Reverts: - 0752886 [core] enable open telemetry by default (ray-project#56432) - 50703af [core] Limit core worker gRPC reply threads to 2 by default (ray-project#58771) Testing if these changes caused the aggregator-to-GCS performance regression. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> Signed-off-by: sampan <[email protected]>

…ect#58771) Previously, `RAY_num_server_call_thread` controlled the gRPC reply thread pool size for all processes (including CoreWorkers), and its default value was tied to the number of CPUs, which could oversubscribe threads in CoreWorkers on large instances. In this PR, we introduce `RAY_core_worker_num_server_call_thread` to separately control CoreWorkers, defaulting to `min(2, max(1, num cpu/4))`, and scope `RAY_num_server_call_thread` to system components (raylet, GCS, etc.) only. This keeps per-worker reply pools tiny so we can run many workers on the same node without oversubscribing threads; the choice of “2” is based on the microbenchmarks in ray-project#58351. ## Related issues Closes ray-project#58351 ## Test ```python #!/usr/bin/env python3 import os import subprocess import sys def get_thread_count(config_value): subprocess.run(["ray", "stop", "-f"], capture_output=True) env = os.environ.copy() if config_value is not None: env["RAY_core_worker_num_server_call_thread"] = str(config_value) test_code = """ import ray import psutil import os import time @ray.remote def count_threads(): return len(psutil.Process(os.getpid()).threads()) ray.init() # Warm up once to make sure thread pools are instantiated. ray.get(count_threads.remote()) time.sleep(1) print(ray.get(count_threads.remote())) """ result = subprocess.run( [sys.executable, "-c", test_code], env=env, capture_output=True, text=True, check=True ) return int(result.stdout.strip()) if __name__ == "__main__": default_threads = get_thread_count(None) with_config_10 = get_thread_count(10) subprocess.run(["ray", "stop", "-f"], capture_output=True) print(f"Default (RAY_core_worker_num_server_call_thread=2): {default_threads} threads") print(f"With RAY_core_worker_num_server_call_thread=10: {with_config_10} threads") ``` ```shell #Default (RAY_core_worker_num_server_call_thread=2): 52 threads #With RAY_core_worker_num_server_call_thread=10: 60 threads ``` By default, this setting creates two threads. After changing it to ten, we typically observe eight additional threads. (Because of ray-project#55215, the exact count may differ, but in most cases the delta is three.) --------- Signed-off-by: yicheng <[email protected]> Co-authored-by: yicheng <[email protected]>

Yicheng-Lu-llll force-pushed the set_num_server_call_thread_for_core_worker_to_two branch from 0463c5a to 3d080da Compare November 19, 2025 04:36

Limit core worker gRPC reply threads to 2 by default

5b3bec0

Signed-off-by: yicheng <[email protected]>

Yicheng-Lu-llll force-pushed the set_num_server_call_thread_for_core_worker_to_two branch from 3d080da to 5b3bec0 Compare November 19, 2025 06:08

Merge branch 'master' into set_num_server_call_thread_for_core_worker…

866d87e

…_to_two

Yicheng-Lu-llll commented Nov 19, 2025

View reviewed changes

change test_worker_thread_count accordingly

9f86f38

Signed-off-by: yicheng <[email protected]>

Yicheng-Lu-llll marked this pull request as ready for review November 19, 2025 22:48

Yicheng-Lu-llll requested a review from a team as a code owner November 19, 2025 22:48

cursor bot reviewed Nov 19, 2025

View reviewed changes

src/ray/common/ray_config_def.h Outdated Show resolved Hide resolved

Ensure pool always has at least one thread

127c325

Signed-off-by: yicheng <[email protected]>

Yicheng-Lu-llll force-pushed the set_num_server_call_thread_for_core_worker_to_two branch from 4f277aa to 127c325 Compare November 19, 2025 23:16

ray-gardener bot added the core Issues that should be addressed in Ray Core label Nov 20, 2025

ZacAttack reviewed Nov 20, 2025

View reviewed changes

dayshah reviewed Nov 20, 2025

View reviewed changes

yicheng and others added 2 commits November 21, 2025 21:41

address comment

b16f354

Signed-off-by: yicheng <[email protected]>

Merge branch 'master' into set_num_server_call_thread_for_core_worker…

a553110

…_to_two

dayshah approved these changes Nov 23, 2025

View reviewed changes

Yicheng-Lu-llll and others added 2 commits November 23, 2025 15:56

Merge branch 'master' into set_num_server_call_thread_for_core_worker…

2d89b68

…_to_two

remove memory ordering

d27b39e

Signed-off-by: yicheng <[email protected]>

Yicheng-Lu-llll added the go add ONLY when ready to merge, run all tests label Nov 24, 2025

edoakes merged commit 50703af into ray-project:master Nov 24, 2025
7 checks passed

	// Max number bytes of inlined objects in a task rpc request/response.
	RAY_CONFIG(int64_t, task_rpc_inlined_bytes_limit, 10 * 1024 * 1024)

[core] Limit core worker gRPC reply threads to 2 by default #58771

[core] Limit core worker gRPC reply threads to 2 by default #58771

Uh oh!

Conversation

Yicheng-Lu-llll commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Test

Uh oh!

Yicheng-Lu-llll Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Yicheng-Lu-llll commented Nov 19, 2025

Uh oh!

edoakes commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicheng-Lu-llll commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dayshah commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yicheng-Lu-llll commented Nov 21, 2025

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Yicheng-Lu-llll commented Nov 19, 2025 •

edited

Loading

Yicheng-Lu-llll Nov 19, 2025 •

edited

Loading

Yicheng-Lu-llll commented Nov 20, 2025 •

edited

Loading

dayshah commented Nov 20, 2025 •

edited

Loading