-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Too many threads in ray worker #36936
Comments
Any progress of this? |
cc @rynewang is it the same issue you fixed recently? |
I thought this problem led to the "pthread_create resource temporarily unavailable..." error when there are too many parallel tasks, say 32 tasks. The logging part in Ray itself and the application code (in my case, some OpenMP code) suffer from it. When the task is running, every ray::xxx process gets a 100+ nTH count. |
Since it's Ray 2.4.0 maybe it's #33957 . Could you try on Ray 2.7.1? |
@rynewang There is no actor involved in this repro script import ray
ray.init() |
There were issues we create a thread per caller for async actor, and thats' been fixed. I think this issue is still not fixed.
event_engine is from grpc, and there's currently no way to control (unless we patch grpc). I think in the reality most of them is idle. it is mysterious why we have so many worker.io threads per proc (we are expected to have only 1 per proc). In terms of fix timeline, as https://discuss.ray.io/t/too-many-threads-in-ray-worker/10881/12?u=sangcho says, we may not prioritize the fix at least in a while unless there's concrete proof in performance impact. Regarding the system resources limit, we in general recommend to set high ulimit. |
In my case, around ~4000 threads were created. Sometimes I can not scale up ray instance like below
|
I've done another investigation today, and finally found the source of these ray/src/ray/rpc/server_call.cc Lines 23 to 27 in baffe07
by default the ray/src/ray/common/ray_config_def.h Lines 855 to 858 in baffe07
the backtrace of the thread creation is like this #0 __pthread_create_2_1 (newthread=0x555556cb26a0, attr=0x0, start_routine=0x7ffff6cc9c00 <boost_asio_detail_posix_thread_function>, arg=0x7fff70004570) at ./nptl/pthread_create.c:621
#1 0x00007ffff6cd355e in boost::asio::detail::posix_thread::start_thread(boost::asio::detail::posix_thread::func_base*) () from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#2 0x00007ffff6cd3a1c in boost::asio::thread_pool::thread_pool(unsigned long) () from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#3 0x00007ffff66157d4 in ray::rpc::(anonymous namespace)::_GetServerCallExecutor() () from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#4 0x00007ffff6615869 in ray::rpc::GetServerCallExecutor() () from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#5 0x00007ffff6338382 in std::_Function_handler<void (ray::Status, std::function<void ()>, std::function<void ()>), ray::rpc::ServerCallImpl<ray::rpc::CoreWorkerServiceHandler, ray::rpc::GetCoreWorkerStatsRequest, ray::rpc::GetCoreWorkerStatsReply, (ray::rpc::AuthType)0>::HandleRequestImpl(bool)::{lambda(ray::Status, std::function<void ()>, std::function<void ()>)#2}>::_M_invoke(std::_Any_data const&, ray::Status&&, std::function<void ()>&&, std::function<void ()>&&) () from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#6 0x00007ffff637a8b9 in ray::core::CoreWorker::HandleGetCoreWorkerStats(ray::rpc::GetCoreWorkerStatsRequest, ray::rpc::GetCoreWorkerStatsReply*, std::function<void (ray::Status, std::function<void ()>, std::function<void ()>)>) ()
from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#7 0x00007ffff636fcf4 in ray::rpc::ServerCallImpl<ray::rpc::CoreWorkerServiceHandler, ray::rpc::GetCoreWorkerStatsRequest, ray::rpc::GetCoreWorkerStatsReply, (ray::rpc::AuthType)0>::HandleRequestImpl(bool) ()
from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#8 0x00007ffff6626f5e in EventTracker::RecordExecution(std::function<void ()> const&, std::shared_ptr<StatsHandle>) () from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#9 0x00007ffff662034e in std::_Function_handler<void (), instrumented_io_context::post(std::function<void ()>, std::string, long)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#10 0x00007ffff66207c6 in boost::asio::detail::completion_handler<std::function<void ()>, boost::asio::io_context::basic_executor_type<std::allocator<void>, 0ul> >::do_complete(void*, boost::asio::detail::scheduler_operation*, boost::system::error_code const&, unsigned long)
() from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#11 0x00007ffff6cd022b in boost::asio::detail::scheduler::do_run_one(boost::asio::detail::conditionally_enabled_mutex::scoped_lock&, boost::asio::detail::scheduler_thread_info&, boost::system::error_code const&) ()
from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#12 0x00007ffff6cd1ba9 in boost::asio::detail::scheduler::run(boost::system::error_code&) () from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#13 0x00007ffff6cd22b2 in boost::asio::io_context::run() () from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#14 0x00007ffff6351889 in ray::core::CoreWorker::RunIOService() () from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#15 0x00007ffff670fb80 in thread_proxy () from /home/ubuntu/example/.venv/lib/python3.10/site-packages/ray/_raylet.so
#16 0x00007ffff7c94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#17 0x00007ffff7d26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 seems like these threads inherit the To workaround it, you can set the env var I'm not sure if |
@jjyao should we close this and open another GH issue to track the addition of a cap? |
I recently encountered the same issue. While this may not directly impact performance, there is a problem when the system reaches the user thread limit, which prevents the creation of new threads. This can be especially problematic in environments like a SLURM cluster. In my current setup, I am hitting the thread limit on the nodes of my SLURM cluster, resulting in messages like "bash: fork: retry: Resource temporarily unavailable" when trying to submit a simple job or SSH into a node where my Ray processes are running. |
Thanks for the investigation in #36936 (comment) ! I think for now the users can set Note to future self: understand if using |
🤔 > @rynewang maybe an enhancement for the future? |
Yes please! an enhancement would be welcome. We still get thousands of threads in our kubernetes pod even with |
I agree, creates issues with runpod |
Any progress? Based on our experience, excessive threads (even with less schedule) could lead to significant performance loss for compute-intensive calls... |
@FengLi666 Please consider do a performance comparsion somehow to show the significant performance loss to help prioritize this, otherwise they will just keep telling us to set a high ulimit. |
What happened + What you expected to happen
On a 8 core Linux server, in local mode after calling
ray.init()
, ray creates 8 (idle) workers (looks reasonable), but in each of them it also creates many threads, in this case total of 33 threads in each worker:On a 128 core Linux server:
total of 86 threads in one ray IDLE process. And there are 128 ray IDLE processes, a total of over 10000 threads spawned just by calling
ray.init()
.See also https://discuss.ray.io/t/too-many-threads-in-ray-worker/10881
Versions / Dependencies
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: