TMCS starts too many processes (and dies) #292

mdbenito · 2023-02-28T14:48:54Z

With code based off v0.5.0, commit 152ec50, ray complains with errors like:

2023-02-28 15:45:28,812	WARNING worker.py:1851 -- WARNING: 274 PYTHON worker processes have been started on node: 22020832ef6d4fae8e4bb7b15ae173df3c17d639877d04d413e5c506 with address: 192.168.0.52. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).

The issue referenced in the error claims this might be because of nested remote calls.

On my machine, the data utility notebook ends up spitting this out:

...
2023-02-28 15:58:12,575	WARNING worker.py:1851 -- WARNING: 676 PYTHON worker processes have been started on node: 22020832ef6d4fae8e4bb7b15ae173df3c17d639877d04d413e5c506 with address: 192.168.0.52. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
(raylet) [2023-02-28 15:58:19,519 E 114931 114931] (raylet) node_manager.cc:3097: 2 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 22020832ef6d4fae8e4bb7b15ae173df3c17d639877d04d413e5c506, IP: 192.168.0.52) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.0.52`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
2023-02-28 15:58:21,823	WARNING worker.py:1851 -- WARNING: 736 PYTHON worker processes have been started on node: 22020832ef6d4fae8e4bb7b15ae173df3c17d639877d04d413e5c506 with address: 192.168.0.52. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
(raylet) [2023-02-28 15:59:19,521 E 114931 114931] (raylet) node_manager.cc:3097: 4 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 22020832ef6d4fae8e4bb7b15ae173df3c17d639877d04d413e5c506, IP: 192.168.0.52) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.0.52`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

And after a few more iterations, it dies with

(raylet) [2023-02-28 16:18:26,974 E 114931 114931] (raylet) worker_pool.cc:524: Some workers of the worker process(953902) have not registered within the timeout. The process is still alive, probably it's hanging during start.
(ShapleyWorker pid=900576) E0228 16:18:22.835699587  902741 chttp2_transport.cc:2721]             keepalive_ping_end state error: 0 (expect: 1)
(ShapleyWorker pid=906401) E0228 16:18:26.290218105  909206 chttp2_transport.cc:2721]             keepalive_ping_end state error: 0 (expect: 1)
(ShapleyWorker pid=906390) E0228 16:18:22.540661878  907569 chttp2_transport.cc:2721]             keepalive_ping_end state error: 0 (expect: 1)
(ShapleyWorker pid=912437) E0228 16:18:25.968255772  914199 chttp2_transport.cc:2721]             keepalive_ping_end state error: 0 (expect: 1)
(ShapleyWorker pid=912395) E0228 16:18:30.959742320  913533 chttp2_transport.cc:2721]             keepalive_ping_end state error: 0 (expect: 1)

The text was updated successfully, but these errors were encountered:

mdbenito added the bug Something isn't working label Feb 28, 2023

mdbenito changed the title ~~TMCS starts too many processes~~ TMCS starts too many processes (and dies) Feb 28, 2023

mdbenito changed the title ~~TMCS starts too many processes (and dies)~~ TMCS starts too many processes Feb 28, 2023

mdbenito changed the title ~~TMCS starts too many processes~~ TMCS starts too many processes (and dies) Feb 28, 2023

mdbenito modified the milestones: Ready for public release, v0.5.1 Mar 5, 2023

mdbenito assigned AnesBenmerzoug Mar 10, 2023

mdbenito modified the milestones: v0.6.0, v0.6.1 Mar 16, 2023

AnesBenmerzoug mentioned this issue Mar 18, 2023

Fix TMCS starts too many processes and dies #329

Merged

4 tasks

AnesBenmerzoug closed this as completed in #329 Apr 8, 2023

mdbenito mentioned this issue May 16, 2023

Investigate error occuring for ray #354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TMCS starts too many processes (and dies) #292

TMCS starts too many processes (and dies) #292

mdbenito commented Feb 28, 2023 •

edited

Loading

TMCS starts too many processes (and dies) #292

TMCS starts too many processes (and dies) #292

Comments

mdbenito commented Feb 28, 2023 • edited Loading

mdbenito commented Feb 28, 2023 •

edited

Loading