Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TMCS starts too many processes (and dies) #292

Closed
mdbenito opened this issue Feb 28, 2023 · 0 comments · Fixed by #329
Closed

TMCS starts too many processes (and dies) #292

mdbenito opened this issue Feb 28, 2023 · 0 comments · Fixed by #329
Assignees
Labels
bug Something isn't working
Milestone

Comments

@mdbenito
Copy link
Collaborator

mdbenito commented Feb 28, 2023

With code based off v0.5.0, commit 152ec50, ray complains with errors like:

2023-02-28 15:45:28,812	WARNING worker.py:1851 -- WARNING: 274 PYTHON worker processes have been started on node: 22020832ef6d4fae8e4bb7b15ae173df3c17d639877d04d413e5c506 with address: 192.168.0.52. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).

The issue referenced in the error claims this might be because of nested remote calls.

On my machine, the data utility notebook ends up spitting this out:

...
2023-02-28 15:58:12,575	WARNING worker.py:1851 -- WARNING: 676 PYTHON worker processes have been started on node: 22020832ef6d4fae8e4bb7b15ae173df3c17d639877d04d413e5c506 with address: 192.168.0.52. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
(raylet) [2023-02-28 15:58:19,519 E 114931 114931] (raylet) node_manager.cc:3097: 2 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 22020832ef6d4fae8e4bb7b15ae173df3c17d639877d04d413e5c506, IP: 192.168.0.52) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.0.52`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
2023-02-28 15:58:21,823	WARNING worker.py:1851 -- WARNING: 736 PYTHON worker processes have been started on node: 22020832ef6d4fae8e4bb7b15ae173df3c17d639877d04d413e5c506 with address: 192.168.0.52. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
(raylet) [2023-02-28 15:59:19,521 E 114931 114931] (raylet) node_manager.cc:3097: 4 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 22020832ef6d4fae8e4bb7b15ae173df3c17d639877d04d413e5c506, IP: 192.168.0.52) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.0.52`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

And after a few more iterations, it dies with

(raylet) [2023-02-28 16:18:26,974 E 114931 114931] (raylet) worker_pool.cc:524: Some workers of the worker process(953902) have not registered within the timeout. The process is still alive, probably it's hanging during start.
(ShapleyWorker pid=900576) E0228 16:18:22.835699587  902741 chttp2_transport.cc:2721]             keepalive_ping_end state error: 0 (expect: 1)
(ShapleyWorker pid=906401) E0228 16:18:26.290218105  909206 chttp2_transport.cc:2721]             keepalive_ping_end state error: 0 (expect: 1)
(ShapleyWorker pid=906390) E0228 16:18:22.540661878  907569 chttp2_transport.cc:2721]             keepalive_ping_end state error: 0 (expect: 1)
(ShapleyWorker pid=912437) E0228 16:18:25.968255772  914199 chttp2_transport.cc:2721]             keepalive_ping_end state error: 0 (expect: 1)
(ShapleyWorker pid=912395) E0228 16:18:30.959742320  913533 chttp2_transport.cc:2721]             keepalive_ping_end state error: 0 (expect: 1)
@mdbenito mdbenito added the bug Something isn't working label Feb 28, 2023
@mdbenito mdbenito changed the title TMCS starts too many processes TMCS starts too many processes (and dies) Feb 28, 2023
@mdbenito mdbenito changed the title TMCS starts too many processes (and dies) TMCS starts too many processes Feb 28, 2023
@mdbenito mdbenito changed the title TMCS starts too many processes TMCS starts too many processes (and dies) Feb 28, 2023
@mdbenito mdbenito modified the milestones: v0.6.0, v0.6.1 Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants