You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The error I am getting while trying to run the pipeline is:
(orchestrate pid=382, ip=172.17.29.195) 11:01:01 INFO - Cluster resources: {'cpus': 280, 'gpus': 0, 'memory': 2180.0, 'object_store': 652.3036788804457}
(orchestrate pid=382, ip=172.17.29.195) 11:01:01 INFO - Number of workers - 102 with {'num_cpus': 2.3333333333333335, 'max_restarts': -1} each
(orchestrate pid=382, ip=172.17.29.195) Traceback (most recent call last):
(orchestrate pid=382, ip=172.17.29.195) File "/home/ray/anaconda3/lib/python3.10/site-packages/data_processing_ray/runtime/ray/transform_orchestrator.py", line 96, in orchestrate
(orchestrate pid=382, ip=172.17.29.195) processors = RayUtils.create_actors(
(orchestrate pid=382, ip=172.17.29.195) File "/home/ray/anaconda3/lib/python3.10/site-packages/data_processing_ray/runtime/ray/ray_utils.py", line 121, in create_actors
(orchestrate pid=382, ip=172.17.29.195) raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")
(orchestrate pid=382, ip=172.17.29.195) data_processing.utils.unrecoverable.UnrecoverableException: out of 102 created actors only 100 alive
(orchestrate pid=382, ip=172.17.29.195) 11:03:06 ERROR - Exception during execution out of 102 created actors only 100 alive: None
This problem happens every time I am attempting to start a fuzzy dedup pipeline. If I decrease the number of actors to be less than 100 (e.g. 99), the pipeline runs every time.
OS
Red Hat Enterprise Linux (RHEL)
Python
3.11.x
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
As we discussed before, the issue is the amount of resources. The amount of cpus that you have - 280 is not continious, the actor allocation is based or the amount of cpus per node, as the allocation is happening on the individual nodes. A node itself needs to have resources for its management, etc. You are leaving 1 cpu per node for this, which is not enough for some of nodes (we have no control over resource utilization of Ray itself). I was using .85 which gives you 6.8 cpu per node available for user workload, not 7 as you assume. so as a result some of the nodes do not have 7 CPUs available for actor allocation as they are running additional DPK things - transform statistics, orchestrator, etc.
General recommendations based on experience:
Larger nodes are better, I would suggest using at least 16 CPU, 256 memory per node.
I would suggest not to use explicit number of workers, but rather rely on computations provided by project
If you still want to use manual number of workers overwrite, assuming you have 16 cpus per node you can use roughly 13 cpus per node for actors. Assuming actor CPU is 2, its 6 actors per node. With for example 20 nodes you can use 120 actors. Anything above that is questionable.
To get a full picture, I would suggest to remove an exit handler (Revital can help you with this) and then look at the cluster usage
Search before asking
Component
KFP workflows, Library/core, Transforms/universal/fdedup
What happened + What you expected to happen
I am not able to run a KFP pipeline for fuzzy dedup that creates 102 actors with the following configuration:
The error I am getting while trying to run the pipeline is:
Reproduction script
Compile a KFP pipeline using this code commit: https://github.com/IBM/data-prep-kit/tree/4941d5bab37a0bdc1e5873ce8e7288483703751f
Upload the pipeline in OCP cluster. Run the pipeline with the worker config mentioned above:
Anything else
This problem happens every time I am attempting to start a fuzzy dedup pipeline. If I decrease the number of actors to be less than 100 (e.g. 99), the pipeline runs every time.
OS
Red Hat Enterprise Linux (RHEL)
Python
3.11.x
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: