[Bug] Cannot run KFP pipeline for fuzzy dedup with more than 100 actors #803

cmadam · 2024-11-16T00:19:00Z

Search before asking

I searched the issues and found no similar issues.

Component

KFP workflows, Library/core, Transforms/universal/fdedup

What happened + What you expected to happen

I am not able to run a KFP pipeline for fuzzy dedup that creates 102 actors with the following configuration:

{
  "cpu": 8,
  "image": "us.icr.io/cil15-shared-registry/preprocessing-pipelines/fuzzydedup-ray:0.2.8-v5",
  "imagePullPolicy": "Always",
  "image_pull_secret": "****-***-***-**",
  "max_replicas": 35,
  "memory": 60,
  "min_replicas": 35,
  "replicas": 35
}

The error I am getting while trying to run the pipeline is:

(orchestrate pid=382, ip=172.17.29.195) 11:01:01 INFO - Cluster resources: {'cpus': 280, 'gpus': 0, 'memory': 2180.0, 'object_store': 652.3036788804457}
(orchestrate pid=382, ip=172.17.29.195) 11:01:01 INFO - Number of workers - 102 with {'num_cpus': 2.3333333333333335, 'max_restarts': -1} each
(orchestrate pid=382, ip=172.17.29.195) Traceback (most recent call last):
(orchestrate pid=382, ip=172.17.29.195)   File "/home/ray/anaconda3/lib/python3.10/site-packages/data_processing_ray/runtime/ray/transform_orchestrator.py", line 96, in orchestrate
(orchestrate pid=382, ip=172.17.29.195)     processors = RayUtils.create_actors(
(orchestrate pid=382, ip=172.17.29.195)   File "/home/ray/anaconda3/lib/python3.10/site-packages/data_processing_ray/runtime/ray/ray_utils.py", line 121, in create_actors
(orchestrate pid=382, ip=172.17.29.195)     raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")
(orchestrate pid=382, ip=172.17.29.195) data_processing.utils.unrecoverable.UnrecoverableException: out of 102 created actors only 100 alive
(orchestrate pid=382, ip=172.17.29.195) 11:03:06 ERROR - Exception during execution out of 102 created actors only 100 alive: None

Reproduction script

Compile a KFP pipeline using this code commit: https://github.com/IBM/data-prep-kit/tree/4941d5bab37a0bdc1e5873ce8e7288483703751f

cd transforms/universal/fdedup/kfp_ray
make workflow-venv
source ../../../venv/bin/activate
export FDEDUP_IMAGE_LOCATION=us.icr.io/cil15-shared-registry/preprocessing-pipelines/fuzzydedup-ray:0.2.8-v5
export FDEDUP_IMAGE_PULL_SECRET="****-***-***-**"
python fdedup_wf.py

Upload the pipeline in OCP cluster. Run the pipeline with the worker config mentioned above:

{
  "cpu": 8,
  "image": "us.icr.io/cil15-shared-registry/preprocessing-pipelines/fuzzydedup-ray:0.2.8-v5",
  "imagePullPolicy": "Always",
  "image_pull_secret": "****-***-***-**",
  "max_replicas": 35,
  "memory": 60,
  "min_replicas": 35,
  "replicas": 35
}

Anything else

This problem happens every time I am attempting to start a fuzzy dedup pipeline. If I decrease the number of actors to be less than 100 (e.g. 99), the pipeline runs every time.

OS

Red Hat Enterprise Linux (RHEL)

Python

3.11.x

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

blublinsky · 2024-11-18T09:21:58Z

As we discussed before, the issue is the amount of resources. The amount of cpus that you have - 280 is not continious, the actor allocation is based or the amount of cpus per node, as the allocation is happening on the individual nodes. A node itself needs to have resources for its management, etc. You are leaving 1 cpu per node for this, which is not enough for some of nodes (we have no control over resource utilization of Ray itself). I was using .85 which gives you 6.8 cpu per node available for user workload, not 7 as you assume. so as a result some of the nodes do not have 7 CPUs available for actor allocation as they are running additional DPK things - transform statistics, orchestrator, etc.
General recommendations based on experience:

Larger nodes are better, I would suggest using at least 16 CPU, 256 memory per node.
I would suggest not to use explicit number of workers, but rather rely on computations provided by project
If you still want to use manual number of workers overwrite, assuming you have 16 cpus per node you can use roughly 13 cpus per node for actors. Assuming actor CPU is 2, its 6 actors per node. With for example 20 nodes you can use 120 actors. Anything above that is questionable.

To get a full picture, I would suggest to remove an exit handler (Revital can help you with this) and then look at the cluster usage

cmadam added the bug Something isn't working label Nov 16, 2024

blublinsky mentioned this issue Nov 19, 2024

update to actor list limit to fix issue 803 #814

Open

cmadam mentioned this issue Nov 28, 2024

Simplified fix for issue 803 #839

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Cannot run KFP pipeline for fuzzy dedup with more than 100 actors #803

[Bug] Cannot run KFP pipeline for fuzzy dedup with more than 100 actors #803

cmadam commented Nov 16, 2024

blublinsky commented Nov 18, 2024 •

edited

Loading

[Bug] Cannot run KFP pipeline for fuzzy dedup with more than 100 actors #803

[Bug] Cannot run KFP pipeline for fuzzy dedup with more than 100 actors #803

Comments

cmadam commented Nov 16, 2024

Search before asking

Component

What happened + What you expected to happen

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

blublinsky commented Nov 18, 2024 • edited Loading

blublinsky commented Nov 18, 2024 •

edited

Loading