Fix TMCS starts too many processes and dies #329

AnesBenmerzoug · 2023-03-18T16:47:49Z

Description

This PR closes #292

It does so by using an abstraction based on concurrent.futures instead of using actors.

I tried first to use ray queues to avoid passing the coordinator to the workers
but they also rely on an actor and it kept dying.

Changes

Added RayExecutor class based concurrent.futures to parallel package.
Used new concurrent.futures executor abstraction in TMCS.
Removed abstract and shapley actor modules.
Moved TMCS tests to separate module.
Fixed check for number of subsets in Data Utility Learning class.
Updated Data Utility Learning notebook.
Updated Shapley Basic Spotify notebook.

EDIT More changes:

Removed n_concurrent_computations and used n_jobs instead of it to mean the number of tasks to submit before waiting for results.
Replaced n_local_workers in the ParallelConfig class with n_workers and use that to set the number of max_workers in the given Executor.
Added a __post_init__ method to ParallelConfig to make sure that n_workers is None when address is set.
Added tests specifically for the executor.
For the 'sequential' parallel backend, use ThreadPoolExecutor with max_workers=1.

EDIT 2 More changes:

Added n_cpus_per_job field to ParallelConfig.
Added cancel_futures_on_exit boolean parameter to RayExecutor.

EDIT 3 More changes:

Changed n_workers in ParallelConfig to n_cpus_local to align more closely with its meaning in ray.
Remove n_cpus_per_job from ParallelConfig and pass it instead as an option to the executor's submit method as part of the kwargs parameter. Otherwise mypy will complain that the method does not have the same signature as the one defined in the base Executor class.
Used max_workers in RayExecutor as the maximum number of submitted jobs. Take its value from n_jobs instead of n_workers (which was renamed to n_cpus_local).
Added a new variable inside TMCS with a value of 2 * effective_n_jobs to represent the total number of submitted jobs, including the jobs that are running.

Checklist

Wrote Unit tests (if necessary)
Updated Documentation (if necessary)
Updated Changelog
If notebooks were added/changed, added boilerplate cells are tagged with "nbsphinx":"hidden"

…es interface

It is almost the same as the one from the base Executor class but it escapes the start characters because sphinx complains about starting emphasis character without matching ending character

mdbenito

Besides the cancelling of tasks, this PR has highlighted our inconsistent (and possibly bogus) use of n_jobs everywhere. I think we need to fix it

More generally, I think that we are not really using ray as it's supposed to be used. For one, ParallelConfig.n_local_workers is used for num_cpus in ray.init() which does not have the effect we document: instead it's the number of cpus for a "raylet" (which I guess is the number of cpus available to each node), which is fine when starting a local cluster, but probably not when using an existing one.

What do you think about the idea of setting max_workers in the parallel config (with it being a noop for ray, or maybe a check against the number of nodes available for a running cluster), and then using n_jobs as the number of tasks, and setting num_cpus to 1 in the call to ray.remote()?

src/pydvl/value/shapley/truncated.py

src/pydvl/utils/parallel/futures.py

src/pydvl/value/shapley/truncated.py

src/pydvl/utils/parallel/futures.py

…te_period, rename n_concurrent_computations to max_workers

- Make RayExecutor respect max_workers - Use a local thread and queues to manage work items and submit futures - Make init_executor take max_workers as n_workers from parallel config

…kend

AnesBenmerzoug · 2023-03-28T07:29:38Z

@mdbenito let's discuss this during the next meeting.

src/pydvl/utils/parallel/futures/ray.py

…asks

Co-authored-by: Miguel de Benito Delgado <[email protected]>

It was not used

mdbenito

I still think that there are some inconsistencies wrt. max_workers. The number of cpus available in the cluster is an external factor over which the code has no effect. So we must ignore that, in particular in ray.init(), where num_cpus does not refer to the number of cpus used for a local cluster.

max_workers could then be used as either of:

the maximum number of vCPUs to be used by the executor (num_jobs * num_cpus_per_job)
the maximum number of tasks to be run by the executor (so that effective_cpus_used = max_workers * n_cpus_per_job).

We need to fix the names once and for all:

task = job
worker = single-core process = CPU

I find the second one horrible, but that seems to be ray's convention, right? We don't have to follow it though: in the ParallelConfig and elsewhere we could use max_cpus instead of max_workers. The question is then what to do when we allow for additional resources like GPUs

src/pydvl/utils/config.py

src/pydvl/utils/parallel/futures/__init__.py

src/pydvl/utils/parallel/futures/ray.py

src/pydvl/value/shapley/truncated.py

tests/utils/conftest.py

src/pydvl/utils/parallel/futures/ray.py

src/pydvl/utils/parallel/backend.py

Co-authored-by: Miguel de Benito Delgado <[email protected]>

AnesBenmerzoug · 2023-04-05T19:46:59Z

@mdbenito I read the ray documentation and architecture more thoroughly and here's what I found:

According to this section of their documentation:
- Resource requirements of tasks or actors do NOT impose limits on actual physical resource usage.
- Ray doesn’t provide CPU isolation for tasks or actors.
According to this other section of their document:
- By default, Ray nodes start with pre-defined CPU, GPU, and memory resources. The quantities of these resources on each node are set to the physical quantities auto detected by Ray. By default, logical resources are configured by the following rule:
  - Number of logical CPUs (num_cpus): Set to the number of CPUs of the machine/container.
- Using ray.init() to start a single node Ray cluster and setting num_cpus will start a Ray node with num_cpus logical cpus i.e. num_cpus worker processes.
According to yet another section of their documentation:
- Ray allows specifying a task or actor’s resource requirements (e.g., CPU, GPU, and custom resources). The task or actor will only run on a node if there are enough required resources available to execute the task or actor.

I finally understand this better. Thanks for the link to that document.

Here's what I suggest:

Change n_workers in ParallelConfig to n_cpus_local to align more closely with its meaning in ray.
Remove n_cpus_per_job from ParallelConfig and pass it instead as an option to the executor's submit method. Still I think this needs more thinking.
Use max_workers in RayExecutor as the maximum number of submitted jobs. Take its value from n_jobs instead of n_workers.
Add another argument called queue_size or something similar to TMCS to represent the number of tasks that will be submitted at each iteration. It can default to 2 * effective_n_jobs

What do you think?

Co-authored-by: Miguel de Benito Delgado <[email protected]>

…_local

…s kwargs This is done because mypy complains if we don't have the same signature as the base Executor class

AnesBenmerzoug added 10 commits March 18, 2023 17:41

Use Queue instead of passing coordinator to workers in TMCS

a790a84

Instantiate all shapley workers inside get_shapley_workers

215246a

Add queue timeout parameter

ca25ed4

Create separate test module for tmcs

4199098

Delete actors code, Add new abstraction based on the concurrent futur…

cccd7a5

…es interface

Fix type hint

dac64cb

Fix type hints

374b4d1

Update data utility learning notebook

5d13445

Update shapley basic spotify notebook

6d0a13b

Remove shapley actor module

7ac8621

AnesBenmerzoug added this to the v0.7.0 milestone Mar 18, 2023

AnesBenmerzoug self-assigned this Mar 18, 2023

AnesBenmerzoug added 4 commits March 18, 2023 17:49

Update changelog

fb2c91d

Add docstring for init_executor

1038a57

Use zeros method of ValuationResult instead of deprecated empty method

e50911a

Add docstring for submit method

25410ff

It is almost the same as the one from the base Executor class but it escapes the start characters because sphinx complains about starting emphasis character without matching ending character

AnesBenmerzoug marked this pull request as ready for review March 19, 2023 08:22

AnesBenmerzoug requested a review from mdbenito March 19, 2023 08:23

mdbenito reviewed Mar 19, 2023

View reviewed changes

AnesBenmerzoug added 11 commits March 19, 2023 14:50

Use deprecation warning for coordinator_update_period and worker_upda…

59aa9b5

…te_period, rename n_concurrent_computations to max_workers

Rename n_local_workers to n_workers

751f327

Lots of changes

86248f1

- Make RayExecutor respect max_workers - Use a local thread and queues to manage work items and submit futures - Make init_executor take max_workers as n_workers from parallel config

Add tests for RayExecutor

a0e5702

Rerun shapley basic spotify notebook

9fd5a0e

Refactor futures module into futures packages

5eebc56

Use ThreadPoolExecutor with max_workers=1 for sequential parallel bac…

9c00425

…kend

Add calls to sleep inside work item manager thread loops

933ba65

Update tests

a873edc

Allow using TMCS with sequential, update tests

57c34de

Add check to parallel config when using external ray cluster

c119a2b

mdbenito reviewed Mar 28, 2023

View reviewed changes

src/pydvl/utils/parallel/futures/ray.py Show resolved Hide resolved

AnesBenmerzoug and others added 10 commits March 28, 2023 12:19

Track submitted futures in a weakset, call ray.cancel to cancel the t…

fc74d14

…asks

Override the base Executor class' __exit__ method

13771d0

Recommend using init_executor instead of instantiating executor directly

f0ed3f0

Fix docstrings

0bbc10a

Add n_cpus_per_job field to ParallelConfig

c7448dd

Add cancel_futures_on_exit parameter to RayExecutor

27c6cff

Merge branch 'develop' into 292-tmcs-starts-too-many-processes-and-dies

8e35c06

Apply suggestions from code review

ee46ecf

Co-authored-by: Miguel de Benito Delgado <[email protected]>

Remove private broken flag from RayExecutor

f1a9a65

It was not used

Update changelog

7210663

AnesBenmerzoug requested a review from mdbenito April 3, 2023 13:53

AnesBenmerzoug modified the milestones: v0.7.0, v0.6.1 Apr 3, 2023

Merge branch 'develop' into 292-tmcs-starts-too-many-processes-and-dies

f01c99f

mdbenito reviewed Apr 5, 2023

View reviewed changes

AnesBenmerzoug and others added 2 commits April 5, 2023 14:20

Update src/pydvl/utils/config.py

63db9c2

Co-authored-by: Miguel de Benito Delgado <[email protected]>

Apply suggestions from code review

d713c74

Co-authored-by: Miguel de Benito Delgado <[email protected]>

AnesBenmerzoug and others added 5 commits April 6, 2023 12:09

Revert changes to wrap()

64dcde6

Co-authored-by: Miguel de Benito Delgado <[email protected]>

Remove n_cpus_per_job from ParallelConfig, rename n_workers to n_cpus…

f3d40e7

…_local

Change check for kwargs length inside parallel backend wrap method

2e229a5

Update fixture

da3b111

Use n_jobs as max_workers in TMCS

14bb2bf

AnesBenmerzoug requested a review from mdbenito April 6, 2023 11:51

Pass options to the ray remote function as part of the submit method'…

fbbb0a3

…s kwargs This is done because mypy complains if we don't have the same signature as the base Executor class

mdbenito approved these changes Apr 7, 2023

View reviewed changes

AnesBenmerzoug merged commit 1a31aba into develop Apr 8, 2023

mdbenito deleted the 292-tmcs-starts-too-many-processes-and-dies branch May 16, 2023 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TMCS starts too many processes and dies #329

Fix TMCS starts too many processes and dies #329

AnesBenmerzoug commented Mar 18, 2023 •

edited

Loading

mdbenito left a comment

AnesBenmerzoug commented Mar 28, 2023

mdbenito left a comment •

edited

Loading

AnesBenmerzoug commented Apr 5, 2023 •

edited

Loading

Fix TMCS starts too many processes and dies #329

Fix TMCS starts too many processes and dies #329

Conversation

AnesBenmerzoug commented Mar 18, 2023 • edited Loading

Description

Changes

Checklist

mdbenito left a comment

Choose a reason for hiding this comment

AnesBenmerzoug commented Mar 28, 2023

mdbenito left a comment • edited Loading

Choose a reason for hiding this comment

AnesBenmerzoug commented Apr 5, 2023 • edited Loading

AnesBenmerzoug commented Mar 18, 2023 •

edited

Loading

mdbenito left a comment •

edited

Loading

AnesBenmerzoug commented Apr 5, 2023 •

edited

Loading