Ray scheduler driver and job api #329

msaroufim · 2021-11-02T23:41:17Z

To run distributed pytorch test

cd torch/torchx/schedulers/test
python -m unittest ray_scheduler_test.py

2021-11-24 00:53:22,810 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
2021-11-24 00:53:23,127 INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_e212ba1ff8e15e24.zip.
2021-11-24 00:53:23,128 INFO packaging.py:352 -- Creating a file package for local directory '/tmp/tmp9copqxd2'.
status: PENDING
status: PENDING
status: RUNNING
status: RUNNING
status: RUNNING
status: RUNNING
status: SUCCEEDED
2021-11-24 00:53:25,521 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
(CommandActor pid=3575) initializing `gloo` process group
(CommandActor pid=3574) initializing `gloo` process group
(CommandActor pid=3575) successfully initialized process group
(CommandActor pid=3575) rank: 1, actual world_size: 2, computed world_size: 2
(CommandActor pid=3574) successfully initialized process group
(CommandActor pid=3574) rank: 0, actual world_size: 2, computed world_size: 2

To run distributed pytorch test using torchX CLI

# Setup cluster and get a HEAD NODE IP
ray up -y ray_cluster.yaml
pip install torchx[dev]

# Get a job ID from deployed job
torchx run -s ray -cfg dashboard_address=34.209.89.185:20002,working_dir=aivanou_test utils.binary --entrypoint ray_simple.py

# Use job ID to get logs or job status
torchx describe ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW
torchx log ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW

cbalioglu

I left some comments.

Besides those comments one major thing that is missing right now are the tests. I am not entirely sure to what extent we can test all the logic contained in this PR, but we should give it a try. I suggest syncing up with Aliaksandr and Tristan to learn how they test Slurm, Kubernetes, or Docker schedulers. We should have a similar approach here.

torchx/schedulers/ray_scheduler.py

torchx/schedulers/utils/ray_common.py

torchx/schedulers/utils/ray_driver.py

cbalioglu · 2021-11-03T16:16:05Z

torchx/schedulers/utils/ray_driver.py

+
+    for actor_dict in actors_dict:
+
+        bundle = {"CPU": actor_dict["num_cpus"], "GPU": actor_dict["num_gpus"]}


I would advise getting @amogkam's feedback on the rest of ray_driver.py. We should pay particular attention to the edge cases. How do we handle failures? For instance what happens if an actor or a process group initialization raises an error? How do we communicate the root cause back to the user?

torchx/schedulers/utils/ray_driver.py

msaroufim · 2021-11-03T18:37:44Z

Chatted briefly with @d4l3k on what an integration test could look like

hat we can do is setup a single node Ray cluster on the machine where the test is running that would instantiate a single Ray node with multiple workers, run the test and tear down the cluster

For multi node we can leverage something like ray up and ray down to setup an infrastructure but will be a bit more work

And last option is we manage our own Ray cluster for tests which will be a hassle

cbalioglu · 2021-11-03T19:00:19Z

@amogkam is it possible to simulate a Ray cluster on localhost?

d4l3k · 2021-11-03T21:23:11Z

@cbalioglu sorry for that request, accidentally clicked

torchx/schedulers/utils/ray_driver.py

amogkam · 2021-11-03T21:14:41Z

torchx/schedulers/utils/ray_driver.py

+
+    # On driver.py
+    logging.debug("Reading actor.json")
+    actors_dict = load_actor_json('actor.json')


cc @jiaodong to confirm, but Ray Jobs should automatically set a different working dir for each job that is submitted to the cluster. So even if each job saves to a ray-actors.json, the absolute path on the node for this file for each job should be different.

torchx/schedulers/utils/ray_driver.py

amogkam · 2021-11-03T21:58:56Z

torchx/schedulers/utils/ray_driver.py

+    while len(unfinished) > 0:
+        finished, unfinished = ray.wait(unfinished)
+        # If a failure occurs the ObjectRef will be marked as finished.
+        # Calling ray.get will expose the failure as a RayActorError.


Suggested change

# Calling ray.get will expose the failure as a RayActorError.

# Calling ray.get will then expose the exception. The same exception that is raised in the actor will be raised on the driver. If the actor is terminated or pre-empted, then a `RayActorError` is raised.

torchx/schedulers/utils/ray_driver.py

amogkam · 2021-11-03T22:41:59Z

Hey @cbalioglu @msaroufim for testing a Ray cluster we have a fake cluster utility that you can run on a single machine: https://docs.ray.io/en/master/fake-autoscaler.html.

You can either start a fake multi-node cluster programmatically: https://docs.ray.io/en/master/fake-autoscaler.html#using-ray-cluster-utils-autoscalingcluster

@jiaodong is it possible to submit jobs to the fake cluster?

jiaodong · 2021-11-03T23:12:24Z

For testing job submission, as long as we got a running ray cluster with dashboard module running on headnode, we can use it for testing. Currently what we do the most is start a single node ray cluster and submit jobs to the headnode.

There's a WIP PR that adds a bunch of things to job submission with tests passing: https://github.com/ray-project/ray/pull/19860/files you should be able to find some tests examples there for {python api, sdk, http endpoint} by looking at files in python/ray/tests/test_job_manager.pyand dashboard/modules/job/tests/test_http_job_server.py

torchx/schedulers/ray/ray_driver.py

torchx/schedulers/ray_scheduler.py

torchx/schedulers/ray/train.py

torchx/schedulers/ray/ray_driver.py

torchx/schedulers/ray_scheduler.py

torchx/schedulers/ray/ray_driver.py

torchx/schedulers/ray_scheduler.py

msaroufim · 2021-11-16T17:58:06Z

Adding final todos here

Add type annotations to make pyre happy @msaroufim @cbalioglu - Almost done
Figure out why ray_driver.py not found by ray job scheduler, run test below to see this issue @jiaodong - FIXED
See a log message message with pytorch logs in python -m unittest ray_scheduler_test.py - FIXED
Cleanup and error messages @cbalioglu - FIXED
How to download ray nightly in dev requirements while making internal/external CI happy @d4l3k - FIXED
Documentation @msaroufim - maybe it makes sense to do this in a third PR - FIXED
Looks like torchX had some new changes I ignored that broke my scheduler so will update - FIXED
Rebase to collapse all commits into one @msaroufim - Will squash and merge when done

msaroufim · 2021-11-23T23:55:19Z

Feedback from Alex 11/23

PO

Create a Ray YAML - DONE locally
Run a DDP job - DONE
Rename cluster address to Ray Dashboard address - DONE

P2

Remove ray start and ray stop commands replace with ETCD test fixtures. Setup a RayHeadServer class in test class to make things more reusable - TBD
Combine Ray cluster address and Ray yaml config - either http file or ray yaml. Parse as pair of port and host if fail treat as file otherwise port and host - DONE
Copy script and script feel redundant - use fsspec instead to work with arbitrary file systems. Copy2 fails with symlinks and same file errors - FIXED
Take runtime_env with pip and create a class for Image in torchX https://docs.ray.io/en/master/ray-job-submission/overview.html#example-setup and expose to user via YAML or ask when requirements.txt will be supported in Ray - For now supported requirements.txt
Clarify which ports do users need to know about - do they only need to know the dashboard port? Make dashboard port configurable, do not hardcode - DONE
Test for existence of ray driver and ray common - DONE, shows up in logs correctly

d4l3k · 2021-12-06T23:13:58Z

@msaroufim can you rebase this diff so the integ tests run?

msaroufim · 2021-12-08T00:08:03Z

Component test

Setup cluster

ray up cluster.yaml


# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal

# The maximum number of workers nodes to launch in addition to the head
# node. min_workers default to 0.
max_workers: 1

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

Use torchX

Smoke test

torchx run -s ray utils.echo

(ray) ubuntu@ip-172-31-2-209:~/torchx$ torchx run -s ray utils.echo 
torchx 2021-12-08 00:57:45 INFO     Uploading package gcs://_ray_pkg_3db5a02fb49faddf.zip.
torchx 2021-12-08 00:57:45 INFO     Creating a file package for local directory '/tmp/tmpac744j35'.
ray://torchx/raysubmit_f6fHJJsgJ8V8eEFJ
torchx 2021-12-08 00:57:45 INFO     Launched app: ray://torchx/raysubmit_f6fHJJsgJ8V8eEFJ
Status is PENDING
torchx 2021-12-08 00:57:45 INFO     AppStatus:
  msg: <NONE>
  num_restarts: -1
  roles: []
  state: PENDING (2)
  structured_error_msg: <NONE>
  ui_url: null

torchx 2021-12-08 00:57:45 INFO     Job URL: None

Train.py test

torchx run -s ray -cfg cluster_config_file=cluster.yaml,copy_scripts=True dist.ddp --script torchx/schedulers/test/train.py

Usage instructions

torchx runopts

ray:
    usage:
        [cluster_config_file=CLUSTER_CONFIG_FILE],[cluster_name=CLUSTER_NAME],[dashboard_address=DASHBOARD_ADDRESS],[copy_scripts=COPY_SCRIPTS],[copy_script_dirs=COPY_SCRIPT_DIRS]

    optional arguments:
        cluster_config_file=CLUSTER_CONFIG_FILE (str, None)
            Use CLUSTER_CONFIG_FILE to access or create the Ray cluster.
        cluster_name=CLUSTER_NAME (str, None)
            Override the configured cluster name.
        dashboard_address=DASHBOARD_ADDRESS (str, 127.0.0.1)
            Use ray status to get the dashboard_address
        copy_scripts=COPY_SCRIPTS (bool, False)
            Copy the Python script(s) to the cluster.
        copy_script_dirs=COPY_SCRIPT_DIRS (bool, False)
            Copy the directories containing the Python scripts to the cluster.

codecov · 2021-12-08T01:23:06Z

Codecov Report

Merging #329 (807420c) into main (ad4430b) will increase coverage by 0.09%.
The diff coverage is 95.16%.

@@            Coverage Diff             @@
##             main     #329      +/-   ##
==========================================
+ Coverage   94.73%   94.83%   +0.09%     
==========================================
  Files          61       63       +2     
  Lines        3229     3347     +118     
==========================================
+ Hits         3059     3174     +115     
- Misses        170      173       +3

Impacted Files	Coverage Δ
torchx/components/utils.py	`60.00% <50.00%> (-0.87%)`	⬇️
torchx/schedulers/ray/ray_driver.py	`92.72% <92.72%> (ø)`
torchx/schedulers/ray_scheduler.py	`94.92% <96.12%> (+4.30%)`	⬆️
torchx/cli/cmd_log.py	`95.23% <100.00%> (ø)`
torchx/schedulers/__init__.py	`96.55% <100.00%> (+1.55%)`	⬆️
torchx/schedulers/ray/ray_common.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad4430b...807420c. Read the comment docs.

aivanou_test/ray_simple.py

dev-requirements.txt

ray_cluster.yaml

scripts/component_integration_tests.py

torchx/components/utils.py

torchx/schedulers/test/ray_scheduler_test.py

torchx/schedulers/ray_scheduler.py

msaroufim · 2021-11-03T21:03:03Z

torchx/schedulers/utils/ray_driver.py

+def load_actor_json(filename : str) -> List[Dict]:
+    with open(filename) as f:
+        actor = json.load(f)
+        actor = json.loads(actor)


def load_actor_json(filename : str) -> List[Dict]: with open(filename) as f: actor = json.load(f) #from filename to string actor = json.loads(actor) #from string to dict return actor

torchx/schedulers/ray_scheduler.py

torchx/schedulers/test/ray_scheduler_test.py

scripts/component_integration_tests.py

ray_cluster.yaml

dev-requirements.txt

aivanou_test/ray_simple.py

torchx/schedulers/ray/ray_driver.py

dev-requirements.txt

torchx/schedulers/ray/ray_driver.py

amogkam · 2022-01-14T18:53:22Z

torchx/schedulers/ray/ray_driver.py

+    actors: List[RayActor], pgs: List[PlacementGroup]
+) -> List[CommandActor]:
+    command_actors: List[CommandActor] = []
+    address, port = get_address_and_port()


That's right, there's no guarantee that the rank 0 worker is placed on the same node that ray_driver is run on- I think you'd want to move this to inside the for loop, and call this on the rank 0 actor for each group of actors. And this would require some changes to the current for loop.

So add a new method to CommandActor:

def get_address_and_port(self): return get_address_and_port()

And then inside the for loop the sequencing should be:

create all the actors

get the master address and port from the rank 0 worker

then pass in the master address and port to all the actors, which means that these values cannot be set during the actor init

Happy to chat more about this as well, or provide some code snippets

torchx/schedulers/ray/ray_driver.py

torchx/schedulers/ray_scheduler.py

amogkam · 2022-01-14T19:58:35Z

torchx/schedulers/ray_scheduler.py

            )

-        if app.metadata:
-            _logger.warning("The Ray scheduler does not use metadata information.")
+        def log_iter(


As @jiaodong mentioned, log streaming support was added to jobs in Ray 1.10: ray-project/ray#20976.

After upgrading to 1.10, that functionality can probably be leveraged directly.

torchx/schedulers/test/ray_scheduler_test.py

cbalioglu

LGTM

facebook-github-bot · 2022-01-20T19:59:57Z

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-01-28T19:10:30Z

This pull request was exported from Phabricator. Differential Revision: D33689705

facebook-github-bot · 2022-01-28T19:13:37Z

This pull request was exported from Phabricator. Differential Revision: D33689705

Summary: ## To run distributed pytorch test ``` cd torch/torchx/schedulers/test python -m unittest ray_scheduler_test.py ``` ``` 2021-11-24 00:53:22,810 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379 2021-11-24 00:53:23,127 INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_e212ba1ff8e15e24.zip. 2021-11-24 00:53:23,128 INFO packaging.py:352 -- Creating a file package for local directory '/tmp/tmp9copqxd2'. status: PENDING status: PENDING status: RUNNING status: RUNNING status: RUNNING status: RUNNING status: SUCCEEDED 2021-11-24 00:53:25,521 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379 (CommandActor pid=3575) initializing `gloo` process group (CommandActor pid=3574) initializing `gloo` process group (CommandActor pid=3575) successfully initialized process group (CommandActor pid=3575) rank: 1, actual world_size: 2, computed world_size: 2 (CommandActor pid=3574) successfully initialized process group (CommandActor pid=3574) rank: 0, actual world_size: 2, computed world_size: 2 ``` ## To run distributed pytorch test using torchX CLI ``` # Setup cluster and get a HEAD NODE IP ray up -y ray_cluster.yaml pip install torchx[dev] # Get a job ID from deployed job torchx run -s ray -cfg dashboard_address=34.209.89.185:20002,working_dir=test_dir utils.binary --entrypoint ray_simple.py # Use job ID to get logs or job status torchx describe ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW torchx log ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW ``` ## Or ``` ray up -y ray_cluster.yaml pip install torchx[dev] torchx run -s ray -cfg cluster_config_file=ray_cluster.yaml,working_dir=test_dir utils.binary --entrypoint ray_simple.py ``` ## ray_simple.py The unit test `ray_scheduler_test.py` is more interesting ``` # Copyright (c) Meta Platforms, Inc. and affiliates. # All rights reserved. # # This source code is licensed under the BSD-style license found in the # LICENSE file in the root directory of this source tree. def main() -> None: print("hello") return if __name__ == "__main__": main() ``` Pull Request resolved: #329 Reviewed By: aivanou Differential Revision: D33689705 Pulled By: msaroufim fbshipit-source-id: b0fb5b95cf148ab3047a7df43e196b05d443a6d3

facebook-github-bot · 2022-01-31T19:01:53Z

This pull request was exported from Phabricator. Differential Revision: D33689705

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 2, 2021

msaroufim requested review from cbalioglu and aivanou November 2, 2021 23:52

cbalioglu requested changes Nov 3, 2021

View reviewed changes

d4l3k requested a review from cbalioglu November 3, 2021 21:22

amogkam reviewed Nov 3, 2021

View reviewed changes

msaroufim requested a review from d4l3k November 4, 2021 04:52

amogkam reviewed Nov 4, 2021

View reviewed changes

torchx/schedulers/ray/ray_driver.py Outdated Show resolved Hide resolved

torchx/schedulers/ray/ray_driver.py Outdated Show resolved Hide resolved

cbalioglu reviewed Nov 5, 2021

View reviewed changes

aivanou reviewed Nov 15, 2021

View reviewed changes

torchx/schedulers/ray/ray_driver.py Outdated Show resolved Hide resolved

torchx/schedulers/ray_scheduler.py Outdated Show resolved Hide resolved

cbalioglu reviewed Nov 15, 2021

View reviewed changes

msaroufim requested review from aivanou and cbalioglu January 6, 2022 23:27

aivanou reviewed Jan 7, 2022

View reviewed changes

msaroufim requested a review from aivanou January 8, 2022 06:41

msaroufim commented Jan 13, 2022

View reviewed changes

amogkam reviewed Jan 14, 2022

View reviewed changes

aivanou approved these changes Jan 20, 2022

View reviewed changes

cbalioglu approved these changes Jan 20, 2022

View reviewed changes

msaroufim force-pushed the raydriver branch 2 times, most recently from 54f141f to 05e1912 Compare January 28, 2022 19:13

msaroufim force-pushed the raydriver branch from 05e1912 to 807420c Compare January 31, 2022 19:01

facebook-github-bot closed this in 787bb8f Jan 31, 2022

d4l3k deleted the raydriver branch February 17, 2022 17:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray scheduler driver and job api #329

Ray scheduler driver and job api #329

msaroufim commented Nov 2, 2021 •

edited

Loading

cbalioglu left a comment

cbalioglu Nov 3, 2021

msaroufim commented Nov 3, 2021 •

edited

Loading

cbalioglu commented Nov 3, 2021

d4l3k commented Nov 3, 2021

amogkam Nov 3, 2021

amogkam Nov 3, 2021

amogkam commented Nov 3, 2021

jiaodong commented Nov 3, 2021

msaroufim commented Nov 16, 2021 •

edited

Loading

msaroufim commented Nov 23, 2021 •

edited

Loading

d4l3k commented Dec 6, 2021

msaroufim commented Dec 8, 2021 •

edited

Loading

codecov bot commented Dec 8, 2021 •

edited

Loading

msaroufim Nov 3, 2021

amogkam Jan 14, 2022

amogkam Jan 14, 2022

cbalioglu left a comment

facebook-github-bot commented Jan 20, 2022

facebook-github-bot commented Jan 28, 2022

facebook-github-bot commented Jan 28, 2022

facebook-github-bot commented Jan 31, 2022


		for actor_dict in actors_dict:

		bundle = {"CPU": actor_dict["num_cpus"], "GPU": actor_dict["num_gpus"]}

	# Calling ray.get will expose the failure as a RayActorError.
	# Calling ray.get will then expose the exception. The same exception that is raised in the actor will be raised on the driver. If the actor is terminated or pre-empted, then a `RayActorError` is raised.

Ray scheduler driver and job api #329

Ray scheduler driver and job api #329

Conversation

msaroufim commented Nov 2, 2021 • edited Loading

To run distributed pytorch test

To run distributed pytorch test using torchX CLI

cbalioglu left a comment

Choose a reason for hiding this comment

cbalioglu Nov 3, 2021

Choose a reason for hiding this comment

msaroufim commented Nov 3, 2021 • edited Loading

cbalioglu commented Nov 3, 2021

d4l3k commented Nov 3, 2021

amogkam Nov 3, 2021

Choose a reason for hiding this comment

amogkam Nov 3, 2021

Choose a reason for hiding this comment

amogkam commented Nov 3, 2021

jiaodong commented Nov 3, 2021

msaroufim commented Nov 16, 2021 • edited Loading

msaroufim commented Nov 23, 2021 • edited Loading

d4l3k commented Dec 6, 2021

msaroufim commented Dec 8, 2021 • edited Loading

Setup cluster

Use torchX

Smoke test

Train.py test

Usage instructions

codecov bot commented Dec 8, 2021 • edited Loading

Codecov Report

msaroufim Nov 3, 2021

Choose a reason for hiding this comment

amogkam Jan 14, 2022

Choose a reason for hiding this comment

amogkam Jan 14, 2022

Choose a reason for hiding this comment

cbalioglu left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Jan 20, 2022

facebook-github-bot commented Jan 28, 2022

facebook-github-bot commented Jan 28, 2022

facebook-github-bot commented Jan 31, 2022

msaroufim commented Nov 2, 2021 •

edited

Loading

msaroufim commented Nov 3, 2021 •

edited

Loading

msaroufim commented Nov 16, 2021 •

edited

Loading

msaroufim commented Nov 23, 2021 •

edited

Loading

msaroufim commented Dec 8, 2021 •

edited

Loading

codecov bot commented Dec 8, 2021 •

edited

Loading