Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray scheduler driver and job api #329

Closed
wants to merge 1 commit into from
Closed

Ray scheduler driver and job api #329

wants to merge 1 commit into from

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented Nov 2, 2021

To run distributed pytorch test

cd torch/torchx/schedulers/test
python -m unittest ray_scheduler_test.py
2021-11-24 00:53:22,810 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
2021-11-24 00:53:23,127 INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_e212ba1ff8e15e24.zip.
2021-11-24 00:53:23,128 INFO packaging.py:352 -- Creating a file package for local directory '/tmp/tmp9copqxd2'.
status: PENDING
status: PENDING
status: RUNNING
status: RUNNING
status: RUNNING
status: RUNNING
status: SUCCEEDED
2021-11-24 00:53:25,521 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
(CommandActor pid=3575) initializing `gloo` process group
(CommandActor pid=3574) initializing `gloo` process group
(CommandActor pid=3575) successfully initialized process group
(CommandActor pid=3575) rank: 1, actual world_size: 2, computed world_size: 2
(CommandActor pid=3574) successfully initialized process group
(CommandActor pid=3574) rank: 0, actual world_size: 2, computed world_size: 2

To run distributed pytorch test using torchX CLI

# Setup cluster and get a HEAD NODE IP
ray up -y ray_cluster.yaml
pip install torchx[dev]

# Get a job ID from deployed job
torchx run -s ray -cfg dashboard_address=34.209.89.185:20002,working_dir=aivanou_test utils.binary --entrypoint ray_simple.py

# Use job ID to get logs or job status
torchx describe ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW
torchx log ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 2, 2021
Copy link
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.

Besides those comments one major thing that is missing right now are the tests. I am not entirely sure to what extent we can test all the logic contained in this PR, but we should give it a try. I suggest syncing up with Aliaksandr and Tristan to learn how they test Slurm, Kubernetes, or Docker schedulers. We should have a similar approach here.


for actor_dict in actors_dict:

bundle = {"CPU": actor_dict["num_cpus"], "GPU": actor_dict["num_gpus"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advise getting @amogkam's feedback on the rest of ray_driver.py. We should pay particular attention to the edge cases. How do we handle failures? For instance what happens if an actor or a process group initialization raises an error? How do we communicate the root cause back to the user?

@msaroufim
Copy link
Member Author

msaroufim commented Nov 3, 2021

Chatted briefly with @d4l3k on what an integration test could look like

hat we can do is setup a single node Ray cluster on the machine where the test is running that would instantiate a single Ray node with multiple workers, run the test and tear down the cluster

For multi node we can leverage something like ray up and ray down to setup an infrastructure but will be a bit more work

And last option is we manage our own Ray cluster for tests which will be a hassle

@cbalioglu
Copy link
Contributor

@amogkam is it possible to simulate a Ray cluster on localhost?

@d4l3k d4l3k requested a review from cbalioglu November 3, 2021 21:22
@d4l3k
Copy link
Member

d4l3k commented Nov 3, 2021

@cbalioglu sorry for that request, accidentally clicked


# On driver.py
logging.debug("Reading actor.json")
actors_dict = load_actor_json('actor.json')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jiaodong to confirm, but Ray Jobs should automatically set a different working dir for each job that is submitted to the cluster. So even if each job saves to a ray-actors.json, the absolute path on the node for this file for each job should be different.

while len(unfinished) > 0:
finished, unfinished = ray.wait(unfinished)
# If a failure occurs the ObjectRef will be marked as finished.
# Calling ray.get will expose the failure as a RayActorError.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Calling ray.get will expose the failure as a RayActorError.
# Calling ray.get will then expose the exception. The same exception that is raised in the actor will be raised on the driver. If the actor is terminated or pre-empted, then a `RayActorError` is raised.

@amogkam
Copy link
Contributor

amogkam commented Nov 3, 2021

Hey @cbalioglu @msaroufim for testing a Ray cluster we have a fake cluster utility that you can run on a single machine: https://docs.ray.io/en/master/fake-autoscaler.html.

You can either start a fake multi-node cluster programmatically: https://docs.ray.io/en/master/fake-autoscaler.html#using-ray-cluster-utils-autoscalingcluster

@jiaodong is it possible to submit jobs to the fake cluster?

@jiaodong
Copy link

jiaodong commented Nov 3, 2021

For testing job submission, as long as we got a running ray cluster with dashboard module running on headnode, we can use it for testing. Currently what we do the most is start a single node ray cluster and submit jobs to the headnode.

There's a WIP PR that adds a bunch of things to job submission with tests passing: https://github.com/ray-project/ray/pull/19860/files you should be able to find some tests examples there for {python api, sdk, http endpoint} by looking at files in python/ray/tests/test_job_manager.pyand dashboard/modules/job/tests/test_http_job_server.py

@msaroufim msaroufim requested a review from d4l3k November 4, 2021 04:52
@msaroufim
Copy link
Member Author

msaroufim commented Nov 16, 2021

Adding final todos here

  • Add type annotations to make pyre happy @msaroufim @cbalioglu - Almost done
  • Figure out why ray_driver.py not found by ray job scheduler, run test below to see this issue @jiaodong - FIXED
  • See a log message message with pytorch logs in python -m unittest ray_scheduler_test.py - FIXED
  • Cleanup and error messages @cbalioglu - FIXED
  • How to download ray nightly in dev requirements while making internal/external CI happy @d4l3k - FIXED
  • Documentation @msaroufim - maybe it makes sense to do this in a third PR - FIXED
  • Looks like torchX had some new changes I ignored that broke my scheduler so will update - FIXED
  • Rebase to collapse all commits into one @msaroufim - Will squash and merge when done

@msaroufim
Copy link
Member Author

msaroufim commented Nov 23, 2021

Feedback from Alex 11/23

PO

  1. Create a Ray YAML - DONE locally
  2. Run a DDP job - DONE
  3. Rename cluster address to Ray Dashboard address - DONE

P2

  1. Remove ray start and ray stop commands replace with ETCD test fixtures. Setup a RayHeadServer class in test class to make things more reusable - TBD
  2. Combine Ray cluster address and Ray yaml config - either http file or ray yaml. Parse as pair of port and host if fail treat as file otherwise port and host - DONE
  3. Copy script and script feel redundant - use fsspec instead to work with arbitrary file systems. Copy2 fails with symlinks and same file errors - FIXED
  4. Take runtime_env with pip and create a class for Image in torchX https://docs.ray.io/en/master/ray-job-submission/overview.html#example-setup and expose to user via YAML or ask when requirements.txt will be supported in Ray - For now supported requirements.txt
  5. Clarify which ports do users need to know about - do they only need to know the dashboard port? Make dashboard port configurable, do not hardcode - DONE
  6. Test for existence of ray driver and ray common - DONE, shows up in logs correctly

@d4l3k
Copy link
Member

d4l3k commented Dec 6, 2021

@msaroufim can you rebase this diff so the integ tests run?

@msaroufim
Copy link
Member Author

msaroufim commented Dec 8, 2021

Component test

Setup cluster

ray up cluster.yaml


# An unique identifier for the head node and workers of this cluster.
cluster_name: minimal

# The maximum number of workers nodes to launch in addition to the head
# node. min_workers default to 0.
max_workers: 1

# Cloud-provider specific configuration.
provider:
    type: aws
    region: us-west-2
    availability_zone: us-west-2a

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

Use torchX

Smoke test

torchx run -s ray utils.echo

(ray) ubuntu@ip-172-31-2-209:~/torchx$ torchx run -s ray utils.echo 
torchx 2021-12-08 00:57:45 INFO     Uploading package gcs://_ray_pkg_3db5a02fb49faddf.zip.
torchx 2021-12-08 00:57:45 INFO     Creating a file package for local directory '/tmp/tmpac744j35'.
ray://torchx/raysubmit_f6fHJJsgJ8V8eEFJ
torchx 2021-12-08 00:57:45 INFO     Launched app: ray://torchx/raysubmit_f6fHJJsgJ8V8eEFJ
Status is PENDING
torchx 2021-12-08 00:57:45 INFO     AppStatus:
  msg: <NONE>
  num_restarts: -1
  roles: []
  state: PENDING (2)
  structured_error_msg: <NONE>
  ui_url: null

torchx 2021-12-08 00:57:45 INFO     Job URL: None

Train.py test

torchx run -s ray -cfg cluster_config_file=cluster.yaml,copy_scripts=True dist.ddp --script torchx/schedulers/test/train.py

Usage instructions

torchx runopts

ray:
    usage:
        [cluster_config_file=CLUSTER_CONFIG_FILE],[cluster_name=CLUSTER_NAME],[dashboard_address=DASHBOARD_ADDRESS],[copy_scripts=COPY_SCRIPTS],[copy_script_dirs=COPY_SCRIPT_DIRS]

    optional arguments:
        cluster_config_file=CLUSTER_CONFIG_FILE (str, None)
            Use CLUSTER_CONFIG_FILE to access or create the Ray cluster.
        cluster_name=CLUSTER_NAME (str, None)
            Override the configured cluster name.
        dashboard_address=DASHBOARD_ADDRESS (str, 127.0.0.1)
            Use ray status to get the dashboard_address
        copy_scripts=COPY_SCRIPTS (bool, False)
            Copy the Python script(s) to the cluster.
        copy_script_dirs=COPY_SCRIPT_DIRS (bool, False)
            Copy the directories containing the Python scripts to the cluster.

@codecov
Copy link

codecov bot commented Dec 8, 2021

Codecov Report

Merging #329 (807420c) into main (ad4430b) will increase coverage by 0.09%.
The diff coverage is 95.16%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #329      +/-   ##
==========================================
+ Coverage   94.73%   94.83%   +0.09%     
==========================================
  Files          61       63       +2     
  Lines        3229     3347     +118     
==========================================
+ Hits         3059     3174     +115     
- Misses        170      173       +3     
Impacted Files Coverage Δ
torchx/components/utils.py 60.00% <50.00%> (-0.87%) ⬇️
torchx/schedulers/ray/ray_driver.py 92.72% <92.72%> (ø)
torchx/schedulers/ray_scheduler.py 94.92% <96.12%> (+4.30%) ⬆️
torchx/cli/cmd_log.py 95.23% <100.00%> (ø)
torchx/schedulers/__init__.py 96.55% <100.00%> (+1.55%) ⬆️
torchx/schedulers/ray/ray_common.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad4430b...807420c. Read the comment docs.

@msaroufim msaroufim requested review from aivanou and cbalioglu January 6, 2022 23:27
@msaroufim msaroufim requested a review from aivanou January 8, 2022 06:41
def load_actor_json(filename : str) -> List[Dict]:
with open(filename) as f:
actor = json.load(f)
actor = json.loads(actor)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def load_actor_json(filename : str) -> List[Dict]:
    with open(filename) as f:
        actor = json.load(f) #from filename to string
        actor = json.loads(actor) #from string to dict
    return actor

actors: List[RayActor], pgs: List[PlacementGroup]
) -> List[CommandActor]:
command_actors: List[CommandActor] = []
address, port = get_address_and_port()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, there's no guarantee that the rank 0 worker is placed on the same node that ray_driver is run on- I think you'd want to move this to inside the for loop, and call this on the rank 0 actor for each group of actors. And this would require some changes to the current for loop.

So add a new method to CommandActor:

def get_address_and_port(self):
    return get_address_and_port()

And then inside the for loop the sequencing should be:

  • create all the actors
  • get the master address and port from the rank 0 worker
  • then pass in the master address and port to all the actors, which means that these values cannot be set during the actor init

Happy to chat more about this as well, or provide some code snippets

)

if app.metadata:
_logger.warning("The Ray scheduler does not use metadata information.")
def log_iter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @jiaodong mentioned, log streaming support was added to jobs in Ray 1.10: ray-project/ray#20976.

After upgrading to 1.10, that functionality can probably be leveraged directly.

Copy link
Contributor

@cbalioglu cbalioglu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@facebook-github-bot
Copy link
Contributor

@msaroufim has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33689705

@msaroufim msaroufim force-pushed the raydriver branch 2 times, most recently from 54f141f to 05e1912 Compare January 28, 2022 19:13
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33689705

Summary:
## To run distributed pytorch test

```
cd torch/torchx/schedulers/test
python -m unittest ray_scheduler_test.py
```

```
2021-11-24 00:53:22,810 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
2021-11-24 00:53:23,127 INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_e212ba1ff8e15e24.zip.
2021-11-24 00:53:23,128 INFO packaging.py:352 -- Creating a file package for local directory '/tmp/tmp9copqxd2'.
status: PENDING
status: PENDING
status: RUNNING
status: RUNNING
status: RUNNING
status: RUNNING
status: SUCCEEDED
2021-11-24 00:53:25,521 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
(CommandActor pid=3575) initializing `gloo` process group
(CommandActor pid=3574) initializing `gloo` process group
(CommandActor pid=3575) successfully initialized process group
(CommandActor pid=3575) rank: 1, actual world_size: 2, computed world_size: 2
(CommandActor pid=3574) successfully initialized process group
(CommandActor pid=3574) rank: 0, actual world_size: 2, computed world_size: 2
```

## To run distributed pytorch test using torchX CLI

```
# Setup cluster and get a HEAD NODE IP
ray up -y ray_cluster.yaml
pip install torchx[dev]

# Get a job ID from deployed job
torchx run -s ray -cfg dashboard_address=34.209.89.185:20002,working_dir=test_dir utils.binary --entrypoint ray_simple.py

# Use job ID to get logs or job status
torchx describe ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW
torchx log ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW
```

## Or
```
ray up -y ray_cluster.yaml
pip install torchx[dev]
torchx run -s ray -cfg cluster_config_file=ray_cluster.yaml,working_dir=test_dir utils.binary --entrypoint ray_simple.py
```

## ray_simple.py

The unit test `ray_scheduler_test.py` is more interesting

```
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

def main() -> None:
    print("hello")
    return

if __name__ == "__main__":
    main()
```

Pull Request resolved: #329

Reviewed By: aivanou

Differential Revision: D33689705

Pulled By: msaroufim

fbshipit-source-id: b0fb5b95cf148ab3047a7df43e196b05d443a6d3
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D33689705

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants