Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot start a simple local cluster using the config.yaml - workers are not found #42128

Open
jav-ed opened this issue Dec 29, 2023 · 5 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks stability

Comments

@jav-ed
Copy link

jav-ed commented Dec 29, 2023

What happened + What you expected to happen

I have multiple pcs that are connected and can be accesses easily through ssh. Going manually inside a pc, that is the node, and defining it to be the head or the worker is working fine. The issue arises, when I try to do the very same thing using the config.yaml.

First, the manual procedure:

  1. ssh into a node that shall be the head
  2. activate the virtual environment
  3. ray start --head --port=6379

now ssh into all the other machines that shall be the workers and perform
ray start --address=head-node-address:port

Using ray status or viewing the dashboard, it can be observed that all the desired nodes are online.

Now this shall be replicated with a config.yaml. However, sometimes when I have luck it will find the workers and mostly it will not find the workers.

cluster_name: default
provider:
    type: local
    head_ip: ilrpoollin04
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips: [ilrpoollin08, ilrpoollin09]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: DOM+jabu413e

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5
cluster_synced_files: []

file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
    # - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  

# List of shell commands to run to set up each nodes.
setup_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands: 
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: 
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray stop
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate &&  ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray stop
    # - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray start --address=$RAY_HEAD_IP:6379
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray start --address=141.30.159.38:6379

Versions / Dependencies

(py_P_Bert) ➜  0_Yamls git:(main) ✗ python --version
Python 3.9.18
(py_P_Bert) ➜  0_Yamls git:(main) ✗ ray --version
ray, version 2.9.0
(py_P_Bert) ➜  0_Yamls git:(main) ✗ lsb_release -a
LSB Version:	core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64:desktop-4.0.fake-amd64:desktop-4.0.fake-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0.fake-amd64:graphics-4.0.fake-noarch
Distributor ID:	openSUSE
Description:	openSUSE Leap 15.5
Release:	15.5
Codename:	n/a

Reproduction script

Please see the description above, that is the config.yaml

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@jav-ed jav-ed added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 29, 2023
@stephanie-wang stephanie-wang added core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes labels Jan 2, 2024
@jjyao jjyao removed the core Issues that should be addressed in Ray Core label Jan 8, 2024
@anyscalesam
Copy link
Contributor

@architkulkarni can you review and triage?

@architkulkarni architkulkarni added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 8, 2024
@jjyao jjyao added the core Issues that should be addressed in Ray Core label Feb 6, 2024
@millefalcon
Copy link

millefalcon commented Jun 26, 2024

@anyscalesam Hello Folks, We're facing the same issue. Any updates or suggestions to working around this ? Thanks

@millefalcon
Copy link

Hello folks, I have found that if we use ray force --stop on both head and worker start ray commands, it seems to work.
Also, had to follow #39565 (comment) for the worker to start next time, if I had shutdown the cluster(have to manually down the worker for it stop) previously.

#46204 #45571 seems related.

@anyscalesam anyscalesam added p0.5 and removed P1 Issue that should be fixed within a few weeks labels Jul 19, 2024
@pratos
Copy link

pratos commented Jul 30, 2024

@millefalcon I tried using ray stop --force before the head and worker start commands as you suggested, but haven't been able to setup the worker node. I've a similar yaml file as presented in the issue. Can you share your yaml file and probably steps that you performed?

@millefalcon
Copy link

millefalcon commented Aug 2, 2024

@pratos I don't have the exact yaml at the moment, but it is mostly similar to example-full.yaml(local). The main difference was I tried ray stop --force instead of just ray stop for both the head node and the workers.

I followed the exact steps as mentioned here #39565 (comment).

Note: In hindsight, it only worked intermittently. I'd to write a wrapper script that ssh into the worker nodes and do the ray stop;..ray start ... etc to make it work always.

So I guess, it didn't exactly fully fix my issue, sorry.

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed P0.5 labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks stability
Projects
None yet
Development

No branches or pull requests

7 participants