Cannot start a simple local cluster using the config.yaml - workers are not found #42128

jav-ed · 2023-12-29T01:07:38Z

What happened + What you expected to happen

I have multiple pcs that are connected and can be accesses easily through ssh. Going manually inside a pc, that is the node, and defining it to be the head or the worker is working fine. The issue arises, when I try to do the very same thing using the config.yaml.

First, the manual procedure:

ssh into a node that shall be the head
activate the virtual environment
ray start --head --port=6379

now ssh into all the other machines that shall be the workers and perform
ray start --address=head-node-address:port

Using ray status or viewing the dashboard, it can be observed that all the desired nodes are online.

Now this shall be replicated with a config.yaml. However, sometimes when I have luck it will find the workers and mostly it will not find the workers.

cluster_name: default
provider:
    type: local
    head_ip: ilrpoollin04
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips: [ilrpoollin08, ilrpoollin09]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: DOM+jabu413e

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5
cluster_synced_files: []

file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
    # - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  

# List of shell commands to run to set up each nodes.
setup_commands: []

# Custom commands that will be run on the head node after common setup.
head_setup_commands: 
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: 
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray stop
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate &&  ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray stop
    # - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray start --address=$RAY_HEAD_IP:6379
    - source ~/Progs/Virtual_Env/py_P_Bert/bin/activate  && ray start --address=141.30.159.38:6379

Versions / Dependencies

(py_P_Bert) ➜  0_Yamls git:(main) ✗ python --version
Python 3.9.18
(py_P_Bert) ➜  0_Yamls git:(main) ✗ ray --version
ray, version 2.9.0
(py_P_Bert) ➜  0_Yamls git:(main) ✗ lsb_release -a
LSB Version:	core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64:desktop-4.0.fake-amd64:desktop-4.0.fake-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0.fake-amd64:graphics-4.0.fake-noarch
Distributor ID:	openSUSE
Description:	openSUSE Leap 15.5
Release:	15.5
Codename:	n/a

Reproduction script

Please see the description above, that is the config.yaml

Issue Severity

Medium: It is a significant difficulty but I can work around it.

anyscalesam · 2024-01-08T22:35:03Z

@architkulkarni can you review and triage?

millefalcon · 2024-06-26T00:43:56Z

@anyscalesam Hello Folks, We're facing the same issue. Any updates or suggestions to working around this ? Thanks

millefalcon · 2024-07-15T03:04:50Z

Hello folks, I have found that if we use ray force --stop on both head and worker start ray commands, it seems to work.
Also, had to follow #39565 (comment) for the worker to start next time, if I had shutdown the cluster(have to manually down the worker for it stop) previously.

#46204 #45571 seems related.

pratos · 2024-07-30T07:41:29Z

@millefalcon I tried using ray stop --force before the head and worker start commands as you suggested, but haven't been able to setup the worker node. I've a similar yaml file as presented in the issue. Can you share your yaml file and probably steps that you performed?

millefalcon · 2024-08-02T02:18:14Z

@pratos I don't have the exact yaml at the moment, but it is mostly similar to example-full.yaml(local). The main difference was I tried ray stop --force instead of just ray stop for both the head node and the workers.

I followed the exact steps as mentioned here #39565 (comment).

Note: In hindsight, it only worked intermittently. I'd to write a wrapper script that ssh into the worker nodes and do the ray stop;..ray start ... etc to make it work always.

So I guess, it didn't exactly fully fix my issue, sorry.

jav-ed added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 29, 2023

stephanie-wang added core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes labels Jan 2, 2024

jjyao removed the core Issues that should be addressed in Ray Core label Jan 8, 2024

anyscalesam assigned architkulkarni Jan 8, 2024

architkulkarni added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 8, 2024

jjyao added the core Issues that should be addressed in Ray Core label Feb 6, 2024

anyscalesam added the stability label Mar 1, 2024

anyscalesam added p0.5 and removed P1 Issue that should be fixed within a few weeks labels Jul 19, 2024

jjyao added P1 Issue that should be fixed within a few weeks and removed P0.5 labels Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot start a simple local cluster using the config.yaml - workers are not found #42128

Cannot start a simple local cluster using the config.yaml - workers are not found #42128

jav-ed commented Dec 29, 2023

anyscalesam commented Jan 8, 2024

millefalcon commented Jun 26, 2024 •

edited

Loading

millefalcon commented Jul 15, 2024

pratos commented Jul 30, 2024

millefalcon commented Aug 2, 2024 •

edited

Loading

Cannot start a simple local cluster using the config.yaml - workers are not found #42128

Cannot start a simple local cluster using the config.yaml - workers are not found #42128

Comments

jav-ed commented Dec 29, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

anyscalesam commented Jan 8, 2024

millefalcon commented Jun 26, 2024 • edited Loading

millefalcon commented Jul 15, 2024

pratos commented Jul 30, 2024

millefalcon commented Aug 2, 2024 • edited Loading

millefalcon commented Jun 26, 2024 •

edited

Loading

millefalcon commented Aug 2, 2024 •

edited

Loading