Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ray local cluster] nodes marked as uninitialized #39565

Open
jmakov opened this issue Sep 11, 2023 · 66 comments
Open

[ray local cluster] nodes marked as uninitialized #39565

jmakov opened this issue Sep 11, 2023 · 66 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes P2 Important issue, but not time-critical stability
Milestone

Comments

@jmakov
Copy link
Contributor

jmakov commented Sep 11, 2023

What happened + What you expected to happen

Running ray up ray.yaml I'd expect that all of the 4 nodes would be setup and join the cluster as I've set min_workers: 4. ray monitor ray.yaml is showing the nodes as uninitialized though.

Versions / Dependencies

ray 2.6.4
python 3.9.18
manjaro

Reproduction script

ray.yaml

# A unique identifier for the head node and workers of this cluster.
cluster_name: test

# Running Ray in Docker images is optional (this docker section can be commented out).
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
#docker:
#    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
#    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
#    container_name: "ray_container"
#    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
#    # if no cached version is present.
#    pull_before_run: True
#    run_options:   # Extra options to pass into "docker run"
#        - --ulimit nofile=65536:65536

provider:
    type: local
    head_ip: 192.168.0.101
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips:
      - 192.168.0.106
      - 192.168.0.107
      - 192.168.0.108
      - 192.168.0.110
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: myuser
    # You can comment out `ssh_private_key` if the following machines don't need a private key for SSH access to the Ray
    # cluster:
    #   (1) The machine on which `ray up` is executed.
    #   (2) The head node of the Ray cluster.
    #
    # The machine that runs ray up executes SSH commands to set up the Ray head node. The Ray head node subsequently
    # executes SSH commands to set up the Ray worker nodes. When you run ray up, ssh credentials sitting on the ray up
    # machine are copied to the head node -- internally, the ssh key is added to the list of file mounts to rsync to head node.
    # ssh_private_key: ~/.ssh/id_rsa

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
# This field is optional.
min_workers: 4

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
# This field is optional.
#max_workers: 4
# The default behavior for manually managed clusters is
# min_workers == max_workers == len(worker_ips),
# meaning that Ray is started on all available nodes of the cluster.
# For automatically managed clusters, max_workers is required and min_workers defaults to 0.

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH. E.g. you could save your conda env to an environment.yaml file, mount
# that directory to all nodes and call `conda -n my_env -f /path1/on/remote/machine/environment.yaml`. In this
# example paths on all nodes must be the same (so that conda can be called always with the same argument)
file_mounts: {
    "/mnt/ray": ".",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands:
    # If we have e.g. conda dependencies stored in "/path1/on/local/machine/environment.yaml", we can prepare the
    # work environment on each worker by:
    #   1. making sure each worker has access to this file i.e. see the `file_mounts` section
    #   2. adding a command here that creates a new conda environment on each node or if the environment already exists,
    #     it updates it:
    #      conda env create -q -n my_venv -f /path1/on/local/machine/environment.yaml || conda env update -q -n my_venv -f /path1/on/local/machine/environment.yaml
    #
    # Ray developers:
    # you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
    - source ~/mambaforge-pypy3/etc/profile.d/conda.sh && mamba env update -f /mnt/ray/env.yaml --prune

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - source ~/mambaforge-pypy3/etc/profile.d/conda.sh && conda activate test && ray stop
    - source ~/mambaforge-pypy3/etc/profile.d/conda.sh && conda activate test && ulimit -c unlimited && ray start --head --disable-usage-stats --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --system-config='{"automatic_object_spilling_enabled":true,"max_io_workers":8,"min_spilling_size":104857600,"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/mnt/ray/object_spilling\"}}"}'


# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - source ~/mambaforge-pypy3/etc/profile.d/conda.sh && conda activate test && ray stop
    - source ~/mambaforge-pypy3/etc/profile.d/conda.sh && conda activate test && ulimit -c unlimited && ray start --address=$RAY_HEAD_IP:6379 --disable-usage-stats

Issue Severity

High: It blocks me from completing my task.

@jmakov jmakov added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 11, 2023
@jjyao jjyao added the core Issues that should be addressed in Ray Core label Sep 25, 2023
@rkooo567
Copy link
Contributor

cc @rickyyx can you follow up with the investigation?

    type: local
    head_ip: 192.168.0.101
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips:
      - 192.168.0.106
      - 192.168.0.107
      - 192.168.0.108
      - 192.168.0.110

Can you tell us what this exactly for?

@rickyyx rickyyx self-assigned this Sep 25, 2023
@rickyyx rickyyx added this to the Autoscaler V2 milestone Sep 25, 2023
@rickyyx
Copy link
Contributor

rickyyx commented Sep 25, 2023

Hey @jmakov - will you be able to get any monitor.* logs generated? That would be helpful to debug.

@jmakov
Copy link
Contributor Author

jmakov commented Sep 26, 2023

Didn't see anything exciting happening there, only monitor.log has some entries:

2023-09-22 21:37:07,546 INFO monitor.py:699 -- Starting monitor using ray installation: /home/jernej_m/mambaforge-pypy3/envs/test_ray/lib/python3.10/site-packages/ray/__init__.py
2023-09-22 21:37:07,546 INFO monitor.py:700 -- Ray version: 2.6.3
2023-09-22 21:37:07,546 INFO monitor.py:701 -- Ray commit: {{RAY_COMMIT_SHA}}
2023-09-22 21:37:07,546 INFO monitor.py:702 -- Monitor started with command: ['/home/jernej_m/mambaforge-pypy3/envs/test_ray/lib/python3.10/site-packages/ray/autoscaler/_private/monitor.py', '--logs-dir=/tmp/ray/session_2023-09-22_21-37-05_827384_110848/logs', '--logging-rotate-bytes=536870912', '--logging-rotate-backup-count=5', '--gcs-address=192.168.0.101:6379', '--autoscaling-config=~/ray_bootstrap_config.yaml', '--monitor-ip=192.168.0.101']
2023-09-22 21:37:07,552 INFO monitor.py:167 -- session_name: session_2023-09-22_21-37-05_827384_110848
2023-09-22 21:37:07,554 INFO monitor.py:199 -- Starting autoscaler metrics server on port 44217
2023-09-22 21:37:07,556 INFO monitor.py:224 -- Monitor: Started
2023-09-22 21:37:07,571 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: []
2023-09-22 21:37:07,572 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.0.106', '192.168.0.107', '192.168.0.108', '192.168.0.110', '192.168.0.101']
2023-09-22 21:37:07,572 INFO autoscaler.py:274 -- disable_node_updaters:False
2023-09-22 21:37:07,572 INFO autoscaler.py:282 -- disable_launch_config_check:False
2023-09-22 21:37:07,572 INFO autoscaler.py:294 -- foreground_node_launch:False
2023-09-22 21:37:07,572 INFO autoscaler.py:304 -- worker_liveness_check:True
2023-09-22 21:37:07,572 INFO autoscaler.py:312 -- worker_rpc_drain:True
2023-09-22 21:37:07,573 INFO autoscaler.py:362 -- StandardAutoscaler: {'cluster_name': 'test', 'auth': {'ssh_user': 'jernej_m', 'ssh_private_key': '~/ray_bootstrap_key.pem'}, 'upscaling_speed': 1.0, 'idle_timeout_minutes': 5, 'docker': {}, 'initialization_commands': [], 'setup_commands': ['source ~/mambaforge-pypy3/etc/profile.d/conda.sh && mamba env update -f /mnt/ray/mount/env.yaml -n test_ray --prune'], 'head_setup_commands': ['source ~/mambaforge-pypy3/etc/profile.d/conda.sh && mamba env update -f /mnt/ray/mount/env.yaml -n test_ray --prune'], 'worker_setup_commands': ['source ~/mambaforge-pypy3/etc/profile.d/c>
2023-09-22 21:37:07,574 INFO monitor.py:394 -- Autoscaler has not yet received load metrics. Waiting.
2023-09-22 21:37:12,588 INFO autoscaler.py:141 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2023-09-22 21:37:12,588 INFO load_metrics.py:161 -- LoadMetrics: Removed ip: 192.168.0.108.
2023-09-22 21:37:12,588 INFO load_metrics.py:164 -- LoadMetrics: Removed 1 stale ip mappings: {'192.168.0.108'} not in {'192.168.0.101'}
2023-09-22 21:37:12,589 INFO autoscaler.py:421 --
======== Autoscaler status: 2023-09-22 21:37:12.589294 ========
Node status
---------------------------------------------------------------
Healthy:
 1 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/32.0 CPU
 0.0/2.0 GPU
 0B/77.60GiB memory
 0B/37.25GiB object_store_memory

Demands:
 (no resource demands)
2023-09-22 21:37:12,590 INFO autoscaler.py:1368 -- StandardAutoscaler: Queue 4 new nodes for launch
2023-09-22 21:37:12,590 INFO autoscaler.py:464 -- The autoscaler took 0.003 seconds to complete the update iteration.
2023-09-22 21:37:12,591 INFO node_launcher.py:174 -- NodeLauncher0: Got 4 nodes to launch.
2023-09-22 21:37:12,592 INFO monitor.py:424 -- :event_summary:Resized to 56 CPUs, 4 GPUs.
2023-09-22 21:37:12,594 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.0.106', '192.168.0.107', '192.168.0.108', '192.168.0.110', '192.168.0.101']
2023-09-22 21:37:12,594 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.0.106', '192.168.0.107', '192.168.0.108', '192.168.0.110', '192.168.0.101']
2023-09-22 21:37:12,595 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.0.106', '192.168.0.107', '192.168.0.108', '192.168.0.110', '192.168.0.101']
2023-09-22 21:37:12,596 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['192.168.0.106', '192.168.0.107', '192.168.0.108', '192.168.0.110', '192.168.0.101']
2023-09-22 21:37:12,596 INFO node_launcher.py:174 -- NodeLauncher0: Launching 4 nodes, type local.cluster.node.
2023-09-22 21:37:17,608 INFO autoscaler.py:141 -- The autoscaler took 0.001 seconds to fetch the list of non-terminated nodes.
2023-09-22 21:37:17,609 INFO autoscaler.py:421 --
======== Autoscaler status: 2023-09-22 21:37:17.609649 ========
Node status
---------------------------------------------------------------
Healthy:
 2 local.cluster.node
Pending:
 192.168.0.106: local.cluster.node, uninitialized
 192.168.0.107: local.cluster.node, uninitialized
 192.168.0.110: local.cluster.node, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/56.0 CPU
 0.0/4.0 GPU
 0B/98.01GiB memory
 0B/46.00GiB object_store_memory

Demands:
 (no resource demands)
2023-09-22 21:37:17,619 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.106.
2023-09-22 21:37:17,620 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.107.
2023-09-22 21:37:17,620 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.108.
2023-09-22 21:37:17,620 INFO autoscaler.py:1316 -- Creating new (spawn_updater) updater thread for node 192.168.0.110.
Running everything manually works. Would be nice to have a working cluster launcher for on prem clusters.

@ajaichemmanam
Copy link

+1 same issue for me. Even with systems on cloud (3rd party cloud, not AWS/GCS/Azure). Opened all ports, sometimes it gets connected, some times it shows uninitialized.

@rickyyx
Copy link
Contributor

rickyyx commented Oct 2, 2023

cc @gvspraveen could someone from the cluster team help take a look? I believe this is more relevant to cluster launcher as of now rather than the actual autoscaling logics since "running everything manually works".

@rickyyx rickyyx assigned gvspraveen and unassigned rickyyx Oct 2, 2023
@jmakov
Copy link
Contributor Author

jmakov commented Oct 2, 2023

@rickyyx not to mention manually starting ray not working and cluster launcher not working. Wondering how ray works at all for anybody. As someone who uses ray for more than a year, every other release breaks a core part.

@rkooo567
Copy link
Contributor

rkooo567 commented Oct 2, 2023

cc @anyscalesam can you triage this issue with @gvspraveen?

@rickyyx rickyyx added core-clusters For launching and managing Ray clusters/jobs/kubernetes and removed core Issues that should be addressed in Ray Core labels Oct 2, 2023
@architkulkarni architkulkarni added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 3, 2023
@architkulkarni
Copy link
Contributor

architkulkarni commented Oct 5, 2023

I'm able to reproduce this on AWS using pip install "ray[default]"==2.7.0 in the setup commands and using the latest ray master on the client side for the cluster launcher.[see below, it was just a port issue on my end]

@jmakov do you happen to remember if this was working for you on a previous version of Ray, and if so which one?

@jmakov
Copy link
Contributor Author

jmakov commented Oct 6, 2023

So cluster launcher worked for me for the last +2 years using a local cluster (without Docker, just conda env). Think it was 2.6.0 before I made the mistake of upgrading, if I remember correctly. Think I'll just start writing my own tests and run before every upgrade.

@ajaichemmanam
Copy link

ajaichemmanam commented Oct 9, 2023

`2023-10-09 11:46:28,208 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['216.48.179.215', '164.52.201.70']
Fetched IP: 164.52.201.70
Warning: Permanently added '164.52.201.70' (ED25519) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==

==> /tmp/ray/session_latest/logs/monitor.log <==
2023-10-08 23:13:33,485 INFO monitor.py:690 -- Starting monitor using ray installation: /home/ray/anaconda3/lib/python3.11/site-packages/ray/__init__.py
2023-10-08 23:13:33,485 INFO monitor.py:691 -- Ray version: 2.7.1
2023-10-08 23:13:33,485 INFO monitor.py:692 -- Ray commit: 9f07c12615958c3af3760604f6dcacc4b3758a47
2023-10-08 23:13:33,486 INFO monitor.py:693 -- Monitor started with command: ['/home/ray/anaconda3/lib/python3.11/site-packages/ray/autoscaler/_private/monitor.py', '--logs-dir=/tmp/ray/session_2023-10-08_23-13-32_012785_2484/logs', '--logging-rotate-bytes=536870912', '--logging-rotate-backup-count=5', '--gcs-address=164.52.201.70:6379', '--autoscaling-config=/home/ray/ray_bootstrap_config.yaml', '--monitor-ip=164.52.201.70']
2023-10-08 23:13:33,489 INFO monitor.py:159 -- session_name: session_2023-10-08_23-13-32_012785_2484
2023-10-08 23:13:33,490 INFO monitor.py:191 -- Starting autoscaler metrics server on port 44217
2023-10-08 23:13:33,491 INFO monitor.py:216 -- Monitor: Started
2023-10-08 23:13:33,506 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: []
2023-10-08 23:13:33,507 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['216.48.179.215', '164.52.201.70']
2023-10-08 23:13:33,507 INFO autoscaler.py:274 -- disable_node_updaters:False
2023-10-08 23:13:33,507 INFO autoscaler.py:282 -- disable_launch_config_check:False
2023-10-08 23:13:33,507 INFO autoscaler.py:294 -- foreground_node_launch:False
2023-10-08 23:13:33,507 INFO autoscaler.py:304 -- worker_liveness_check:True
2023-10-08 23:13:33,507 INFO autoscaler.py:312 -- worker_rpc_drain:True
2023-10-08 23:13:33,508 INFO autoscaler.py:362 -- StandardAutoscaler: {'cluster_name': 'default', 'auth': {'ssh_user': 'user', 'ssh_private_key': '~/ray_bootstrap_key.pem'}, 'upscaling_speed': 1.0, 'idle_timeout_minutes': 30, 'docker': {'image': 'rayproject/ray:2.7.1.9f07c1-py311-gpu', 'worker_image': 'rayproject/ray:2.7.1.9f07c1-py311-gpu', 'container_name': 'ray_container', 'pull_before_run': True, 'run_options': ['--ulimit nofile=65536:65536']}, 'initialization_commands': [], 'setup_commands': ['sudo apt-get update', 'sudo apt-get install gcc ffmpeg libsm6 libxext6  -y', 'pip install -r "/app/requirements-gpu.txt"'], 'head_setup_commands': ['sudo apt-get update', 'sudo apt-get install gcc ffmpeg libsm6 libxext6  -y', 'pip install -r "/app/requirements-gpu.txt"'], 'worker_setup_commands': ['sudo apt-get update', 'sudo apt-get install gcc ffmpeg libsm6 libxext6  -y', 'pip install -r "/app/requirements-gpu.txt"'], 'head_start_ray_commands': ['ray stop', 'ulimit -c unlimited && export RAY_health_check_timeout_ms=30000 && ray start --head --node-ip-address=164.52.201.70 --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 --disable-usage-stats --log-color=auto -v'], 'worker_start_ray_commands': ['ray stop', 'ray start --address=164.52.201.70:6379 --object-manager-port=8076'], 'file_mounts': {'~/.ssh/id_rsa': '/home/ray/.ssh/id_rsa', '/app/requirements-gpu.txt': '/app/requirements-gpu.txt'}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'rsync_exclude': ['**/.git', '**/.git/**'], 'rsync_filter': ['.gitignore'], 'provider': {'type': 'local', 'head_ip': '164.52.201.70', 'worker_ips': ['216.48.179.215']}, 'available_node_types': {'local.cluster.node': {'node_config': {}, 'resources': {}, 'min_workers': 1, 'max_workers': 1}}, 'head_node_type': 'local.cluster.node', 'max_workers': 1, 'no_restart': False}
2023-10-08 23:13:33,509 INFO monitor.py:385 -- Autoscaler has not yet received load metrics. Waiting.
2023-10-08 23:13:38,522 INFO autoscaler.py:141 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2023-10-08 23:13:38,522 INFO autoscaler.py:421 -- 
======== Autoscaler status: 2023-10-08 23:13:38.522726 ========
Node status
---------------------------------------------------------------
Healthy:
 1 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/12.0 CPU
 0.0/1.0 GPU
 0B/28.57GiB memory
 0B/14.29GiB object_store_memory

Demands:
 (no resource demands)
2023-10-08 23:13:38,524 INFO autoscaler.py:1379 -- StandardAutoscaler: Queue 1 new nodes for launch
2023-10-08 23:13:38,524 INFO autoscaler.py:464 -- The autoscaler took 0.002 seconds to complete the update iteration.
2023-10-08 23:13:38,524 INFO node_launcher.py:177 -- NodeLauncher0: Got 1 nodes to launch.
2023-10-08 23:13:38,525 INFO monitor.py:415 -- :event_summary:Resized to 12 CPUs, 1 GPUs.
2023-10-08 23:13:38,526 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['216.48.179.215', '164.52.201.70']
2023-10-08 23:13:38,526 INFO node_launcher.py:177 -- NodeLauncher0: Launching 1 nodes, type local.cluster.node.
2023-10-08 23:13:43,534 INFO autoscaler.py:141 -- The autoscaler took 0.0 seconds to fetch the list of non-terminated nodes.
2023-10-08 23:13:43,534 INFO autoscaler.py:421 -- 
======== Autoscaler status: 2023-10-08 23:13:43.534774 ========
Node status
---------------------------------------------------------------
Healthy:
 1 local.cluster.node
Pending:
 216.48.179.215: local.cluster.node, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/12.0 CPU
 0.0/1.0 GPU
 0B/28.57GiB memory
 0B/14.29GiB object_store_memory

Demands:
 (no resource demands)
2023-10-08 23:13:43,537 INFO autoscaler.py:1326 -- Creating new (spawn_updater) updater thread for node 216.48.179.215.`

@ajaichemmanam
Copy link

The above log is for
2023-10-08 23:13:33,485 INFO monitor.py:691 -- Ray version: 2.7.1
2023-10-08 23:13:33,485 INFO monitor.py:692 -- Ray commit: 9f07c12

@jmakov
Copy link
Contributor Author

jmakov commented Oct 9, 2023

This issue is still present in ray 2.7.1

@ajaichemmanam
Copy link

Let us know if any other details are required

@architkulkarni
Copy link
Contributor

architkulkarni commented Oct 19, 2023

Actually, when I reproduced the issue earlier, I had forgotten to open all the ports. After opening all ports, I wasn't able to reproduce the issue.

@jmakov or @ajaichemmanam if you're able to reproduce the issue and you have time, it would potentially be very helpful if you could amend your YAML file as follows:

worker_start_ray_commands:
    - ray stop
    - "echo \"Executing: ray start --address=$RAY_HEAD_IP:6379\" >> ray_worker_output.txt"
    - ray start --address=$RAY_HEAD_IP:6379 >> ray_worker_output.txt 2>&1

And share the ray_worker_output.txt from the failing worker nodes. (Or do modify the commands in any way you see fit, as long as we can see the output of ray start --address=...)

@jmakov
Copy link
Contributor Author

jmakov commented Oct 20, 2023

@architkulkarni I've added ulimit -c unlimited && ray start --address=$RAY_HEAD_IP:6379 --disable-usage-stats >> /tmp/ray_worker_output.txt 2>&1 and get

ls /tmp/ray_worker_output.txt
ls: cannot access '/tmp/ray_worker_output.txt': No such file or directory

@MatteoCorvi
Copy link

MatteoCorvi commented Jun 13, 2024

Worker nodes almost always stuck as launching/uninitialized or no cluster status at all.
Only way a recent version (2.22) seems to be working for me is using a conda env with an old version of ray (2.3) and pip install -U ray==2.22. 100% success creating a working cluster on prem so far. New dashboard and logging, plus the cluster seems more stable so I assume improvements of new versions went through.

@jacksonjacobs1
Copy link

Hi @MatteoCorvi,

Glad to hear you were able to get this working, but I'm a little confused about your solution. How is this different from simply installing ray version 2.22?

@MatteoCorvi
Copy link

Hi @jacksonjacobs1,
not sure but aside ray not much else was changed if I recall, so just updating might have kept old versions of the dependencies that don't cause issues.

@jacksonjacobs1
Copy link

Interesting, thanks.

It would be fantastic if a Ray dev from the cluster team could comment on why newer versions of ray seem to break on-prem cluster launching & cleanup.

@anyscalesam What would be your recommendation for resolving this issue?

@Tipmethewink
Copy link

Tipmethewink commented Jul 23, 2024

I'm running ray on AWS EC2 instances with the same issue. ray up... launches the head node though there's no further logs (no logging about setting up nodes) and the head node sits in uninitialized status, eventually ray up times out and everything shuts down. If I commented out file_mounts then the cluster came up fine. Which led me to realise ray doesn't use rsync over ssh (my assumption), it's using the default 873 port which I hadn't opened (it's not documented here). As soon as I opened 873, it all sprang to life.

@jacksonjacobs1
Copy link

Hi @Tipmethewink, are you using existing EC2 instances (equivalent to an on-prem cluster) or using ray cluster launcher to provision new EC2 instances?

@Tipmethewink
Copy link

Tipmethewink commented Jul 23, 2024 via email

@jyakaranda
Copy link

I got this same painful issue today, after retriving the codes and logs from ray dashboard, I made my worker node started finally.
I'm not sure if this would solve your guys problem, I'm still want to share my debugging process.

  1. If you could start head node via ray up cluster.yaml, check out the monitor.log and monitor.out in dashboard at http://127.0.0.1:8265/#/logs (forwarded by ray dashboard cluster.yaml), sometimes these logs would tell you whether the worker node is starting or hanging. And in my case, the head node is hanging on simple ssh issue;

image

  1. ssh hanging issue is tricky. in my case, it's due to ray is using same auth for all head node and worker nodes, but I didn't create a same user in worker node as header node. After create the same user on worker node and uncomment ssh_private_key, the worker node could finally be sshed and started from header node.

  2. like former comment mentions, if the worker node didn't stop the container properly, the header node still could not start worker node properly too, so you might need to docker stop RAY_CONTAINER_NAME manually before ray up.

hopes these findings could help you

@olly-writes-code
Copy link

olly-writes-code commented Oct 16, 2024

Hey folks, I ran into a similar issue when trying to set up an "On Prem" 1 click cluster via Lambda Labs.

I could start the cluster successfully when not using a docker image. But as soon as I switched to the docker image, I ran into the uninitialized issue.

I would get something like

poetry run ray status
======== Autoscaler status: 2024-10-16 21:28:29.027359 ========
Node status
---------------------------------------------------------------
Active:
 1 local.cluster.node
Pending:
 scrubbed_ip: local.cluster.node, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU
 0B/18.61GiB memory
 0B/9.31GiB object_store_memory

Demands:
 (no resource demands)

Here's the config.yaml I was using.

cluster_name: test-cluster

upscaling_speed: 1.0

docker:
  container_name: basic-ray-ml-image
  image: rayproject/ray-ml:latest-gpu
  pull_before_run: true

provider:
 type: local
 head_ip: scrubbed_ip
 worker_ips:
  - scrubbed_ip

auth:
 ssh_user: ubuntu
 ssh_private_key: ~/.ssh/keypair

min_workers: 1
max_workers: 1

setup_commands:
 - pip install ray[default]

head_start_ray_commands:
 - ray stop
 - ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

worker_start_ray_commands:
 - ray stop
 - ray start --address=$RAY_HEAD_IP:6379

I managed to fix this by

  1. Manually rebooting the node that wouldn't initialize.
  2. I noticed when SSHing into that node that docker ps would return permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock. I fixed this by running sudo usermod -aG docker $USER, exiting the machine and then SSH'ing in again. This might be a lambda labs thing.
  3. Re-running ray up from the head node

Maybe this helps some people!

I feel like this stems from poor logging / error reporting from the other nodes.

@olly-writes-code
Copy link

Additionally I don't see any logging or log file.

Even though the instruction from poetry run ray monitor my_cluster.yaml is to find logs at

==> /tmp/ray/session_latest/logs/monitor.out <==

I don't see such a file on any of the nodes

cat /tmp/ray/session_latest/logs/monitor.out
cat: /tmp/ray/session_latest/logs/monitor.out: No such file or directory

@olly-writes-code
Copy link

olly-writes-code commented Oct 17, 2024

I want to flag that the working node always gets stuck in the Pending: uninitialized state when trying to run ray up for a custom docker image from AWS ECR.

A few things to note:

  • My head node spins up successfully, pulling my custom docker image.
  • My worker node is logged into the docker, such it shouldn't be a permissions issue

Here's an example of my cluster.yaml

cluster_name: test-cluster

max_workers: 4
upscaling_speed: 1.0

docker:
  image: xyz.dkr.ecr.region.amazonaws.com/ray-worker:latest
  container_name: ray-worker
  pull_before_run: true

provider:
 type: local
 head_ip: scrubbed_ip
 worker_ips:
  - scrubbed_ip

auth:
 ssh_user: ubuntu
 ssh_private_key: ~/.ssh/keypair

min_workers: 1
max_workers: 1

setup_commands:
 - pip install ray

head_start_ray_commands:
 - ray stop
 - ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0

worker_start_ray_commands:
 - ray stop
 - ray start --address=$RAY_HEAD_IP:6379

If I try to force starting the worker node I get this

ray start --address='scrubbed_ip:6379'
Local node IP: scrubbed_ip
[2024-10-16 23:17:07,188 E 14956 14956] gcs_rpc_client.h:179: Failed to connect to GCS at address scrubbed_ip:6379 within 5 seconds.
[2024-10-16 23:17:38,201 W 14956 14956] gcs_client.cc:177: Failed to get cluster ID from GCS server: TimedOut: Timed out while waiting for GCS to become available.

ray monitor my_cluster.yaml looks like

Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2024-10-17 00:50:30,507	INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['IP_REMOVED', 'IP_REMOVED']
Fetched IP: IP_REMOVED
==> /tmp/ray/session_latest/logs/monitor.err <==

==> /tmp/ray/session_latest/logs/monitor.log <==
2024-10-17 00:45:00,544	INFO monitor.py:688 -- Starting monitor using ray installation: /usr/local/lib/python3.11/dist-packages/ray/__init__.py
2024-10-17 00:45:00,545	INFO monitor.py:689 -- Ray version: 2.30.0
2024-10-17 00:45:00,545	INFO monitor.py:690 -- Ray commit: 97c37298df9e997b86ca9efed824e27024f3bd60
2024-10-17 00:45:00,545	INFO monitor.py:691 -- Monitor started with command: ['/usr/local/lib/python3.11/dist-packages/ray/autoscaler/_private/monitor.py', '--logs-dir=/tmp/ray/session_2024-10-17_00-44-58_905796_119/logs', '--logging-rotate-bytes=536870912', '--logging-rotate-backup-count=5', '--gcs-address=IP_REMOVED:6379', '--autoscaling-config=/root/ray_bootstrap_config.yaml', '--monitor-ip=IP_REMOVED']
2024-10-17 00:45:00,554	INFO monitor.py:159 -- session_name: session_2024-10-17_00-44-58_905796_119
2024-10-17 00:45:00,556	INFO monitor.py:191 -- Starting autoscaler metrics server on port 44217
2024-10-17 00:45:00,569	INFO monitor.py:216 -- Monitor: Started
2024-10-17 00:45:00,585	INFO node_provider.py:53 -- ClusterState: Loaded cluster state: []
2024-10-17 00:45:00,586	INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['IP_REMOVED', 'IP_REMOVED']
2024-10-17 00:45:00,586	INFO autoscaler.py:280 -- disable_node_updaters:False
2024-10-17 00:45:00,586	INFO autoscaler.py:288 -- disable_launch_config_check:False
2024-10-17 00:45:00,586	INFO autoscaler.py:300 -- foreground_node_launch:False
2024-10-17 00:45:00,586	INFO autoscaler.py:310 -- worker_liveness_check:True
2024-10-17 00:45:00,586	INFO autoscaler.py:318 -- worker_rpc_drain:True
2024-10-17 00:45:00,589	INFO autoscaler.py:368 -- StandardAutoscaler: {'cluster_name': 'test-cluster', 'auth': {'ssh_user': 'ubuntu', 'ssh_private_key': '~/ray_bootstrap_key.pem'}, 'upscaling_speed': 1.0, 'idle_timeout_minutes': 5, 'docker': {'image': 'xyz.dkr.ecr.region.amazonaws.com/ray-worker:latest', 'container_name': 'ray-worker', 'pull_before_run': True}, 'initialization_commands': [], 'setup_commands': ['pip install ray[default]==2.30.0'], 'head_setup_commands': ['pip install ray[default]==2.30.0'], 'worker_setup_commands': ['pip install ray[default]==2.30.0'], 'head_start_ray_commands': ['ray stop', 'ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0'], 'worker_start_ray_commands': ['ray stop', 'ray start --address=$RAY_HEAD_IP:6379'], 'file_mounts': {}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'rsync_exclude': [], 'rsync_filter': [], 'max_workers': 1, 'provider': {'type': 'local', 'head_ip': 'IP_REMOVED', 'worker_ips': ['IP_REMOVED']}, 'available_node_types': {'local.cluster.node': {'node_config': {}, 'resources': {}, 'min_workers': 1, 'max_workers': 1}}, 'head_node_type': 'local.cluster.node', 'no_restart': False}
2024-10-17 00:45:00,592	INFO monitor.py:383 -- Autoscaler has not yet received load metrics. Waiting.
2024-10-17 00:45:05,606	INFO autoscaler.py:147 -- The autoscaler took 0.001 seconds to fetch the list of non-terminated nodes.
2024-10-17 00:45:05,607	INFO autoscaler.py:427 --
======== Autoscaler status: 2024-10-17 00:45:05.607296 ========
Node status
---------------------------------------------------------------
Active:
 1 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU
 0B/18.59GiB memory
 0B/9.30GiB object_store_memory

Demands:
 (no resource demands)
2024-10-17 00:45:05,611	INFO autoscaler.py:1389 -- StandardAutoscaler: Queue 1 new nodes for launch
2024-10-17 00:45:05,612	INFO autoscaler.py:470 -- The autoscaler took 0.006 seconds to complete the update iteration.
2024-10-17 00:45:05,612	INFO node_launcher.py:177 -- NodeLauncher0: Got 1 nodes to launch.
2024-10-17 00:45:05,615	INFO monitor.py:413 -- :event_summary:Resized to 8 CPUs.
2024-10-17 00:45:05,665	INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['IP_REMOVED', 'IP_REMOVED']
2024-10-17 00:45:05,666	INFO node_launcher.py:177 -- NodeLauncher0: Launching 1 nodes, type local.cluster.node.
2024-10-17 00:45:10,639	INFO autoscaler.py:147 -- The autoscaler took 0.001 seconds to fetch the list of non-terminated nodes.
2024-10-17 00:45:10,640	INFO autoscaler.py:427 --
======== Autoscaler status: 2024-10-17 00:45:10.640625 ========
Node status
---------------------------------------------------------------
Active:
 1 local.cluster.node
Pending:
 IP_REMOVED: local.cluster.node, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU
 0B/18.59GiB memory
 0B/9.30GiB object_store_memory

Demands:
 (no resource demands)
2024-10-17 00:45:10,647	INFO autoscaler.py:1336 -- Creating new (spawn_updater) updater thread for node IP_REMOVED.

==> /tmp/ray/session_latest/logs/monitor.out <==

@olly-writes-code
Copy link

olly-writes-code commented Oct 17, 2024

Anything to help debug things would be very useful!

@olly-writes-code
Copy link

Fixed! I've managed to ray up the cluster from a private docker image! It looks like the Ray version on my Docker image was different from the one getting pip installed on the worker node.

@jacksonjacobs1
Copy link

Thanks @olly-writes-code . Were you able to successfully tear down and re-initialize your cluster?

I ask because your ray version incompatibility issue was definitely not the case for me - I pulled the pre-built rayproject/ray docker image onto all nodes.

On the first attempt, the cluster spun up without any issues. It was only after running ray down and ray up again that the issue started.

@olly-writes-code
Copy link

Interesting. It seems that running ray down doesn't stop the docker container on the worker node. I had to SSH into the node and kill the docker container manually. Maybe it's related to this issue #17689

@DmitriGekhtman
Copy link
Contributor

Local node provider is not actively maintained.
I'd recommend looking into alternative strategies for managing Ray on-prem.

@olly-writes-code
Copy link

Ahh, damn, okay. Is it correct to say that Ray is not recommended for running training on a cluster of GPU machines provided by someone other than AWS, Azure, or GCP?

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Oct 18, 2024

Ahh, damn, okay. Is it correct to say that Ray is not recommended for running training on a cluster of GPU machines provided by someone other than AWS, Azure, or GCP?

You can use Ray in any on-prem or cloud environment, but I'd recommend figuring out another way to orchestrate the process of pulling images and running Ray start.
One strategy to run Ray on-prem (or on an unsupported cloud provider) is to first figure out how to run Kubernetes in your environment, then use KubeRay to manage Ray clusters in the Kubernetes cluster.

@olly-writes-code
Copy link

I see. Thanks for the clarification @DmitriGekhtman :)

For future users, it would be great if the deprecation of local clusters could be made clear in the docs.

@olly-writes-code
Copy link

This doc is clearly false now https://docs.ray.io/en/latest/cluster/vms/user-guides/launching-clusters/on-premises.html

"This document describes how to set up an on-premise Ray cluster, i.e., to run Ray on bare metal machines, or in a private cloud."

@DmitriGekhtman
Copy link
Contributor

Ah, I think I might have spoken too soon on another thread about this functionality being officially deprecated.

However, based on my experience with the Ray project, this functionality is not very well maintained.

You will likely get better results by "setting up manually", i.e. running ray start on each of the machines in the cluster. If you have ssh access to each of the machines, you can write a for loop to do this.

@olly-writes-code
Copy link

However, based on my experience with the Ray project, this functionality is not very well maintained.

Yes, that is my experience, having worked with code. Docs are badly maintained and the code is very flaky. We have burned 3 days, battling this code, only to find out it is poorly supported.

Due to this experience and confusion over the state of deprecation, etc., we will not use any Ray (or Anyscale) anytime soon.

@DmitriGekhtman
Copy link
Contributor

cc @anyscalesam on the poor UX here. My recommendation to the maintainers would be to officially deprecate local node provider.

The best maintained and most popular method for using Ray in the OSS is to run Ray on Kubernetes using KubeRay.

The Anyscale product is also quite stable and reliable (as it's used by paying customers of Anyscale.)

@olly-writes-code
Copy link

Right, but the sense is that due to the incentives at play, support for Ray OSS will weaken to encourage people to pay for Anyscale, precisely as described above. This leaves a very bitter taste, which means we won't even try Ray and, therefore, would never consider Anyscale.

@jjyao jjyao added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Oct 30, 2024
@flyingfalling
Copy link

Yes, it is all very confusing, since local bare metal is the most "obvious" way of running ray to people from supercomputer background (like openMPI). I've written a suite of ansible scripts to start up ray in an openMPI-like fashion, and figure out sub-GPU chunks and custom resources as well (https://github.com/flyingfalling/pyraygputils). However, it is VERY hackish (am polishing it in parallel with some other related projects). It would be very unfortunate if bare metal ray were deprecated...

@monsieurzhang
Copy link

Same issue with "worker nodes uninitialized".
In short, for me, this problem is fixed by removing one line.
for t in T: t.start() t.join()

It changes the initialization for each worker from paralization to serialization.
Related issue: #38718

The last line in log file: monitor.log, shows:
Creating new (spawn_updater) updater thread for node xxx

In line https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/autoscaler.py#L717-L719
It is noted that some similar problem has been detected, but maybe for "local" nodes, it's not the focus.
Spawning these threads directly seems to cause problems

PR-5903 adds back the multi-thread processing.
So we just need to intialize worker one by one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes P2 Important issue, but not time-critical stability
Projects
None yet
Development

No branches or pull requests