Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

Open
neex opened this issue Oct 26, 2022 · 4 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks P2 Important issue, but not time-critical

Comments

@neex
Copy link

neex commented Oct 26, 2022

What happened + What you expected to happen

I use ray cluster with Google Cloud Platform for my tasks. One thing to note is that I use preemptible instances for workers (thus, Google may stop it anytime).

After a while (about 30-40 minutes of active usage), the scaling stops working: no new workers go up, and no old workers are destroyed after idle timeout (moreover, some workers are up but not initialized). I've debugged the issue down to something that looks like an infinite exception-restart loop in /tmp/ray/session_latest/logs/monitor.log at the head node; the relevant log part is:

2022-10-26 13:26:33,018 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances?filter=%28%28status+%3D+PROVISIONING%29+OR+%28status+%3D+STAGI
NG%29+OR+%28status+%3D+RUNNING%29%29+AND+%28labels.ray-cluster-name+%3D+research%29&alt=json
2022-10-26 13:26:33,136 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute?alt=json
2022-10-26 13:26:33,195 ERROR autoscaler.py:341 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 338, in update
    self._update()
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 397, in _update
    self.process_completed_updates()
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 732, in process_completed_updates
    self.load_metrics.mark_active(self.provider.internal_ip(node_id))
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 155, in internal_ip
    node = self._get_cached_node(node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 217, in _get_cached_node
    return self._get_node(node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 45, in method_with_retries
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 209, in _get_node
    instance = resource.get_instance(node_id=node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node.py", line 407, in get_instance
    .execute()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/googleapiclient/http.py", line 851, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute?alt=json returned "The resource 'projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute' was not found">
2022-10-26 13:26:33,196 CRITICAL autoscaler.py:350 -- StandardAutoscaler: Too many errors, abort.

This exception repeats again and again with the same worker id ray-research-worker-cbcbb628-compute.

The ray-research-worker-cbcbb628-compute instance seems to have indeed existed but does not exist at the moment of the exception (thus, a 404 response from the GCP is justified).

I believe (though not sure) that situation is something like this:

  1. Ray started setting up the instance for worker and added it to some internal data structures.
  2. At some point (probably during the setup), it was shut down as I use preemptible instances.
  3. The Google Cloud Platform immediately forgot about it, starting to return 404 for all requests related to the instance.
  4. The autoscaler did not handle this corner case correctly and did not remove it from their lists.

The expected behavior is that the autoscaler should handle this case and continue to set up other workers, shut down idle ones, etc.

Versions / Dependencies

$ ray --version
ray, version 2.0.1
$ python --version
Python 3.10.6
$ uname -a
Linux ray-research-head-3c5e32a6-compute 5.15.0-1021-gcp #28-Ubuntu SMP Fri Oct 14 15:46:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue
Ubuntu 22.04.1 LTS \n \l

Google cloud platform is used, and preemptible instances are used for workers (see condig).

Reproduction script

Config:

cluster_name: ray-debug
max_workers: 30

provider:
  type: gcp
  region: europe-west1
  availability_zone: europe-west1-c
  project_id: wunderfund-research


available_node_types:
    head:
        resources: {"CPU": 0}
        node_config:
            machineType: n2-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
                  sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts

                  # ubuntu-2204-jammy-v20220712a
    worker:
        # memory 640 GB =  640*1024*1024*1024 = 687194767360
        resources: {"CPU": 1, "memory": 687194767360}
        node_config:
            machineType: n2-standard-2

            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
                  sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
            scheduling:
              - preemptible: true
            serviceAccounts:
            - email: "[email protected]"
              scopes:
              - https://www.googleapis.com/auth/cloud-platform


head_node_type: head
idle_timeout_minutes: 1
upscaling_speed: 2


auth:
   ssh_user: ubuntu


setup_commands:
  - sudo apt update
  - sudo DEBIAN_FRONTEND=noninteractive apt install python3-pip python-is-python3 -y
  - sudo pip install -U pip
  - sudo pip install ray[all]


# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Script:

import time
import ray

def test_job(delay):
    time.sleep(delay)
    return f"Waited for {delay} secs"


def run_jobs():
    delays = [i * 10 for i in range(1, 30)]
    jobs = [ray.remote(test_job).options(num_cpus=1).remote(d) for d in delays]

    while jobs:
        done_ids, jobs = ray.wait(jobs)
        for ref in done_ids:
            result = ray.get(ref)
            print(ref, result)


if __name__ == "__main__":
    run_jobs()

In order to reproduce the issue, you may have to submit the script to the cluster several times for the instance shutdown to be caught in the right state.

Issue Severity

High: It blocks me from completing my task.

@neex neex added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 26, 2022
@hora-anyscale hora-anyscale added core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022
@cadedaniel
Copy link
Member

@wuisawesome could you help triage this?

@cheremovsky
Copy link

Same story as OP 😢

@wuisawesome wuisawesome added core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks labels Nov 1, 2022
@cadedaniel cadedaniel removed their assignment Nov 1, 2022
@wuisawesome wuisawesome removed their assignment Nov 8, 2022
@richardliaw richardliaw added core-autoscaler autoscaler related issues infra autoscaler, ray client, kuberay, related issues and removed core-autoscaler autoscaler related issues labels Nov 21, 2022
@architkulkarni architkulkarni self-assigned this Jan 17, 2023
@scv119 scv119 added core-autoscaler autoscaler related issues and removed core Issues that should be addressed in Ray Core labels Feb 16, 2023
@richardliaw richardliaw added core-clusters For launching and managing Ray clusters/jobs/kubernetes and removed infra autoscaler, ray client, kuberay, related issues labels Mar 20, 2023
@jjyao jjyao added the core Issues that should be addressed in Ray Core label Feb 6, 2024
@jjyao jjyao added p0.5 and removed P1 Issue that should be fixed within a few weeks Ray-2.4 labels Jul 9, 2024
@jjyao jjyao removed the p0.5 label Jul 9, 2024
@jjyao jjyao added the P0 Issues that should be fixed in short order label Jul 9, 2024
@anyscalesam
Copy link
Contributor

anyscalesam commented Jul 15, 2024

More dets; issue also replicates with TPUs (just a couple is fine; maybe 4-8)

UPDATE:
@hongchaodeng can you please take a look at this; rickyx@ can help with some of the context around AS in general. For help on reproing and an environment on GCP please grab thomas@ so you can get a GCP sandbox to proc spot preemptions on GCP if needed.

@hongchaodeng
Copy link
Member

hongchaodeng commented Jul 16, 2024

The issue is a known bug in the GCP provider of the cluster launcher.

The Ray autoscaler performs two primary functions:

  1. monitoring the current state of instances
  2. making autoscaling decisions.

The resource 'projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute' was not found

The problem arose during the first step. The cluster launcher code assumes that instances remain available once created. However, any external actions, such as manual termination or spot preemption, would disrupt this assumption. When such disruptions occur, the cluster launcher does not handle the unexpected exceptions properly and continuously retries the operation.

This behavior is due to the cluster launcher being designed primarily for bootstrapping and prototyping Ray projects. It is important to note that this issue does not affect the Anyscale platform, which uses a different proprietary autoscaler.

To avoid this problem, you may consider leveraging the autoscaling capabilities of Anyscale. Alternatively, you would need to implement additional steps to manage autoscaling effectively.

@jjyao jjyao added P1 Issue that should be fixed within a few weeks p0.5 and removed P0 Issues that should be fixed in short order labels Jul 18, 2024
@jjyao jjyao added P2 Important issue, but not time-critical P1 Issue that should be fixed within a few weeks and removed P1 Issue that should be fixed within a few weeks P0.5 labels Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests