[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

neex · 2022-10-26T10:50:50Z

What happened + What you expected to happen

I use ray cluster with Google Cloud Platform for my tasks. One thing to note is that I use preemptible instances for workers (thus, Google may stop it anytime).

After a while (about 30-40 minutes of active usage), the scaling stops working: no new workers go up, and no old workers are destroyed after idle timeout (moreover, some workers are up but not initialized). I've debugged the issue down to something that looks like an infinite exception-restart loop in /tmp/ray/session_latest/logs/monitor.log at the head node; the relevant log part is:

2022-10-26 13:26:33,018 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances?filter=%28%28status+%3D+PROVISIONING%29+OR+%28status+%3D+STAGI
NG%29+OR+%28status+%3D+RUNNING%29%29+AND+%28labels.ray-cluster-name+%3D+research%29&alt=json
2022-10-26 13:26:33,136 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute?alt=json
2022-10-26 13:26:33,195 ERROR autoscaler.py:341 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 338, in update
    self._update()
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 397, in _update
    self.process_completed_updates()
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/autoscaler.py", line 732, in process_completed_updates
    self.load_metrics.mark_active(self.provider.internal_ip(node_id))
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 155, in internal_ip
    node = self._get_cached_node(node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 217, in _get_cached_node
    return self._get_node(node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 45, in method_with_retries
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node_provider.py", line 209, in _get_node
    instance = resource.get_instance(node_id=node_id)
  File "/usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/gcp/node.py", line 407, in get_instance
    .execute()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/googleapiclient/http.py", line 851, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute?alt=json returned "The resource 'projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute' was not found">
2022-10-26 13:26:33,196 CRITICAL autoscaler.py:350 -- StandardAutoscaler: Too many errors, abort.

This exception repeats again and again with the same worker id ray-research-worker-cbcbb628-compute.

The ray-research-worker-cbcbb628-compute instance seems to have indeed existed but does not exist at the moment of the exception (thus, a 404 response from the GCP is justified).

I believe (though not sure) that situation is something like this:

Ray started setting up the instance for worker and added it to some internal data structures.
At some point (probably during the setup), it was shut down as I use preemptible instances.
The Google Cloud Platform immediately forgot about it, starting to return 404 for all requests related to the instance.
The autoscaler did not handle this corner case correctly and did not remove it from their lists.

The expected behavior is that the autoscaler should handle this case and continue to set up other workers, shut down idle ones, etc.

Versions / Dependencies

$ ray --version
ray, version 2.0.1
$ python --version
Python 3.10.6
$ uname -a
Linux ray-research-head-3c5e32a6-compute 5.15.0-1021-gcp #28-Ubuntu SMP Fri Oct 14 15:46:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/issue
Ubuntu 22.04.1 LTS \n \l

Google cloud platform is used, and preemptible instances are used for workers (see condig).

Reproduction script

Config:

cluster_name: ray-debug
max_workers: 30

provider:
  type: gcp
  region: europe-west1
  availability_zone: europe-west1-c
  project_id: wunderfund-research


available_node_types:
    head:
        resources: {"CPU": 0}
        node_config:
            machineType: n2-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
                  sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts

                  # ubuntu-2204-jammy-v20220712a
    worker:
        # memory 640 GB =  640*1024*1024*1024 = 687194767360
        resources: {"CPU": 1, "memory": 687194767360}
        node_config:
            machineType: n2-standard-2

            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
                  sourceImage: projects/ubuntu-os-cloud/global/images/family/ubuntu-2204-lts
            scheduling:
              - preemptible: true
            serviceAccounts:
            - email: "[email protected]"
              scopes:
              - https://www.googleapis.com/auth/cloud-platform


head_node_type: head
idle_timeout_minutes: 1
upscaling_speed: 2


auth:
   ssh_user: ubuntu


setup_commands:
  - sudo apt update
  - sudo DEBIAN_FRONTEND=noninteractive apt install python3-pip python-is-python3 -y
  - sudo pip install -U pip
  - sudo pip install ray[all]


# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Script:

import time
import ray

def test_job(delay):
    time.sleep(delay)
    return f"Waited for {delay} secs"


def run_jobs():
    delays = [i * 10 for i in range(1, 30)]
    jobs = [ray.remote(test_job).options(num_cpus=1).remote(d) for d in delays]

    while jobs:
        done_ids, jobs = ray.wait(jobs)
        for ref in done_ids:
            result = ray.get(ref)
            print(ref, result)


if __name__ == "__main__":
    run_jobs()

In order to reproduce the issue, you may have to submit the script to the cluster several times for the instance shutdown to be caught in the right state.

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

cadedaniel · 2022-10-28T21:44:49Z

@wuisawesome could you help triage this?

cheremovsky · 2022-11-01T07:21:15Z

Same story as OP 😢

anyscalesam · 2024-07-15T15:46:38Z

More dets; issue also replicates with TPUs (just a couple is fine; maybe 4-8)

UPDATE:
@hongchaodeng can you please take a look at this; rickyx@ can help with some of the context around AS in general. For help on reproing and an environment on GCP please grab thomas@ so you can get a GCP sandbox to proc spot preemptions on GCP if needed.

hongchaodeng · 2024-07-16T01:07:32Z

The issue is a known bug in the GCP provider of the cluster launcher.

The Ray autoscaler performs two primary functions:

monitoring the current state of instances
making autoscaling decisions.

The resource 'projects/wunderfund-research/zones/europe-west1-c/instances/ray-research-worker-cbcbb628-compute' was not found

The problem arose during the first step. The cluster launcher code assumes that instances remain available once created. However, any external actions, such as manual termination or spot preemption, would disrupt this assumption. When such disruptions occur, the cluster launcher does not handle the unexpected exceptions properly and continuously retries the operation.

This behavior is due to the cluster launcher being designed primarily for bootstrapping and prototyping Ray projects. It is important to note that this issue does not affect the Anyscale platform, which uses a different proprietary autoscaler.

To avoid this problem, you may consider leveraging the autoscaling capabilities of Anyscale. Alternatively, you would need to implement additional steps to manage autoscaling effectively.

neex added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 26, 2022

hora-anyscale assigned cadedaniel Oct 28, 2022

hora-anyscale added core Issues that should be addressed in Ray Core and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Oct 28, 2022

cadedaniel assigned wuisawesome Oct 28, 2022

wuisawesome added core-autoscaler autoscaler related issues P1 Issue that should be fixed within a few weeks labels Nov 1, 2022

cadedaniel removed their assignment Nov 1, 2022

wuisawesome removed their assignment Nov 8, 2022

DmitriGekhtman assigned wuisawesome Nov 15, 2022

richardliaw added core-autoscaler autoscaler related issues infra autoscaler, ray client, kuberay, related issues and removed core-autoscaler autoscaler related issues labels Nov 21, 2022

hora-anyscale added the Ray-2.4 label Dec 14, 2022

architkulkarni self-assigned this Jan 17, 2023

scv119 added core-autoscaler autoscaler related issues and removed core Issues that should be addressed in Ray Core labels Feb 16, 2023

richardliaw added core-clusters For launching and managing Ray clusters/jobs/kubernetes and removed infra autoscaler, ray client, kuberay, related issues labels Mar 20, 2023

jjyao added the core Issues that should be addressed in Ray Core label Feb 6, 2024

jjyao added p0.5 and removed P1 Issue that should be fixed within a few weeks Ray-2.4 labels Jul 9, 2024

jjyao unassigned wuisawesome and architkulkarni Jul 9, 2024

jjyao removed the p0.5 label Jul 9, 2024

jjyao added the P0 Issues that should be fixed in short order label Jul 9, 2024

anyscalesam assigned hongchaodeng Jul 15, 2024

jjyao added P1 Issue that should be fixed within a few weeks p0.5 and removed P0 Issues that should be fixed in short order labels Jul 18, 2024

jjyao added P2 Important issue, but not time-critical P1 Issue that should be fixed within a few weeks and removed P1 Issue that should be fixed within a few weeks P0.5 labels Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

neex commented Oct 26, 2022

cadedaniel commented Oct 28, 2022

cheremovsky commented Nov 1, 2022

anyscalesam commented Jul 15, 2024 •

edited

Loading

hongchaodeng commented Jul 16, 2024 •

edited

Loading

[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

[core] autoscaler occasionally goes into exception loop when using preemptible GCP instances #29698

Comments

neex commented Oct 26, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

cadedaniel commented Oct 28, 2022

cheremovsky commented Nov 1, 2022

anyscalesam commented Jul 15, 2024 • edited Loading

hongchaodeng commented Jul 16, 2024 • edited Loading

anyscalesam commented Jul 15, 2024 •

edited

Loading

hongchaodeng commented Jul 16, 2024 •

edited

Loading