Skip to content

[Core][Autoscaler] Refactor v2 Log Formatting #49350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Mar 6, 2025
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions python/ray/autoscaler/v2/tests/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -555,10 +555,10 @@ def test_cluster_status_formatter():
Pending:
worker_node, 1 launching
worker_node_gpu, 1 launching
127.0.0.3: worker_node, starting ray
instance4: worker_node, starting ray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that the instance ID for KubeRay is the Pod name. Could you check whether the ray status result also shows the Pod name so that we can map K8s Pods to Ray instances?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ray status currently shows the Pod name, this is from my manual testing:

(base) ray@raycluster-autoscaler-head-p77pc:~$ ray status --verbose
======== Autoscaler status: 2025-02-25 22:35:58.558544 ========
GCS request time: 0.001412s

Node status
---------------------------------------------------------------
Active:
 (no active nodes)
Idle:
 1 headgroup
Pending:
 : small-group, 
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0B/1.86GiB memory
 0B/495.58MiB object_store_memory

Total Demands:
 {'CPU': 1.0, 'TPU': 4.0}: 1+ pending tasks/actors

Node: 65d0a32bfeee84475a235b3c290824ec3ac0b1ab5148d96fc674ce93
 Idle: 82253 ms
 Usage:
  0B/1.86GiB memory
  0B/495.58MiB object_store_memory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the key in the key-value pairs in Pending be the Pod name? That's my expectation.

In addition, have you manually tested this PR? The test below shows that either "head node" or "worker node" is appended to the end of the Node: ... line. For example,

Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 (head_node)

However, the above output from your manual testing is:

Node: 65d0a32bfeee84475a235b3c290824ec3ac0b1ab5148d96fc674ce93

Copy link
Contributor Author

@ryanaoleary ryanaoleary Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood your initial comment, I thought you were asking whether ray status currently shows the Pod name in KubeRay. The above snippet was using the ray 2.41 image. I've been running into issues building an image lately to test the new changes with the following Dockerfile:

# Use the latest Ray master as base.
FROM rayproject/ray:nightly-py310
# Invalidate the cache so that fresh code is pulled in the next step.
ARG BUILD_DATE
# Retrieve your development code.
ADD . ray
# Install symlinks to your modified Python code.
RUN python ray/python/ray/setup-dev.py -y

where the RayCluster Pods will immediately crash and terminate after pulling the image. Describing the RayCluster just shows:

Normal   DeletedHeadPod         5m21s (x8 over 5m22s)  raycluster-controller  Deleted head Pod default/raycluster-autoscaler-head-ll2vs; Pod status: Running; Pod restart policy: Never; Ray container terminated status: &ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2025-03-04 11:52:38 +0000 UTC,FinishedAt:2025-03-04 11:52:38 +0000 UTC,ContainerID:containerd://81a7332de2046c934ba6725cbb72eb3b228ee8aa66bc26f2db5f6741607ae82f,}
  Normal   DeletedHeadPod         26s (x159 over 5m9s)   raycluster-controller  (combined from similar events): Deleted head Pod default/raycluster-autoscaler-head-5g2c4; Pod status: Running; Pod restart policy: Never; Ray container terminated status: &ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2025-03-04 11:57:33 +0000 UTC,FinishedAt:2025-03-04 11:57:34 +0000 UTC,ContainerID:containerd://f423d2877a176beaecb88e6d1d8e61456233b1359c9e8b94e333ea4560e86b1c,}

The head Pod keeps immediately crashing and re-creating, so I can't get any more useful logs from the container. I tried building an image using the latest changes from master (i.e. I didn't use any of my python changes) and it still had the same issue, is this a problem you've seen before? As soon as I have a working image I can run a manual test to check for Pod name in the key-value pairs in Pending.

Copy link
Contributor Author

@ryanaoleary ryanaoleary Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just able to manually test it with my changes, here is the output of ray status --verbose with a Pending node:

======== Autoscaler status: 2025-03-04 12:11:15.078526 ========
GCS request time: 0.001526s

Node status
---------------------------------------------------------------
Active:
 (no active nodes)
Idle:
 1 headgroup
Pending:
 a4dfeafc-8a5e-47ff-9721-cdd559c00dfc: small-group, 
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0B/1.86GiB memory
 0B/511.68MiB object_store_memory

Total Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors

Node: dcb068352f72b5244cfdefaa70055d5cd51b5cd29778295b41cd0775 (headgroup)
 Idle: 10641 ms
 Usage:
  0B/1.86GiB memory
  0B/511.68MiB object_store_memory
  
(base) ray@raycluster-autoscaler-head-hr8pd:~$ ray status --verbose
======== Autoscaler status: 2025-03-04 12:11:36.180572 ========
GCS request time: 0.001749s

Node status
---------------------------------------------------------------
Active:
 1 small-group
Idle:
 1 headgroup
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 1.0/1.0 CPU
 0B/2.79GiB memory
 0B/781.36MiB object_store_memory

Total Demands:
 (no resource demands)

Node: 1c0ff9b00d40e332469adb4fdfacd9d0f21599bac65e50666e808a4d (small-group)
 Usage:
  1.0/1.0 CPU
  0B/953.67MiB memory
  0B/269.68MiB object_store_memory
 Activity:
  Resource: CPU currently in use.
  Busy workers on node.

Node: dcb068352f72b5244cfdefaa70055d5cd51b5cd29778295b41cd0775 (headgroup)
 Idle: 31744 ms
 Usage:
  0B/1.86GiB memory
  0B/511.68MiB object_store_memory
 Activity:
  (no activity)

it look like instance_id isn't set to the Pod name, but some other generated unique ID. Looking at the Autoscaler logs, if we wanted it to output Pod name here we should use cloud_instance_id:

2025-03-04 12:11:34,960 - INFO - Update instance ALLOCATED->RAY_RUNNING (id=a4dfeafc-8a5e-47ff-9721-cdd559c00dfc, type=small-group, cloud_instance_id=raycluster-autoscaler-small-group-worker-qbd8r, ray_id=): ray node 1c0ff9b00d40e332469adb4fdfacd9d0f21599bac65e50666e808a4d is RUNNING

Recent failures:
worker_node: LaunchFailed (latest_attempt: 02:46:40) - Insufficient capacity
worker_node: NodeTerminated (ip: 127.0.0.5)
worker_node: NodeTerminated (instance_id: instance5)

Resources
--------------------------------------------------------
Expand All @@ -573,18 +573,18 @@ def test_cluster_status_formatter():
{'GPU': 2} * 1 (STRICT_PACK): 2+ pending placement groups
{'GPU': 2, 'CPU': 100}: 2+ from request_resources()

Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00001
Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 (head_node)
Usage:
0.5/1.0 CPU
0.0/2.0 GPU
5.42KiB/10.04KiB object_store_memory

Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00002
Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00002 (worker_node)
Usage:
0/1.0 CPU
0/2.0 GPU

Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00003
Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00003 (worker_node)
Usage:
0.0/1.0 CPU"""
assert actual == expected
Expand Down
Loading