-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Autoscaler] Refactor v2 Log Formatting #49350
Conversation
7589e14
to
30fe22c
Compare
Signed-off-by: ryanaoleary <[email protected]>
30fe22c
to
d36e447
Compare
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
cc: @rickyyx, I think I'm going to add some different unit test cases but this should be good to review |
Signed-off-by: Ryan O'Leary <[email protected]>
@kevin85421 This PR would be helpful to add since it moves away from referencing the legacy autoscaler from the V2 autoscaler. This refactor will also make it easier to make other improvements to V2 log formatting, since we won't have to parse it to the V1 format first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewing
@@ -555,10 +555,10 @@ def test_cluster_status_formatter(): | |||
Pending: | |||
worker_node, 1 launching | |||
worker_node_gpu, 1 launching | |||
127.0.0.3: worker_node, starting ray | |||
instance4: worker_node, starting ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that the instance ID for KubeRay is the Pod name. Could you check whether the ray status
result also shows the Pod name so that we can map K8s Pods to Ray instances?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ray status
currently shows the Pod name, this is from my manual testing:
(base) ray@raycluster-autoscaler-head-p77pc:~$ ray status --verbose
======== Autoscaler status: 2025-02-25 22:35:58.558544 ========
GCS request time: 0.001412s
Node status
---------------------------------------------------------------
Active:
(no active nodes)
Idle:
1 headgroup
Pending:
: small-group,
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0B/1.86GiB memory
0B/495.58MiB object_store_memory
Total Demands:
{'CPU': 1.0, 'TPU': 4.0}: 1+ pending tasks/actors
Node: 65d0a32bfeee84475a235b3c290824ec3ac0b1ab5148d96fc674ce93
Idle: 82253 ms
Usage:
0B/1.86GiB memory
0B/495.58MiB object_store_memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the key in the key-value pairs in Pending be the Pod name? That's my expectation.
In addition, have you manually tested this PR? The test below shows that either "head node" or "worker node" is appended to the end of the Node: ...
line. For example,
Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 (head_node)
However, the above output from your manual testing is:
Node: 65d0a32bfeee84475a235b3c290824ec3ac0b1ab5148d96fc674ce93
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood your initial comment, I thought you were asking whether ray status
currently shows the Pod name in KubeRay. The above snippet was using the ray 2.41 image. I've been running into issues building an image lately to test the new changes with the following Dockerfile:
# Use the latest Ray master as base.
FROM rayproject/ray:nightly-py310
# Invalidate the cache so that fresh code is pulled in the next step.
ARG BUILD_DATE
# Retrieve your development code.
ADD . ray
# Install symlinks to your modified Python code.
RUN python ray/python/ray/setup-dev.py -y
where the RayCluster Pods will immediately crash and terminate after pulling the image. Describing the RayCluster just shows:
Normal DeletedHeadPod 5m21s (x8 over 5m22s) raycluster-controller Deleted head Pod default/raycluster-autoscaler-head-ll2vs; Pod status: Running; Pod restart policy: Never; Ray container terminated status: &ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2025-03-04 11:52:38 +0000 UTC,FinishedAt:2025-03-04 11:52:38 +0000 UTC,ContainerID:containerd://81a7332de2046c934ba6725cbb72eb3b228ee8aa66bc26f2db5f6741607ae82f,}
Normal DeletedHeadPod 26s (x159 over 5m9s) raycluster-controller (combined from similar events): Deleted head Pod default/raycluster-autoscaler-head-5g2c4; Pod status: Running; Pod restart policy: Never; Ray container terminated status: &ContainerStateTerminated{ExitCode:1,Signal:0,Reason:Error,Message:,StartedAt:2025-03-04 11:57:33 +0000 UTC,FinishedAt:2025-03-04 11:57:34 +0000 UTC,ContainerID:containerd://f423d2877a176beaecb88e6d1d8e61456233b1359c9e8b94e333ea4560e86b1c,}
The head Pod keeps immediately crashing and re-creating, so I can't get any more useful logs from the container. I tried building an image using the latest changes from master (i.e. I didn't use any of my python changes) and it still had the same issue, is this a problem you've seen before? As soon as I have a working image I can run a manual test to check for Pod name in the key-value pairs in Pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just able to manually test it with my changes, here is the output of ray status --verbose
with a Pending node:
======== Autoscaler status: 2025-03-04 12:11:15.078526 ========
GCS request time: 0.001526s
Node status
---------------------------------------------------------------
Active:
(no active nodes)
Idle:
1 headgroup
Pending:
a4dfeafc-8a5e-47ff-9721-cdd559c00dfc: small-group,
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0B/1.86GiB memory
0B/511.68MiB object_store_memory
Total Demands:
{'CPU': 1.0}: 1+ pending tasks/actors
Node: dcb068352f72b5244cfdefaa70055d5cd51b5cd29778295b41cd0775 (headgroup)
Idle: 10641 ms
Usage:
0B/1.86GiB memory
0B/511.68MiB object_store_memory
(base) ray@raycluster-autoscaler-head-hr8pd:~$ ray status --verbose
======== Autoscaler status: 2025-03-04 12:11:36.180572 ========
GCS request time: 0.001749s
Node status
---------------------------------------------------------------
Active:
1 small-group
Idle:
1 headgroup
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
1.0/1.0 CPU
0B/2.79GiB memory
0B/781.36MiB object_store_memory
Total Demands:
(no resource demands)
Node: 1c0ff9b00d40e332469adb4fdfacd9d0f21599bac65e50666e808a4d (small-group)
Usage:
1.0/1.0 CPU
0B/953.67MiB memory
0B/269.68MiB object_store_memory
Activity:
Resource: CPU currently in use.
Busy workers on node.
Node: dcb068352f72b5244cfdefaa70055d5cd51b5cd29778295b41cd0775 (headgroup)
Idle: 31744 ms
Usage:
0B/1.86GiB memory
0B/511.68MiB object_store_memory
Activity:
(no activity)
it look like instance_id
isn't set to the Pod name, but some other generated unique ID. Looking at the Autoscaler logs, if we wanted it to output Pod name here we should use cloud_instance_id
:
2025-03-04 12:11:34,960 - INFO - Update instance ALLOCATED->RAY_RUNNING (id=a4dfeafc-8a5e-47ff-9721-cdd559c00dfc, type=small-group, cloud_instance_id=raycluster-autoscaler-small-group-worker-qbd8r, ray_id=): ray node 1c0ff9b00d40e332469adb4fdfacd9d0f21599bac65e50666e808a4d is RUNNING
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
@kevin85421 I went through and re-factored the code to make it easier to review. I also made sure the PR is consistent in using string concatenation with |
@@ -555,10 +555,10 @@ def test_cluster_status_formatter(): | |||
Pending: | |||
worker_node, 1 launching | |||
worker_node_gpu, 1 launching | |||
127.0.0.3: worker_node, starting ray | |||
instance4: worker_node, starting ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the key in the key-value pairs in Pending be the Pod name? That's my expectation.
In addition, have you manually tested this PR? The test below shows that either "head node" or "worker node" is appended to the end of the Node: ...
line. For example,
Node: fffffffffffffffffffffffffffffffffffffffffffffffffff00001 (head_node)
However, the above output from your manual testing is:
Node: 65d0a32bfeee84475a235b3c290824ec3ac0b1ab5148d96fc674ce93
Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
RayCluster manifest with Autoscaler v2 used for manual testing:
|
@ryanaoleary would you mind fixing the CI errors? |
I see it passing the test it failed in the CI ( |
Signed-off-by: Ryan O'Leary <[email protected]>
79c124c
to
2084439
Compare
It's now passing the CI after 2084439 cc: @kevin85421 |
Signed-off-by: Ryan O'Leary <[email protected]>
Currently the V2 Autoscaler formats logs by converting the V2 data structure `ClusterStatus` to the V1 structures `AutoscalerSummary` and `LoadMetricsSummary` and then passing them to the legacy `format_info_string`. It'd be useful for the V2 autoscaler to directly format `ClusterStatus` to the correct output log format. This PR refactors `utils.py` to directly format `ClusterStatus`. Additionally, this PR changes the node reports to output `instance_id` rather than `ip_address`, since the latter is not necessarily unique for failed nodes. ## Related issue number Closes ray-project#37856 --------- Signed-off-by: ryanaoleary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Currently the V2 Autoscaler formats logs by converting the V2 data structure `ClusterStatus` to the V1 structures `AutoscalerSummary` and `LoadMetricsSummary` and then passing them to the legacy `format_info_string`. It'd be useful for the V2 autoscaler to directly format `ClusterStatus` to the correct output log format. This PR refactors `utils.py` to directly format `ClusterStatus`. Additionally, this PR changes the node reports to output `instance_id` rather than `ip_address`, since the latter is not necessarily unique for failed nodes. ## Related issue number Closes ray-project#37856 --------- Signed-off-by: ryanaoleary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Currently the V2 Autoscaler formats logs by converting the V2 data structure `ClusterStatus` to the V1 structures `AutoscalerSummary` and `LoadMetricsSummary` and then passing them to the legacy `format_info_string`. It'd be useful for the V2 autoscaler to directly format `ClusterStatus` to the correct output log format. This PR refactors `utils.py` to directly format `ClusterStatus`. Additionally, this PR changes the node reports to output `instance_id` rather than `ip_address`, since the latter is not necessarily unique for failed nodes. ## Related issue number Closes ray-project#37856 --------- Signed-off-by: ryanaoleary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Jay Chia <[email protected]>
Currently the V2 Autoscaler formats logs by converting the V2 data structure `ClusterStatus` to the V1 structures `AutoscalerSummary` and `LoadMetricsSummary` and then passing them to the legacy `format_info_string`. It'd be useful for the V2 autoscaler to directly format `ClusterStatus` to the correct output log format. This PR refactors `utils.py` to directly format `ClusterStatus`. Additionally, this PR changes the node reports to output `instance_id` rather than `ip_address`, since the latter is not necessarily unique for failed nodes. ## Related issue number Closes ray-project#37856 --------- Signed-off-by: ryanaoleary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Jay Chia <[email protected]>
Currently the V2 Autoscaler formats logs by converting the V2 data structure `ClusterStatus` to the V1 structures `AutoscalerSummary` and `LoadMetricsSummary` and then passing them to the legacy `format_info_string`. It'd be useful for the V2 autoscaler to directly format `ClusterStatus` to the correct output log format. This PR refactors `utils.py` to directly format `ClusterStatus`. Additionally, this PR changes the node reports to output `instance_id` rather than `ip_address`, since the latter is not necessarily unique for failed nodes. ## Related issue number Closes ray-project#37856 --------- Signed-off-by: ryanaoleary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]>
Currently the V2 Autoscaler formats logs by converting the V2 data structure `ClusterStatus` to the V1 structures `AutoscalerSummary` and `LoadMetricsSummary` and then passing them to the legacy `format_info_string`. It'd be useful for the V2 autoscaler to directly format `ClusterStatus` to the correct output log format. This PR refactors `utils.py` to directly format `ClusterStatus`. Additionally, this PR changes the node reports to output `instance_id` rather than `ip_address`, since the latter is not necessarily unique for failed nodes. ## Related issue number Closes ray-project#37856 --------- Signed-off-by: ryanaoleary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Dhakshin Suriakannu <[email protected]>
Why are these changes needed?
Currently the V2 Autoscaler formats logs by converting the V2 data structure
ClusterStatus
to the V1 structuresAutoscalerSummary
andLoadMetricsSummary
and then passing them to the legacyformat_info_string
. It'd be useful for the V2 autoscaler to directly formatClusterStatus
to the correct output log format. This PR refactorsutils.py
to directly formatClusterStatus
. Additionally, this PR changes the node reports to outputinstance_id
rather thanip_address
, since the latter is not necessarily unique for failed nodes.Related issue number
Closes #37856
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.