Skip to content

Conversation

@aslonnie
Copy link
Collaborator

merge everything into AnyscaleJobRunner

merge everything into AnyscaleJobRunner

Signed-off-by: Lonnie Liu <[email protected]>
@aslonnie aslonnie requested a review from a team as a code owner November 18, 2025 00:05
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the command runners by removing the JobRunner class and merging its functionality into AnyscaleJobRunner. The changes are mostly correct, but there is a critical issue introduced by changing the base class of AnyscaleJobRunner which will cause a NotImplementedError when waiting for nodes. Additionally, the change effectively makes the job run type an alias for anyscale_job, which is a significant behavioral change that should be considered. I've left detailed comments on these points.



class AnyscaleJobRunner(JobRunner):
class AnyscaleJobRunner(CommandRunner):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Changing the base class from JobRunner to CommandRunner introduces a critical issue. The wait_for_nodes method in this class calls super().wait_for_nodes(). Previously, this resolved to JobRunner.wait_for_nodes(). Now, it will resolve to CommandRunner.wait_for_nodes(), which raises a NotImplementedError.

This will break tests that require waiting for nodes. The implementation of wait_for_nodes from JobRunner should be merged into this class. Specifically, the super() call in AnyscaleJobRunner.wait_for_nodes should be replaced with the logic to schedule the wait_cluster.py script, like this:

def wait_for_nodes(self, num_nodes: int, timeout: float = 900):
    self._wait_for_nodes_timeout = timeout
    self.job_manager.cluster_startup_timeout += timeout
    self.run_prepare_command(
        f"python wait_cluster.py {num_nodes} {timeout}", timeout=timeout + 30
    )

Comment on lines +58 to 59
"job": AnyscaleJobRunner,
"anyscale_job": AnyscaleJobRunner,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change makes run_type="job" an alias for run_type="anyscale_job". This is a significant behavioral change, as "job" previously used JobRunner with FullClusterManager (which manages the cluster lifecycle), and will now use AnyscaleJobRunner with MinimalClusterManager (where the cluster is managed by the Anyscale Job service). If this is intended, consider removing the "job" run type to avoid confusion, as it is now redundant.

self._copy_script_to_working_dir("anyscale_job_wrapper.py")
super().prepare_remote_env()
self._copy_script_to_working_dir("wait_cluster.py")
self._copy_script_to_working_dir("prometheus_metrics.py")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Cluster Node Waiting Logic Missing

The wait_for_nodes method calls super().wait_for_nodes() which raises NotImplementedError in the base CommandRunner class. The old JobRunner implementation added a prepare command to run wait_cluster.py, but this logic was lost during the merge. The method should call self.run_prepare_command(f"python wait_cluster.py {num_nodes} {timeout}", timeout=timeout + 30) instead of calling the parent method.

Fix in Cursor Fix in Web

@aslonnie aslonnie added the go add ONLY when ready to merge, run all tests label Nov 18, 2025
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core release-test release test labels Nov 18, 2025
@aslonnie aslonnie merged commit 207bd7e into master Nov 18, 2025
6 checks passed
@aslonnie aslonnie deleted the lonnie-251117-shortrunner branch November 18, 2025 06:49
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
merge everything into AnyscaleJobRunner

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
merge everything into AnyscaleJobRunner

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: YK <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
merge everything into AnyscaleJobRunner

Signed-off-by: Lonnie Liu <[email protected]>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
merge everything into AnyscaleJobRunner

Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests release-test release test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants