-
Notifications
You must be signed in to change notification settings - Fork 7k
[release test] remove JobRunner #58720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
merge everything into AnyscaleJobRunner Signed-off-by: Lonnie Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the command runners by removing the JobRunner class and merging its functionality into AnyscaleJobRunner. The changes are mostly correct, but there is a critical issue introduced by changing the base class of AnyscaleJobRunner which will cause a NotImplementedError when waiting for nodes. Additionally, the change effectively makes the job run type an alias for anyscale_job, which is a significant behavioral change that should be considered. I've left detailed comments on these points.
|
|
||
|
|
||
| class AnyscaleJobRunner(JobRunner): | ||
| class AnyscaleJobRunner(CommandRunner): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the base class from JobRunner to CommandRunner introduces a critical issue. The wait_for_nodes method in this class calls super().wait_for_nodes(). Previously, this resolved to JobRunner.wait_for_nodes(). Now, it will resolve to CommandRunner.wait_for_nodes(), which raises a NotImplementedError.
This will break tests that require waiting for nodes. The implementation of wait_for_nodes from JobRunner should be merged into this class. Specifically, the super() call in AnyscaleJobRunner.wait_for_nodes should be replaced with the logic to schedule the wait_cluster.py script, like this:
def wait_for_nodes(self, num_nodes: int, timeout: float = 900):
self._wait_for_nodes_timeout = timeout
self.job_manager.cluster_startup_timeout += timeout
self.run_prepare_command(
f"python wait_cluster.py {num_nodes} {timeout}", timeout=timeout + 30
)| "job": AnyscaleJobRunner, | ||
| "anyscale_job": AnyscaleJobRunner, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes run_type="job" an alias for run_type="anyscale_job". This is a significant behavioral change, as "job" previously used JobRunner with FullClusterManager (which manages the cluster lifecycle), and will now use AnyscaleJobRunner with MinimalClusterManager (where the cluster is managed by the Anyscale Job service). If this is intended, consider removing the "job" run type to avoid confusion, as it is now redundant.
| self._copy_script_to_working_dir("anyscale_job_wrapper.py") | ||
| super().prepare_remote_env() | ||
| self._copy_script_to_working_dir("wait_cluster.py") | ||
| self._copy_script_to_working_dir("prometheus_metrics.py") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Cluster Node Waiting Logic Missing
The wait_for_nodes method calls super().wait_for_nodes() which raises NotImplementedError in the base CommandRunner class. The old JobRunner implementation added a prepare command to run wait_cluster.py, but this logic was lost during the merge. The method should call self.run_prepare_command(f"python wait_cluster.py {num_nodes} {timeout}", timeout=timeout + 30) instead of calling the parent method.
merge everything into AnyscaleJobRunner Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
merge everything into AnyscaleJobRunner Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: YK <[email protected]>
merge everything into AnyscaleJobRunner Signed-off-by: Lonnie Liu <[email protected]>
merge everything into AnyscaleJobRunner Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Future-Outlier <[email protected]>
merge everything into AnyscaleJobRunner