-
Couldn't load subscription status.
- Fork 43
feat: Hyperparameter Optimization APIs in Kubeflow SDK #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 18826185352Details
💛 - Coveralls |
|
I have implemented |
|
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-sdk-team, Fiona-Waters, abhijeet-dhumal, jskswamy. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
| trial_config: Optional[TrialConfig] = None, | ||
| search_space: dict[str, Any], | ||
| objectives: Optional[list[Objective]] = None, | ||
| algorithm: Optional[RandomSearch] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider adding options already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add it in the followup PR, since we want to limit number of APIs user can configure initially for Experiment CR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
|
|
||
| logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted") | ||
|
|
||
| def __get_optimization_job_from_crd( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def __get_optimization_job_from_crd( | |
| def __get_optimization_job_from_custom_resource( |
|
|
||
| def __get_optimization_job_from_crd( | ||
| self, | ||
| optimization_job_crd: models.V1beta1Experiment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| optimization_job_crd: models.V1beta1Experiment, | |
| optimization_job_cr: models.V1beta1Experiment, |
kubeflow/optimizer/backends/base.py
Outdated
| from kubeflow.optimizer.types.optimization_types import Objective, OptimizationJob, TrialConfig | ||
|
|
||
|
|
||
| class ExecutionBackend(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class ExecutionBackend(abc.ABC): | |
| class RuntimeBackend(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or:
| class ExecutionBackend(abc.ABC): | |
| class OptimizerBackend(abc.ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We previously agreed on the ExecutionBackend here: #34 (comment) with @kramaranya and @szaher.
Do you prefer to find better name for it @astefanutti ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich that's not a big deal, ExecutionBackend is fine. RuntimeBackend seems more general as it also covers resources and not only the "execution", like the job "registry" (ETCD for Kubernetes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good!
28d2b5e to
92f34a5
Compare
| EXPERIMENT_SUCCEEDED = "Succeeded" | ||
|
|
||
| # Label to identify Experiment's resources. | ||
| EXPERIMENT_LABEL = "katib.kubeflow.org/experiment" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we start using optimizer.kubeflow.org?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we rely on Katib Experiment CRD for now, we can't use the new labels yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do we have plans to implement OptimizerRuntime and let OptimizerJob override it in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need OptimizerRuntime, since OptimizerJob should natively integrate with TrainingRuntime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Thanks for this. I left my initial question:)
| steps: list[Step] | ||
| num_nodes: int | ||
| status: str = constants.UNKNOWN | ||
| creation_timestamp: datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need creation_timestamp? Shouldn't it be added automatically in the creation phase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does. We set this property from the Experiment.metadata.creation_timestamp:
| creation_timestamp=optimization_job_cr.metadata.creation_timestamp, |
|
@kramaranya @Electronic-Waste @astefanutti Any additional comments before we move forward with the initial support of HPO in Kubeflow SDK ? |
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
9b3700a to
1353fc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @andreyvelich for this great work!!
I left a few comments
| "pydantic>=2.10.0", | ||
| "kubeflow-trainer-api>=2.0.0", | ||
| # TODO (andreyvelich): Switch to kubeflow-katib-api once it is published. | ||
| "kubeflow_katib_api@git+https://github.com/kramaranya/katib.git@separate-models-from-sdk#subdirectory=api/python_api", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it has been merged, we can update that with katib ref instead of the fork. Or shall we cut a new Katib release and publish those models to PyPI?
| initializers. | ||
| """ | ||
|
|
||
| trainer: CustomTrainer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we support BuiltinTrainer initially? Is it due to metrics collection?
| # Import the Kubeflow Trainer types. | ||
| from kubeflow.trainer.types.types import TrainJobTemplate | ||
|
|
||
| __all__ = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add GridSearch here?
| """ | ||
| # Set the default backend config. | ||
| if not backend_config: | ||
| backend_config = KubernetesBackendConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, just for consistency shall we match trainer and use the same import style:
if not backend_config:
backend_config = common_types.KubernetesBackendConfig()|
|
||
| logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted") | ||
|
|
||
| def __get_optimization_job_from_custom_resource( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To align with trainer, should we update this?
| def __get_optimization_job_from_custom_resource( | |
| def __get_optimization_job_from_cr( |
|
|
||
| except multiprocessing.TimeoutError as e: | ||
| raise TimeoutError( | ||
| f"Timeout to list OptimizationJobs in namespace: {self.namespace}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add OptimizationJob to constants instead?
| # Trainer function arguments for the appropriate substitution. | ||
| parameters_spec = [] | ||
| trial_parameters = [] | ||
| trial_template.trainer.func_args = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this not overwrite existing func_args?
| trial_config: Optional[TrialConfig] = None, | ||
| search_space: dict[str, Any], | ||
| objectives: Optional[list[Objective]] = None, | ||
| algorithm: Optional[RandomSearch] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we should accept a base type instead so any algorithm works without changing api in the future?
| algorithm: Optional[RandomSearch] = None, | |
| algorithm: Optional[BaseAlgorithm] = None, |
| MAXIMIZE = "maximize" | ||
| MINIMIZE = "minimize" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding "max" and "min" aliases?
Part of: #46
Depends on: kubeflow/katib#2579
I've added initial support for hyperparameter optimization with
OptimizerClient()into Kubeflow SDK.This PR also introduced some refactoring to re-use code across
TrainerClient()andOptimizerClient().Working example:
Katib Experiment
/assign @kubeflow/kubeflow-sdk-team @akshaychitneni
TODO items:
get_job()APIlist_jobs()APIdelete_job()API