feat: Hyperparameter Optimization APIs in Kubeflow SDK #124

andreyvelich · 2025-10-13T21:52:56Z

Part of: #46
Depends on: kubeflow/katib#2579

I've added initial support for hyperparameter optimization with OptimizerClient() into Kubeflow SDK.
This PR also introduced some refactoring to re-use code across TrainerClient() and OptimizerClient().

Working example:

def train_func(lr: str, num_epochs: str):
    import time
    import random

    for i in range(10):
        time.sleep(1)
        print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")

    print(f"loss={round(random.uniform(0.77, 0.99), 2)}")

OptimizerClient().optimize(
    TrainJobTemplate(
        trainer=CustomTrainer(train_func, num_nodes=2),
    ),
    search_space={
        "lr": Search.loguniform(0.01, 0.05),
        "num_epochs": Search.choice([2, 4, 5]),
    },
)

Katib Experiment

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: o759d77408d2
  namespace: default
spec:
  algorithm:
    algorithmName: random
  maxTrialCount: 10
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    metricStrategies:
      - name: loss
        value: min
    objectiveMetricName: loss
    type: minimize
  parallelTrialCount: 1
  parameters:
    - feasibleSpace:
        distribution: logUniform
        max: "0.05"
        min: "0.01"
      name: lr
      parameterType: double
    - feasibleSpace:
        distribution: uniform
        list:
          - "2"
          - "4"
          - "5"
      name: num_epochs
      parameterType: categorical
  resumePolicy: Never
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: node
    primaryPodLabels:
      batch.kubernetes.io/job-completion-index: "0"
      jobset.sigs.k8s.io/replicatedjob-name: node
    retain: true
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
      - name: lr
        reference: lr
      - name: num_epochs
        reference: num_epochs
    trialSpec:
      apiVersion: trainer.kubeflow.org/v1alpha1
      kind: TrainJob
      spec:
        runtimeRef:
          name: torch-distributed
        trainer:
          command:
            - bash
            - -c
            - |2-

              read -r -d '' SCRIPT << EOM

              def train_func(lr: str, num_epochs: str):
                  import time
                  import random

                  for i in range(10):
                      time.sleep(1)
                      print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")

                  print(f"loss={round(random.uniform(0.77, 0.99), 2)}")

              train_func(**{'lr': '${trialParameters.lr}', 'num_epochs': '${trialParameters.num_epochs}'})

              EOM
              printf "%s" "$SCRIPT" > "test-iceberg.py"
              torchrun "test-iceberg.py"
          numNodes: 2
status:
  completionTime: "2025-10-13T21:48:41Z"
  conditions:
    - lastTransitionTime: "2025-10-13T21:45:37Z"
      lastUpdateTime: "2025-10-13T21:45:37Z"
      message: Experiment is created
      reason: ExperimentCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2025-10-13T21:48:41Z"
      lastUpdateTime: "2025-10-13T21:48:41Z"
      message: Experiment is running
      reason: ExperimentRunning
      status: "False"
      type: Running
    - lastTransitionTime: "2025-10-13T21:48:41Z"
      lastUpdateTime: "2025-10-13T21:48:41Z"
      message: Experiment has succeeded because max trial count has reached
      reason: ExperimentMaxTrialsReached
      status: "True"
      type: Succeeded
  currentOptimalTrial:
    bestTrialName: o759d77408d2-lfcqff79
    observation:
      metrics:
        - latest: "0.85"
          max: "0.98"
          min: "0.77"
          name: loss
    parameterAssignments:
      - name: lr
        value: "0.018571949792818013"
      - name: num_epochs
        value: "5"
  startTime: "2025-10-13T21:45:37Z"
  succeededTrialList:
    - o759d77408d2-lfcqff79
    - o759d77408d2-qwbkwc9n
    - o759d77408d2-jhqgmnm6
    - o759d77408d2-xjk86z66
    - o759d77408d2-g8mr72v7
    - o759d77408d2-5s2mqftm
    - o759d77408d2-86p9bw4r
    - o759d77408d2-28d5gd8f
    - o759d77408d2-m8gq4pcn
    - o759d77408d2-kxg6f45v
  trials: 10
  trialsSucceeded: 10

/assign @kubeflow/kubeflow-sdk-team @akshaychitneni

TODO items:

Add get_job() API
Add list_jobs() API
Add delete_job() API

google-oss-prow · 2025-10-13T21:53:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2025-10-13T22:42:16Z

Pull Request Test Coverage Report for Build 18826185352

Details

31 of 37 (83.78%) changed or added relevant lines in 1 file are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+6.2%) to 79.621%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
kubeflow/trainer/backends/kubernetes/backend.py	31	37	83.78%

Totals
Change from base Build 18655221655:	6.2%
Covered Lines:	168
Relevant Lines:	211

💛 - Coveralls

andreyvelich · 2025-10-14T14:56:32Z

I have implemented create_job(), get_job(), list_jobs(), and delete_job() APIs for OptimizerClient().
Please take a look at this PR.
/cc @kubeflow/kubeflow-sdk-team @briangallagher @Fiona-Waters @abhijeet-dhumal @anencore94 @jskswamy @franciscojavierarceo

google-oss-prow · 2025-10-14T14:56:37Z

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-sdk-team, Fiona-Waters, abhijeet-dhumal, jskswamy.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I have implemented create_job(), get_job(), list_jobs(), and delete_job() APIs for OptimizerClient().
Please take a look at this PR.
/cc @kubeflow/kubeflow-sdk-team @briangallagher @Fiona-Waters @abhijeet-dhumal @anencore94 @jskswamy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andreyvelich · 2025-10-15T01:01:22Z

cc @helenxie-bit @mahdikhashan

astefanutti · 2025-10-15T11:49:10Z

kubeflow/optimizer/api/optimizer_client.py

+        trial_config: Optional[TrialConfig] = None,
+        search_space: dict[str, Any],
+        objectives: Optional[list[Objective]] = None,
+        algorithm: Optional[RandomSearch] = None,


Should we consider adding options already?

Let's add it in the followup PR, since we want to limit number of APIs user can configure initially for Experiment CR.

Sounds good!

astefanutti · 2025-10-15T11:53:08Z

kubeflow/optimizer/backends/kubernetes/backend.py

+
+        logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted")
+
+    def __get_optimization_job_from_crd(


Suggested change

def __get_optimization_job_from_crd(

def __get_optimization_job_from_custom_resource(

astefanutti · 2025-10-15T11:53:18Z

kubeflow/optimizer/backends/kubernetes/backend.py

+
+    def __get_optimization_job_from_crd(
+        self,
+        optimization_job_crd: models.V1beta1Experiment,


Suggested change

optimization_job_crd: models.V1beta1Experiment,

optimization_job_cr: models.V1beta1Experiment,

astefanutti · 2025-10-15T11:54:34Z

kubeflow/optimizer/backends/base.py

+from kubeflow.optimizer.types.optimization_types import Objective, OptimizationJob, TrialConfig
+
+
+class ExecutionBackend(abc.ABC):


Suggested change

class ExecutionBackend(abc.ABC):

class RuntimeBackend(abc.ABC):

Or:

Suggested change

class ExecutionBackend(abc.ABC):

class OptimizerBackend(abc.ABC):

We previously agreed on the ExecutionBackend here: #34 (comment) with @kramaranya and @szaher.
Do you prefer to find better name for it @astefanutti ?

@andreyvelich that's not a big deal, ExecutionBackend is fine. RuntimeBackend seems more general as it also covers resources and not only the "execution", like the job "registry" (ETCD for Kubernetes).

That sounds good!

astefanutti · 2025-10-16T06:31:32Z

kubeflow/optimizer/constants/constants.py

+EXPERIMENT_SUCCEEDED = "Succeeded"
+
+# Label to identify Experiment's resources.
+EXPERIMENT_LABEL = "katib.kubeflow.org/experiment"


Should we start using optimizer.kubeflow.org?

Since we rely on Katib Experiment CRD for now, we can't use the new labels yet.

So, do we have plans to implement OptimizerRuntime and let OptimizerJob override it in the future?

I don't think we need OptimizerRuntime, since OptimizerJob should natively integrate with TrainingRuntime

Electronic-Waste

@andreyvelich Thanks for this. I left my initial question:)

Electronic-Waste · 2025-10-17T15:06:21Z

kubeflow/trainer/types/types.py

    steps: list[Step]
    num_nodes: int
-    status: str = constants.UNKNOWN
+    creation_timestamp: datetime


Why do we need creation_timestamp? Shouldn't it be added automatically in the creation phase?

It does. We set this property from the Experiment.metadata.creation_timestamp:

sdk/kubeflow/optimizer/backends/kubernetes/backend.py

Line 278 in 9b3700a

creation_timestamp=optimization_job_cr.metadata.creation_timestamp,

andreyvelich · 2025-10-27T00:08:55Z

@kramaranya @Electronic-Waste @astefanutti Any additional comments before we move forward with the initial support of HPO in Kubeflow SDK ?

Signed-off-by: Andrey Velichkevich <[email protected]>

kramaranya

Thank you so much @andreyvelich for this great work!!
I left a few comments

kramaranya · 2025-10-27T11:40:10Z

pyproject.toml

  "pydantic>=2.10.0",
  "kubeflow-trainer-api>=2.0.0",
+  # TODO (andreyvelich): Switch to kubeflow-katib-api once it is published.
+  "kubeflow_katib_api@git+https://github.com/kramaranya/katib.git@separate-models-from-sdk#subdirectory=api/python_api",


Since it has been merged, we can update that with katib ref instead of the fork. Or shall we cut a new Katib release and publish those models to PyPI?

kramaranya · 2025-10-27T12:43:59Z

kubeflow/trainer/types/types.py

+            initializers.
+    """
+
+    trainer: CustomTrainer


Why don't we support BuiltinTrainer initially? Is it due to metrics collection?

kramaranya · 2025-10-27T12:52:11Z

kubeflow/optimizer/__init__.py

+# Import the Kubeflow Trainer types.
+from kubeflow.trainer.types.types import TrainJobTemplate
+
+__all__ = [


Shall we add GridSearch here?

kramaranya · 2025-10-27T13:03:59Z

kubeflow/optimizer/api/optimizer_client.py

+        """
+        # Set the default backend config.
+        if not backend_config:
+            backend_config = KubernetesBackendConfig()


nit, just for consistency shall we match trainer and use the same import style:

if not backend_config: backend_config = common_types.KubernetesBackendConfig()

kramaranya · 2025-10-27T13:32:47Z

kubeflow/optimizer/backends/kubernetes/backend.py

+
+        logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted")
+
+    def __get_optimization_job_from_custom_resource(


To align with trainer, should we update this?

Suggested change

def __get_optimization_job_from_custom_resource(

def __get_optimization_job_from_cr(

kramaranya · 2025-10-27T13:35:57Z

kubeflow/optimizer/backends/kubernetes/backend.py

+
+        except multiprocessing.TimeoutError as e:
+            raise TimeoutError(
+                f"Timeout to list OptimizationJobs in namespace: {self.namespace}"


Can we add OptimizationJob to constants instead?

kramaranya · 2025-10-27T13:54:34Z

kubeflow/optimizer/backends/kubernetes/backend.py

+        # Trainer function arguments for the appropriate substitution.
+        parameters_spec = []
+        trial_parameters = []
+        trial_template.trainer.func_args = {}


Would this not overwrite existing func_args?

kramaranya · 2025-10-27T14:13:42Z

kubeflow/optimizer/api/optimizer_client.py

+        trial_config: Optional[TrialConfig] = None,
+        search_space: dict[str, Any],
+        objectives: Optional[list[Objective]] = None,
+        algorithm: Optional[RandomSearch] = None,


I wonder whether we should accept a base type instead so any algorithm works without changing api in the future?

Suggested change

algorithm: Optional[RandomSearch] = None,

algorithm: Optional[BaseAlgorithm] = None,

kramaranya · 2025-10-27T14:21:23Z

kubeflow/optimizer/types/optimization_types.py

+    MAXIMIZE = "maximize"
+    MINIMIZE = "minimize"


What do you think about adding "max" and "min" aliases?

google-oss-prow bot requested review from kramaranya and szaher October 13, 2025 21:53

google-oss-prow bot added the size/XL label Oct 13, 2025

andreyvelich mentioned this pull request Oct 13, 2025

chore(models): Move models into kubeflow_katib_api package kubeflow/katib#2579

Merged

kramaranya mentioned this pull request Oct 13, 2025

Support Hyperparameter Optimization in Kubeflow SDK #46

Open

8 tasks

google-oss-prow bot added size/XXL and removed size/XL labels Oct 13, 2025

andreyvelich marked this pull request as draft October 14, 2025 02:14

google-oss-prow bot added the do-not-merge/work-in-progress label Oct 14, 2025

andreyvelich marked this pull request as ready for review October 14, 2025 14:54

google-oss-prow bot removed the do-not-merge/work-in-progress label Oct 14, 2025

google-oss-prow bot requested review from anencore94 and briangallagher October 14, 2025 14:56

andreyvelich changed the title ~~feat(optimizer): Hyperparameter Optimization APIs in Kubeflow SDK~~ feat(api): Hyperparameter Optimization APIs in Kubeflow SDK Oct 14, 2025

andreyvelich changed the title ~~feat(api): Hyperparameter Optimization APIs in Kubeflow SDK~~ feat: Hyperparameter Optimization APIs in Kubeflow SDK Oct 14, 2025

astefanutti reviewed Oct 15, 2025

View reviewed changes

andreyvelich force-pushed the hpo-support branch from 28d2b5e to 92f34a5 Compare October 15, 2025 16:57

astefanutti reviewed Oct 16, 2025

View reviewed changes

Electronic-Waste reviewed Oct 17, 2025

View reviewed changes

andreyvelich mentioned this pull request Oct 27, 2025

Create unit tests for the OptimizerClient() #126

Open

andreyvelich added 4 commits October 27, 2025 00:34

Init commit

65e4dea

Signed-off-by: Andrey Velichkevich <[email protected]>

Create optimize() API

2bd1540

Signed-off-by: Andrey Velichkevich <[email protected]>

Set retain=True for Experiment

778c6a9

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix location to Trainer utils

f7f7aba

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich added 7 commits October 27, 2025 00:34

Implement get_job, list_jobs, and delete_job APIs

192299c

Signed-off-by: Andrey Velichkevich <[email protected]>

Add metrics and parameters to Trial object

bf0b93a

Signed-off-by: Andrey Velichkevich <[email protected]>

Clarify message for objective

cdec3b9

Signed-off-by: Andrey Velichkevich <[email protected]>

Move TrainJobTemplate to the Trainer types

5e7d131

Signed-off-by: Andrey Velichkevich <[email protected]>

Rename CRD to CR

55a89fa

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix serialization of TrainJob

14c1497

Signed-off-by: Andrey Velichkevich <[email protected]>

Rename ExecutionBackend to RuntimeBackend

1353fc9

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich force-pushed the hpo-support branch from 9b3700a to 1353fc9 Compare October 27, 2025 00:34

kramaranya reviewed Oct 27, 2025

View reviewed changes


		logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted")

		def __get_optimization_job_from_crd(

	def __get_optimization_job_from_crd(
	def __get_optimization_job_from_custom_resource(

	optimization_job_crd: models.V1beta1Experiment,
	optimization_job_cr: models.V1beta1Experiment,

		from kubeflow.optimizer.types.optimization_types import Objective, OptimizationJob, TrialConfig


		class ExecutionBackend(abc.ABC):

	class ExecutionBackend(abc.ABC):
	class RuntimeBackend(abc.ABC):

	class ExecutionBackend(abc.ABC):
	class OptimizerBackend(abc.ABC):

	def __get_optimization_job_from_custom_resource(
	def __get_optimization_job_from_cr(

	algorithm: Optional[RandomSearch] = None,
	algorithm: Optional[BaseAlgorithm] = None,

Uh oh!

feat: Hyperparameter Optimization APIs in Kubeflow SDK #124

Are you sure you want to change the base?

feat: Hyperparameter Optimization APIs in Kubeflow SDK #124

Conversation

andreyvelich commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Oct 13, 2025

Uh oh!

coveralls commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 18826185352

Details

💛 - Coveralls

Uh oh!

andreyvelich commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Oct 14, 2025

Uh oh!

andreyvelich commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Electronic-Waste left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Oct 27, 2025

Uh oh!

kramaranya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

andreyvelich commented Oct 13, 2025 •

edited

Loading

coveralls commented Oct 13, 2025 •

edited

Loading

andreyvelich commented Oct 14, 2025 •

edited

Loading