Skip to content

Conversation

@andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Oct 13, 2025

Part of: #46
Depends on: kubeflow/katib#2579

I've added initial support for hyperparameter optimization with OptimizerClient() into Kubeflow SDK.
This PR also introduced some refactoring to re-use code across TrainerClient() and OptimizerClient().

Working example:

def train_func(lr: str, num_epochs: str):
    import time
    import random

    for i in range(10):
        time.sleep(1)
        print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")

    print(f"loss={round(random.uniform(0.77, 0.99), 2)}")

OptimizerClient().optimize(
    TrainJobTemplate(
        trainer=CustomTrainer(train_func, num_nodes=2),
    ),
    search_space={
        "lr": Search.loguniform(0.01, 0.05),
        "num_epochs": Search.choice([2, 4, 5]),
    },
)
Katib Experiment
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: o759d77408d2
  namespace: default
spec:
  algorithm:
    algorithmName: random
  maxTrialCount: 10
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    metricStrategies:
      - name: loss
        value: min
    objectiveMetricName: loss
    type: minimize
  parallelTrialCount: 1
  parameters:
    - feasibleSpace:
        distribution: logUniform
        max: "0.05"
        min: "0.01"
      name: lr
      parameterType: double
    - feasibleSpace:
        distribution: uniform
        list:
          - "2"
          - "4"
          - "5"
      name: num_epochs
      parameterType: categorical
  resumePolicy: Never
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: node
    primaryPodLabels:
      batch.kubernetes.io/job-completion-index: "0"
      jobset.sigs.k8s.io/replicatedjob-name: node
    retain: true
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
      - name: lr
        reference: lr
      - name: num_epochs
        reference: num_epochs
    trialSpec:
      apiVersion: trainer.kubeflow.org/v1alpha1
      kind: TrainJob
      spec:
        runtimeRef:
          name: torch-distributed
        trainer:
          command:
            - bash
            - -c
            - |2-

              read -r -d '' SCRIPT << EOM

              def train_func(lr: str, num_epochs: str):
                  import time
                  import random

                  for i in range(10):
                      time.sleep(1)
                      print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")

                  print(f"loss={round(random.uniform(0.77, 0.99), 2)}")

              train_func(**{'lr': '${trialParameters.lr}', 'num_epochs': '${trialParameters.num_epochs}'})

              EOM
              printf "%s" "$SCRIPT" > "test-iceberg.py"
              torchrun "test-iceberg.py"
          numNodes: 2
status:
  completionTime: "2025-10-13T21:48:41Z"
  conditions:
    - lastTransitionTime: "2025-10-13T21:45:37Z"
      lastUpdateTime: "2025-10-13T21:45:37Z"
      message: Experiment is created
      reason: ExperimentCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2025-10-13T21:48:41Z"
      lastUpdateTime: "2025-10-13T21:48:41Z"
      message: Experiment is running
      reason: ExperimentRunning
      status: "False"
      type: Running
    - lastTransitionTime: "2025-10-13T21:48:41Z"
      lastUpdateTime: "2025-10-13T21:48:41Z"
      message: Experiment has succeeded because max trial count has reached
      reason: ExperimentMaxTrialsReached
      status: "True"
      type: Succeeded
  currentOptimalTrial:
    bestTrialName: o759d77408d2-lfcqff79
    observation:
      metrics:
        - latest: "0.85"
          max: "0.98"
          min: "0.77"
          name: loss
    parameterAssignments:
      - name: lr
        value: "0.018571949792818013"
      - name: num_epochs
        value: "5"
  startTime: "2025-10-13T21:45:37Z"
  succeededTrialList:
    - o759d77408d2-lfcqff79
    - o759d77408d2-qwbkwc9n
    - o759d77408d2-jhqgmnm6
    - o759d77408d2-xjk86z66
    - o759d77408d2-g8mr72v7
    - o759d77408d2-5s2mqftm
    - o759d77408d2-86p9bw4r
    - o759d77408d2-28d5gd8f
    - o759d77408d2-m8gq4pcn
    - o759d77408d2-kxg6f45v
  trials: 10
  trialsSucceeded: 10

/assign @kubeflow/kubeflow-sdk-team @akshaychitneni

TODO items:

  • Add get_job() API
  • Add list_jobs() API
  • Add delete_job() API

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andreyvelich. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Oct 13, 2025

Pull Request Test Coverage Report for Build 18826185352

Details

  • 31 of 37 (83.78%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+6.2%) to 79.621%

Changes Missing Coverage Covered Lines Changed/Added Lines %
kubeflow/trainer/backends/kubernetes/backend.py 31 37 83.78%
Totals Coverage Status
Change from base Build 18655221655: 6.2%
Covered Lines: 168
Relevant Lines: 211

💛 - Coveralls

@andreyvelich andreyvelich marked this pull request as draft October 14, 2025 02:14
@andreyvelich andreyvelich marked this pull request as ready for review October 14, 2025 14:54
@andreyvelich
Copy link
Member Author

andreyvelich commented Oct 14, 2025

I have implemented create_job(), get_job(), list_jobs(), and delete_job() APIs for OptimizerClient().
Please take a look at this PR.
/cc @kubeflow/kubeflow-sdk-team @briangallagher @Fiona-Waters @abhijeet-dhumal @anencore94 @jskswamy @franciscojavierarceo

@google-oss-prow
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/kubeflow-sdk-team, Fiona-Waters, abhijeet-dhumal, jskswamy.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

I have implemented create_job(), get_job(), list_jobs(), and delete_job() APIs for OptimizerClient().
Please take a look at this PR.
/cc @kubeflow/kubeflow-sdk-team @briangallagher @Fiona-Waters @abhijeet-dhumal @anencore94 @jskswamy

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andreyvelich andreyvelich changed the title feat(optimizer): Hyperparameter Optimization APIs in Kubeflow SDK feat(api): Hyperparameter Optimization APIs in Kubeflow SDK Oct 14, 2025
@andreyvelich andreyvelich changed the title feat(api): Hyperparameter Optimization APIs in Kubeflow SDK feat: Hyperparameter Optimization APIs in Kubeflow SDK Oct 14, 2025
@andreyvelich
Copy link
Member Author

cc @helenxie-bit @mahdikhashan

trial_config: Optional[TrialConfig] = None,
search_space: dict[str, Any],
objectives: Optional[list[Objective]] = None,
algorithm: Optional[RandomSearch] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider adding options already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add it in the followup PR, since we want to limit number of APIs user can configure initially for Experiment CR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!


logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted")

def __get_optimization_job_from_crd(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __get_optimization_job_from_crd(
def __get_optimization_job_from_custom_resource(


def __get_optimization_job_from_crd(
self,
optimization_job_crd: models.V1beta1Experiment,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
optimization_job_crd: models.V1beta1Experiment,
optimization_job_cr: models.V1beta1Experiment,

from kubeflow.optimizer.types.optimization_types import Objective, OptimizationJob, TrialConfig


class ExecutionBackend(abc.ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ExecutionBackend(abc.ABC):
class RuntimeBackend(abc.ABC):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or:

Suggested change
class ExecutionBackend(abc.ABC):
class OptimizerBackend(abc.ABC):

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We previously agreed on the ExecutionBackend here: #34 (comment) with @kramaranya and @szaher.
Do you prefer to find better name for it @astefanutti ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich that's not a big deal, ExecutionBackend is fine. RuntimeBackend seems more general as it also covers resources and not only the "execution", like the job "registry" (ETCD for Kubernetes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good!

EXPERIMENT_SUCCEEDED = "Succeeded"

# Label to identify Experiment's resources.
EXPERIMENT_LABEL = "katib.kubeflow.org/experiment"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start using optimizer.kubeflow.org?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we rely on Katib Experiment CRD for now, we can't use the new labels yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do we have plans to implement OptimizerRuntime and let OptimizerJob override it in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need OptimizerRuntime, since OptimizerJob should natively integrate with TrainingRuntime

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thanks for this. I left my initial question:)

steps: list[Step]
num_nodes: int
status: str = constants.UNKNOWN
creation_timestamp: datetime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need creation_timestamp? Shouldn't it be added automatically in the creation phase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does. We set this property from the Experiment.metadata.creation_timestamp:

creation_timestamp=optimization_job_cr.metadata.creation_timestamp,

@andreyvelich
Copy link
Member Author

@kramaranya @Electronic-Waste @astefanutti Any additional comments before we move forward with the initial support of HPO in Kubeflow SDK ?

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @andreyvelich for this great work!!
I left a few comments

"pydantic>=2.10.0",
"kubeflow-trainer-api>=2.0.0",
# TODO (andreyvelich): Switch to kubeflow-katib-api once it is published.
"kubeflow_katib_api@git+https://github.com/kramaranya/katib.git@separate-models-from-sdk#subdirectory=api/python_api",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it has been merged, we can update that with katib ref instead of the fork. Or shall we cut a new Katib release and publish those models to PyPI?

initializers.
"""

trainer: CustomTrainer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we support BuiltinTrainer initially? Is it due to metrics collection?

# Import the Kubeflow Trainer types.
from kubeflow.trainer.types.types import TrainJobTemplate

__all__ = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add GridSearch here?

"""
# Set the default backend config.
if not backend_config:
backend_config = KubernetesBackendConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, just for consistency shall we match trainer and use the same import style:

if not backend_config:
    backend_config = common_types.KubernetesBackendConfig()


logger.debug(f"OptimizationJob {self.namespace}/{name} has been deleted")

def __get_optimization_job_from_custom_resource(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To align with trainer, should we update this?

Suggested change
def __get_optimization_job_from_custom_resource(
def __get_optimization_job_from_cr(


except multiprocessing.TimeoutError as e:
raise TimeoutError(
f"Timeout to list OptimizationJobs in namespace: {self.namespace}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add OptimizationJob to constants instead?

# Trainer function arguments for the appropriate substitution.
parameters_spec = []
trial_parameters = []
trial_template.trainer.func_args = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this not overwrite existing func_args?

trial_config: Optional[TrialConfig] = None,
search_space: dict[str, Any],
objectives: Optional[list[Objective]] = None,
algorithm: Optional[RandomSearch] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we should accept a base type instead so any algorithm works without changing api in the future?

Suggested change
algorithm: Optional[RandomSearch] = None,
algorithm: Optional[BaseAlgorithm] = None,

Comment on lines +30 to +31
MAXIMIZE = "maximize"
MINIMIZE = "minimize"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding "max" and "min" aliases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants