Skip to content

Conversation

@kramaranya
Copy link
Contributor

What this PR does / why we need it:
Move generated Python models from sdk/python/v1beta1/kubeflow/katib/models into a new api/kubeflow_katib_api/models package. This is needed so we can implement new OptimizerClient and import those models form the Kubeflow SDK.

Which issue(s) this PR fixes:
Fixes #2577

cc @kubeflow/kubeflow-sdk-team @kubeflow/wg-automl-leads

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this @kramaranya!
I left a few comments.

api/README.md Outdated
@@ -0,0 +1,3 @@
# Kubeflow Katib API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the folder consistent with Trainer and put it under:

api/python_api/...

POST_GEN_PYTHON_HANDLER="hack/gen-python-api/post_gen.py"
KATIB_VERSIONS=(v1beta1)

# Download JAR package if file doesn't exist.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can we use openapi generator container to generate modules like we do for Trainer ?
https://github.com/kubeflow/trainer/blob/master/hack/python-api/gen-api.sh#L34-L48

@@ -0,0 +1,132 @@
# Copyright 2025 The Kubeflow Authors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to have this post_gen script for modules since we don't need api_client, and other files like in Trainer: https://github.com/kubeflow/trainer/blob/master/hack/python-api/gen-api.sh#L50-L54

Signed-off-by: Andrey Velichkevich <[email protected]>
@andreyvelich
Copy link
Member

@kramaranya I've done required updates to make sure Katib models can work with Kubeflow SDK correctly.
We still need to perform some testing, but at least I was able to create Katib Experiment:

def train_func(lr: str, num_epochs: str):
    import time
    import random

    for i in range(10):
        time.sleep(1)
        print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")

    print(f"loss={round(random.uniform(0.77, 0.99), 2)}")


OptimizerClient().optimize(
    TrainJobTemplate(
        trainer=CustomTrainer(train_func),
    ),
    search_space={
        "lr": Search.loguniform(0.01, 0.05),
        "num_epochs": Search.choice([2, 4, 5]),
    },
)
Katib Experiment
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  creationTimestamp: "2025-10-10T03:13:40Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  name: j570986812db
  namespace: default
  resourceVersion: "10962"
  uid: ea3b8710-6534-415c-a836-7153b03fbc70
spec:
  algorithm:
    algorithmName: random
  maxTrialCount: 10
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    metricStrategies:
    - name: loss
      value: min
    objectiveMetricName: loss
    type: minimize
  parallelTrialCount: 1
  parameters:
  - feasibleSpace:
      distribution: logUniform
      max: "0.05"
      min: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      distribution: uniform
      list:
      - "2"
      - "4"
      - "5"
    name: num_epochs
    parameterType: categorical
  resumePolicy: Never
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: node
    primaryPodLabels:
      batch.kubernetes.io/job-completion-index: "0"
      jobset.sigs.k8s.io/replicatedjob-name: node
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - name: lr
      reference: lr
    - name: num_epochs
      reference: num_epochs
    trialSpec:
      apiVersion: trainer.kubeflow.org/v1alpha1
      kind: TrainJob
      spec:
        runtimeRef:
          name: torch-distributed
        trainer:
          command:
          - bash
          - -c
          - |2-

            read -r -d '' SCRIPT << EOM

            def train_func(lr: str, num_epochs: str):
                import time
                import random

                for i in range(10):
                    time.sleep(1)
                    print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")

                print(f"loss={round(random.uniform(0.77, 0.99), 2)}")

            train_func(**{'lr': '${trialParameters.lr}', 'num_epochs': '${trialParameters.num_epochs}'})

            EOM
            printf "%s" "$SCRIPT" > "test-iceberg.py"
            torchrun "test-iceberg.py"
status:
  conditions:
  - lastTransitionTime: "2025-10-10T03:13:41Z"
    lastUpdateTime: "2025-10-10T03:13:41Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2025-10-10T03:14:01Z"
    lastUpdateTime: "2025-10-10T03:14:01Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    observation: {}
  pendingTrialList:
  - j570986812db-z7xbnp4m
  startTime: "2025-10-10T03:13:41Z"
  trials: 1
  trialsPending: 1

@kramaranya
Copy link
Contributor Author

/retest

@kramaranya
Copy link
Contributor Author

@kramaranya I've done required updates to make sure Katib models can work with Kubeflow SDK correctly. We still need to perform some testing, but at least I was able to create Katib Experiment:

Thank you @andreyvelich for this! I've updated the PR with container runtime script. Is there anything else that should be updated as part of this PR?

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kramaranya 🎉
This modules look to be working fine for create API: kubeflow/sdk#124

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit f3bfb48 into kubeflow:master Oct 13, 2025
88 of 89 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Separate API Models from SDK Package

2 participants