- 
                Notifications
    You must be signed in to change notification settings 
- Fork 488
chore(models): Move models into kubeflow_katib_api package #2579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(models): Move models into kubeflow_katib_api package #2579
Conversation
Signed-off-by: kramaranya <[email protected]>
Signed-off-by: kramaranya <[email protected]>
Signed-off-by: kramaranya <[email protected]>
Signed-off-by: kramaranya <[email protected]>
Signed-off-by: kramaranya <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this @kramaranya!
I left a few comments.
        
          
                api/README.md
              
                Outdated
          
        
      | @@ -0,0 +1,3 @@ | |||
| # Kubeflow Katib API | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the folder consistent with Trainer and put it under:
api/python_api/...
        
          
                hack/gen-python-api/gen-api.sh
              
                Outdated
          
        
      | POST_GEN_PYTHON_HANDLER="hack/gen-python-api/post_gen.py" | ||
| KATIB_VERSIONS=(v1beta1) | ||
|  | ||
| # Download JAR package if file doesn't exist. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can we use openapi generator container to generate modules like we do for Trainer ?
https://github.com/kubeflow/trainer/blob/master/hack/python-api/gen-api.sh#L34-L48
        
          
                hack/gen-python-api/post_gen.py
              
                Outdated
          
        
      | @@ -0,0 +1,132 @@ | |||
| # Copyright 2025 The Kubeflow Authors. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to have this post_gen script for modules since we don't need api_client, and other files like in Trainer: https://github.com/kubeflow/trainer/blob/master/hack/python-api/gen-api.sh#L50-L54
Signed-off-by: Andrey Velichkevich <[email protected]>
| @kramaranya I've done required updates to make sure Katib models can work with Kubeflow SDK correctly. def train_func(lr: str, num_epochs: str):
    import time
    import random
    for i in range(10):
        time.sleep(1)
        print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")
    print(f"loss={round(random.uniform(0.77, 0.99), 2)}")
OptimizerClient().optimize(
    TrainJobTemplate(
        trainer=CustomTrainer(train_func),
    ),
    search_space={
        "lr": Search.loguniform(0.01, 0.05),
        "num_epochs": Search.choice([2, 4, 5]),
    },
)Katib ExperimentapiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  creationTimestamp: "2025-10-10T03:13:40Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  name: j570986812db
  namespace: default
  resourceVersion: "10962"
  uid: ea3b8710-6534-415c-a836-7153b03fbc70
spec:
  algorithm:
    algorithmName: random
  maxTrialCount: 10
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    metricStrategies:
    - name: loss
      value: min
    objectiveMetricName: loss
    type: minimize
  parallelTrialCount: 1
  parameters:
  - feasibleSpace:
      distribution: logUniform
      max: "0.05"
      min: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      distribution: uniform
      list:
      - "2"
      - "4"
      - "5"
    name: num_epochs
    parameterType: categorical
  resumePolicy: Never
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: node
    primaryPodLabels:
      batch.kubernetes.io/job-completion-index: "0"
      jobset.sigs.k8s.io/replicatedjob-name: node
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - name: lr
      reference: lr
    - name: num_epochs
      reference: num_epochs
    trialSpec:
      apiVersion: trainer.kubeflow.org/v1alpha1
      kind: TrainJob
      spec:
        runtimeRef:
          name: torch-distributed
        trainer:
          command:
          - bash
          - -c
          - |2-
            read -r -d '' SCRIPT << EOM
            def train_func(lr: str, num_epochs: str):
                import time
                import random
                for i in range(10):
                    time.sleep(1)
                    print(f"Training {i}, lr: {lr}, num_epochs: {num_epochs}")
                print(f"loss={round(random.uniform(0.77, 0.99), 2)}")
            train_func(**{'lr': '${trialParameters.lr}', 'num_epochs': '${trialParameters.num_epochs}'})
            EOM
            printf "%s" "$SCRIPT" > "test-iceberg.py"
            torchrun "test-iceberg.py"
status:
  conditions:
  - lastTransitionTime: "2025-10-10T03:13:41Z"
    lastUpdateTime: "2025-10-10T03:13:41Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2025-10-10T03:14:01Z"
    lastUpdateTime: "2025-10-10T03:14:01Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    observation: {}
  pendingTrialList:
  - j570986812db-z7xbnp4m
  startTime: "2025-10-10T03:13:41Z"
  trials: 1
  trialsPending: 1 | 
Signed-off-by: kramaranya <[email protected]>
Signed-off-by: kramaranya <[email protected]>
| /retest | 
| 
 Thank you @andreyvelich for this! I've updated the PR with container runtime script. Is there anything else that should be updated as part of this PR? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kramaranya 🎉
This modules look to be working fine for create API: kubeflow/sdk#124
/lgtm
/approve
| [APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich The full list of commands accepted by this bot can be found here. The pull request process is described here 
Needs approval from an approver in each of these files:
 
 Approvers can indicate their approval by writing  | 
What this PR does / why we need it:
Move generated Python models from
sdk/python/v1beta1/kubeflow/katib/modelsinto a newapi/kubeflow_katib_api/modelspackage. This is needed so we can implement new OptimizerClient and import those models form the Kubeflow SDK.Which issue(s) this PR fixes:
Fixes #2577
cc @kubeflow/kubeflow-sdk-team @kubeflow/wg-automl-leads