Katib v1alpha2 API for CRDs #381

richardsliu · 2019-02-15T01:30:15Z

@YujiOshima @gaocegege @johnugeorge @alexandraj777 @hougangliu @xyhuang

This is an initial proposal for the Katib v1alpha2 API. The changes here reflect the discussion in #370.

Comments and suggestions are welcome.

Please note that the NAS APIs are not included here since the feature is still in early phase.

This change is

pkg/api/operators/apis/studyjob/v1alpha2/studyjob_types.go

hougangliu · 2019-02-15T02:32:12Z

@richardsliu Thanks! some comment

pkg/api/operators/apis/studyjob/v1alpha2/studyjob_types.go

xyhuang · 2019-02-15T05:29:48Z

@richardsliu are you planning to have a separate PR for nas?

johnugeorge · 2019-02-15T07:08:22Z

Shall we have this PR to have a complete v1alpha2 API so that it will be easier to review/implement

API modifications that are not included:

Do we need any more extra status fields wrt to Katib status should return optimal parameters values #356 ?
Manual suggest #352 proposes ReuseStudyID @YujiOshima
Nas specific fields are missing @andreyvelich
support worker/metricsCollector template from specified configmap + path #349 proposes configMap related fields @hougangliu

jdplatt · 2019-02-15T19:07:42Z

pkg/api/operators/apis/studyjob/v1alpha2/studyjob_types.go

+type WorkerCondition struct {
+	WorkerID       string      `json:"workerId,omitempty"`
+	Kind           string      `json:"kind,omitempty"`
+	Condition      Condition   `json:"condition,omitempty"`


What worker conditions are allowed right now? When tracking failed trails I think it is important to distinguish between trails that failed for reasons independent of the suggested parameters (e.g. k8s killed the pod due to resource constraints on a node) and trails that failed due to the suggested parameters (e.g. learning rate was too high and loss blew up). The first type of failures are worth retrying while the second type should be avoided in future.

jdplatt · 2019-02-15T19:08:05Z

pkg/api/operators/apis/studyjob/v1alpha2/studyjob_types.go

+	ObjectiveValue *float64    `json:"objectiveValue,omitempty"`
+	StartTime      metav1.Time `json:"startTime,omitempty"`
+	CompletionTime metav1.Time `json:"completionTime,omitempty"`
+}


How do you feel about allowing the option to add metadata to a trial? For example, when a trials fail we'll often add a message containing the error so we can figure out why that parameter combination didn't work (e.g. if it's an GPU OOM error then our model is too large and we might reduce the batch size or the number of neurons, if it's a loss=NaN error we reduce the maximum learning rate).

jdplatt · 2019-02-15T19:08:47Z

pkg/api/operators/apis/studyjob/v1alpha2/studyjob_types.go

+	OptimizationGoal     *float64              `json:"optimizationGoal,omitempty"`
+	ObjectiveValueName   string                `json:"objectiveValueName,omitempty"`
+	MaxSuggestionCount   int                   `json:"maxSuggestionCount,omitempty"`
+	MetricsNames         []string              `json:"metricsNames,omitempty"`


Based on our experience using sigopt I think the user should be able to set an ObservationBudget, which sets the number of trails they expect to run, independent of the MaxSuggestionCount, which sets the maximum number of trials the experiment runs before stopping. There are two advantages to this:

If the HPO still hasn't found a good model when it hits the ObservationBudget often we will leave a study running until it hits a reasonable value. It would be nice to allow studies to just keep running in this scenario without us having to restart them manually.

We often change the ObservationBudget to influence the explore vs. exploit trade-off in the HPO algorithm. If the algorithm thinks it has fewer observations left to find a good solution it will be more aggressive exploring hyperparameter space. The normal use case for this is when we are happy to trade a little long-term accuracy for finding a decent model quickly.

richardsliu · 2019-02-15T22:45:17Z

Thanks for the initial comments. I've made a few changes:

Incorporated pending changes from Manual suggest #352 and WIP: support worker and metricsCollector template spec #375
Added a WorkerMetadata struct with a list of WorkerConditions. The semantics are similar to JobConditions, but workers can have different status values.

With regards to NAS (@xyhuang ), I would like to keep the API design separate from this effort. Whenever we finalize this API, I will add the NAS stuff to the schema, with the understanding that NAS is an alpha feature and can change in non-compatible ways.

Please continue to provide feedback. For example:

What to do with metric collectors? Is this something we should remove and replace with a metadata store API?
support worker/metricsCollector template from specified configmap + path #349, Katib status should return optimal parameters values #356
General naming conventions. How do people feel about names like "Study", "Trial", "Worker" etc in general? Are there better/more well known terms that we can use here?

Let's aim to have the API structure stabilized by mid-March.

jdplatt · 2019-02-17T19:09:41Z

I'm still a little confused about the relationship between a Trial and a Worker. Am I right to assume that like in the vizier paper a Trial is meant to be an evaluation of a single suggestion and a worker is meant to be the process that runs a Trial? In that case I am unsure:

Why there can be multiple workers attached to a single trial?
Why the objective value is inside the WorkerMetadata struct not the Trial struct? If the goal of a Trial is to calculate the objective value for a given suggestion then it might be worth putting the resulting value in the Trial rather than the worker than ran it.
What does the Kind field in WorkerMetadata mean?

pkg/api/operators/apis/studyjob/v1alpha2/studyjob_types.go

cvenets · 2019-02-17T19:59:48Z

I agree with @jdplatt's comments above. Also, if a Trial is actually a single run with specific parameters, why don't we call it a 'Run' rather than a 'Trial'? Wouldn't this be easier for users to get what it is?

And since @richardsliu asked regarding naming, I think 'Study' is a very Vizier-specific term. How would people feel about simply calling it 'Optimization', which seems to be a more generic and well known term?

jdplatt · 2019-02-18T14:13:45Z

Based on the discussion in #386 I think we also need to add the ability to track what order the trials were run in. We could this by adding index field to Trail indicating what order the trials were run in, or by inferring the order later on using the started/completed times. Tracking the order the trials were generated in will make it possible to monitor a running study to see how the metric is improving over time. We look at these sorts of plots a lot to decide when to end a study (for example if a new best model hasn't been found in a while)

richardsliu · 2019-02-18T22:53:14Z

@jdplatt Your understanding is correct. A "Trial" is a vizier concept mapping to one evaluation of a suggestion; whereas a "Worker" is a k8s resource that runs the Trial.

According to https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L558, each Trial should correspond to exactly one Worker. @YujiOshima can you confirm this?
This is probably because the objective value is calculated by each Worker and fetched by the metrics collector (which maps to workers and not trials). Perhaps we can combine/restructure this.
Kind refers to the type of worker - "Job" for generic k8s BatchJob, "TFJob" and "PytorchJob" for kubeflow distributed training jobs. We are also looking into making this more generic, see Make Katib generic for operator support #341.

hougangliu · 2019-02-19T07:37:24Z

According to https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L558, each Trial should correspond to exactly one Worker. @YujiOshima can you confirm this?

#352 aims to reuse trial so that multiple workers may belongs to one trial.
BTW, now when a worker failed, we mark studyjob failed, too. But for some case, if we re-create another worker, everything may work well (for example, the worker is killed by eviction policy or node error, if the worker is scheduled to other node, everything can work). Maybe for a trial, when a worker failed, we can re-create another one until max-fail-time to mark studyjob Failed. So in this case, multiple workers (max-fail-time at most) can belong to a trial. I logged #390 to trace discussion

jlewi · 2019-02-19T13:19:58Z

One alternative to Study would be to call it an Experiment.

The term Experiment appears in other places; I think KF pipelines and MLFlow both use it. I don't know if that is a good thing or bad thing.

jdplatt · 2019-02-19T14:24:37Z

@richardsliu I think some of the confusion over Trials vs Workers is because Katib uses workers differently than Vizier. Below is a figure from the Vizier paper showing the logic inside a worker.

Each worker runs for the length of a Study and ends up running many trials. However, in Katib a worker run only for the life of a single trial.

Is Katib meant to be an open source version of Vizier or do you foresee the project going in it's own direction over time?
If we stick with the current approach in Katib where workers and trials have the same lifecycle can we just collapse them into a single object?

jdplatt · 2019-02-19T14:25:02Z

@jlewi Sigopt uses the term Experiment instead of Study as well.

jdplatt · 2019-02-19T14:30:09Z

Another feature I think we should add to the new API is the ability to scale the feasible parameter space (e.g. #224). It is really common to use log scaling on parameters such as learning rate or regularization coefficient

hougangliu · 2019-02-20T01:28:07Z

Each worker runs for the length of a Study and ends up running many trials. However, in Katib a worker run only for the life of a single trial.

I think katib keeps consistent with vizier, you can take RunTrial(trial) as a worker lifecycle in katib.

richardsliu · 2019-02-20T01:47:17Z

@jdplatt In response to your questions:

I see katib as a project inspired by Vizier, but not necessarily the "open source version of Vizier". My long-term vision is for it to evolve into an open source HP-tuning/NAS/AutoML service that integrates well with Kubernetes. @YujiOshima what do you think?
I agree with @hougangliu suggestion - we can collapse the vocabulary and just use "Trial/RunTrial".

johnugeorge · 2019-02-20T05:13:43Z

@jdplatt In response to your questions:

I see katib as a project inspired by Vizier, but not necessarily the "open source version of Vizier". My long-term vision is for it to evolve into an open source HP-tuning/NAS/AutoML service that integrates well with Kubernetes. @YujiOshima what do you think?

I agree with @hougangliu suggestion - we can collapse the vocabulary and just use "Trial/RunTrial".

I agree. Though the initial design is inspired from Vizier, evolving it into a k8s native solution with best user experience is more important in the longer run.

YujiOshima · 2019-02-20T07:08:31Z

@jdplatt @richardsliu I'm sorry for late reply.

I see katib as a project inspired by Vizier, but not necessarily the "open source version of Vizier". My long-term vision is for it to evolve into an open source HP-tuning/NAS/AutoML service that integrates well with Kubernetes. @YujiOshima what do you think?

I agree. We do not need to stick to the design of Vizier.
In my first design, the Trial is only a set of parameter. And the Worker is an evaluating process for a trial. So if you need to evaluate multiple time for one parameter, you can make multiple Worker for a Trial.
Though I think we can use another name instead of Worker, we should need a concept for an evaluating process that independent from Trial.
Because the objective value and metrics may change every time even with the same trial.

For avoiding confusion, how about use consistent names for resources and processes?
Currently:

Resouce	Process
Study	StudyJob
Trial	Worker

Suggestion:

Resouce	Process
Study(Experiment?)	StudyRun(ExperimentRun?)
Trial	TrialRun

pkg/api/operators/apis/experiment/v1alpha2/types.go

hougangliu · 2019-03-05T00:09:13Z

/lgtm

pkg/api/operators/apis/trial/v1alpha2/types.go

pkg/api/operators/apis/experiment/v1alpha2/types.go

richardsliu · 2019-03-05T18:58:34Z

Renamed "hyperparameter" and "suggestionparameter" to "ParameterAssignment".
Restructured AlgorithmSpec

johnugeorge · 2019-03-06T01:52:53Z

pkg/api/operators/apis/experiment/v1alpha2/types.go

+type AlgorithmSpec struct {
+	AlgorithmName string		 `json:"algorithmName,omitempty"`
+	// Key-value pairs for hyperparameters and assignment values.
+	ParameterAssignments []trial.ParameterAssignment `json:"parameterAssignments"`


This is algorithm specific parameters. eg: https://github.com/kubeflow/katib/blob/master/examples/grid-example.yaml#L41

Will this be confusing with ParameterAssignment of Trial though both are key-value struct type?

I agree with @johnugeorge .
In the gRpc API, we distinguish these concepts.
ParameterAssignment of Trial: https://github.com/kubeflow/katib/blob/master/pkg/api/api.proto#L284
Parameter for Suggestion service: https://github.com/kubeflow/katib/blob/master/pkg/api/api.proto#L338

It is also key-value but we should use another name. How about AlgorithmParameterAssignment ?

I think it should be something like AlgorithmSettings. "Assignment" is for assigning HP values which should not be confused with internal configuration settings for suggestion algorithms (sorry if my previous answer was misleading). So maybe it is better to avoid terms like "parameter" and "assignments" here entirely.

richardsliu · 2019-03-06T18:06:30Z

Some small fixes:

Added "retainHistoricalData" to Experiment spec
Added a type for "AlgorithmSetting" to avoid confusion with HP assignments

Everyone please take a look and lgtm if you think the API looks ok. We can still make minor edits after merging this PR.

richardsliu · 2019-03-06T18:33:36Z

/hold

YujiOshima · 2019-03-07T05:48:15Z

@richardsliu Thank you!
/lgtm

johnugeorge · 2019-03-07T07:36:19Z

LGTM
nitpick: We might not repeat Algorithm term in AlgorithmSpec fields. We can change this later too.

alexandraj777

Looking good!

richardsliu · 2019-03-08T02:20:41Z

Thanks everyone!
/approve

k8s-ci-robot · 2019-03-08T02:20:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardsliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [richardsliu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

v1alpha2 API proposal

33d6701

k8s-ci-robot added the do-not-merge/work-in-progress label Feb 15, 2019

k8s-ci-robot requested review from ddysher and jose5918 February 15, 2019 01:30

k8s-ci-robot added the size/L label Feb 15, 2019

hougangliu reviewed Feb 15, 2019

View reviewed changes

johnugeorge reviewed Feb 15, 2019

View reviewed changes

pkg/api/operators/apis/studyjob/v1alpha2/studyjob_types.go Outdated Show resolved Hide resolved

jdplatt reviewed Feb 15, 2019

View reviewed changes

Fix comments round 1

e88810f

cvenets reviewed Feb 17, 2019

View reviewed changes

pkg/api/operators/apis/studyjob/v1alpha2/studyjob_types.go Outdated Show resolved Hide resolved

alexandraj777 reviewed Mar 5, 2019

View reviewed changes

pkg/api/operators/apis/experiment/v1alpha2/types.go Show resolved Hide resolved

pkg/api/operators/apis/experiment/v1alpha2/types.go Outdated Show resolved Hide resolved

pkg/api/operators/apis/experiment/v1alpha2/types.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned hougangliu Mar 5, 2019

k8s-ci-robot added the lgtm label Mar 5, 2019

YujiOshima reviewed Mar 5, 2019

View reviewed changes

pkg/api/operators/apis/trial/v1alpha2/types.go Outdated Show resolved Hide resolved

jdplatt reviewed Mar 5, 2019

View reviewed changes

pkg/api/operators/apis/experiment/v1alpha2/types.go Show resolved Hide resolved

Rename

d6f2b1d

k8s-ci-robot removed the lgtm label Mar 5, 2019

johnugeorge reviewed Mar 6, 2019

View reviewed changes

Minor edits

4975286

k8s-ci-robot added the do-not-merge/hold label Mar 6, 2019

richardsliu changed the title ~~WIP: Katib v1alpha2 API proposal~~ Katib v1alpha2 API for CRDs Mar 6, 2019

k8s-ci-robot removed the do-not-merge/work-in-progress label Mar 6, 2019

k8s-ci-robot assigned YujiOshima Mar 7, 2019

k8s-ci-robot added the lgtm label Mar 7, 2019

richardsliu removed the do-not-merge/hold label Mar 8, 2019

alexandraj777 approved these changes Mar 8, 2019

View reviewed changes

k8s-ci-robot added the approved label Mar 8, 2019

k8s-ci-robot merged commit 61451ef into kubeflow:master Mar 8, 2019

YujiOshima mentioned this pull request Mar 8, 2019

gRpc API v2 #423

Closed

jlewi mentioned this pull request Mar 10, 2019

StudyJob v1alpha2 API version #370

Closed

15 tasks

johnugeorge mentioned this pull request May 21, 2019

Asynchronous suggestion request #223

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Katib v1alpha2 API for CRDs #381

Katib v1alpha2 API for CRDs #381

richardsliu commented Feb 15, 2019 •

edited by jlewi

Loading

hougangliu commented Feb 15, 2019

xyhuang commented Feb 15, 2019

johnugeorge commented Feb 15, 2019

jdplatt Feb 15, 2019

jdplatt Feb 15, 2019

jdplatt Feb 15, 2019

richardsliu commented Feb 15, 2019

jdplatt commented Feb 17, 2019

cvenets commented Feb 17, 2019 •

edited

Loading

jdplatt commented Feb 18, 2019

richardsliu commented Feb 18, 2019

hougangliu commented Feb 19, 2019 •

edited

Loading

jlewi commented Feb 19, 2019

jdplatt commented Feb 19, 2019

jdplatt commented Feb 19, 2019

jdplatt commented Feb 19, 2019

hougangliu commented Feb 20, 2019

richardsliu commented Feb 20, 2019

johnugeorge commented Feb 20, 2019

YujiOshima commented Feb 20, 2019

hougangliu commented Mar 5, 2019

richardsliu commented Mar 5, 2019

johnugeorge Mar 6, 2019

YujiOshima Mar 6, 2019

richardsliu Mar 6, 2019 •

edited

Loading

richardsliu commented Mar 6, 2019

richardsliu commented Mar 6, 2019

YujiOshima commented Mar 7, 2019

johnugeorge commented Mar 7, 2019

alexandraj777 left a comment

richardsliu commented Mar 8, 2019

k8s-ci-robot commented Mar 8, 2019

Katib v1alpha2 API for CRDs #381

Katib v1alpha2 API for CRDs #381

Conversation

richardsliu commented Feb 15, 2019 • edited by jlewi Loading

hougangliu commented Feb 15, 2019

xyhuang commented Feb 15, 2019

johnugeorge commented Feb 15, 2019

jdplatt Feb 15, 2019

Choose a reason for hiding this comment

jdplatt Feb 15, 2019

Choose a reason for hiding this comment

jdplatt Feb 15, 2019

Choose a reason for hiding this comment

richardsliu commented Feb 15, 2019

jdplatt commented Feb 17, 2019

cvenets commented Feb 17, 2019 • edited Loading

jdplatt commented Feb 18, 2019

richardsliu commented Feb 18, 2019

hougangliu commented Feb 19, 2019 • edited Loading

jlewi commented Feb 19, 2019

jdplatt commented Feb 19, 2019

jdplatt commented Feb 19, 2019

jdplatt commented Feb 19, 2019

hougangliu commented Feb 20, 2019

richardsliu commented Feb 20, 2019

johnugeorge commented Feb 20, 2019

YujiOshima commented Feb 20, 2019

hougangliu commented Mar 5, 2019

richardsliu commented Mar 5, 2019

johnugeorge Mar 6, 2019

Choose a reason for hiding this comment

YujiOshima Mar 6, 2019

Choose a reason for hiding this comment

richardsliu Mar 6, 2019 • edited Loading

Choose a reason for hiding this comment

richardsliu commented Mar 6, 2019

richardsliu commented Mar 6, 2019

YujiOshima commented Mar 7, 2019

johnugeorge commented Mar 7, 2019

alexandraj777 left a comment

Choose a reason for hiding this comment

richardsliu commented Mar 8, 2019

k8s-ci-robot commented Mar 8, 2019

richardsliu commented Feb 15, 2019 •

edited by jlewi

Loading

cvenets commented Feb 17, 2019 •

edited

Loading

hougangliu commented Feb 19, 2019 •

edited

Loading

richardsliu Mar 6, 2019 •

edited

Loading