Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Katib v1alpha2 API for CRDs #381

Merged
merged 6 commits into from
Mar 8, 2019
Merged

Conversation

richardsliu
Copy link
Contributor

@richardsliu richardsliu commented Feb 15, 2019

@YujiOshima @gaocegege @johnugeorge @alexandraj777 @hougangliu @xyhuang

This is an initial proposal for the Katib v1alpha2 API. The changes here reflect the discussion in #370.

Comments and suggestions are welcome.

Please note that the NAS APIs are not included here since the feature is still in early phase.


This change is Reviewable

@hougangliu
Copy link
Member

@richardsliu Thanks! some comment

@xyhuang
Copy link
Member

xyhuang commented Feb 15, 2019

@richardsliu are you planning to have a separate PR for nas?

@johnugeorge
Copy link
Member

Shall we have this PR to have a complete v1alpha2 API so that it will be easier to review/implement

API modifications that are not included:

  1. Do we need any more extra status fields wrt to Katib status should return optimal parameters values #356 ?
  2. Manual suggest #352 proposes ReuseStudyID @YujiOshima
  3. Nas specific fields are missing @andreyvelich
  4. support worker/metricsCollector template from specified configmap + path #349 proposes configMap related fields @hougangliu

type WorkerCondition struct {
WorkerID string `json:"workerId,omitempty"`
Kind string `json:"kind,omitempty"`
Condition Condition `json:"condition,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What worker conditions are allowed right now? When tracking failed trails I think it is important to distinguish between trails that failed for reasons independent of the suggested parameters (e.g. k8s killed the pod due to resource constraints on a node) and trails that failed due to the suggested parameters (e.g. learning rate was too high and loss blew up). The first type of failures are worth retrying while the second type should be avoided in future.

ObjectiveValue *float64 `json:"objectiveValue,omitempty"`
StartTime metav1.Time `json:"startTime,omitempty"`
CompletionTime metav1.Time `json:"completionTime,omitempty"`
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about allowing the option to add metadata to a trial? For example, when a trials fail we'll often add a message containing the error so we can figure out why that parameter combination didn't work (e.g. if it's an GPU OOM error then our model is too large and we might reduce the batch size or the number of neurons, if it's a loss=NaN error we reduce the maximum learning rate).

OptimizationGoal *float64 `json:"optimizationGoal,omitempty"`
ObjectiveValueName string `json:"objectiveValueName,omitempty"`
MaxSuggestionCount int `json:"maxSuggestionCount,omitempty"`
MetricsNames []string `json:"metricsNames,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our experience using sigopt I think the user should be able to set an ObservationBudget, which sets the number of trails they expect to run, independent of the MaxSuggestionCount, which sets the maximum number of trials the experiment runs before stopping. There are two advantages to this:

  1. If the HPO still hasn't found a good model when it hits the ObservationBudget often we will leave a study running until it hits a reasonable value. It would be nice to allow studies to just keep running in this scenario without us having to restart them manually.

  2. We often change the ObservationBudget to influence the explore vs. exploit trade-off in the HPO algorithm. If the algorithm thinks it has fewer observations left to find a good solution it will be more aggressive exploring hyperparameter space. The normal use case for this is when we are happy to trade a little long-term accuracy for finding a decent model quickly.

@richardsliu
Copy link
Contributor Author

Thanks for the initial comments. I've made a few changes:

With regards to NAS (@xyhuang ), I would like to keep the API design separate from this effort. Whenever we finalize this API, I will add the NAS stuff to the schema, with the understanding that NAS is an alpha feature and can change in non-compatible ways.

Please continue to provide feedback. For example:

Let's aim to have the API structure stabilized by mid-March.

@jdplatt
Copy link
Contributor

jdplatt commented Feb 17, 2019

I'm still a little confused about the relationship between a Trial and a Worker. Am I right to assume that like in the vizier paper a Trial is meant to be an evaluation of a single suggestion and a worker is meant to be the process that runs a Trial? In that case I am unsure:

  1. Why there can be multiple workers attached to a single trial?
  2. Why the objective value is inside the WorkerMetadata struct not the Trial struct? If the goal of a Trial is to calculate the objective value for a given suggestion then it might be worth putting the resulting value in the Trial rather than the worker than ran it.
  3. What does the Kind field in WorkerMetadata mean?

@cvenets
Copy link

cvenets commented Feb 17, 2019

I agree with @jdplatt's comments above. Also, if a Trial is actually a single run with specific parameters, why don't we call it a 'Run' rather than a 'Trial'? Wouldn't this be easier for users to get what it is?

And since @richardsliu asked regarding naming, I think 'Study' is a very Vizier-specific term. How would people feel about simply calling it 'Optimization', which seems to be a more generic and well known term?

@jdplatt
Copy link
Contributor

jdplatt commented Feb 18, 2019

Based on the discussion in #386 I think we also need to add the ability to track what order the trials were run in. We could this by adding index field to Trail indicating what order the trials were run in, or by inferring the order later on using the started/completed times. Tracking the order the trials were generated in will make it possible to monitor a running study to see how the metric is improving over time. We look at these sorts of plots a lot to decide when to end a study (for example if a new best model hasn't been found in a while)

@richardsliu
Copy link
Contributor Author

@jdplatt Your understanding is correct. A "Trial" is a vizier concept mapping to one evaluation of a suggestion; whereas a "Worker" is a k8s resource that runs the Trial.

  1. According to https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L558, each Trial should correspond to exactly one Worker. @YujiOshima can you confirm this?

  2. This is probably because the objective value is calculated by each Worker and fetched by the metrics collector (which maps to workers and not trials). Perhaps we can combine/restructure this.

  3. Kind refers to the type of worker - "Job" for generic k8s BatchJob, "TFJob" and "PytorchJob" for kubeflow distributed training jobs. We are also looking into making this more generic, see Make Katib generic for operator support #341.

@hougangliu
Copy link
Member

hougangliu commented Feb 19, 2019

  1. According to https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/studyjob_controller.go#L558, each Trial should correspond to exactly one Worker. @YujiOshima can you confirm this?

#352 aims to reuse trial so that multiple workers may belongs to one trial.
BTW, now when a worker failed, we mark studyjob failed, too. But for some case, if we re-create another worker, everything may work well (for example, the worker is killed by eviction policy or node error, if the worker is scheduled to other node, everything can work). Maybe for a trial, when a worker failed, we can re-create another one until max-fail-time to mark studyjob Failed. So in this case, multiple workers (max-fail-time at most) can belong to a trial. I logged #390 to trace discussion

@jlewi
Copy link
Contributor

jlewi commented Feb 19, 2019

One alternative to Study would be to call it an Experiment.

The term Experiment appears in other places; I think KF pipelines and MLFlow both use it. I don't know if that is a good thing or bad thing.

@jdplatt
Copy link
Contributor

jdplatt commented Feb 19, 2019

@richardsliu I think some of the confusion over Trials vs Workers is because Katib uses workers differently than Vizier. Below is a figure from the Vizier paper showing the logic inside a worker.

vizierworker

Each worker runs for the length of a Study and ends up running many trials. However, in Katib a worker run only for the life of a single trial.

  1. Is Katib meant to be an open source version of Vizier or do you foresee the project going in it's own direction over time?
  2. If we stick with the current approach in Katib where workers and trials have the same lifecycle can we just collapse them into a single object?

@jdplatt
Copy link
Contributor

jdplatt commented Feb 19, 2019

@jlewi Sigopt uses the term Experiment instead of Study as well.

@jdplatt
Copy link
Contributor

jdplatt commented Feb 19, 2019

Another feature I think we should add to the new API is the ability to scale the feasible parameter space (e.g. #224). It is really common to use log scaling on parameters such as learning rate or regularization coefficient

@hougangliu
Copy link
Member

Each worker runs for the length of a Study and ends up running many trials. However, in Katib a worker run only for the life of a single trial.

I think katib keeps consistent with vizier, you can take RunTrial(trial) as a worker lifecycle in katib.

@richardsliu
Copy link
Contributor Author

@jdplatt In response to your questions:

  1. I see katib as a project inspired by Vizier, but not necessarily the "open source version of Vizier". My long-term vision is for it to evolve into an open source HP-tuning/NAS/AutoML service that integrates well with Kubernetes. @YujiOshima what do you think?

  2. I agree with @hougangliu suggestion - we can collapse the vocabulary and just use "Trial/RunTrial".

@johnugeorge
Copy link
Member

@jdplatt In response to your questions:

  1. I see katib as a project inspired by Vizier, but not necessarily the "open source version of Vizier". My long-term vision is for it to evolve into an open source HP-tuning/NAS/AutoML service that integrates well with Kubernetes. @YujiOshima what do you think?
  2. I agree with @hougangliu suggestion - we can collapse the vocabulary and just use "Trial/RunTrial".

I agree. Though the initial design is inspired from Vizier, evolving it into a k8s native solution with best user experience is more important in the longer run.

@YujiOshima
Copy link
Contributor

@jdplatt @richardsliu I'm sorry for late reply.

  1. I see katib as a project inspired by Vizier, but not necessarily the "open source version of Vizier". My long-term vision is for it to evolve into an open source HP-tuning/NAS/AutoML service that integrates well with Kubernetes. @YujiOshima what do you think?

I agree. We do not need to stick to the design of Vizier.
In my first design, the Trial is only a set of parameter. And the Worker is an evaluating process for a trial. So if you need to evaluate multiple time for one parameter, you can make multiple Worker for a Trial.
Though I think we can use another name instead of Worker, we should need a concept for an evaluating process that independent from Trial.
Because the objective value and metrics may change every time even with the same trial.

For avoiding confusion, how about use consistent names for resources and processes?
Currently:

Resouce Process
Study StudyJob
Trial Worker

Suggestion:

Resouce Process
Study(Experiment?) StudyRun(ExperimentRun?)
Trial TrialRun

@hougangliu
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 5, 2019
@richardsliu
Copy link
Contributor Author

  • Renamed "hyperparameter" and "suggestionparameter" to "ParameterAssignment".
  • Restructured AlgorithmSpec

type AlgorithmSpec struct {
AlgorithmName string `json:"algorithmName,omitempty"`
// Key-value pairs for hyperparameters and assignment values.
ParameterAssignments []trial.ParameterAssignment `json:"parameterAssignments"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is algorithm specific parameters. eg: https://github.com/kubeflow/katib/blob/master/examples/grid-example.yaml#L41

Will this be confusing with ParameterAssignment of Trial though both are key-value struct type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @johnugeorge .
In the gRpc API, we distinguish these concepts.
ParameterAssignment of Trial: https://github.com/kubeflow/katib/blob/master/pkg/api/api.proto#L284
Parameter for Suggestion service: https://github.com/kubeflow/katib/blob/master/pkg/api/api.proto#L338

It is also key-value but we should use another name. How about AlgorithmParameterAssignment ?

Copy link
Contributor Author

@richardsliu richardsliu Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be something like AlgorithmSettings. "Assignment" is for assigning HP values which should not be confused with internal configuration settings for suggestion algorithms (sorry if my previous answer was misleading). So maybe it is better to avoid terms like "parameter" and "assignments" here entirely.

@richardsliu
Copy link
Contributor Author

Some small fixes:

  • Added "retainHistoricalData" to Experiment spec
  • Added a type for "AlgorithmSetting" to avoid confusion with HP assignments

Everyone please take a look and lgtm if you think the API looks ok. We can still make minor edits after merging this PR.

@richardsliu
Copy link
Contributor Author

/hold

@richardsliu richardsliu changed the title WIP: Katib v1alpha2 API proposal Katib v1alpha2 API for CRDs Mar 6, 2019
@YujiOshima
Copy link
Contributor

@richardsliu Thank you!
/lgtm

@johnugeorge
Copy link
Member

LGTM
nitpick: We might not repeat Algorithm term in AlgorithmSpec fields. We can change this later too.

Copy link
Contributor

@alexandraj777 alexandraj777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

@richardsliu
Copy link
Contributor Author

Thanks everyone!
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: richardsliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 61451ef into kubeflow:master Mar 8, 2019
@YujiOshima YujiOshima mentioned this pull request Mar 8, 2019
@jlewi jlewi mentioned this pull request Mar 10, 2019
15 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.