Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation for NAS job in Katib controller #398

Merged

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Feb 20, 2019

Fixes: #372.
As well, I added validation for HP job.
/assign @hougangliu @YujiOshima


This change is Reviewable

return fmt.Errorf("Missing OutputSize in NasConfig.GraphConfig: %v", instance.Spec.NasConfig.GraphConfig)
}

if instance.Spec.NasConfig.GraphConfig.NumLayers == 0 {
Copy link
Member

@hougangliu hougangliu Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumLayers is not always in need for all NAS (ENAS need it). I think we need validate NAS config considering suggestion algorithm.
Please also verify if other parameters are general or only ENAS-specified

Copy link
Member Author

@andreyvelich andreyvelich Feb 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hougangliu I think we can check these parameters exactly for Reinforcement Learning algorithm for now. For NumLayers and Step here:

if jobType == jobTypeNAS && pc.Feasible.Step == "" && pc.ParameterType == katibv1alpha1.ParameterTypeDouble {
, we should validate only for RL algorithm.
What do you think about it?

@andreyvelich
Copy link
Member Author

/retest

if reflect.DeepEqual(pc.Feasible, katibv1alpha1.FeasibleSpace{}) {
return fmt.Errorf("Missing Feasible in ParameterConfig: %v", pc)
}
if pc.ParameterType == katibv1alpha1.ParameterTypeCategorical {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only categorical but discrete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In discrete we also use only List structure or something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently discrete only need List. But we may need more type for discretes, int or double.
Please treated in the same way as Categorical for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return validJobErr
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move validateHPJob and validateNASJob into validateStudy, and move validateStudy away from initializeStudy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return err
}
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return validateParameterConfigs to replace line 215-219

return validErr
}

if getJobType(instance) != jobTypeNAS {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use suggestionSpec.suggestionAlgorithm to return jobTypeNAS or jobTypeHP in getJobType to handle the invalid studyjob case: .Spec.NasConfig is not nil and suggestionSpec.suggestionAlgorithm is not nas-like

PyTorchJobWorker = "PyTorchJob"
jobTypeNAS = "NAS"
jobTypeHP = "HP"
suggestionAlgorithmRL = "nasrl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/suggestionAlgorithmRL/suggestionNASRL

return fmt.Errorf("Missing Max in ParameterConfig.Feasible: %v", pc.Feasible)
}
if jobType == jobTypeNAS && pc.Feasible.Step == "" && pc.ParameterType == katibv1alpha1.ParameterTypeDouble && suggestionAlgorithm == suggestionAlgorithmRL {
return fmt.Errorf("Missing Step in ParameterConfig.Feasible for NAS job: %v", pc.Feasible)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define a suggestionNASList (now only nasrl is in)
s/suggestionAlgorithm == suggestionAlgorithmRL/contains(suggestionNASList, suggestionAlgorithm)
so that when we add other nas suggestion, we only need update suggestionNASList

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Better way to handle this?

Change GetJobType function
Make a list with NAS algorithms
@andreyvelich
Copy link
Member Author

@hougangliu Thank you! I changed it.

@@ -237,7 +251,7 @@ func validateNASJob(instance *katibv1alpha1.StudyJob) error {
return fmt.Errorf("Missing OutputSize in NasConfig.GraphConfig: %v", instance.Spec.NasConfig.GraphConfig)
}

if instance.Spec.NasConfig.GraphConfig.NumLayers == 0 && instance.Spec.SuggestionSpec.SuggestionAlgorithm == suggestionAlgorithmRL {
if instance.Spec.NasConfig.GraphConfig.NumLayers == 0 && contains(suggestionNASList, instance.Spec.SuggestionSpec.SuggestionAlgorithm) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think line 298 can use contains(suggestionNASList,..), but here I don't think so.
If we add another nas suggestion, it may not need NumLayers.

Copy link
Member Author

@andreyvelich andreyvelich Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hougangliu You think we should specify this block only for suggestionNASRL ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a better place to put algorithm specific checks other than the controller? This list can grow over time with more algorithms and these checks are very specific to the algorithm implementation. @andreyvelich @hougangliu Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, it is important to check it in the controller. We don't need to run getSuggestion if something wrong with yaml file. We can't set failed condition for the StudyJob inside Suggestion. Controller do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do algorithm-specific check in suggestion service.

  • Add ValidateSuggestionParameter API to Suggestion service
    or
  • GetSuggesiton returns a specific error code if the config is invalid then the controller handle the error code

@andreyvelich @johnugeorge @hougangliu Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YujiOshima In that case, should we send StudyConfig to ValidateSuggestionParameters, because we don't save StudyConfig in db, if it is not valid.
We can't GetStudy like here: https://github.com/kubeflow/katib/blob/master/pkg/suggestion/nasrl_service.py#L219 ?
As well, should we put this function after this line: https://github.com/kubeflow/katib/blob/master/pkg/controller/studyjob/katib_api_util.go#L43 ?
We generate StudyConfig and can send it to Suggestion to check, whether it is valid or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich You're right. StudyConf and SuggestionParams should be included in an args of ValidateSuggestionParameters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YujiOshima Ok. In this PR, should I make a validation only for NAS RL service, or we should include BO here, as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Only for NAS RL is enough here. If you could add validation for BO in another PR, it would be fantastic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YujiOshima @hougangliu What do you think about name of this API: ValidateStudyParameters ?
It looks confused with ValidateSuggestionParameters since we validate StudyConfig and SuggestionParameters in this function

)

var ValidWorkerKindList = [...]string{DefaultJobWorker, TFJobWorker, PyTorchJobWorker}

var suggestionNASList = []string{suggestionNASRL}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our goal is to keep controller independent of worker type and suggestion algos. Currently there are no worker specific code in the code(except for watch. Hope, it can be fixed soon and ValidWorkerKindList won't not needed any more). Similarly i feel that there should not be any suggestion algorithm specific changes in the controller code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, how we can make a validation for the NAS job ?

@andreyvelich
Copy link
Member Author

Validation in NAS RL service is not ready
/hold

@andreyvelich
Copy link
Member Author

andreyvelich commented Feb 28, 2019

I created a new API function called ValidateSuggestionParameters and added validation inside NAS RL service, as noticed here #398 (comment).
Right now, we can add this function inside Suggestion service and validate StudyJob.
Logic works like this:

StudyJobController -> vizier-core-manager -> Suggestion service

Case1: Suggestion can't reply (Suggestion doesn't have ValidateSuggestionParameters function)

StudyJobController (Request)-> vizier-core-manager  (Request)->x Failed Suggestion service
StudyJobController <-(Reply{}, error code Unknown) vizier-core-manager - Suggestion service
instance.Status.Condition = katibv1alpha1.ConditionRunning

Case2: StudyJob is Valid

StudyJobController (Request)-> vizier-core-manager  (Request)-> Suggestion service
StudyJobController <-(Reply{}, nil) vizier-core-manager <-(Reply{}, nil) Suggestion service
instance.Status.Condition = katibv1alpha1.ConditionRunning

Case3: StudyJob is Invalid

StudyJobController (Request)-> vizier-core-manager  (Request)-> Suggestion service
StudyJobController <-(Reply{}, error code InvalidArgument + error message from Suggestion) vizier-core-manager <-(Reply{}, error code InvalidArgument + error message from Suggestion) Suggestion service
instance.Status.Condition = katibv1alpha1.ConditionFailed

@YujiOshima @hougangliu @johnugeorge @richardsliu @DeeperMind Please, check it.

/hold cancel

@andreyvelich
Copy link
Member Author

andreyvelich commented Feb 28, 2019

/test kubeflow-katib-presubmit

return err
}

log.Println("Validate Parameters inside Suggestion Start")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start to validate Suggestion Parameters

} else {
instance.Status.Condition = katibv1alpha1.ConditionFailed
return errors.New("Suggestion Parameters is not valid")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/s/is/are
or Valid suggestion Parameters and Invalid suggestion Parameters

* Return 1 if Suggestion Parameters is Valid, 0 if not
*/
message ValidateSuggestionParametersReply {
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool is_valid 1;
string msg 2;

@andreyvelich
Copy link
Member Author

/test kubeflow-katib-presubmit

@andreyvelich
Copy link
Member Author

/retest

3 similar comments
@andreyvelich
Copy link
Member Author

/retest

@andreyvelich
Copy link
Member Author

/retest

@andreyvelich
Copy link
Member Author

/retest

@andreyvelich
Copy link
Member Author

@jlewi @YujiOshima @hougangliu @richardsliu
Should we switch test server to us-east1-c since us-east1-d has VM shortage?
Like here: kubeflow/kubeflow#2583

@andreyvelich
Copy link
Member Author

/retest

@andreyvelich
Copy link
Member Author

/retest

@@ -212,10 +212,20 @@ service Manager {
get: "/api/Manager/GetSavedModels"
};
};
/**
* Validate Suggestion Parameters from Study Job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain a suggestion service should return INVALID_ARGUMENT error when the parameter is invalid.

}

conn, err := grpc.Dial("vizier-suggestion-"+in.SuggestionAlgorithm+":6789", grpc.WithInsecure())
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should return specific error code here (e.g. Unavailable ) and set message.

Copy link
Member Author

@andreyvelich andreyvelich Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YujiOshima By default this Error https://github.com/kubeflow/katib/pull/398/files#diff-9e63f7b0791edcb4a53c944552b88970R120 will return Unavailable code. I think, we don't need to specify this error code there.
Right now, I handle this error here, as you said: b88030b#diff-7a381fdaeae1295724b74c0a04f7644cR386.
Please check, if it is ok.

if statusCode.Code() == codes.Unknown {
log.Printf("Method ValidateSuggestionParameters not found inside Suggestion service: %s", suggestionAlgorithm)
return true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please handle InvalidArgument and Unavailable error and print the reason for the error (the parameter is invalid or service is unavailable).

* Suggestion service should return INVALID_ARGUMENT Error when the parameter is invalid
*/
rpc ValidateSuggestionParameters(ValidateSuggestionParametersRequest) returns (ValidateSuggestionParametersReply){
option (google.api.http) = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that ValidateSuggestionParametersReply is used as return &api.ValidateSuggestionParametersReply{} all the time. can we drop ValidateSuggestionParametersReply, alway return nothing?
or you can use below struct to replace current one

message ValidateSuggestionParametersReply {
    bool is_valid 1;
    string msg 2;
 }

Copy link
Member Author

@andreyvelich andreyvelich Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hougangliu We don't need to return anything, like here:
https://github.com/kubeflow/katib/blob/master/cmd/manager/main.go#L213.
If parameter is not valid, we return INVALID_ARGUMENT Error message.

@andreyvelich
Copy link
Member Author

@YujiOshima @hougangliu Do you have any other suggestions on this PR?

@hougangliu
Copy link
Member

/lgtm
/approve
Thanks!

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hougangliu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 8f89ad4 into kubeflow:master Mar 5, 2019
@andreyvelich andreyvelich deleted the issue-372-validate-nasconfig branch September 27, 2021 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants