Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB: add studyjob table and extend worker table #362

Closed
wants to merge 2 commits into from

Conversation

YujiOshima
Copy link
Contributor

@YujiOshima YujiOshima commented Feb 7, 2019

Related to #352 (comment)
I split changes related to DB scheme from #352 .
In this PR, information of studyjob and worker are stored in DB and users can refer them after the studyjob CR is deleted.
It makes easier to rerun the workers since the manifests of them are stored in DB.

@hougangliu @gaocegege @richardsliu @johnugeorge

Signed-off-by: YujiOshima [email protected]


This change is Reviewable

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: yujioshima

If they are not already assigned, you can assign the PR to them by writing /assign @yujioshima in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@YujiOshima
Copy link
Contributor Author

/retest

@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

@googlebot
Copy link

CLAs look good, thanks!

@johnugeorge
Copy link
Member

Needs rebase to get gcloud fix

@YujiOshima YujiOshima force-pushed the enrich_workerinfo branch 2 times, most recently from f8e5fa0 to 819b32f Compare February 14, 2019 06:01
@YujiOshima
Copy link
Contributor Author

YujiOshima commented Feb 14, 2019

CI passed!
@johnugeorge @hougangliu @richardsliu Could you take a look?

@YujiOshima
Copy link
Contributor Author

/retest

State status = 5; /// Status of Worker.
string TemplatePath = 6; /// Path for the manufest template of Worker.
repeated Tag tags = 7; /// Tags of Worker.
string manufest = 6; /// A json formatted manufest for Worker.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manifest?

string TemplatePath = 6; /// Path for the manufest template of Worker.
repeated Tag tags = 7; /// Tags of Worker.
string manufest = 6; /// A json formatted manufest for Worker.
string metrics_collector_manufest = 7; /// A json formatted manufest for MetricsCollector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -248,7 +244,7 @@ func (r *ReconcileStudyJobController) Reconcile(request reconcile.Request) (reco
default:
now := metav1.Now()
instance.Status.StartTime = &now
err = initializeStudy(instance)
update, err = initializeStudy(instance, request.Namespace)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe it's not your change, just merge by mistake?

@@ -53,8 +62,11 @@ func (d *dbConn) DBInit() {
trial_id CHAR(16),
type VARCHAR(255),
status TINYINT,
template_path TEXT,
manufest TEXT,
metrics_collector_manufest TEXT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo err

study_id CHAR(16),
job_type TEXT,
worker_template TEXT,
metrics_collector_template TEXT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better choose one of manifest and template to name var/column related consistently

@@ -226,11 +227,6 @@ func (r *ReconcileStudyJobController) Reconcile(request reconcile.Request) (reco
if !contains(pendingFinalizers, cleanDataFinalizer) {
return reconcile.Result{}, nil
}
err = deleteStudy(instance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we can add a field like "Retain" for studyjob to decide if deleteStudy when delete studyjob

}
defer conn.Close()
c := katibapi.NewManagerClient(conn)
_, err = c.GetStudyJob(context.Background(), &katibapi.GetStudyJobRequest{
StudyJobUid: string(instance.UID),
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which case instance.UID can be duplicated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By CR operator implementation, the reconcile function will be called parallelly. So initializeStudy is often called two times. I don't know how to avoid it.

if mt, ok := mtl[instance.Spec.MetricsCollectorSpec.GoTemplate.TemplatePath]; ok {
regStudyJobreq.StudyJob.MetricsCollectorTemplate = mt
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also handle GoTemplate.TemplatePath == "" && GoTemplate.RawTemplate == "" case both for worker and metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the defaultWorkerTemplate.yaml and defaultMetricsCollectorTemplate.yaml will be used. It's not an error.

@hougangliu
Copy link
Member

hougangliu commented Feb 21, 2019

tables studyjobs stores workertemplate/metricscollectortemplate info.
table study stores other info except workertemplate/metricscollectortemplate.

@YujiOshima do you plan to support below configuration?

spec:
   reusestudyid: STUDYID
   reusestudyjobid: UUID

@YujiOshima
Copy link
Contributor Author

@hougangliu Thank you for your review!
I will add reusestudyid. But can I reuse studyjobid? The studyjobid is a UID of a studyjob CR.
It is automatically filled by k8s.

@@ -341,12 +353,35 @@ func TestGetWorkerList(t *testing.T) {
}

func TestUpdateWorker(t *testing.T) {
mock.ExpectExec(`UPDATE workers SET status = \? WHERE id = \?`).WithArgs(api.State_COMPLETED, defaultWorkerID).WillReturnResult(sqlmock.NewResult(1, 1))
err := dbInterface.UpdateWorker(defaultWorkerID, api.State_COMPLETED)
wmanufest := "woker manufest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: manufest -> manifest, woker ->worker

mock.ExpectExec(`UPDATE workers SET status = \? WHERE id = \?`).WithArgs(api.State_COMPLETED, defaultWorkerID).WillReturnResult(sqlmock.NewResult(1, 1))
err := dbInterface.UpdateWorker(defaultWorkerID, api.State_COMPLETED)
wmanufest := "woker manufest"
mcmanufest := "metricscollector manufest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. Please correct the other lines too

@hougangliu
Copy link
Member

@hougangliu Thank you for your review!
I will add reusestudyid. But can I reuse studyjobid? The studyjobid is a UID of a studyjob CR.
It is automatically filled by k8s.

Do you mean when a new studyjob uses reusestudyid, it will use configuration of row from studys tables and worker/metrics of row from study jobs tables whose studyid == reusestudyid?
is there one record or multiple records in studyjobs table which studyid == reusestudyid? if there is many, if worker/metrics template keeps same or not?

@YujiOshima
Copy link
Contributor Author

@hougangliu Sorry for confusing.
Yes, I will add

spec:
   reusestudyid: STUDYID

But I don't have plan to add reusestudyjobid: UUID since studyjobid cannot be reused.

is there one record or multiple records in studyjobs table which studyid == reusestudyid? if there is many, if worker/metrics template keeps same or not?

Multiple studyjob for a studyid and worker/metrics template do not need to be the same for each studyjobs even they have the same studyid.

@hougangliu
Copy link
Member

But I don't have plan to add reusestudyjobid: UUID since studyjobid cannot be reused.

studyjob has been stored in db, why we cannot reuse it? (we create a studyjob crd whose uid is xxx, a studyjob with studyjobid xxx also created in db, when we delete studyjob with uid xx, a studyjob with studyjobid xxx still stored in db.)

Multiple studyjob for a studyid and worker/metrics template do not need to be the same for each studyjobs even they have the same studyid.

I think maybe you plan to use it as below, which can lead to multiple studyjob for a studyid, right?
if yes, we can only reuse a studyjob spec without worker/metrics spec. That is, even we store studyjob in db, we cannot reuse it.

spec:
   reusestudyid: STUDYID
   workerSpec:
     ...
   metricsCollectorSpec:
     ...

@YujiOshima
Copy link
Contributor Author

@hougangliu

studyjob has been stored in db, why we cannot reuse it?
Oh, I misunderstood. We can reuse a config for a studyjob but a UID will change in the new studyjob.

My assumption is below.

  • Reuse StudyID: Run another config(algorithm, worker template) for an existing study. Only reuse studyID and studyconfig.
  • Reuse StudyJobID: Rerun a studyjob with the same config (same algorithm, same worker template)

@hougangliu
Copy link
Member

@YujiOshima I think the change is a little complex. To avoid misunderstanding, I think a detail example can make it clear.
saying: there are existing studyjob s1 and s2 (what it looks config and worker/metrics, simple one is ok). when create a new studyjob s3 with reusestudy/reusestudyjob, what s3 will look like.
Thanks!

@hougangliu
Copy link
Member

@YujiOshima it makes sense.
Then I think there is only one studyjob for a studyid (one to one). Otherwise, for below case, it's hard to decide selecting one of studyjob to fill workerSpec automatically.

apiVersion: "kubeflow.org/v1alpha1"
kind: StudyJob
metadata:
  namespace: kubeflow
  labels:
    controller-tools.k8s.io: "1.0"
  name: random-example-reusejob
spec:
  reuseStudyJobId: f4a4b946-3b33-11e9-adcb-08002750630b

@YujiOshima
Copy link
Contributor Author

YujiOshima commented Mar 4, 2019

@hougangliu

Then I think there is only one studyjob for a studyid (one to one).

No, multi studyjobs can use a studyID and the studyjobs share historical data of Trial and Worker belong to the studyID.
The use cases are

  • You can get initial values for bayesian opt, running random or manual studyjob. After that, run bayesian opt job reusing the StudyID.
  • You can retry or try another algorithm after a job is completed and see all results on the same graph in UI.
  • You can add a trial manually during job running since the algorithm doesn't search near the trial.

Otherwise, for below case, it's hard to decide selecting one of studyjob to fill workerSpec automatically.

A studyjob has only one WorkerSpec. So it can fill with the WorkerSpec.

@gaocegege
Copy link
Member

/close

We have dropped v1alpha1. Thus I think we cannot merge it into master.

@k8s-ci-robot
Copy link

@gaocegege: Closed this PR.

In response to this:

/close

We have dropped v1alpha1. Thus I think we cannot merge it into master.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants