Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

simplify reconciler #166

Merged
merged 1 commit into from
Dec 19, 2021
Merged

simplify reconciler #166

merged 1 commit into from
Dec 19, 2021

Conversation

zw0610
Copy link
Member

@zw0610 zw0610 commented Sep 28, 2021

To support GenericJob in tf-operator, a kind of Job independent from any deep learning frameworks, this pull request introduces the following modification to reconciler.v1 package:

  1. Simplify multiple reconciler names (by dropping redundant words like Kubeflow)
  2. Fix PodList, ServiceList converting issue
  3. Fix default GetReconcilerName returned value
  4. Simplify reconciler.v1 package by moving the KubeflowReconciler to tf-operator [WIP]: add a GenericJob type and controller training-operator#1398
  5. Simplify TestReconciler construction function

@google-oss-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign jeffwan after the PR has been reviewed.
You can assign the PR to them by writing /assign @jeffwan in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/reconciler.v1/common/service.go Outdated Show resolved Hide resolved
@@ -6,19 +6,19 @@ To use the reconciler, following methods must be overridden according to the API

```go
// GetJob returns the job that matches the request
func (r *KubeflowJobReconciler) GetJob(ctx context.Context, req ctrl.Request) (client.Object, error)
func (r *JobReconciler) GetJob(ctx context.Context, req ctrl.Request) (client.Object, error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like Training or TrainingJob since we are always dealing with model trainings here. Job seems very generic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TrainingJob sounds good to me. As the common library is designed for all kinds of controllers, I would suggest renaming the GenericJob to TrainingJob in the pull request I mentioned above for tf-operator (kubeflow/training-operator#1398). What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@terrytangyuan
Copy link
Member

/cc @kubeflow/wg-training-leads

@Jeffwan
Copy link
Member

Jeffwan commented Sep 29, 2021

Sorry for the delay. Let me merge #163 first before accepting any new changes. This is really annoying.

@zw0610
Copy link
Member Author

zw0610 commented Dec 1, 2021

@Jeffwan could we move this pr forward as the revert string issue is resolved now.

Copy link
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
/assign @Jeffwan


// GetReconcilerName returns the name of this reconciler, which is "Kubeflow Reconciler"
// GetReconcilerName returns the name of this reconciler, which is "common-reconciler"
func (r *ReconcilerUtil) GetReconcilerName() string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Kubeflow Reconciler" won't work because it's not a word?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. When generating the default labels, the returned string from GetReconcilerName will be used as the value for key commonv1.OperatorNameLable (""training.kubeflow.org/operator-name"").

@@ -239,21 +239,3 @@ type JobInterface interface {
// PastActiveDeadline CAN be overridden to customize how to determine if this job has past activate deadline.
PastActiveDeadline(runPolicy *commonv1.RunPolicy, jobStatus *commonv1.JobStatus) bool
}

// KubeflowReconcilerInterface defines the abstract interface for a base reconciler for kubeflow jobs.
type KubeflowReconcilerInterface interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation to move KubeflowReconcilerInterface to training-operator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing the test code, test_reconciler, which could be considered as one example of how the common.reconciler will be used, I found that:

  1. Such interface will not be used for any abstraction when implementing a real reconciler
  2. Embedding modularized interfaces in KubeflowReconcilerInterface conflicts embedding them in the real reconciler struct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it make sense to me. current testing integration between common and training-operator is a headache.

svcList := &corev1.ServiceList{}
err := r.List(ctx, svcList, client.MatchingLabels(r.GenLabels(job.GetName())))
if err != nil {
return nil, err
}

var svcs []*corev1.Service = nil
for _, svc := range svcList.Items {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

em. what's the problem of original way?

Copy link
Member Author

@zw0610 zw0610 Dec 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the old way will not give a new address for svc so when we take the pointer of svc, it remains the same. As a result, all Services in the svc slice are the same. Please see the screenshot below.

Screen Shot 2021-12-12 at 20 14 53

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. yeah. This pointer points to the shared memory location that each of the slice items are being copied into.
All the pointers created using &svc point to the same memory location.

Current way looks good to me.

pkg/reconciler.v1/common/service.go Outdated Show resolved Hide resolved
@Jeffwan
Copy link
Member

Jeffwan commented Dec 12, 2021

@zw0610 sorry for late response. Please have a check on above comments

@Jeffwan
Copy link
Member

Jeffwan commented Dec 12, 2021

/lgtm
/hold

This change looks good to me. Let's see if anyone else have further feedbacks. /cc @kubeflow/wg-training-leads

@zw0610
Copy link
Member Author

zw0610 commented Dec 19, 2021

Any feedback before we can merge this pr? @kubeflow/wg-training-leads

@terrytangyuan
Copy link
Member

/hold cancel

@terrytangyuan
Copy link
Member

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Jeffwan,terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 4322683 into kubeflow:master Dec 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants