Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kustomize package for mxnet operator #279

Merged
merged 2 commits into from
Aug 27, 2019
Merged

Conversation

Jeffwan
Copy link
Member

@Jeffwan Jeffwan commented Aug 15, 2019

Which issue is resolved by this Pull Request:
Resolves #228

Description of your changes:
Generate packages using ksonnet and migrate to kustomize .

Checklist:

  • Unit tests have been rebuilt:
    1. cd manifests/tests
    2. make generate
    3. make test

e2e test works as expected

INFO:root:Epoch[9] Batch [810]	Speed: 18104.75 samples/sec	accuracy=0.998437
INFO:root:Epoch[9] Batch [820]	Speed: 16208.60 samples/sec	accuracy=0.993750
INFO:root:Epoch[9] Batch [830]	Speed: 18071.96 samples/sec	accuracy=0.998437
INFO:root:Epoch[9] Batch [840]	Speed: 17168.77 samples/sec	accuracy=0.998437
INFO:root:Epoch[9] Batch [850]	Speed: 17117.32 samples/sec	accuracy=0.996875
INFO:root:Epoch[9] Batch [860]	Speed: 17657.56 samples/sec	accuracy=0.993750
INFO:root:Epoch[9] Batch [870]	Speed: 17583.30 samples/sec	accuracy=0.996875
INFO:root:Epoch[9] Batch [880]	Speed: 17988.64 samples/sec	accuracy=0.992188
INFO:root:Epoch[9] Batch [890]	Speed: 17165.04 samples/sec	accuracy=0.993750
INFO:root:Epoch[9] Batch [900]	Speed: 17731.97 samples/sec	accuracy=0.996875
INFO:root:Epoch[9] Batch [910]	Speed: 17920.19 samples/sec	accuracy=0.996875
INFO:root:Epoch[9] Batch [920]	Speed: 17885.09 samples/sec	accuracy=1.000000
INFO:root:Epoch[9] Batch [930]	Speed: 18132.88 samples/sec	accuracy=0.992188
INFO:root:Epoch[9] Train-accuracy=0.997768
INFO:root:Epoch[9] Time cost=3.427
INFO:root:Epoch[9] Validation-accuracy=0.980195
worker
9091
mxnet-job-scheduler-0
1
3
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
➜  mxjob


This change is Reviewable

@jlewi
Copy link
Contributor

jlewi commented Aug 16, 2019

I might suggest splitting the package into 2

  • 1 for cluster wide resources (cluster roles and binding and CRD)
  • 1 for everything else

The reason is this way if a user doesn't have cluster admin priveleges they can just ask a cluster admin to do the first part

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 16, 2019

The reason is this way if a user doesn't have cluster admin priveleges they can just ask a cluster admin to do the first part

Yeah, that's very reasonable. I will make the change.

@jlewi
Copy link
Contributor

jlewi commented Aug 18, 2019

/assign @jlewi
/assign @johnugeorge

@jlewi
Copy link
Contributor

jlewi commented Aug 18, 2019

@Jeffwan let me know when this is ready for another look

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 19, 2019

@jlewi Not sure if we need to put service-account.yaml to the same package as cluster roles and crds. If end user wants to deploy operator under user's namespace, they also need to have service account at the same namespace, and clusterRoleBinding need to know namespace of the service account. I fee it's kind of hard to split into two packages.

If another userB wants to use operator, should this user deploy another operator? Trying to understand the best practice here.

CRD is totally separate and it can be moved to another package. If cluster admin will be responsible for both CRD and cluster role and bindings, it's not that meaningful to just have crd in separate package.

Any thoughts?

@johnugeorge
Copy link
Member

good point raised by @jlewi . However, we haven't followed the same for other operators.

https://github.com/kubeflow/manifests/tree/master/pytorch-job -- crds are separate
https://github.com/kubeflow/manifests/tree/master/tf-training/tf-job-operator/base - no separation

@jlewi
Copy link
Contributor

jlewi commented Aug 20, 2019

@johnugeorge is correct that we haven't consistently followed the pattern of splitting cluster level resources into separate applications.

The request to split it into packages 1 for cluster scoped resources and 1 for non cluster scoped resources comes from some users who have expressed that request given the way permissions are structured across different teams.

If you don't want to split it into I can live with that.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 20, 2019

@johnugeorge is correct that we haven't consistently followed the pattern of splitting cluster level resources into separate applications.

The request to split it into packages 1 for cluster scoped resources and 1 for non cluster scoped resources comes from some users who have expressed that request given the way permissions are structured across different teams.

If you don't want to split it into I can live with that.

@jlewi I think split into two is a good idea, meet some issues in this way. I am not sure where should I put the service account. SA is reference by both cluster-role-binding (cluster-scoped) and operator (non cluster scoped). Operator was designed as a cluster scoped controller which listen job requests from all namespaces..

@jlewi
Copy link
Contributor

jlewi commented Aug 23, 2019

I don't have a good answer so my suggestion is pick which ever you think best and we can change it later if that doesn't work.

@Jeffwan
Copy link
Member Author

Jeffwan commented Aug 26, 2019

I don't have a good answer so my suggestion is pick which ever you think best and we can change it later if that doesn't work.

Let's merge it and if anyone have requests on split, we have a refactor for all training operators and make structure unified.

@jlewi
Copy link
Contributor

jlewi commented Aug 27, 2019

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit d4c09c3 into kubeflow:master Aug 27, 2019
@Jeffwan Jeffwan deleted the mxnet branch August 27, 2019 06:30
Tomcli pushed a commit to Tomcli/manifests that referenced this pull request Dec 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a kustomize package for the mxnet operator
4 participants