Skip to content
This repository has been archived by the owner on Feb 1, 2022. It is now read-only.

Mxnet operator v1 API #36

Open
3 of 4 tasks
wackxu opened this issue May 13, 2019 · 6 comments
Open
3 of 4 tasks

Mxnet operator v1 API #36

wackxu opened this issue May 13, 2019 · 6 comments

Comments

@wackxu
Copy link
Contributor

wackxu commented May 13, 2019

There are couple of minor api changes that are suggested. We can incorporate all these changes in the next API version.

Related: kubeflow/training-operator#935

  • Requires support of Status subresource in CRD
  • Add ActiveDeadlineSeconds and BackoffLimit
  • Use pod group instead of PDB for gang scheduling
  • Supporting multiple versions of CRD

@suleisl2000 @gaocegege

@wackxu
Copy link
Contributor Author

wackxu commented May 13, 2019

If this is reasonable, I can help do this.

@suleisl2000
Copy link
Contributor

@wackxu Thanks for your summary. It makes sense to trace the changes of tf-operator/pytorch-operator. Just one question, we have already merged the third item "Use pod group instead of PDB for gang scheduling" to v1beta1, does it introduce the issue of compatibility?

@gaocegege
Copy link
Member

It may affect but I think it does not block the development of v1beta2. Actually, We are working on v1 in tfjob now. Maybe we could implement v1 directly in mxnet-operator.

@wackxu
Copy link
Contributor Author

wackxu commented May 13, 2019

@suleisl2000 It should have no effect. For old mxjob that use pdb, new controller will also create the podgroup for the mxjob and delete the podgroup when mxjob is deleted. for pdb that was created before by the controller, when mxjob is deleting, the k8s garbagecollector will delete the pdb and finally everything about the mxjob is deleted.

@wackxu
Copy link
Contributor Author

wackxu commented May 13, 2019

Agree with @gaocegege Since the changes in the list has been added to tf-operator for a while and has been tested enough and we can implement v1 directly in mxnet-operator. @suleisl2000 WDYT

@suleisl2000
Copy link
Contributor

@wackxu It is ok to me to work on v1 directly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants