-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support Kubeflow MPIJob in MultiKueue #2880
[Feature] Support Kubeflow MPIJob in MultiKueue #2880
Conversation
Skipping CI for Draft Pull Request. |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
/ok-to-test |
1182083
to
5537a91
Compare
/retest-required |
1 similar comment
/retest-required |
/test pull-kueue-test-integration-main Due to #2901. |
/retest-required |
bfebd01
to
e231672
Compare
/retest-required |
/retest |
b27cba7
to
11f84a7
Compare
/retest |
4af4abb
to
25c5547
Compare
/retest |
Modify training-operator setup to be able to work along side mpi-operator
due to consolidation of MultiKueue adapters for the KFJobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Only this warning left:
# Warning: 'patchesStrategicMerge' is deprecated. Please use 'patches' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
But I think we can fix it on follow-up.
Thanks!
mkdir -p $(EXTERNAL_CRDS_DIR)/training-operator/ | ||
cp -f $(KF_TRAINING_ROOT)/manifests/base/crds/* $(EXTERNAL_CRDS_DIR)/training-operator/ | ||
cp -prf $(KF_TRAINING_ROOT)/manifests/* $(EXTERNAL_CRDS_DIR)/training-operator/ | ||
## Removing kubeflow.org_mpijobs.yaml is required as the version of MPIJob is conflicting between training-operator and mpi-operator - in integration tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details!
It makes me think that we need a note on this page https://kueue.sigs.k8s.io/docs/tasks/run/kubeflow/mpijobs/ explaining that you need to disable the MPI from training-operator if using both.
apiVersion: v1 | ||
kind: Namespace | ||
metadata: | ||
name: kubeflow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this namespace? Are the tests creating any objects in it?
# 2. Training-operator deployment is modified to enable all kubeflow jobs except for mpi - https://github.com/kubeflow/training-operator/issues/1777 | ||
|
||
# Modify the `newTag` for the `kubeflow/training-operator` to use the one training-operator version | ||
$YQ eval '(.images[] | select(.name == "kubeflow/training-operator").newTag) = env(KUBEFLOW_IMAGE_VERSION)' -i "$KUBEFLOW_MANIFEST_MANAGER/kustomization.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you reverting these changes after testing?
Like here:
kueue/hack/multikueue-e2e-test.sh
Line 109 in 467b4ee
(cd config/components/manager && $KUSTOMIZE edit set image controller="$IMAGE_TAG") |
Lines 97 to 99 in 467b4ee
function restore_managers_image { | |
(cd config/components/manager && $KUSTOMIZE edit set image controller="$INITIAL_IMAGE") | |
} |
If not, users may accidentally commit unnecessary changes to this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't, good point I will add the restore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this file outside of the tracked files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It's on the git and yq will edit this file.
10f0399
to
37f9a07
Compare
37f9a07
to
8daa5f7
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, mszadkow The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Leaving LGTM to @mbobrovskyi |
/lgtm Thanks! |
LGTM label has been added. Git tree hash: e6c09ab9b855734fcb0824b61368ab7fe86456f5
|
/cherry-pick website |
@mbobrovskyi: #2880 failed to apply on top of branch "website":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@mszadkow IIUC we have changes on site. Could you please prepare cherry-pick to website branch? |
We should not merge this PR to the website branch since this feature has not been released yet. |
* Introduce MultiKueue adapter for MPIJob * Add MPIJobs integration tests * Update access rights for mpijob * Remove v1 MPIJob yaml from training-operator dep-crds * Update MPIJob version to v2beta1 * Introduce mpi-operator to multikueue e2e tests Modify training-operator setup to be able to work along side mpi-operator * Reduce the amount of KFJob e2e multikueue tests due to consolidation of MultiKueue adapters for the KFJobs * Add e2e Multikueue tests for MPIJob * Apply suggestions from verify * Use one makefile target to prep kubeflow traiing-operator manifest and crds * Apply code review changes * Yet another small fix * Rework after code review * move modifications of training-operator deployment and crds to kustomize * Another rework after code review * Cleanup after manager kustomize modification
What type of PR is this?
/kind feature
What this PR does / why we need it:
The PR introduces a new MultiKueue adapter to handle MPIJob (Kubeflow).
We want to extend MultiKueue capabilities to satisfy the needs of early adopters.
Which issue(s) this PR fixes:
Fixes #2552
Special notes for your reviewer:
Does this PR introduce a user-facing change?