Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sync]: code from ODH incubation to RHOAI main #285

Closed
wants to merge 18 commits into from

Conversation

zdtsw
Copy link

@zdtsw zdtsw commented Jun 18, 2024

No description provided.

aslakknutsen and others added 18 commits June 18, 2024 11:09
…#1033)

context should be determined by the caller and propagated
down the call chain.
- we might see throttling in some cluster, this is just to uplift the
default value

Signed-off-by: Wen Zhou <[email protected]>
…pendatahub-io#996)

- both depedent operators have been check in configureServerlessFeatures

Signed-off-by: Wen Zhou <[email protected]>
…tahub-io#1027)

They do not belong to pkg/deploy as they are about reading/writing cluster resources rather than deploying resources.
* Update version to v2.13.0

* Fix image

* Update version in manifests

Signed-off-by: Wen Zhou <[email protected]>
The GITHUB_URL was set to https://github.com/ (note the trailing slash)
which causes the script to fail in certain environment where git is
configured to rewrite the url i.e. to always use the ssh protocol:

  [url "[email protected]:"]
    insteadOf = https://github.com

Signed-off-by: Luca Burgazzoli <[email protected]>
…atahub-io#1046)

Avoid returning (nil, nil), convert error if CRD not present to
NotFound and handle them in the caller code.

Avoid checking for found == nil but handle only error code.

This is more idiomatic and less error prone.

Signed-off-by: Yauheni Kaliuta <[email protected]>
…ndatahub-io#1051)

* tests: envtest: Add OperatorCondition CRD

Add OLM's[1] OperatorCondition/OperatorConditionList to the external
CRDs. Will be used in future patches.

[1] https://github.com/operator-framework/operator-lifecycle-manager/blob/master/deploy/upstream/manifests/0.18.3/0000_50_olm_00-operatorconditions.crd.yaml

Signed-off-by: Yauheni Kaliuta <[email protected]>

* tests: dscinitialization: add OperatorCondition CRD to schema

Following patches will change initialization to list the objects.

Signed-off-by: Yauheni Kaliuta <[email protected]>

* cluster: GetPlatform: replace CSV list with OperatorExists calls

Jira: https://issues.redhat.com/browse/RHOAIENG-8483

Depending of the way operators installed CSVs can be seen in all
namespace so listing of them causes producing N * M results (where N
is number of namespaces and M is number of CSVs) which is not
scalable in general and in practice causes timeouts on such large
clusters.

The function basically checks if ODH or RHOAI operator installed and
there is already such function in the package, OperatorExists(). So,
reuse it.

Signed-off-by: Yauheni Kaliuta <[email protected]>

---------

Signed-off-by: Yauheni Kaliuta <[email protected]>
* cluster: rename isXManaged to detectX

"is" prefix assumes that it returns boolean value while the
functions return actual results.

Signed-off-by: Yauheni Kaliuta <[email protected]>

* cluster: tune managed/selfmanaged user facing name

Jira: https://issues.redhat.com/browse/RHOAIENG-8497

The config map name "addon-managed-odh-catalog" is not pretty
enough. Also self managed variant is ambiguous, does not say it
explicitly.

Change the names to:
- Managed Red Hat OpenShift AI
- Self Managed Red Hat OpenShift AI

Signed-off-by: Yauheni Kaliuta <[email protected]>

---------

Signed-off-by: Yauheni Kaliuta <[email protected]>
After removing leader election, operator fails to start if it is
instructed to create default DSCI. Looks like webhook is not ready
by the time:

```
create default DSCI CR.
{"level":"error","ts":"2024-05-13T09:25:58Z","logger":"setup","msg":"unable to create initial setup for the operator","error":"Internal error occurred: failed calling webhook \"operator.opendatahub.io\": failed to call webhook: Post \"https://opendatahub-operator-controller-manager-service.oo-2ts9m.svc:443/validate-opendatahub-io-v1?timeout=10s\": no endpoints available for service \"opendatahub-operator-controller-manager-service\"","stacktrace":"main.main.func1\n\t/workspace/main.go:200\nsigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/manager.go:336\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:219"}
```
Leader election added some delay.

The problem does not happen in default configuration since it
explicitly disables DSCI creation in the manifests:

```
       containers:
       - command:
         - /manager
         env:
           - name: DISABLE_DSC_CONFIG
             value: 'true'
         args:
         - --operator-name=opendatahub
         image: controller:latest
```

Make a wrapper function cluster.CreateWithRetry for client.Object
creation with timeout. Use hardcoded 5s interval, just seems
reasonable, and timeout in minutes as the parameter.

It requires disable linter nilerr since for the polling function
error in creation is a valid condition, something the function wait
to disappear.

Fixes: 3610b0b ("feat: remove leader election for operator (opendatahub-io#1000)")

Signed-off-by: Yauheni Kaliuta <[email protected]>
* main: move DSC creation after DSCIs one

Reorder a bit to keep definition of cleanExistingResourceFunc next
to the Add() call.

Signed-off-by: Yauheni Kaliuta <[email protected]>

* main: convert DSC creation to RunableFunc

DSC creation is checked by the webhook so will not work from the
main before manager starts if webhook is enabled.

The same as DSCI. It also requires CreateWithRetry() for the same
reasons as

e26100e ("upgrade: retry if default DSCI creation fails (opendatahub-io#1008)")

Signed-off-by: Yauheni Kaliuta <[email protected]>

---------

Signed-off-by: Yauheni Kaliuta <[email protected]>
As the first step of improving Feature DSL this PR brings godoc explaining purpose of each method in the builder chain.
- Source is already used elsewhere in Feature DSL, so we might want to
  avoid confusion
- Location fits the purpose of this field better
The actual use with resMap is inlined and moved to the caller.

This way we can construct plugins on demand and use them as building blocks instead.
Signed-off-by: Wen Zhou <[email protected]>
Signed-off-by: Wen Zhou <[email protected]>
@zdtsw zdtsw requested a review from ykaliuta June 18, 2024 10:40
Copy link

openshift-ci bot commented Jun 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zdtsw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zdtsw
Copy link
Author

zdtsw commented Jun 18, 2024

@ykaliuta this is only for a sanity check, not intended for full code review.

@ykaliuta
Copy link

Can we remove squashing policy for this repo? The PR were already squashed in upstream.

@zdtsw
Copy link
Author

zdtsw commented Jun 18, 2024

Can we remove squashing policy for this repo? The PR were already squashed in upstream.

i am open to do this change, but try to understand the reason behind it (multiple commits than one final squash)

if we want to be able to better track the changes introduced in rhoai, we have the commits in the list so we can go back to see what the original change from ODH is for , right? example (#274) shows 4 commits from ODH

these cherry-pick are not 100% the same between ODH commits to the commits into this PR.
it involves resolved conflicts on every commit, therefore if anything is wrong for the "final big sync PR" it could be

  1. manual error when doing cherry-pick with resolve 2) original ODH PR is wrong

@ykaliuta
Copy link

these cherry-pick are not 100% the same between ODH commits to the commits into this PR.

Isn't it easier to track on per-commit basis? You have original commit in ODH, you have amended commit in RHOAI. You can easily get the amendments just comparing the diffs instead of fetching them from the huge commit.

It is also good to have the list of amendments in the commit message to understand the intentions.

@zdtsw
Copy link
Author

zdtsw commented Jun 18, 2024

these cherry-pick are not 100% the same between ODH commits to the commits into this PR.

Isn't it easier to track on per-commit basis? You have original commit in ODH, you have amended commit in RHOAI. You can easily get the amendments just comparing the diffs instead of fetching them from the huge commit.

It is also good to have the list of amendments in the commit message to understand the intentions.

in theory, yes, if it is a 1 ODH commit :1 RHOAI commit mapping in order.
but we do not cherry-pick on that order. e.g some old commits are not synced down (could be a human mistake mostly , could also be features that cannot add to rhoai yet)
we will need to run test and ensure every commits in this PR are working before starting cherry-pick the next one.
then it is an extra efforts for the one who is doing sync on behalf of the team.
one bad example, if commit A is not working after sync from ODH to RHOAI, because of some missing functions in that commit then we can only fix it in the later commit X, and this would not help when compare commit A from RHOAI and commit A from ODH.

we can have another discussion in the team, and identify work process.
one thing i can think of is the one make the change in ODH should be responsible for the change land in RHOAI main.
but i am open for more options.

@openshift-merge-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@zdtsw
Copy link
Author

zdtsw commented Jun 19, 2024

close this PR in favor of #287

but we can keep discussion how such [sync] should be done till we have the feature "keep odh only" done

@zdtsw zdtsw closed this Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants