Skip to content

Conversation

@inesqyx
Copy link
Contributor

@inesqyx inesqyx commented Apr 2, 2024

  • What I did

Mimicking the way that leader election is setup in machine config controller and machine config operator, we set up leader election in MOB as well. Doing so will ensure that only one single Machine OS Builder pod would be running at any given time.

  • How to verify it
  1. Deploy an OpenShift cluster.
  2. Opt into on-cluster builds.
  3. Retrieve the logs for the Machine OS Builder pod and verify that it pauses for leader election and eventually starts.
  4. Delete the pod and wait for the Deployment to start a replacement pod.
  5. Retrieve the replacement pod logs and verify that it pauses for leader election and eventually starts.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 2, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 2, 2024

@inesqyx: This pull request references MCO-790 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

  • What I did

Mimicking the way that leader election is setup in machine config controller and machine config operator, we set up leader election in MOB as well. Doing so will ensure that only one single Machine OS Builder pod would be running at any given time.

  • How to verify it
  1. Deploy an OpenShift cluster.
  2. Opt into on-cluster builds.
  3. Retrieve the logs for the Machine OS Builder pod and verify that it pauses for leader election and eventually starts.
  4. Delete the pod and wait for the Deployment to start a replacement pod.
  5. Retrieve the replacement pod logs and verify that it pauses for leader election and eventually starts.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 2, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@inesqyx
Copy link
Contributor Author

inesqyx commented Apr 2, 2024

/test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 2, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: inesqyx
Once this PR has been reviewed and has the lgtm label, please assign dkhater-redhat for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@inesqyx inesqyx marked this pull request as ready for review April 2, 2024 20:48
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2024
@openshift-ci openshift-ci bot requested review from dkhater-redhat and jkyros April 2, 2024 20:50
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 2, 2024

@inesqyx: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn dd4e9b9 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-azure-ovn-upgrade-out-of-change dd4e9b9 link false /test e2e-azure-ovn-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@inesqyx
Copy link
Contributor Author

inesqyx commented Apr 4, 2024

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 4, 2024

@inesqyx: This pull request references MCO-790 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sergiordlr
Copy link
Contributor

sergiordlr commented Apr 15, 2024

When we configure a new imageBuilderType the machine-os-builder pod is restarted, but it does not release the lease, so when the new machine-os-builder pod starts it cannot get the lease and reports a failure with this message

$ oc logs machine-os-builder-fb856c6f4-nvrvf 
I0415 15:59:33.756112       1 start.go:89] Options parsed: {kubeconfig:}
I0415 15:59:33.756133       1 start.go:92] Version: machine-config-daemon-4.6.0-202006240615.p0-2682-g200c5f24-dirty (200c5f24043dee744f5b1680eb09cffcaa7d7a8f)
I0415 15:59:33.756143       1 builder.go:93] Using in-cluster kube client config
I0415 15:59:33.756369       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
I0415 15:59:33.770262       1 leaderelection.go:250] attempting to acquire leader lease openshift-machine-config-operator/machine-os-builder...
I0415 15:59:33.775174       1 leaderelection.go:354] lock is held by machine-os-builder-fb856c6f4-nvrvf_42ca8477-a939-4883-89f4-643b5dcfa7b0 and has not yet expired
I0415 15:59:33.775195       1 leaderelection.go:255] failed to acquire lease openshift-machine-config-operator/machine-os-builder

It fails for 2 minutes and a half and then takes the lease ungracefully.

To reproduce it just enable the on-cluster-build functionality and reconfigure the imageBuildertType with this command, for example:

$ oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "custom-pod-builder"}}

Since imageBuilderType configuration is a controlled situation, the lease should be released and acquired gracefully, shouldn't it?

A pre-merge jira ticket has been created to track this behaviour https://issues.redhat.com/browse/OCPBUGS-32271

@sergiordlr
Copy link
Contributor

With the new MachineOsCOnfig resource there is only one image builder type, hence the ticket that we opened regarding this PR no longer applies.

We add the qe-approved label.

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 24, 2024
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Apr 24, 2024

@inesqyx: This pull request references MCO-790 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

  • What I did

Mimicking the way that leader election is setup in machine config controller and machine config operator, we set up leader election in MOB as well. Doing so will ensure that only one single Machine OS Builder pod would be running at any given time.

  • How to verify it
  1. Deploy an OpenShift cluster.
  2. Opt into on-cluster builds.
  3. Retrieve the logs for the Machine OS Builder pod and verify that it pauses for leader election and eventually starts.
  4. Delete the pod and wait for the Deployment to start a replacement pod.
  5. Retrieve the replacement pod logs and verify that it pauses for leader election and eventually starts.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 24, 2024
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 24, 2024
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@inesqyx
Copy link
Contributor Author

inesqyx commented Jul 24, 2024

Close the PR, merged in #4327

@inesqyx inesqyx closed this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants