MCO-1230: Retry build and push operations multiple times #4469

cheesesashimi · 2024-07-11T13:38:25Z

- What I did

Occasionally, an image build and / or push operation will fail due to a transient network condition. To make this process more robust, we should retry these operations multiple times. This PR implements a simplified approach where the build and push operations themselves are wrapped in a retry function. It should be noted that a key limitation of this approach is that it does not account for situations where the build pod is evicted or rescheduled onto a different node. For that, we may want to investigate using a Kubernetes Job which provides additional resilience around evictions and rescheduling.

- How to verify it

Create a MachineOSConfig with a syntax error in the Containerfile or use an invalid image push secret.
Allow the build to run. Eventually, it will fail.
Review the logs for the build pod. Statements such as "Retry 3/3 exited 1, no more retries left." should be observed.

- Description for the changelog
Image builds and pushes should be retried

openshift-ci · 2024-07-11T13:38:34Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2024-07-22T14:41:51Z

@cheesesashimi: This pull request references MCO-1230 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

Details

In response to this:

- What I did

Occasionally, an image build and / or push operation will fail due to a transient network condition. To make this process more robust, we should retry these operations multiple times. This PR implements a simplified approach where the build and push operations themselves are wrapped in a retry function. It should be noted that a key limitation of this approach is that it does not account for situations where the build pod is evicted or rescheduled onto a different node. For that, we may want to investigate using a Kubernetes Job which provides additional resilience around evictions and rescheduling.

- How to verify it

Create a MachineOSConfig with a syntax error in the Containerfile or use an invalid image push secret.

Allow the build to run. Eventually, it will fail.

Review the logs for the build pod. Statements such as "Retry 3/3 exited 1, no more retries left." should be observed.

- Description for the changelog
Image builds and pushes should be retried

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

cheesesashimi · 2024-07-26T13:45:57Z

/test e2e-gcp-op-techpreview

yuqi-zhang

/lgtm

Code logically makes sense. Curious, would we ever want to retry at the pod level? i.e. if a builder pod fails, we retry with a new builder pod. Although I do think this way is a bit better since different operations can be retried separately (build and push)

cheesesashimi · 2024-07-29T16:12:28Z

@yuqi-zhang

if a builder pod fails, we retry with a new builder pod

I want to do that eventually. I opened MCO-1231 to consider using Kubernetes Jobs instead of bare pods like we're doing here.

cheesesashimi · 2024-07-29T16:18:44Z

Also, I just wanted to point out that the TestYumReposBuilds failure should be alleviated once #4471 lands. Basically, the credential got rotated while the build was running and the push operation failed. Although I would've expected it to retry a few more times before failing, hmm...

I think I found it. It's another one of those weird Bash footguns. I think I eventually want to do away with Bash as the entrypoint for this and have a Golang binary that does all of the setup, retries, etc. It would be nice if I could do something like this instead: https://github.com/containers/buildah/blob/main/docs/tutorials/04-include-in-your-build-tool.md

yuqi-zhang · 2024-07-29T23:43:10Z

/lgtm

openshift-ci · 2024-07-29T23:43:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cheesesashimi,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-07-30T00:28:59Z

/retest-required

Remaining retests: 0 against base HEAD 4a518fe and 2 for PR HEAD a0c1d84 in total

openshift-ci-robot · 2024-07-30T04:28:59Z

/retest-required

Remaining retests: 0 against base HEAD 11015e9 and 1 for PR HEAD a0c1d84 in total

openshift-ci · 2024-07-30T06:00:37Z

@cheesesashimi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-vsphere-ovn-upi	`a0c1d84`	link	false	`/test e2e-vsphere-ovn-upi`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2024-07-30T08:23:24Z

[ART PR BUILD NOTIFIER]

Distgit: ose-machine-config-operator
This PR has been included in build ose-machine-config-operator-container-v4.18.0-202407300742.p0.g9f30598.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 11, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 11, 2024

cheesesashimi changed the title ~~retry build and push operations multiple times~~ MCO-1230: Retry build and push operations multiple times Jul 22, 2024

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 22, 2024

cheesesashimi marked this pull request as ready for review July 22, 2024 14:43

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 22, 2024

openshift-ci bot requested review from jkyros and sinnykumari July 22, 2024 14:45

yuqi-zhang approved these changes Jul 26, 2024

View reviewed changes

openshift-ci bot assigned yuqi-zhang Jul 26, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 26, 2024

retry build and push operations multiple times

a0c1d84

cheesesashimi force-pushed the zzlotnik/add-build-retries branch from 4aa53f8 to a0c1d84 Compare July 29, 2024 16:27

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 29, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 29, 2024

openshift-merge-bot bot merged commit 9f30598 into openshift:master Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCO-1230: Retry build and push operations multiple times #4469

MCO-1230: Retry build and push operations multiple times #4469

Uh oh!

cheesesashimi commented Jul 11, 2024

Uh oh!

openshift-ci bot commented Jul 11, 2024

Uh oh!

openshift-ci-robot commented Jul 22, 2024 •

edited by openshift-ci bot

Loading

Uh oh!

cheesesashimi commented Jul 26, 2024

Uh oh!

yuqi-zhang left a comment

Uh oh!

cheesesashimi commented Jul 29, 2024

Uh oh!

cheesesashimi commented Jul 29, 2024 •

edited

Loading

Uh oh!

yuqi-zhang commented Jul 29, 2024

Uh oh!

openshift-ci bot commented Jul 29, 2024

Uh oh!

openshift-ci-robot commented Jul 30, 2024

Uh oh!

openshift-ci-robot commented Jul 30, 2024

Uh oh!

openshift-ci bot commented Jul 30, 2024

Uh oh!

openshift-bot commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MCO-1230: Retry build and push operations multiple times #4469

MCO-1230: Retry build and push operations multiple times #4469

Uh oh!

Conversation

cheesesashimi commented Jul 11, 2024

Uh oh!

openshift-ci bot commented Jul 11, 2024

Uh oh!

openshift-ci-robot commented Jul 22, 2024 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cheesesashimi commented Jul 26, 2024

Uh oh!

yuqi-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

cheesesashimi commented Jul 29, 2024

Uh oh!

cheesesashimi commented Jul 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuqi-zhang commented Jul 29, 2024

Uh oh!

openshift-ci bot commented Jul 29, 2024

Uh oh!

openshift-ci-robot commented Jul 30, 2024

Uh oh!

openshift-ci-robot commented Jul 30, 2024

Uh oh!

openshift-ci bot commented Jul 30, 2024

Uh oh!

openshift-bot commented Jul 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

openshift-ci-robot commented Jul 22, 2024 •

edited by openshift-ci bot

Loading

cheesesashimi commented Jul 29, 2024 •

edited

Loading