Restart ironic-agent.service when it fails #34

zhouhao3 · 2022-01-25T03:13:29Z

Avoid the situation that ironic-agent.service will not restart after failure due to some circumstances.

Signed-off-by: Zhou Hao [email protected]

openshift-ci · 2022-01-25T03:15:06Z

Hi @zhouhao3. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hs0210 · 2022-01-26T10:20:14Z

IMO adding Restart=on-failure is a code improvement, not a bug fix.

When testing IPI with #33 on iRMC, ironic-agent sometimes failed to start successfully, but after a manual restart, the service ran successfully. We noticed that the error was related to the network. Unfortunately, we didn't record the error message at that time. Now we've done 6 tests in a row but can't reproduce the error. We guess that when the service was pulling the image, the proxy was just unavailable, causing the service to fail to start.

Therefore, if the service itself supports restart after failure, it can avoid permanent failure of the service due to factors such as network instability.

dtantsur · 2022-01-28T17:13:24Z

/ok-to-test

hs0210 · 2022-01-28T23:56:50Z

The error reappeared.

-- Logs begin at Fri 2022-01-28 23:33:55 UTC, end at Fri 2022-01-28 23:38:23 UTC. --
Jan 28 23:34:41 master-1 systemd[1]: Starting Ironic Agent...
Jan 28 23:34:42 master-1 podman[2243]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b9f235258d1bf30f5819089506e1ee5b698dcbe1b52ce9e49cbaad6ad96c1897...
Jan 28 23:34:46 master-1 podman[2243]: Getting image source signatures
Jan 28 23:34:50 master-1 podman[2243]: Error: Error reading blob sha256:eac1b95df832dc9f172fd1f07e7cb50c1929b118a4249ddd02c6318a677b506a: Get "https://cdn02.quay.io/sha256/ea/eac1b95df832dc9f172fd1f07e7cb50c1929b118a4249ddd02c6318a677b506a?Expires=1643413370&Signature=Tix43Y0YVkJaSgjvc6stWUiASYCRxEc7C-JSNge-R4zbwrS9bhgNNIMJDBpy4lbNj-r7KJJw7q~rmO8xiI47sYJkI1zLRAFFijTFDVCKbNGxaYUBfUvrC0BEcCdLTosgkS8m2Nv1BOqjQcZI61B2ChjLMtYFzmwllSOMga9UV9A2iZJDTi9pSjLPaFl-qam1W57FJQIh107gfA3lo7NounE337WKt6rQWSVgFfvFe0G~Sjs3yMknHcCvjuiKlF86F05RJAD0VBPMOCnDqYHLHOA~-6d-ZmSmBqK307Ab2~Bic~eqahQKxewWuRK70AWtG24LLKlDeGBumH006nTKSw__&Key-Pair-Id=APKAJ67PQLWGCSP66DGA": ERROR
Jan 28 23:34:50 master-1 systemd[1]: ironic-agent.service: Control process exited, code=exited status=125
Jan 28 23:34:50 master-1 systemd[1]: ironic-agent.service: Failed with result 'exit-code'.
Jan 28 23:34:50 master-1 systemd[1]: Failed to start Ironic Agent.

fenggw-fnst · 2022-01-29T07:44:49Z

It happened again:

-- Logs begin at Sat 2022-01-29 06:31:32 UTC, end at Sat 2022-01-29 07:46:48 UTC. --
Jan 29 06:32:19 worker-0 systemd[1]: Starting Ironic Agent...
Jan 29 06:32:20 worker-0 podman[2277]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f07e1596763e4f1b0834cb700037fd1958071ae26492d8c3fc686afc5ef560cb...
Jan 29 06:32:26 worker-0 podman[2277]: Error: Error parsing image configuration: Get "https://cdn02.quay.io/sha256/db/db1135106ca881540040a27ecd1e8d3af97eff3881095ea9a48a1f5b1d1e407d?Expires=1643438266&Signature=NFPoGINK-IW0lgieHhLa~G-9m462emXWjEvcXqMJhMAJSNPZWrA4ywibhgxguUYy60GZFRVpZcZsm0OQ3aRhLkbTrRPCgv6H3ndmCV-kpHgvqajCm-QZM5PBI~pdnNajxyLPKqfVEZq9cntOE3CidbERlUNT~92P4cL2Zlp~nE1JHvFS~Y5BkpvscQWshEUEBlc~ZmAPO1QF9CSGbLlB7TFsNg22M65rpikNqHszyk9N5STuq6p5SqrRTGs78JUm-AIEDq~CZWfJVcHTqliCUxfMvSJEsl-wLIfcz2NfQIhctDHwrmGVEWdKjMs1Y7T4UkHGb5IEpBoarp~nApvSkw__&Key-Pair-Id=APKAJ67PQLWGCSP66DGA": ERROR
Jan 29 06:32:26 worker-0 systemd[1]: ironic-agent.service: Control process exited, code=exited status=125
Jan 29 06:32:26 worker-0 systemd[1]: ironic-agent.service: Failed with result 'exit-code'.
Jan 29 06:32:26 worker-0 systemd[1]: Failed to start Ironic Agent.

andfasano · 2022-01-31T09:40:58Z

In case of non-transient failure, with this approach isn't the service restarted continuosly?

dtantsur · 2022-01-31T11:07:12Z

Restarting the service matches the behavior of our previous ramdisk, so it should be fine.

Please fix the unit test failure.

andfasano · 2022-01-31T17:32:53Z

Restarting the service matches the behavior of our previous ramdisk, so it should be fine.

Yes, I was thinking more to the fact that in case of permanent failure (ie network outage) the installation will probably fail after the bootstrap timeout (30m), while a mechanism based on a counter could fail faster.

dtantsur · 2022-01-31T18:06:57Z

I don't think we have a way to communicate the agent failure back to the control plane.

andfasano · 2022-02-01T12:21:59Z

I don't think we have a way to communicate the agent failure back to the control plane.

Right, not sure if at least the error reporting could be improved then ( I guess @hs0210 fetched the journal log directly from within the instance?)

Signed-off-by: Zhou Hao <[email protected]>

zhouhao3 · 2022-02-07T02:29:56Z

/retest

zhouhao3 · 2022-02-07T02:30:49Z

@dtantsur @andfasano updated, PTAL.

zhouhao3 · 2022-02-07T05:15:23Z

/retest

openshift-ci · 2022-02-07T08:03:25Z

@zhouhao3: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

zhouhao3 · 2022-02-14T01:26:46Z

Right, not sure if at least the error reporting could be improved then ( I guess @hs0210 fetched the journal log directly from within the instance?)

We can log in to the master to get the journal log, but we haven't thought of a good way to improve the error report.

@dtantsur @andfasano When we use IPI deployment, we still encounter this problem from time to time. Can you review this PR? thanks.

dtantsur · 2022-02-14T12:59:32Z

/lgtm

andfasano · 2022-02-14T13:36:25Z

/approve

openshift-ci · 2022-02-14T13:36:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano, zhouhao3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [andfasano]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sshnaidm · 2022-04-19T15:14:03Z

Can we port it to 4.10 as well?

elfosardo · 2022-06-13T14:50:02Z

/cherry-pick release-4.10

openshift-cherrypick-robot · 2022-06-13T14:50:50Z

@elfosardo: new pull request created: #52

Details

In response to this:

/cherry-pick release-4.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from andfasano and sadasu January 25, 2022 03:15

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 25, 2022

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 28, 2022

zhouhao3 force-pushed the add-restart-for-service branch from 71d5d4e to 02d1bec Compare February 7, 2022 01:37

openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2022

zhouhao3 force-pushed the add-restart-for-service branch 2 times, most recently from d46a421 to 6a48405 Compare February 7, 2022 01:46

openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2022

Restart ironic-agent.service when it fails

6a48405

Signed-off-by: Zhou Hao <[email protected]>

openshift-ci bot assigned dtantsur Feb 14, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2022

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2022

openshift-merge-robot merged commit 5c400b9 into openshift:main Feb 14, 2022

openshift-ci bot mentioned this pull request Jun 13, 2022

Bug 2060968: Add delay between ironic-agent restart, but restart forever #51

Merged

openshift-cherrypick-robot mentioned this pull request Jun 13, 2022

[release-4.10] Restart ironic-agent.service when it fails #52

Closed

Restart ironic-agent.service when it fails #34

Restart ironic-agent.service when it fails #34

Uh oh!

Conversation

zhouhao3 commented Jan 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jan 25, 2022

Uh oh!

hs0210 commented Jan 26, 2022

Uh oh!

dtantsur commented Jan 28, 2022

Uh oh!

hs0210 commented Jan 28, 2022

Uh oh!

fenggw-fnst commented Jan 29, 2022

Uh oh!

andfasano commented Jan 31, 2022

Uh oh!

dtantsur commented Jan 31, 2022

Uh oh!

andfasano commented Jan 31, 2022

Uh oh!

dtantsur commented Jan 31, 2022

Uh oh!

andfasano commented Feb 1, 2022

Uh oh!

zhouhao3 commented Feb 7, 2022

Uh oh!

zhouhao3 commented Feb 7, 2022

Uh oh!

zhouhao3 commented Feb 7, 2022

Uh oh!

openshift-ci bot commented Feb 7, 2022

Uh oh!

zhouhao3 commented Feb 14, 2022

Uh oh!

dtantsur commented Feb 14, 2022

Uh oh!

andfasano commented Feb 14, 2022

Uh oh!

openshift-ci bot commented Feb 14, 2022

Uh oh!

sshnaidm commented Apr 19, 2022

Uh oh!

elfosardo commented Jun 13, 2022

Uh oh!

openshift-cherrypick-robot commented Jun 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

zhouhao3 commented Jan 25, 2022 •

edited

Loading