-
Notifications
You must be signed in to change notification settings - Fork 22
Restart ironic-agent.service when it fails #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart ironic-agent.service when it fails #34
Conversation
|
Hi @zhouhao3. Thanks for your PR. I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
IMO adding When testing IPI with #33 on iRMC, ironic-agent sometimes failed to start successfully, but after a manual restart, the service ran successfully. We noticed that the error was related to the network. Unfortunately, we didn't record the error message at that time. Now we've done 6 tests in a row but can't reproduce the error. We guess that when the service was pulling the image, the proxy was just unavailable, causing the service to fail to start. Therefore, if the service itself supports restart after failure, it can avoid permanent failure of the service due to factors such as network instability. |
|
/ok-to-test |
|
The error reappeared. |
|
It happened again: |
|
In case of non-transient failure, with this approach isn't the service restarted continuosly? |
|
Restarting the service matches the behavior of our previous ramdisk, so it should be fine. Please fix the unit test failure. |
Yes, I was thinking more to the fact that in case of permanent failure (ie network outage) the installation will probably fail after the bootstrap timeout (30m), while a mechanism based on a counter could fail faster. |
|
I don't think we have a way to communicate the agent failure back to the control plane. |
Right, not sure if at least the error reporting could be improved then ( I guess @hs0210 fetched the journal log directly from within the instance?) |
71d5d4e to
02d1bec
Compare
d46a421 to
6a48405
Compare
Signed-off-by: Zhou Hao <[email protected]>
|
/retest |
|
@dtantsur @andfasano updated, PTAL. |
|
/retest |
|
@zhouhao3: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
We can log in to the master to get the journal log, but we haven't thought of a good way to improve the error report. @dtantsur @andfasano When we use IPI deployment, we still encounter this problem from time to time. Can you review this PR? thanks. |
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andfasano, zhouhao3 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Can we port it to 4.10 as well? |
|
/cherry-pick release-4.10 |
|
@elfosardo: new pull request created: #52 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Avoid the situation that ironic-agent.service will not restart after failure due to some circumstances.
Signed-off-by: Zhou Hao [email protected]