-
Notifications
You must be signed in to change notification settings - Fork 43
pkg/start/start: Drop bootstrapPodsRunningTimeout #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
And plumb through contexts from runCmdStart so we can drop the context.TODO() calls. bootstrapPodsRunningTimeout was added in d07548e (Add --tear-down-event flag to delay tear down, 2019-01-24, openshift#9), although Stefan had no strong opinion on them then [1]. But as it stands, a hung pod creates loops like [2]: $ tar xf log-bundle.tar.gz $ cd bootstrap/journals $ grep 'Started Bootstrap\|Error: error while checking pod status' bootkube.log Apr 16 17:46:23 ip-10-0-4-87 systemd[1]: Started Bootstrap a Kubernetes cluster. Apr 16 18:12:41 ip-10-0-4-87 bootkube.sh[1510]: Error: error while checking pod status: timed out waiting for the condition Apr 16 18:12:41 ip-10-0-4-87 bootkube.sh[1510]: Error: error while checking pod status: timed out waiting for the condition Apr 16 18:12:46 ip-10-0-4-87 systemd[1]: Started Bootstrap a Kubernetes cluster. Apr 16 18:33:02 ip-10-0-4-87 bootkube.sh[11418]: Error: error while checking pod status: timed out waiting for the condition Apr 16 18:33:02 ip-10-0-4-87 bootkube.sh[11418]: Error: error while checking pod status: timed out waiting for the condition Apr 16 18:33:07 ip-10-0-4-87 systemd[1]: Started Bootstrap a Kubernetes cluster. Instead of having systemd keep kicking bootkube.sh (which in turn keeps launching cluster-bootstrap), removing this timeout will just leave cluster-bootstrap running while folks gather logs from the broken cluster. And the less spurious-restart noise there is in those logs, the easier it will be to find what actually broke. [1]: openshift#9 (comment) [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1700504#c14
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wking If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| if err = waitUntilPodsRunning(ctx, client, b.requiredPodPrefixes); err != nil { | ||
| return err | ||
| } | ||
| cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't work. We break the switch over logic this way. We have to force createAssetsInBackground to stop creating assets. We did this with this cancel() call, and restarted asset creation afterwards, potentially with the ELB such that we can shut down the bootstrap control plane.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix this: wrap assetContext into another context, cancel it here and then use the inner context below.
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
And plumb through contexts from
runCmdStartso we can drop thecontext.TODO()calls.bootstrapPodsRunningTimeoutwas added in d07548e (#9), although @sttts had no strong opinion on them then. But as it stands, a hung pod creates loops like:Instead of having systemd keep kicking
bootkube.sh(which in turn keeps launchingcluster-bootstrap), removing this timeout will just leavecluster-bootstraprunning while folks gather logs from the broken cluster. And the less spurious-restart noise there is in those logs, the easier it will be to find what actually broke.