-
Notifications
You must be signed in to change notification settings - Fork 1.5k
bootstrap: record progress of services #4742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bootstrap: record progress of services #4742
Conversation
|
Example of the release-image.service.json file when the pull secret is not valid. |
Each OpenShift service running on the bootstrap machine will now create a json file in /var/log/openshift/ that contains an array of entries detailing the progress that the service has made. The entries included in the json file are the following. * Service start * Service end, with result and error details * Service stage start * Service stage end, with result and error details The json files in /var/log/openshift will be collected by the bootstrap gather in /bootstrap/services/ for evaluation by the installer for improved failure reporting to the user. The evaluation is left for follow-on work. https://issues.redhat.com/browse/CORS-1542 https://issues.redhat.com/browse/CORS-1543
038ce92 to
a35c8dd
Compare
patrickdillon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 1st entry in the release-image example does not have a name or any identifying features:
{
"timestamp": "2021-03-11T04:50:00Z",
"phase": "start"
},
Is that a bug? I think each entry should identify either the stage or service...
|
|
||
|
|
||
| # add_service_record_entry adds a record entry to the service records file. | ||
| # PHASE - phase being recorded; one of "start", "end", "stage start", or "stage end" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider
start -> service start
end -> service end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change the phase for pre- and post-commands to pre-command start, pre-command end, post-command start, and post-command end, too.
|
|
||
| If a service has pre- or post-commands that could either run for significant periods or could | ||
| potentially fail, then those commands should add to the json file as well. Such a command should | ||
| source the same /usr/local/bin/bootstrap-service-record.sh script. It should also set either the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I'm grasping the PRE- POST_COMMAND concept although this description is fine.
What value does this add? It doesn't seem like it, but does it link services together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take the kubelet service as an example.
As described in the doc, that service has a pre-command of /usr/local/bin/kubelet-pause-image.sh. What that means is that, when the kubelet service runs, the first step of the service is to run the kube-pause-image script. If the kube-pause-image script fails, then the service fails. On restart of the service, the kube-pause-image script is run again. The main kubelet command does not start until the kube-pause-image script succeeds.
When analyzing a bootstrap gather bundle, we want to be able to look at the kubelet.json file to see (1) whether the service completed successfully, (2) whether the service failed and restarted, or (3) whether the service is still running. If the service is restarting because the kube-pause-image script is failing, then we want to be able to ascertain that from preCommand: kube-pause-image entries in the kubelet.json file. Likewise, if the service is stuck waiting on the kube-pause-image script to complete, then we want to be able to see that from preCommand: kube-pause-image, phase: start entry without a subsequent preCommand: kube-pause-image, phase: end entry.
You could almost think of the pre- and post-commands entries as similar to stage entries except that they run before and after the main command. The entries are included in the same json file as the rest of the entries for the service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example of a kubelet.json file.
[
{
"timestamp": "2021-03-12T00:43:37Z",
"preCommand": "kubelet-pause-image",
"phase": "start"
},
{
"timestamp": "2021-03-12T00:43:38Z",
"preCommand": "kubelet-pause-image",
"phase": "end",
"result": "success"
},
{
"timestamp": "2021-03-12T00:43:38Z",
"phase": "start"
}
]
The service is identified by the name of the json file. I could add the service name to each entry, if you feel strongly that that would be helpful. |
Filename is good. I forgot about that viewing in this context. |
* add "pre-command start", "pre-command end", "post-command start" and "post-command end" phases * fixed issue where the kubelet was not notifying systemd that it had started since it had been moved to a script
|
|
||
| echo "Gathering bootstrap service records ..." | ||
| mkdir -p "${ARTIFACTS}/bootstrap/services" | ||
| sudo cp -r /var/log/openshift/* "${ARTIFACTS}/bootstrap/services/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not respect $SERVICE_RECORDS_DIR but that seems ok to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is fine. The $SERVICE_RECORDS_DIR override is really just there for testing the entry recording script locally.
|
/lgtm Left one comment that gather is using hardcoded path so logs could (in theory) be in a different dir for gather. It does not seem like erring on the |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: staebler The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
11 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
16 similar comments
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@staebler: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Each OpenShift service running on the bootstrap machine will now create a json file in /var/log/openshift/ that contains an array of entries detailing the progress that the service has made.
The entries included in the json file are the following.
The json files in /var/log/openshift will be collected by the bootstrap gather in /bootstrap/services/ for evaluation by the
installer for improved failure reporting to the user. The evaluation is left for follow-on work.
https://issues.redhat.com/browse/CORS-1542
https://issues.redhat.com/browse/CORS-1543