bootstrap: record progress of services #4742

staebler · 2021-03-11T21:39:38Z

Each OpenShift service running on the bootstrap machine will now create a json file in /var/log/openshift/ that contains an array of entries detailing the progress that the service has made.

The entries included in the json file are the following.

Service start
Service end, with result and error details
Service stage start
Service stage end, with result and error details

The json files in /var/log/openshift will be collected by the bootstrap gather in /bootstrap/services/ for evaluation by the
installer for improved failure reporting to the user. The evaluation is left for follow-on work.

https://issues.redhat.com/browse/CORS-1542
https://issues.redhat.com/browse/CORS-1543

staebler · 2021-03-11T21:40:32Z

Example of the release-image.service.json file when the pull secret is not valid.

[
  {
    "timestamp": "2021-03-11T04:50:00Z",
    "phase": "start"
  },
  {
    "timestamp": "2021-03-11T04:50:00Z",
    "stage": "pull-release-image",
    "phase": "stage start"
  },
  {
    "timestamp": "2021-03-11T04:50:01Z",
    "stage": "pull-release-image",
    "phase": "stage end",
    "result": "failure",
    "errorLine": "25 main /usr/local/bin/release-image-download.sh",
    "errorMessage": "Starting Download the OpenShift Release Image...\nPulling registry.ci.openshift.org/origin/release:4.8...\nError: unable to pull registry.ci.openshift.org/origin/release:4.8: Error initializing source docker://registry.ci.openshift.org/origin/release:4.8: unable to retrieve auth token: invalid username/password: unauthorized: authentication required"
  },
  {
    "timestamp": "2021-03-11T04:50:01Z",
    "stage": "pull-release-image",
    "phase": "stage start"
  },
  {
    "timestamp": "2021-03-11T04:50:01Z",
    "stage": "pull-release-image",
    "phase": "stage end",
    "result": "failure",
    "errorLine": "25 main /usr/local/bin/release-image-download.sh",
    "errorMessage": "Error: unable to pull registry.ci.openshift.org/origin/release:4.8: Error initializing source docker://registry.ci.openshift.org/origin/release:4.8: unable to retrieve auth token: invalid username/password: unauthorized: authentication required\nPull failed. Retrying registry.ci.openshift.org/origin/release:4.8...\nError: unable to pull registry.ci.openshift.org/origin/release:4.8: Error initializing source docker://registry.ci.openshift.org/origin/release:4.8: unable to retrieve auth token: invalid username/password: unauthorized: authentication required"
  },
  {
    "timestamp": "2021-03-11T04:50:01Z",
    "stage": "pull-release-image",
    "phase": "stage start"
  },
  {
    "timestamp": "2021-03-11T04:50:02Z",
    "stage": "pull-release-image",
    "phase": "stage end",
    "result": "failure",
    "errorLine": "25 main /usr/local/bin/release-image-download.sh",
    "errorMessage": "Error: unable to pull registry.ci.openshift.org/origin/release:4.8: Error initializing source docker://registry.ci.openshift.org/origin/release:4.8: unable to retrieve auth token: invalid username/password: unauthorized: authentication required\nPull failed. Retrying registry.ci.openshift.org/origin/release:4.8...\nError: unable to pull registry.ci.openshift.org/origin/release:4.8: Error initializing source docker://registry.ci.openshift.org/origin/release:4.8: unable to retrieve auth token: invalid username/password: unauthorized: authentication required"
  }
]

Each OpenShift service running on the bootstrap machine will now create a json file in /var/log/openshift/ that contains an array of entries detailing the progress that the service has made. The entries included in the json file are the following. * Service start * Service end, with result and error details * Service stage start * Service stage end, with result and error details The json files in /var/log/openshift will be collected by the bootstrap gather in /bootstrap/services/ for evaluation by the installer for improved failure reporting to the user. The evaluation is left for follow-on work. https://issues.redhat.com/browse/CORS-1542 https://issues.redhat.com/browse/CORS-1543

patrickdillon

The 1st entry in the release-image example does not have a name or any identifying features:

  {
    "timestamp": "2021-03-11T04:50:00Z",
    "phase": "start"
  },

Is that a bug? I think each entry should identify either the stage or service...

patrickdillon · 2021-03-12T15:58:27Z

data/data/bootstrap/files/usr/local/bin/bootstrap-service-record.sh

+
+
+# add_service_record_entry adds a record entry to the service records file.
+#   PHASE - phase being recorded; one of "start", "end", "stage start", or "stage end"


consider
start -> service start
end -> service end

Sounds reasonable.

I'll change the phase for pre- and post-commands to pre-command start, pre-command end, post-command start, and post-command end, too.

patrickdillon · 2021-03-12T19:37:39Z

docs/dev/bootstrap_services.md

+
+If a service has pre- or post-commands that could either run for significant periods or could
+potentially fail, then those commands should add to the json file as well. Such a command should
+source the same /usr/local/bin/bootstrap-service-record.sh script. It should also set either the


I don't think I'm grasping the PRE- POST_COMMAND concept although this description is fine.

What value does this add? It doesn't seem like it, but does it link services together?

Let's take the kubelet service as an example.

As described in the doc, that service has a pre-command of /usr/local/bin/kubelet-pause-image.sh. What that means is that, when the kubelet service runs, the first step of the service is to run the kube-pause-image script. If the kube-pause-image script fails, then the service fails. On restart of the service, the kube-pause-image script is run again. The main kubelet command does not start until the kube-pause-image script succeeds.

When analyzing a bootstrap gather bundle, we want to be able to look at the kubelet.json file to see (1) whether the service completed successfully, (2) whether the service failed and restarted, or (3) whether the service is still running. If the service is restarting because the kube-pause-image script is failing, then we want to be able to ascertain that from preCommand: kube-pause-image entries in the kubelet.json file. Likewise, if the service is stuck waiting on the kube-pause-image script to complete, then we want to be able to see that from preCommand: kube-pause-image, phase: start entry without a subsequent preCommand: kube-pause-image, phase: end entry.

You could almost think of the pre- and post-commands entries as similar to stage entries except that they run before and after the main command. The entries are included in the same json file as the rest of the entries for the service.

Here is an example of a kubelet.json file.

[ { "timestamp": "2021-03-12T00:43:37Z", "preCommand": "kubelet-pause-image", "phase": "start" }, { "timestamp": "2021-03-12T00:43:38Z", "preCommand": "kubelet-pause-image", "phase": "end", "result": "success" }, { "timestamp": "2021-03-12T00:43:38Z", "phase": "start" } ]

staebler · 2021-03-12T20:12:29Z

The 1st entry in the release-image example does not have a name or any identifying features:
  {
    "timestamp": "2021-03-11T04:50:00Z",
    "phase": "start"
  },
Is that a bug? I think each entry should identify either the stage or service...

The service is identified by the name of the json file. I could add the service name to each entry, if you feel strongly that that would be helpful.

patrickdillon · 2021-03-12T20:16:48Z

The 1st entry in the release-image example does not have a name or any identifying features:
  {
    "timestamp": "2021-03-11T04:50:00Z",
    "phase": "start"
  },
Is that a bug? I think each entry should identify either the stage or service...
The service is identified by the name of the json file. I could add the service name to each entry, if you feel strongly that that would be helpful.

Filename is good. I forgot about that viewing in this context.

* add "pre-command start", "pre-command end", "post-command start" and "post-command end" phases * fixed issue where the kubelet was not notifying systemd that it had started since it had been moved to a script

patrickdillon · 2021-03-17T03:17:41Z

data/data/bootstrap/files/usr/local/bin/installer-gather.sh


+echo "Gathering bootstrap service records ..."
+mkdir -p "${ARTIFACTS}/bootstrap/services"
+sudo cp -r /var/log/openshift/* "${ARTIFACTS}/bootstrap/services/"


This does not respect $SERVICE_RECORDS_DIR but that seems ok to me.

I think that is fine. The $SERVICE_RECORDS_DIR override is really just there for testing the entry recording script locally.

patrickdillon · 2021-03-17T03:30:09Z

/lgtm

Left one comment that gather is using hardcoded path so logs could (in theory) be in a different dir for gather. It does not seem like erring on the cp for a non-existent dir would stop (break) gather.

staebler · 2021-03-17T12:53:29Z

/approve

openshift-ci-robot · 2021-03-17T12:53:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [staebler]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-03-17T14:27:24Z

/retest