Skip to content

Conversation

@staebler
Copy link
Contributor

Each OpenShift service running on the bootstrap machine will now create a json file in /var/log/openshift/ that contains an array of entries detailing the progress that the service has made.

The entries included in the json file are the following.

  • Service start
  • Service end, with result and error details
  • Service stage start
  • Service stage end, with result and error details

The json files in /var/log/openshift will be collected by the bootstrap gather in /bootstrap/services/ for evaluation by the
installer for improved failure reporting to the user. The evaluation is left for follow-on work.

https://issues.redhat.com/browse/CORS-1542
https://issues.redhat.com/browse/CORS-1543

@staebler
Copy link
Contributor Author

staebler commented Mar 11, 2021

Example of the release-image.service.json file when the pull secret is not valid.

[
  {
    "timestamp": "2021-03-11T04:50:00Z",
    "phase": "start"
  },
  {
    "timestamp": "2021-03-11T04:50:00Z",
    "stage": "pull-release-image",
    "phase": "stage start"
  },
  {
    "timestamp": "2021-03-11T04:50:01Z",
    "stage": "pull-release-image",
    "phase": "stage end",
    "result": "failure",
    "errorLine": "25 main /usr/local/bin/release-image-download.sh",
    "errorMessage": "Starting Download the OpenShift Release Image...\nPulling registry.ci.openshift.org/origin/release:4.8...\nError: unable to pull registry.ci.openshift.org/origin/release:4.8: Error initializing source docker://registry.ci.openshift.org/origin/release:4.8: unable to retrieve auth token: invalid username/password: unauthorized: authentication required"
  },
  {
    "timestamp": "2021-03-11T04:50:01Z",
    "stage": "pull-release-image",
    "phase": "stage start"
  },
  {
    "timestamp": "2021-03-11T04:50:01Z",
    "stage": "pull-release-image",
    "phase": "stage end",
    "result": "failure",
    "errorLine": "25 main /usr/local/bin/release-image-download.sh",
    "errorMessage": "Error: unable to pull registry.ci.openshift.org/origin/release:4.8: Error initializing source docker://registry.ci.openshift.org/origin/release:4.8: unable to retrieve auth token: invalid username/password: unauthorized: authentication required\nPull failed. Retrying registry.ci.openshift.org/origin/release:4.8...\nError: unable to pull registry.ci.openshift.org/origin/release:4.8: Error initializing source docker://registry.ci.openshift.org/origin/release:4.8: unable to retrieve auth token: invalid username/password: unauthorized: authentication required"
  },
  {
    "timestamp": "2021-03-11T04:50:01Z",
    "stage": "pull-release-image",
    "phase": "stage start"
  },
  {
    "timestamp": "2021-03-11T04:50:02Z",
    "stage": "pull-release-image",
    "phase": "stage end",
    "result": "failure",
    "errorLine": "25 main /usr/local/bin/release-image-download.sh",
    "errorMessage": "Error: unable to pull registry.ci.openshift.org/origin/release:4.8: Error initializing source docker://registry.ci.openshift.org/origin/release:4.8: unable to retrieve auth token: invalid username/password: unauthorized: authentication required\nPull failed. Retrying registry.ci.openshift.org/origin/release:4.8...\nError: unable to pull registry.ci.openshift.org/origin/release:4.8: Error initializing source docker://registry.ci.openshift.org/origin/release:4.8: unable to retrieve auth token: invalid username/password: unauthorized: authentication required"
  }
]

Each OpenShift service running on the bootstrap machine will now
create a json file in /var/log/openshift/ that contains an array
of entries detailing the progress that the service has made.

The entries included in the json file are the following.
* Service start
* Service end, with result and error details
* Service stage start
* Service stage end, with result and error details

The json files in /var/log/openshift will be collected by the
bootstrap gather in /bootstrap/services/ for evaluation by the
installer for improved failure reporting to the user. The evaluation
is left for follow-on work.

https://issues.redhat.com/browse/CORS-1542
https://issues.redhat.com/browse/CORS-1543
@staebler staebler force-pushed the bootstrap_service_markers branch from 038ce92 to a35c8dd Compare March 12, 2021 02:00
Copy link
Contributor

@patrickdillon patrickdillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1st entry in the release-image example does not have a name or any identifying features:

  {
    "timestamp": "2021-03-11T04:50:00Z",
    "phase": "start"
  },

Is that a bug? I think each entry should identify either the stage or service...



# add_service_record_entry adds a record entry to the service records file.
# PHASE - phase being recorded; one of "start", "end", "stage start", or "stage end"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider
start -> service start
end -> service end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the phase for pre- and post-commands to pre-command start, pre-command end, post-command start, and post-command end, too.


If a service has pre- or post-commands that could either run for significant periods or could
potentially fail, then those commands should add to the json file as well. Such a command should
source the same /usr/local/bin/bootstrap-service-record.sh script. It should also set either the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'm grasping the PRE- POST_COMMAND concept although this description is fine.

What value does this add? It doesn't seem like it, but does it link services together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take the kubelet service as an example.

As described in the doc, that service has a pre-command of /usr/local/bin/kubelet-pause-image.sh. What that means is that, when the kubelet service runs, the first step of the service is to run the kube-pause-image script. If the kube-pause-image script fails, then the service fails. On restart of the service, the kube-pause-image script is run again. The main kubelet command does not start until the kube-pause-image script succeeds.

When analyzing a bootstrap gather bundle, we want to be able to look at the kubelet.json file to see (1) whether the service completed successfully, (2) whether the service failed and restarted, or (3) whether the service is still running. If the service is restarting because the kube-pause-image script is failing, then we want to be able to ascertain that from preCommand: kube-pause-image entries in the kubelet.json file. Likewise, if the service is stuck waiting on the kube-pause-image script to complete, then we want to be able to see that from preCommand: kube-pause-image, phase: start entry without a subsequent preCommand: kube-pause-image, phase: end entry.

You could almost think of the pre- and post-commands entries as similar to stage entries except that they run before and after the main command. The entries are included in the same json file as the rest of the entries for the service.

Copy link
Contributor Author

@staebler staebler Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of a kubelet.json file.

[
  {
    "timestamp": "2021-03-12T00:43:37Z",
    "preCommand": "kubelet-pause-image",
    "phase": "start"
  },
  {
    "timestamp": "2021-03-12T00:43:38Z",
    "preCommand": "kubelet-pause-image",
    "phase": "end",
    "result": "success"
  },
  {
    "timestamp": "2021-03-12T00:43:38Z",
    "phase": "start"
  }
]

@staebler
Copy link
Contributor Author

The 1st entry in the release-image example does not have a name or any identifying features:

  {
    "timestamp": "2021-03-11T04:50:00Z",
    "phase": "start"
  },

Is that a bug? I think each entry should identify either the stage or service...

The service is identified by the name of the json file. I could add the service name to each entry, if you feel strongly that that would be helpful.

@patrickdillon
Copy link
Contributor

The 1st entry in the release-image example does not have a name or any identifying features:

  {
    "timestamp": "2021-03-11T04:50:00Z",
    "phase": "start"
  },

Is that a bug? I think each entry should identify either the stage or service...

The service is identified by the name of the json file. I could add the service name to each entry, if you feel strongly that that would be helpful.

Filename is good. I forgot about that viewing in this context.

* add "pre-command start", "pre-command end", "post-command start" and "post-command end" phases
* fixed issue where the kubelet was not notifying systemd that it had started since it had been moved to a script

echo "Gathering bootstrap service records ..."
mkdir -p "${ARTIFACTS}/bootstrap/services"
sudo cp -r /var/log/openshift/* "${ARTIFACTS}/bootstrap/services/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not respect $SERVICE_RECORDS_DIR but that seems ok to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is fine. The $SERVICE_RECORDS_DIR override is really just there for testing the entry recording script locally.

@patrickdillon
Copy link
Contributor

/lgtm

Left one comment that gather is using hardcoded path so logs could (in theory) be in a different dir for gather. It does not seem like erring on the cp for a non-existent dir would stop (break) gather.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 17, 2021
@staebler
Copy link
Contributor Author

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 17, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

11 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

16 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 19, 2021

@staebler: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-crc 4a1bdaa link /test e2e-crc
ci/prow/e2e-aws-workers-rhel7 4a1bdaa link /test e2e-aws-workers-rhel7

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants