-
Notifications
You must be signed in to change notification settings - Fork 1.5k
OCPBUGS-56876: gather: collect logs & analyze node-image-pull #9761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPBUGS-56876: gather: collect logs & analyze node-image-pull #9761
Conversation
As part of the overlay node image, a new service was introduced to pull the node image in 60c63bb This commit updates the installer gather and analyze to collect these logs and analyze them.
|
@patrickdillon: This pull request references Jira Issue OCPBUGS-56876, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/jira refresh |
|
@patrickdillon: This pull request references Jira Issue OCPBUGS-56876, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
tthvo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, with this change, I can now see the logs from journal for the node-image-pull service under bootstrap/journals/node-image-pull.log 😄
However, the installer could not analyze the bundle (i.e. openshift-install analyze) for such image-pull errors. I believe the service record for the node-image-pull is missing
$ ls -la <log-bundle-dir>/bootstrap/services/
total 4
drwxr-xr-x. 1 thvo thvo 40 May 29 16:35 .
drwxr-xr-x. 1 thvo thvo 94 May 29 16:35 ..Looking at the template for node-image-pull script. Looks like it is missing the crucial . /usr/local/bin/bootstrap-service-record.sh that records the service (See here).
installer/data/data/bootstrap/files/usr/local/bin/node-image-pull.sh.template
Lines 1 to 6 in 88ba667
| #!/bin/bash | |
| set -euo pipefail | |
| # shellcheck source=release-image.sh.template | |
| . /usr/local/bin/release-image.sh | |
Adding the . /usr/local/bin/bootstrap-service-record.sh at the top of the template file seems to record the service phases and the installer could then analyze the failed service.
pkg/gather/service/analyze.go
Outdated
| check func(analysis) bool | ||
| optional bool | ||
| }{ | ||
| {name: "node-image-pull", check: checkReleaseImageDownload, optional: false}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about adding unit tests case for node-image-pull in:
| func TestAnalyzeGatherBundle(t *testing.T) { |
But the service release-image and node-image-pull are handled the same way. Let's just rename test cases to node-image-pull instead to avoid dups + reflect the new "actually being used" service?
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
/remove-lifecycle stale |
Update analyze command to check for the failed node-image-pull service, so that users are presented with a helpful error message if they have a bad pull secret.
| {name: "release-image", check: checkReleaseImageDownload, optional: false}, | ||
| {name: "node-image-pull", check: checkNodeImagePull, optional: false}, | ||
| {name: "bootkube", check: checkBootkubeService, optional: false}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| {name: "release-image", check: checkReleaseImageDownload, optional: false}, | |
| {name: "node-image-pull", check: checkNodeImagePull, optional: false}, | |
| {name: "bootkube", check: checkBootkubeService, optional: false}, | |
| {name: "node-image-pull", check: checkNodeImagePull, optional: false}, | |
| {name: "release-image", check: checkReleaseImageDownload, optional: false}, | |
| {name: "bootkube", check: checkBootkubeService, optional: false}, |
I think the order matters right, according to #4751 (comment)?
IIUC, node-image-pull is first to start before the other two 🤔 as I saw the release-image never seemed to start when node-image-pull is throwing errors...Though, I am clueless how that works because the service unit files don't define such dependencies 😞
$ cat log-bundle-20251027132247/bootstrap/journals/node-image-pull.log
...output-omitted...
Oct 27 19:48:33 ip-10-0-160-222 node-image-pull.sh[1949]: Failed to fetch release image; retrying...
Oct 27 19:48:43 ip-10-0-160-222 ostree-containe[2243]: Fetching ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c7ba2a9638c369c24f9d564f9bfa8d59154df08085bb75510454b98aa0fda51e
Oct 27 19:48:44 ip-10-0-160-222 node-image-pull.sh[2243]: error: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reading manifest sha256:c7ba2a9638c369c24f9d564f9bfa8d59154df08085bb75510454b98aa0fda51e in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
...output-omitted...
$ cat log-bundle-20251027132247/bootstrap/journals/release-image.log
-- No entries --
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the current change, we will only ever see the below, which is not what we want right?
$ openshift-install analyze --file=log-bundle-20251027132247.tar.gz
ERROR The bootstrap machine did not execute the release-image.service systemd unit If I change the order as above comment, we can now see:
$ openshift-install analyze --file=log-bundle-20251027132247.tar.gz
ERROR Node image pull failed on the bootstrap machine
INFO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that the empty INFO line, which is supposed to print the last 3 lines of service logs. Here, it is not. It seems like the node-image-pull service is looping on the bootstrap and never ends; so its error is never captured.
installer/data/data/bootstrap/files/usr/local/bin/node-image-pull.sh.template
Lines 54 to 58 in d7dc751
| while ! ostree container image pull --authfile "/root/.docker/config.json" \ | |
| "${ostree_repo}" ostree-unverified-image:docker://"${COREOS_IMAGE}"; do | |
| echo 'Failed to fetch release image; retrying...' | |
| sleep 10 | |
| done |
$ systemctl status node-image-pull
● node-image-pull.service - Node Image Pull
Loaded: loaded (/etc/systemd/system/node-image-pull.service; static)
Active: activating (start) since Mon 2025-10-27 19:47:56 UTC; 1h 9min ago
Process: 1943 ExecStartPre=chcon --reference=/usr/bin/ostree /usr/local/bin/node-image-pull.sh (code=exited, status=0/SUCCESS)
Main PID: 1949 (node-image-pull)
Tasks: 2 (limit: 99952)
Memory: 608.0M
CPU: 1min 10.703s
CGroup: /system.slice/node-image-pull.service
├─1949 /bin/bash /usr/local/bin/node-image-pull.sh
└─7897 sleep 10
Oct 27 20:57:05 ip-10-0-160-222 node-image-pull.sh[1949]: Failed to fetch release image; retrying...
Oct 27 20:57:15 ip-10-0-160-222 ostree-containe[7814]: Fetching ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c7ba2a9638c369c24f9d564f9bf>
Oct 27 20:57:15 ip-10-0-160-222 node-image-pull.sh[7814]: error: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reading manifest s>
Oct 27 20:57:15 ip-10-0-160-222 node-image-pull.sh[1949]: Failed to fetch release image; retrying...
Oct 27 20:57:25 ip-10-0-160-222 ostree-containe[7826]: Fetching ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c7ba2a9638c369c24f9d564f9bf>
Oct 27 20:57:26 ip-10-0-160-222 node-image-pull.sh[7826]: error: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reading manifest s>There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can also improve the UX a bit by checking if the error message is present. If not, we can direct the user to the log file. It seems like the simplest way. WDYT @patrickdillon ?
installer/pkg/gather/service/analyze.go
Lines 188 to 192 in d7dc751
| func (a analysis) logLastError() { | |
| for _, l := range strings.Split(a.lastError, "\n") { | |
| logrus.Info(l) | |
| } | |
| } |
|
@patrickdillon: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/jira refresh |
|
@gpei: This pull request references Jira Issue OCPBUGS-56876, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@gpei: This pull request references Jira Issue OCPBUGS-56876, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
As part of the overlay node image, a new service was introduced to pull the node image in
60c63bb
This commit updates the installer gather and analyze to collect these logs and analyze them.
Still testing this...