Skip to content

Conversation

@patrickdillon
Copy link
Contributor

As part of the overlay node image, a new service was introduced to pull the node image in
60c63bb

This commit updates the installer gather and analyze to collect these logs and analyze them.

Still testing this...

As part of the overlay node image, a new service was introduced
to pull the node image in
60c63bb

This commit updates the installer gather and analyze to collect
these logs and analyze them.
@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 29, 2025
@openshift-ci-robot
Copy link
Contributor

@patrickdillon: This pull request references Jira Issue OCPBUGS-56876, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

As part of the overlay node image, a new service was introduced to pull the node image in
60c63bb

This commit updates the installer gather and analyze to collect these logs and analyze them.

Still testing this...

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 29, 2025
@openshift-ci openshift-ci bot requested review from bfournie and sadasu May 29, 2025 19:25
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 29, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bfournie for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@patrickdillon
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 29, 2025
@openshift-ci-robot
Copy link
Contributor

@patrickdillon: This pull request references Jira Issue OCPBUGS-56876, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @gpei

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from gpei May 29, 2025 19:27
Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, with this change, I can now see the logs from journal for the node-image-pull service under bootstrap/journals/node-image-pull.log 😄

However, the installer could not analyze the bundle (i.e. openshift-install analyze) for such image-pull errors. I believe the service record for the node-image-pull is missing

$ ls -la <log-bundle-dir>/bootstrap/services/
total 4
drwxr-xr-x. 1 thvo thvo 40 May 29 16:35 .
drwxr-xr-x. 1 thvo thvo 94 May 29 16:35 ..

Looking at the template for node-image-pull script. Looks like it is missing the crucial . /usr/local/bin/bootstrap-service-record.sh that records the service (See here).

#!/bin/bash
set -euo pipefail
# shellcheck source=release-image.sh.template
. /usr/local/bin/release-image.sh

Adding the . /usr/local/bin/bootstrap-service-record.sh at the top of the template file seems to record the service phases and the installer could then analyze the failed service.

check func(analysis) bool
optional bool
}{
{name: "node-image-pull", check: checkReleaseImageDownload, optional: false},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about adding unit tests case for node-image-pull in:

func TestAnalyzeGatherBundle(t *testing.T) {

But the service release-image and node-image-pull are handled the same way. Let's just rename test cases to node-image-pull instead to avoid dups + reflect the new "actually being used" service?

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 28, 2025
@tthvo
Copy link
Member

tthvo commented Aug 28, 2025

/remove-lifecycle stale

@openshift-ci openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 28, 2025
Update analyze command to check for the failed node-image-pull
service, so that users are presented with a helpful error
message if they have a bad pull secret.
Comment on lines 87 to 89
{name: "release-image", check: checkReleaseImageDownload, optional: false},
{name: "node-image-pull", check: checkNodeImagePull, optional: false},
{name: "bootkube", check: checkBootkubeService, optional: false},
Copy link
Member

@tthvo tthvo Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{name: "release-image", check: checkReleaseImageDownload, optional: false},
{name: "node-image-pull", check: checkNodeImagePull, optional: false},
{name: "bootkube", check: checkBootkubeService, optional: false},
{name: "node-image-pull", check: checkNodeImagePull, optional: false},
{name: "release-image", check: checkReleaseImageDownload, optional: false},
{name: "bootkube", check: checkBootkubeService, optional: false},

I think the order matters right, according to #4751 (comment)?

IIUC, node-image-pull is first to start before the other two 🤔 as I saw the release-image never seemed to start when node-image-pull is throwing errors...Though, I am clueless how that works because the service unit files don't define such dependencies 😞

$ cat log-bundle-20251027132247/bootstrap/journals/node-image-pull.log 
...output-omitted...
Oct 27 19:48:33 ip-10-0-160-222 node-image-pull.sh[1949]: Failed to fetch release image; retrying...
Oct 27 19:48:43 ip-10-0-160-222 ostree-containe[2243]: Fetching ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c7ba2a9638c369c24f9d564f9bfa8d59154df08085bb75510454b98aa0fda51e
Oct 27 19:48:44 ip-10-0-160-222 node-image-pull.sh[2243]: error: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reading manifest sha256:c7ba2a9638c369c24f9d564f9bfa8d59154df08085bb75510454b98aa0fda51e in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized
...output-omitted...

$ cat log-bundle-20251027132247/bootstrap/journals/release-image.log 
-- No entries --

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current change, we will only ever see the below, which is not what we want right?

$ openshift-install analyze --file=log-bundle-20251027132247.tar.gz 
ERROR The bootstrap machine did not execute the release-image.service systemd unit 

If I change the order as above comment, we can now see:

$ openshift-install analyze --file=log-bundle-20251027132247.tar.gz 
ERROR Node image pull failed on the bootstrap machine 
INFO        

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that the empty INFO line, which is supposed to print the last 3 lines of service logs. Here, it is not. It seems like the node-image-pull service is looping on the bootstrap and never ends; so its error is never captured.

while ! ostree container image pull --authfile "/root/.docker/config.json" \
"${ostree_repo}" ostree-unverified-image:docker://"${COREOS_IMAGE}"; do
echo 'Failed to fetch release image; retrying...'
sleep 10
done

$ systemctl status node-image-pull
● node-image-pull.service - Node Image Pull
     Loaded: loaded (/etc/systemd/system/node-image-pull.service; static)
     Active: activating (start) since Mon 2025-10-27 19:47:56 UTC; 1h 9min ago
    Process: 1943 ExecStartPre=chcon --reference=/usr/bin/ostree /usr/local/bin/node-image-pull.sh (code=exited, status=0/SUCCESS)
   Main PID: 1949 (node-image-pull)
      Tasks: 2 (limit: 99952)
     Memory: 608.0M
        CPU: 1min 10.703s
     CGroup: /system.slice/node-image-pull.service
             ├─1949 /bin/bash /usr/local/bin/node-image-pull.sh
             └─7897 sleep 10

Oct 27 20:57:05 ip-10-0-160-222 node-image-pull.sh[1949]: Failed to fetch release image; retrying...
Oct 27 20:57:15 ip-10-0-160-222 ostree-containe[7814]: Fetching ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c7ba2a9638c369c24f9d564f9bf>
Oct 27 20:57:15 ip-10-0-160-222 node-image-pull.sh[7814]: error: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reading manifest s>
Oct 27 20:57:15 ip-10-0-160-222 node-image-pull.sh[1949]: Failed to fetch release image; retrying...
Oct 27 20:57:25 ip-10-0-160-222 ostree-containe[7826]: Fetching ostree-unverified-registry:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c7ba2a9638c369c24f9d564f9bf>
Oct 27 20:57:26 ip-10-0-160-222 node-image-pull.sh[7826]: error: Creating importer: failed to invoke method OpenImage: failed to invoke method OpenImage: reading manifest s>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can also improve the UX a bit by checking if the error message is present. If not, we can direct the user to the log file. It seems like the simplest way. WDYT @patrickdillon ?

func (a analysis) logLastError() {
for _, l := range strings.Split(a.lastError, "\n") {
logrus.Info(l)
}
}

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 27, 2025

@patrickdillon: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-multi-network 50c0261 link false /test e2e-vsphere-ovn-multi-network
ci/prow/aro-unit 50c0261 link true /test aro-unit
ci/prow/okd-scos-e2e-aws-ovn 250b7ff link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@gpei
Copy link
Contributor

gpei commented Oct 28, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. and removed jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Oct 28, 2025
@openshift-ci-robot
Copy link
Contributor

@gpei: This pull request references Jira Issue OCPBUGS-56876, which is invalid:

  • expected the bug to target either version "4.21." or "openshift-4.21.", but it targets "4.20.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@gpei
Copy link
Contributor

gpei commented Oct 28, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 28, 2025
@openshift-ci-robot
Copy link
Contributor

@gpei: This pull request references Jira Issue OCPBUGS-56876, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jinyunma

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from jinyunma October 28, 2025 03:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants