Skip to content

Conversation

@petr-muller
Copy link
Member

This is a cleaned up content of #30100 which it supersedes.

The PR adds a programmatic model of oc adm upgrade status outputs, based on
something that vaguely resembles a recursive descent parser. The model itself
is somewhat similar to the page object pattern used in web application testing:
instead of tests checking their stuff over the raw output, they interact over
a programmatic model of the output.

Every successfully captured output shapshot is parsed into a programmatic model,
which itself serves as a test (any valid output should be possible to model).

These models are then checked by four new tests:

  • Test for control plane section content
  • Test for worker section content
  • Test for health section content
  • Test for consistent update lifecycle reporting over time

hongkailiu and others added 10 commits August 13, 2025 13:21
Some tests will want to walk the snapshots in a timewise order so it is more practical to maintain them in a slice.
The output with most information is more useful that the one with less.
…tputs

This is more complicated than I wanted but here we are. It is something like a recursive descent parser of `oc adm upgrade status` outputs, parsing into a programmatic model that the tests can interact with (similar to how page objects work in web application testing)
This adds a test that for each successfuly collected `oc adm upgrade status` output builds the programmatic model ("page objedct"). This serves as a basic layout test (anything that cannot be parsed into a model is likely a bad output) and also a foundation for further tests that can use the model as a basis for their checks instead of depending on the textual output.
Checks the control plane section using the programmatic model.
Checks the worker section using the programmatic model.
Checks the health section using the programmatic model.
Add a test that the reported cluter update state is consistent over all snapshot and goes through expected update stages
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Aug 13, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Aug 13, 2025

@petr-muller: This pull request references OTA-1580 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.20.0" version, but no target version was set.

Details

In response to this:

This is a cleaned up content of #30100 which it supersedes.

The PR adds a programmatic model of oc adm upgrade status outputs, based on
something that vaguely resembles a recursive descent parser. The model itself
is somewhat similar to the page object pattern used in web application testing:
instead of tests checking their stuff over the raw output, they interact over
a programmatic model of the output.

Every successfully captured output shapshot is parsed into a programmatic model,
which itself serves as a test (any valid output should be possible to model).

These models are then checked by four new tests:

  • Test for control plane section content
  • Test for worker section content
  • Test for health section content
  • Test for consistent update lifecycle reporting over time

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@petr-muller petr-muller changed the title OTA-1580: Further tests for oc adm upgrade status OTA-1580: Further tests for oc adm upgrade status Aug 13, 2025
@petr-muller
Copy link
Member Author

/cc @wking @hongkailiu

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2025
@petr-muller
Copy link
Member Author

/test ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 13, 2025

@petr-muller: The following commands are available to trigger required jobs:

/test e2e-aws-jenkins
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-image-registry
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-ovn
/test e2e-gcp-ovn-builds
/test e2e-gcp-ovn-image-ecosystem
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test images
/test lint
/test okd-scos-images
/test unit
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test e2e-agnostic-ovn-cmd
/test e2e-aws-csi
/test e2e-aws-disruptive
/test e2e-aws-etcd-certrotation
/test e2e-aws-etcd-recovery
/test e2e-aws-ovn
/test e2e-aws-ovn-cgroupsv2
/test e2e-aws-ovn-edge-zones
/test e2e-aws-ovn-etcd-scaling
/test e2e-aws-ovn-kube-apiserver-rollout
/test e2e-aws-ovn-kubevirt
/test e2e-aws-ovn-serial-ipsec
/test e2e-aws-ovn-serial-publicnet-1of2
/test e2e-aws-ovn-serial-publicnet-2of2
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-single-node-serial
/test e2e-aws-ovn-single-node-techpreview
/test e2e-aws-ovn-single-node-techpreview-serial
/test e2e-aws-ovn-single-node-upgrade
/test e2e-aws-ovn-upgrade
/test e2e-aws-ovn-upgrade-rollback
/test e2e-aws-ovn-upi
/test e2e-aws-proxy
/test e2e-azure
/test e2e-azure-ovn-etcd-scaling
/test e2e-azure-ovn-upgrade
/test e2e-baremetalds-kubevirt
/test e2e-external-aws
/test e2e-external-aws-ccm
/test e2e-external-vsphere-ccm
/test e2e-gcp-csi
/test e2e-gcp-disruptive
/test e2e-gcp-fips-serial-1of2
/test e2e-gcp-fips-serial-2of2
/test e2e-gcp-ovn-etcd-scaling
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-ovn-techpreview
/test e2e-gcp-ovn-techpreview-serial-1of2
/test e2e-gcp-ovn-techpreview-serial-2of2
/test e2e-gcp-ovn-usernamespace
/test e2e-hypershift-conformance
/test e2e-metal-ipi-ovn
/test e2e-metal-ipi-ovn-bgp-virt-dualstack
/test e2e-metal-ipi-ovn-bgp-virt-dualstack-techpreview
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-dualstack-bgp
/test e2e-metal-ipi-ovn-dualstack-bgp-local-gw
/test e2e-metal-ipi-ovn-dualstack-local-gateway
/test e2e-metal-ipi-ovn-kube-apiserver-rollout
/test e2e-metal-ipi-serial-1of2
/test e2e-metal-ipi-serial-2of2
/test e2e-metal-ipi-serial-ovn-ipv6-1of2
/test e2e-metal-ipi-serial-ovn-ipv6-2of2
/test e2e-metal-ipi-virtualmedia
/test e2e-metal-ovn-single-node-live-iso
/test e2e-metal-ovn-single-node-with-worker-live-iso
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-openstack-ovn
/test e2e-openstack-serial
/test e2e-vsphere-ovn-dualstack-primaryv6
/test e2e-vsphere-ovn-etcd-scaling
/test okd-scos-e2e-aws-ovn

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-origin-main-e2e-agnostic-ovn-cmd
pull-ci-openshift-origin-main-e2e-aws-csi
pull-ci-openshift-origin-main-e2e-aws-disruptive
pull-ci-openshift-origin-main-e2e-aws-ovn
pull-ci-openshift-origin-main-e2e-aws-ovn-cgroupsv2
pull-ci-openshift-origin-main-e2e-aws-ovn-edge-zones
pull-ci-openshift-origin-main-e2e-aws-ovn-fips
pull-ci-openshift-origin-main-e2e-aws-ovn-kube-apiserver-rollout
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-serial
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade
pull-ci-openshift-origin-main-e2e-aws-ovn-upgrade
pull-ci-openshift-origin-main-e2e-aws-proxy
pull-ci-openshift-origin-main-e2e-azure
pull-ci-openshift-origin-main-e2e-gcp-csi
pull-ci-openshift-origin-main-e2e-gcp-ovn
pull-ci-openshift-origin-main-e2e-gcp-ovn-rt-upgrade
pull-ci-openshift-origin-main-e2e-gcp-ovn-techpreview
pull-ci-openshift-origin-main-e2e-gcp-ovn-techpreview-serial-1of2
pull-ci-openshift-origin-main-e2e-gcp-ovn-techpreview-serial-2of2
pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade
pull-ci-openshift-origin-main-e2e-hypershift-conformance
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-local-gateway
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-kube-apiserver-rollout
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-1of2
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-2of2
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-1of2
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-2of2
pull-ci-openshift-origin-main-e2e-metal-ipi-virtualmedia
pull-ci-openshift-origin-main-e2e-openstack-ovn
pull-ci-openshift-origin-main-e2e-vsphere-ovn
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi
pull-ci-openshift-origin-main-images
pull-ci-openshift-origin-main-lint
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-origin-main-okd-scos-images
pull-ci-openshift-origin-main-unit
pull-ci-openshift-origin-main-verify
pull-ci-openshift-origin-main-verify-deps
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@petr-muller
Copy link
Member Author

/test e2e-azure-ovn-upgrade

`oc adm upgrade status` emits operators with linebreaks in messages in a poor way which we can tolerate for now but will fix in the future
Fixed a typo in a condition, for "nodes are not updated" we need to test `!cp.NodesUpdated`
MCO churn sometimes briefly tricks our code into thinking the cluster is updating, we need to tolerate for now
Copy link
Member

@hongkailiu hongkailiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still lots of details I need to catch up and understand.
But please do not block on my review comments: they are mainly questions I collected while reading the code.

var total int
for when, observed := range w.ocAdmUpgradeStatus {
for _, snap := range w.ocAdmUpgradeStatus {
total++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
(this is code introduced by me:)
We can use len(w.ocAdmUpgradeStatus) (which works for both map and slice) instead of counting the elements.

"strings"
)

type ControlPlaneStatus struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
ControlPlaneStatus, WorkersStatus, and Health could be private and are unlikely to be used out of the admupgradestatus pkg.

var getMessage func() (string, error)
if strings.HasPrefix(line, "Message: ") {
getMessage = p.parseHealthMessage
health.Detailed = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt health.Detailed true all the time?
Because the status cmd we execute in the test is with --details=all.

}

if total == 0 {
noFailures.SkipMessage = &junitapi.SkipMessage{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we Fail or Skip here?
Do we have a case in CI that justifies total==0?


// Zero failures is too strict for at least SNO clusters
p := (len(failures) / total) * 100
p := (float32(len(failures)) / float32(total)) * 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. My bad.

if err != nil {
return false, fmt.Errorf("failed to get cluster version: %w", err)
}
return len(cv.Status.History) > len(w.initialClusterVersion.Status.History), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on cv.Status.History (when the collection is done) to tell if the test is an upgrade test.
If an upgrade test failed to refresh cv.Status.History for any reason, the testing result might be misleading.

I understand that my way might be even worse.
28dc69b#diff-840f994ffd52dd53189c8e78b470a8c93a1d6a7cbaf7eac9a5e83c5e16deec7cR74-R77

Ideally, the framework should tell us if a test is doing a cluster upgrade or not.

// and we do not need to skip
expectedLayout.SkipMessage = nil

if observed.out == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this should cause the parser to error out and then be falling into the observed.err != nil case.

@openshift-trt
Copy link

openshift-trt bot commented Aug 14, 2025

Job Failure Risk Analysis for sha: 5db0df8

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-cgroupsv2 Medium
[sig-instrumentation] Metrics should grab all metrics from kubelet /metrics/resource endpoint [Suite:openshift/conformance/parallel] [Suite:k8s]
This test has passed 96.52% of 2071 runs on release 4.20 [Overall] in the last week.

Open Bugs
e2e-aws-ovn-edge-zones is unstable
Kubelet metrics endpoint test regressed
pull-ci-openshift-origin-main-e2e-hypershift-conformance Medium
[sig-sippy] infrastructure should work
This test has passed 87.53% of 3738 runs on release 4.20 [Overall] in the last week.

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New tests seen in this PR at sha: 5db0df8

  • "[sig-cli][OCPFeatureGate:UpgradeStatus] oc adm upgrade status control plane section is consistent" [Total: 33, Pass: 33, Fail: 0, Flake: 0]
  • "[sig-cli][OCPFeatureGate:UpgradeStatus] oc adm upgrade status health section is consistent" [Total: 33, Pass: 33, Fail: 0, Flake: 0]
  • "[sig-cli][OCPFeatureGate:UpgradeStatus] oc adm upgrade status output has expected layout" [Total: 33, Pass: 33, Fail: 0, Flake: 0]
  • "[sig-cli][OCPFeatureGate:UpgradeStatus] oc adm upgrade status snapshots reflect the cluster upgrade lifecycle" [Total: 33, Pass: 33, Fail: 0, Flake: 0]
  • "[sig-cli][OCPFeatureGate:UpgradeStatus] oc adm upgrade status workers section is consistent" [Total: 33, Pass: 33, Fail: 0, Flake: 0]

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job analysis has everything passing 100%.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 14, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member

wking commented Aug 14, 2025

In case any of the failed jobs are blockers:

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 14, 2025

@petr-muller: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn-techpreview-serial-2of2 5db0df8 link false /test e2e-gcp-ovn-techpreview-serial-2of2
ci/prow/okd-scos-e2e-aws-ovn 5db0df8 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-ovn-techpreview 5db0df8 link false /test e2e-gcp-ovn-techpreview
ci/prow/e2e-aws-ovn-edge-zones 5db0df8 link false /test e2e-aws-ovn-edge-zones
ci/prow/e2e-aws-ovn-single-node 5db0df8 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-hypershift-conformance 5db0df8 link false /test e2e-hypershift-conformance
ci/prow/e2e-metal-ipi-ovn-dualstack 5db0df8 link false /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-metal-ipi-virtualmedia 5db0df8 link false /test e2e-metal-ipi-virtualmedia
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway 5db0df8 link false /test e2e-metal-ipi-ovn-dualstack-local-gateway
ci/prow/e2e-aws-ovn-cgroupsv2 5db0df8 link false /test e2e-aws-ovn-cgroupsv2
ci/prow/e2e-aws-disruptive 5db0df8 link false /test e2e-aws-disruptive
ci/prow/e2e-metal-ipi-ovn-kube-apiserver-rollout 5db0df8 link false /test e2e-metal-ipi-ovn-kube-apiserver-rollout
ci/prow/e2e-aws-ovn-single-node-upgrade 5db0df8 link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 2557f36 into openshift:main Aug 14, 2025
34 of 47 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-tests
This PR has been included in build openshift-enterprise-tests-container-v4.20.0-202508140915.p0.g2557f36.assembly.stream.el9.
All builds following this will include this PR.

@petr-muller petr-muller deleted the ota-1580-03-all-tests branch August 14, 2025 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants