Skip to content

Conversation

@gbenhaim
Copy link
Member

  • Add a test for metal ipi upgrade
  • Extend "baremetalds-e2e-test-commands.sh" to run the upgrade tests if
    a destination cluster image is specified using
    "OPENSHIFT_UPGRADE_RELEASE_IMAGE_OVERRIDE".

Signed-off-by: Eran Israeli [email protected]
Signed-off-by: gbenhaim [email protected]

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 25, 2020
@openshift-ci-robot
Copy link
Contributor

Hi @gbenhaim. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 25, 2020
@yuvalturg
Copy link
Contributor

/ok-to-test

@openshift-ci-robot openshift-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 25, 2020
@danielBelenky
Copy link

/retest

@gbenhaim gbenhaim force-pushed the baremetalds_e2e_upgrade_gal branch 4 times, most recently from c71a06e to f689350 Compare October 25, 2020 15:53
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better probably to use OPENSHIFT_UPGRADE_RELEASE_IMAGE, see changes currently in progress on #12959

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check it's ok, anyhow I'd suggest to follow an approach similar to the one in #12959 (to be as much as possible similar to the ones adopted by the other profiles) where possible, ie making a check directly against OPENSHIFT_UPGRADE_RELEASE_IMAGE as in https://github.com/openshift/release/blob/d4f03b5848e88d409f6b0cac7ec205f47530e07b/ci-operator/step-registry/openshift/e2e/test/openshift-e2e-test-commands.sh#L75 (we'll save one var)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this approach, the problem is that OPENSHIFT_UPGRADE_RELEASE_IMAGE will always be non-empty since the step specifies it as a dependency, so I've needed a new variable.

From where the TEST_COMMAND variable comes from? I didn't see it in any of the templates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's a sanity check (always expected to be not empty). They've defined TEST_COMMAND as a normal env var in the openshift-e2e-test.
In our case we've the additional requirement to filter our some of the conformance tests (in the case of installation test), checking the current value of TEST_COMMAND could help in crafting properly the various pieces required by openshift-tests (the command itself, additional options such as --to-image and the filter list)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, https://github.com/openshift/release/pull/13120/files#diff-7b4946fc316154f2f0fe2beac2371cf82bc14d8071298890e8a014417e88159f doesn't use the TEST_COMMAND variable. If we want to use it, I would request to do it in a different PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to adopt it if possible in the current PR, so that also the code duplication will be removed. If not, it's ok to address it in another PR.

@gbenhaim gbenhaim force-pushed the baremetalds_e2e_upgrade_gal branch from f689350 to 8bed02b Compare October 26, 2020 10:59
@gbenhaim gbenhaim changed the title WIP: Test metal ipi upgrade Test metal ipi upgrade Oct 26, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2020
@gbenhaim gbenhaim force-pushed the baremetalds_e2e_upgrade_gal branch 3 times, most recently from a0661e7 to 2217245 Compare October 26, 2020 16:40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@gbenhaim gbenhaim force-pushed the baremetalds_e2e_upgrade_gal branch from 2217245 to 107cfd7 Compare October 26, 2020 16:50
@andfasano
Copy link
Contributor

/test pj-rehearse

1 similar comment
@gbenhaim
Copy link
Member Author

/test pj-rehearse

@gbenhaim
Copy link
Member Author

The upgrade test fails on

2020/10/26 23:26:04 Executing pod "e2e-metal-ipi-upgrade-baremetalds-e2e-test"
2020/10/26 23:26:07 Container cp-secret-wrapper in pod e2e-metal-ipi-upgrade-baremetalds-e2e-test completed successfully
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2020-10-27T02:16:39Z"}

@gbenhaim gbenhaim force-pushed the baremetalds_e2e_upgrade_gal branch 2 times, most recently from c0d6a94 to 1d73149 Compare October 27, 2020 13:54
@gbenhaim
Copy link
Member Author

Failed on an unrelated issue

error: some steps failed:
  * could not run steps: step e2e-metal-ipi-upgrade failed: "e2e-metal-ipi-upgrade" pre steps failed: "e2e-metal-ipi-upgrade" pod "e2e-metal-ipi-upgrade-baremetalds-devscripts-setup" failed: the pod ci-op-1lp7bydh/e2e-metal-ipi-upgrade-baremetalds-devscripts-setup failed after 55m25s (failed containers: test): ContainerFailed one or more containers exited
Container test exited with code 1, reason Error
---
el=debug msg="Gather remote logs"
skipped 18 lines unfold_more
level=debug msg="Gathering master journals ..."
level=debug msg="Gathering master containers ..."
level=debug msg="Waiting for logs ..."
level=debug msg="Log bundle written to /var/home/core/log-bundle-20201027150400.tar.gz"
level=info msg="Bootstrap gather logs captured here \"/root/dev-scripts/ocp/ostest/log-bundle-20201027150400.tar.gz\""
level=fatal msg="Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition"
+(utils.sh:1): create_cluster(): removetmp
+(utils.sh:487): removetmp(): '[' -n '' ']'
+(utils.sh:487): removetmp(): true
make: *** [Makefile:27: ocp_run] Error 1
dev-scripts setup completed, fetching logs
Warning: Permanently added '139.178.69.183' (ECDSA) to the list of known hosts.
tar: Removing leading `/' from member names
error: failed to execute wrapped command: exit status 2
---
Link to step on registry info site: https://steps.ci.openshift.org/reference/baremetalds-devscripts-setup
Link to job on registry info site: https://steps.ci.openshift.org/job?org=openshift-metal3&repo=dev-scripts&branch=master&test=e2e-metal-ipi-upgrade
time="2020-10-27T15:06:47Z" level=info msg="Reporting job state 'failed' with reason 'executing_graph:step_failed:utilizing_lease:executing_test:executing_multi_stage_test'"

@gbenhaim
Copy link
Member Author

/retest ci/rehearse/openshift-metal3/dev-scripts/master/e2e-metal-ipi-upgrade

@gbenhaim
Copy link
Member Author

gbenhaim commented Nov 8, 2020

/test pj-rehearse

@gbenhaim
Copy link
Member Author

gbenhaim commented Nov 9, 2020

/test pj-rehearse

1 similar comment
@gbenhaim
Copy link
Member Author

/test pj-rehearse

@gbenhaim gbenhaim force-pushed the baremetalds_e2e_upgrade_gal branch from 661f560 to f0e0601 Compare November 10, 2020 11:42
@gbenhaim
Copy link
Member Author

@andfasano the test runs and fails on:


2/74 Tests Failed. | expand_less
-- | --
: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less1h11m37sfail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:85]: Unexpected error:     <*errors.errorString \| 0xc00246d730>: {         s: "timed out waiting for service \"service-test\" to have a load balancer",     }     timed out waiting for service "service-test" to have a load balancer occurred | : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less | 1h11m37s | fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:85]: Unexpected error:     <*errors.errorString \| 0xc00246d730>: {         s: "timed out waiting for service \"service-test\" to have a load balancer",     }     timed out waiting for service "service-test" to have a load balancer occurred
: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less | 1h11m37s
fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:85]: Unexpected error:     <*errors.errorString \| 0xc00246d730>: {         s: "timed out waiting for service \"service-test\" to have a load balancer",     }     timed out waiting for service "service-test" to have a load balancer occurred

can you please take a look?

@andfasano
Copy link
Contributor

andfasano commented Nov 11, 2020

@andfasano the test runs and fails on:


2/74 Tests Failed. | expand_less
-- | --
: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less1h11m37sfail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:85]: Unexpected error:     <*errors.errorString \| 0xc00246d730>: {         s: "timed out waiting for service \"service-test\" to have a load balancer",     }     timed out waiting for service "service-test" to have a load balancer occurred | : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less | 1h11m37s | fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:85]: Unexpected error:     <*errors.errorString \| 0xc00246d730>: {         s: "timed out waiting for service \"service-test\" to have a load balancer",     }     timed out waiting for service "service-test" to have a load balancer occurred
: [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] expand_less | 1h11m37s
fail [github.com/openshift/origin/test/e2e/upgrade/service/service.go:85]: Unexpected error:     <*errors.errorString \| 0xc00246d730>: {         s: "timed out waiting for service \"service-test\" to have a load balancer",     }     timed out waiting for service "service-test" to have a load balancer occurred

can you please take a look?

Hi @gbenhaim it seems that this test is failing because currently the baremetal platform does not support LoadBalancer service type (see [1]). Can you please try to filter it out (you can follow a similar way to the one adopted for the installation e2e test)

[1] openshift/enhancements#356

cc @hardys

@gbenhaim
Copy link
Member Author

gbenhaim commented Nov 11, 2020

@andfasano
Copy link
Contributor

@andfasano from looking at the code I can see it's not possible to filter this test
https://github.com/openshift/openshift-tests/blob/release-4.6/cmd/extended-platform-tests/upgrade.go#L46

You can also see that tests that shouldn't run were commented out https://github.com/openshift/openshift-tests/blob/release-4.6/test/e2e/upgrade/upgrade.go#L32

(IIRC the repo is now in https://github.com/openshift/origin/blob/8251e84a8208e949712d34185e71b09392c561d1/test/e2e/upgrade/upgrade.go#L36)

Do you mean that the approach I've reported in my previous comment doesn't work for the upgrade (

ssh "${SSHOPTS[@]}" "root@${IP}" openshift-tests run "openshift/conformance/parallel" --dry-run \| grep 'Feature:ProjectAPI' \| openshift-tests run -o /tmp/artifacts/e2e.log --junit-dir /tmp/artifacts/junit -f -
) (note: there's also some documentation about it here: https://github.com/openshift/origin/blob/01b8d264e7f5e33b01b70ac602d85b8b64102b3c/test/extended/README.md#running-tests)?

Could be also worth trying the platform suite instead of the all one.

eisraeli and others added 5 commits November 17, 2020 13:34
- Add a test for metal ipi upgrade
- Extend "baremetalds-e2e-test-commands.sh" to run only the upgrade tests if
  the workflow requested it.
- Choose the cluster images based on the type of the job. For regular
  jobs the cluster will use release:latest. For upgrade jobs the
  cluster will use release:initial and will be upgraded to
  release:latest.

Signed-off-by: Eran Israeli <[email protected]>
Signed-off-by: gbenhaim <[email protected]>
We would like to collect the logs from the remote
server also in the case the script was termintated by a signal.

Signed-off-by: gbenhaim <[email protected]>
Because of a bug in "ci-operator" when the job is canceled because
of a timeout logs are not collected [1].

In order to bypass this issue, add an internal timeout to the upgrade
command.

[1] openshift/ci-tools#1340

Signed-off-by: gbenhaim <[email protected]>
Upgrades are not supported for IPv6 only clusters, since quay.io
and api.openshift.com can't serve IPv6 requests.

Also, add the env var the marks the test as an upgrade test to
the ci-operator config, since the env map in it overrides the env
map in the workflow.

Signed-off-by: gbenhaim <[email protected]>
The entire test suite contains tests that are not
suitable for BM deployment.

Signed-off-by: gbenhaim <[email protected]>
@gbenhaim gbenhaim force-pushed the baremetalds_e2e_upgrade_gal branch from f0e0601 to e56263a Compare November 17, 2020 11:39
@gbenhaim
Copy link
Member Author

@andfasano the e2e-metal-ipi-ocp-sdn-ipv4-upgrade job passed. I've used the platform tests.

Copy link
Contributor

@andfasano andfasano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor cleanups, for the rest looks fine for me, thanks!

echo "### Running Upgrade tests"
# In case of a timeout we don't get the trace log of this script
# Let's verify that the we started the upgrade process
touch "${ARTIFACT_DIR}/upgrade_started"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

timeout \
--kill-after 10m \
120m \
ssh \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not required anymore, could this timeout be removed (there will be also the CI default one for the job)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep it just in case other issues will cause the code to get stuck. In this case, it would be good to have the logs.

Since the logs will be collected even if there is a timeout,
we don't need additional debug code.

Signed-off-by: gbenhaim <[email protected]>
@andfasano
Copy link
Contributor

/test pj-rehearse

@andfasano
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 19, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano, gbenhaim

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit d5d1b68 into openshift:master Nov 19, 2020
@openshift-ci-robot
Copy link
Contributor

@gbenhaim: Updated the following 3 configmaps:

  • job-config-master configmap in namespace ci at cluster api.ci using the following files:
    • key openshift-metal3-dev-scripts-master-presubmits.yaml using file ci-operator/jobs/openshift-metal3/dev-scripts/openshift-metal3-dev-scripts-master-presubmits.yaml
  • step-registry configmap in namespace ci at cluster api.ci using the following files:
    • key baremetalds-devscripts-setup-commands.sh using file ci-operator/step-registry/baremetalds/devscripts/setup/baremetalds-devscripts-setup-commands.sh
    • key baremetalds-devscripts-setup-ref.yaml using file ci-operator/step-registry/baremetalds/devscripts/setup/baremetalds-devscripts-setup-ref.yaml
    • key baremetalds-e2e-test-commands.sh using file ci-operator/step-registry/baremetalds/e2e/test/baremetalds-e2e-test-commands.sh
    • key baremetalds-e2e-test-ref.yaml using file ci-operator/step-registry/baremetalds/e2e/test/baremetalds-e2e-test-ref.yaml
    • key OWNERS using file ci-operator/step-registry/baremetalds/e2e/upgrade/OWNERS
    • key baremetalds-e2e-upgrade-workflow.metadata.json using file ci-operator/step-registry/baremetalds/e2e/upgrade/baremetalds-e2e-upgrade-workflow.metadata.json
    • key baremetalds-e2e-upgrade-workflow.yaml using file ci-operator/step-registry/baremetalds/e2e/upgrade/baremetalds-e2e-upgrade-workflow.yaml
  • step-registry configmap in namespace ci at cluster app.ci using the following files:
    • key baremetalds-devscripts-setup-commands.sh using file ci-operator/step-registry/baremetalds/devscripts/setup/baremetalds-devscripts-setup-commands.sh
    • key baremetalds-devscripts-setup-ref.yaml using file ci-operator/step-registry/baremetalds/devscripts/setup/baremetalds-devscripts-setup-ref.yaml
    • key baremetalds-e2e-test-commands.sh using file ci-operator/step-registry/baremetalds/e2e/test/baremetalds-e2e-test-commands.sh
    • key baremetalds-e2e-test-ref.yaml using file ci-operator/step-registry/baremetalds/e2e/test/baremetalds-e2e-test-ref.yaml
    • key OWNERS using file ci-operator/step-registry/baremetalds/e2e/upgrade/OWNERS
    • key baremetalds-e2e-upgrade-workflow.metadata.json using file ci-operator/step-registry/baremetalds/e2e/upgrade/baremetalds-e2e-upgrade-workflow.metadata.json
    • key baremetalds-e2e-upgrade-workflow.yaml using file ci-operator/step-registry/baremetalds/e2e/upgrade/baremetalds-e2e-upgrade-workflow.yaml
Details

In response to this:

  • Add a test for metal ipi upgrade
  • Extend "baremetalds-e2e-test-commands.sh" to run the upgrade tests if
    a destination cluster image is specified using
    "OPENSHIFT_UPGRADE_RELEASE_IMAGE_OVERRIDE".

Signed-off-by: Eran Israeli [email protected]
Signed-off-by: gbenhaim [email protected]

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot
Copy link
Contributor

@gbenhaim: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/openshift/cluster-baremetal-operator/master/e2e-metal-ipi 107cfd746c3db9e10f0c6e48a25f4f3c7da6c772 link /test pj-rehearse
ci/rehearse/openshift-metal3/dev-scripts/master/e2e-metal-ipi-upgrade b691bfe6a5f742bfd666c5d7ea61016c16646cd3 link /test pj-rehearse
ci/rehearse/openshift-metal3/dev-scripts/master/e2e-metal-ipi-ovn-dualstack-upgrade a711a4668eabdeff112fc2e20e82fe614bb1accb link /test pj-rehearse
ci/rehearse/openshift/cluster-baremetal-operator/release-4.6/e2e-metal-ipi d951049 link /test pj-rehearse
ci/prow/pj-rehearse d951049 link /test pj-rehearse
ci/rehearse/openshift/baremetal-operator/master/e2e-metal-ipi d951049 link /test pj-rehearse
ci/rehearse/openshift-metal3/dev-scripts/master/e2e-metal-ipi-ovn-dualstack d951049 link /test pj-rehearse
ci/rehearse/openshift/machine-config-operator/master/e2e-metal-ipi-ovn-dualstack d951049 link /test pj-rehearse

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@stbenjam
Copy link
Member

Small related follow-up to this: #13769

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants