Skip to content

Conversation

@jsafrane
Copy link
Contributor

@jsafrane jsafrane commented Jun 24, 2019

Run ssh bastion and set up its env. variables before running openshift-tests in openshift_installer template. Some upstream e2e tests require SSH access to nodes.

Partly fixes https://bugzilla.redhat.com/show_bug.cgi?id=1711600

cc @vrutkovs to double check restore-cluster-state and recover-from-etcd-quorum-loss changes.

Tested with:

./ci-operator \
    --artifact-dir artifacts/ \
    --config openshift-origin-master.yaml \
    --git-ref jsafrane/origin@mount-propagation-test \
    --template cluster-launch-installer-e2e.yaml \
    --target cluster-launch-installer-e2e \
    --secret-dir "$name-cluster-profile/" \
    --namespace $ns

Where jsafrane/origin@mount-propagation-test is this PR: openshift/origin#22966

Mount propagation tests (= which need ssh to nodes) succeed.

@openshift-ci-robot openshift-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 24, 2019
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be removed before the tests start, otherwise e2e test would fail - we're grabbing image from outside of the release image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole point of this PR is to run the tests with the bastion running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in that case we have to use some other images in eparis/ssh-bastion repo. See https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_release/3595/rehearse-3595-pull-ci-openshift-installer-master-e2e-restore-cluster-state/67, test [Feature:Platform][Smoke] Managed cluster should ensure pods use images from our release image with proper ImagePullPolicy [Suite:openshift/conformance/parallel] would fail with fail [github.com/openshift/origin/test/extended/operators/images.go:112]: May 10 09:33:15.658: Pods found with invalid container images not present in release payload: openshift-ssh-bastion/ssh-bastion-f4d5bbcbd-xcx68/ssh-bastion image=quay.io/eparis/ssh:latest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that one is going to fail. I think the test could skip openshift-ssh-bastion namespace if $KUBE_SSH_BASTION is set.

openshift/origin#23252

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later on we shoud be vendoring the manifests in bindata and run it in ssh-bastion namespace in BeforeAll

Once openshift/origin#23208 merged I could take care of updating this

@vrutkovs
Copy link
Contributor

/hold

The PR looks good, I doubt it would make tests pass though.

Could you add a space at the end of

in a separate commit to trigger a rehearse of restore etcd test? It would be reverted later on

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2019
@jsafrane
Copy link
Contributor Author

Could you add a space at the end of restore-cluster-state?

Not exactly empty space, but done.

@jsafrane jsafrane changed the title Run ssh bastion during openshift-tests WIP: Run ssh bastion during openshift-tests Jun 24, 2019
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2019
@jsafrane jsafrane force-pushed the enable-ssh-bastion branch 2 times, most recently from 9de3374 to a92564c Compare June 24, 2019 13:00
@abhinavdahiya
Copy link
Contributor

SSH access to clusters that are up should not be allowed for e2e tests.

the e2e cluster restore are given SSH access because it brings the control plane down.

/cc @crawford @eparis

@jsafrane
Copy link
Contributor Author

I am a bit confused. How ssh access brings control plane down?

There are tests that simply need to access OS on nodes. Upstream has chosen ssh as the access method. We can either disable those tests and leave corresponding parts untested, or, as I am trying, to run the tests with ssh bastion.

Access to nodes is mostly required by [Disruptive] tests that we don't run anyway (*), but there are few marked as [Disabled:Broken] by OpenShift and could run if we have SSH:

`SSH`, // TRIAGE
`should implement service.kubernetes.io/service-proxy-name`, // this is an optional test that requires SSH. sig-network
`should propagate mounts to the host`,                                        // requires SSH, https://bugzilla.redhat.com/show_bug.cgi?id=1711600

*) To be honest, I don't understand why we don't run [Disruptive] tests. Speaking from sig-storage perspective, they check kubelet restart / node reboot cases that have great potential to be buggy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to verify that the project is fully removed before we continue? is it possible that we will leak a loadbalancer otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, oc delete waits for the namespace to be fully deleted.

@derekwaynecarr
Copy link
Member

@abhinavdahiya I am not following your comment, we should make every effort to run upstream tests.

@abhinavdahiya
Copy link
Contributor

With openshift 4 we don't ship with ssh access to the clusters by default. And my understanding was that we stopped tests that required ssh access in the hope that we can move them off needing ssh access.
And this PR turns on ssh for all tests which in my opinion is weakening our position. All tests from now on can use ssh...

@jsafrane jsafrane force-pushed the enable-ssh-bastion branch from a92564c to cab5e61 Compare July 1, 2019 08:08
@jsafrane
Copy link
Contributor Author

jsafrane commented Jul 1, 2019

/retest

@jsafrane
Copy link
Contributor Author

jsafrane commented Jul 1, 2019

I am updating Eric's ssh bastion to have configurable namespace: eparis/ssh-bastion#10

  • SSH access works well (Mount propagation should propagate mounts to the host passes)
  • This one fails reliably, but it looks unrelated to the template change:
    The default cluster RBAC policy should only allow the system:authenticated group to access certain policy rules [Suite:openshift/conformance/parallel]
    
    fail [github.com/openshift/origin/test/extended/authorization/rbac/groups_default_rules.go:196]: Jul  1 13:08:13.102: system:authenticated has extra permissions in namespace "":
    {APIGroups:["console.openshift.io"], Resources:["consoleclidownloads"], Verbs:["get" "list" "watch"]}
    {APIGroups:["console.openshift.io"], Resources:["consolelinks"], Verbs:["get" "list" "watch"]}
    {APIGroups:["console.openshift.io"], Resources:["consolenotifications"], Verbs:["get" "list" "watch"]}
    

@jsafrane jsafrane force-pushed the enable-ssh-bastion branch 3 times, most recently from 358eede to 8676108 Compare July 3, 2019 09:58
@jsafrane jsafrane changed the title WIP: Run ssh bastion during openshift-tests Run ssh bastion during openshift-tests Jul 3, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 3, 2019
@vrutkovs
Copy link
Contributor

vrutkovs commented Jul 3, 2019

/hold cancel
/approve

Looks fine to me

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 3, 2019
@jsafrane
Copy link
Contributor Author

jsafrane commented Jul 3, 2019

Removed WIP patches for rehearsals.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jul 26, 2019

@jsafrane: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/rehearse/openshift/installer/master/e2e-restore-cluster-state e8ecee4669ea87de455e4eb10887edd661b5aaec link /test pj-rehearse

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jsafrane
Copy link
Contributor Author

/assign @vrutkovs

@jsafrane
Copy link
Contributor Author

jsafrane commented Sep 3, 2019

/test pj-rehearse

Copy link
Contributor

@vrutkovs vrutkovs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vrutkovs
Copy link
Contributor

vrutkovs commented Sep 3, 2019

/lgtm
/hold

waiting for rehearses to pass

@openshift-ci-robot openshift-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Sep 3, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jsafrane, vrutkovs
To complete the pull request process, please assign crawford
You can assign the PR to them by writing /assign @crawford in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jsafrane
Copy link
Contributor Author

jsafrane commented Sep 4, 2019

Rehearsal went OK.

/assign @wking @smarterclayton
for approval

@vrutkovs
Copy link
Contributor

vrutkovs commented Sep 4, 2019

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 4, 2019
@smarterclayton
Copy link
Contributor

/hold

I don't want this on for everyone. Let me review.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 4, 2019
queue /tmp/artifacts/must-gather/must-gather.log oc --insecure-skip-tls-verify adm must-gather --dest-dir /tmp/artifacts/must-gather
echo "Removing ssh-bastion ..."
queue /dev/null oc --insecure-skip-tls-verify --request-timeout=5s delete project testing-ssh-bastion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you doing this? Why isn't this being torn down by the cluster tear down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was requested in #4161 (review)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service load balancers aren't leaked.

@smarterclayton
Copy link
Contributor

Upstream has chosen ssh as the access method

This statement is not true, upstream has been moving away from SSH for a long time. In general, tests that use SSH should not use SSH unless there is no other alternative. Disruptive (most of which won't work on OpenShift anyway because they encode assumptions about the shape of the OS) is the biggest place where that happens, and even most of those can be done via pods.

I would prefer to see the remaining tests moved away from SSH than turning this on globally for all test frameworks. I'm certainly ok with having this option be available for certain test suites (like disruptive) but opt-in is probably better, not on by default.

I think for this PR if you remove the automatic opt-in and leave the function there, which specific tests can use by adding it before the test runs (specifically disruptive, which needs to get enabled).

@jsafrane
Copy link
Contributor Author

jsafrane commented Sep 5, 2019

Upstream has been moving away from SSH for a long time

I haven't noticed any coordinated effort, either as KEP, issue or announced on kubernetes-dev. We (sig-storage) are still adding new tests that use ssh. I don't remember anyone reporting issues for that (with one exception, we use kubernetes-in-docker to tests CSI stuff and ssh does not work there; kubernetes/kubernetes#81751)

I think for this PR if you remove the automatic opt-in and leave the function there, which specific tests can use by adding it before the test runs (specifically disruptive, which needs to get enabled).

Node tests run as part of e2e-aws, I don't think it's a good idea to clutter the tests just for https://bugzilla.redhat.com/show_bug.cgi?id=1711600.

@smarterclayton
Copy link
Contributor

You haven't noticed it but it has repeatedly been enforced at test level. You can't have conformance tests that require SSH. Large sets of tests have had SSH access removed.

@jsafrane
Copy link
Contributor Author

jsafrane commented Sep 5, 2019

it has repeatedly been enforced at test level

Any references? Again, no KEP, no announce, how are we supposed to know about SSH deprecation?

@smarterclayton
Copy link
Contributor

We run almost every test in the suite. A year ago there were a hundred or more that depend on SSH. Now there are 3 or 4 in the core path.

@jsafrane
Copy link
Contributor Author

jsafrane commented Sep 5, 2019

I am not arguing that there tests were not reworked but the communication around. How are we supposed to know about SSH deprecation?

@jsafrane
Copy link
Contributor Author

jsafrane commented Sep 6, 2019

/close
Disruptive tests seem to be the only acceptable usage of SSH bastion and we don't run them, so this PR is useless.

@openshift-ci-robot
Copy link
Contributor

@jsafrane: Closed this PR.

Details

In response to this:

/close
Disruptive tests seem to be the only acceptable usage of SSH bastion and we don't run them, so this PR is useless.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

do
# AWS fills only .hostname of a service
BASTION_HOST=$(oc get service -n "${SSH_BASTION_NAMESPACE}" ssh-bastion -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
if [[ -n "${BASTION_HOST}" ]]; then break; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should be in the script, let's get Eric to fix his bastion

@smarterclayton
Copy link
Contributor

We’re about to start running disruptive tests from e2e and that suite does need SSH

@smarterclayton
Copy link
Contributor

We also might end up with an “other” suite for some of the disabled tests - that’s a good spot to include SSH as wel

@smarterclayton
Copy link
Contributor

I'm going to pull this code out and fix it up as part of getting the disruptive suite going.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants