steps/proxy: Port to Fedora CoreOS #11750

cgwalters · 2020-09-10T20:07:42Z

We're currently using RHCOS as a way to run a container image
in a single disposable VM. Let's use FCOS because it's more
oriented towards this use case and also gets us out of needing
to deal with Ignition version dependencies - we can just
unconditionally use spec 3 (which RHCOS also uses in 4.6).

cgwalters · 2020-09-10T20:08:04Z

(Didn't test this locally, we'll see what rehearse says)

ewolinetz · 2020-09-10T20:09:43Z

do we also need to bump here? https://github.com/openshift/release/pull/11750/files#diff-e687af75ecf4bc3dee02258e9703afa7R218

cgwalters · 2020-09-10T20:11:00Z

do we also need to bump here? https://github.com/openshift/release/pull/11750/files#diff-e687af75ecf4bc3dee02258e9703afa7R218

That link is taking me to the toplevel of the diff - can you elaborate on "here"?

ewolinetz · 2020-09-10T20:30:49Z

we are specifying version 2.1.0 here still, does that need to be bumped?

UserData:
        Fn::Base64: !Sub
        - '{"ignition":{"config":{"replace":{"source":"\${IgnitionLocation}","verification":{}}},"timeouts":{},"version":"2.1.0"},"networkd":{},"passwd":{},"storage":{},"systemd":{}}'
        - {
          IgnitionLocation: !Ref ProxyIgnitionLocation
        }

cgwalters · 2020-09-10T20:33:35Z

we are specifying version 2.1.0 here still, does that need to be bumped?

It does and done!

cgwalters · 2020-09-11T13:20:06Z

/retest

cgwalters · 2020-09-11T14:03:53Z

rpm-md issues
/retest

ewolinetz · 2020-09-11T20:24:48Z

it looks like the proxy instance still isn't able to come up...
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_release/11750/rehearse-11750-pull-ci-openshift-installer-master-e2e-aws-proxy/1304420641117573120/artifacts/e2e-aws-proxy/gather-aws-console/i-0951c95379a9e8bc5

cgwalters · 2020-09-11T20:48:00Z

OK yeah I'm refactoring this script so I can more easily generate the Ignition outside of Prow and test things.

cgwalters · 2020-09-11T20:58:54Z

OK running the generated ignition appears to work when doing some quick tests in qemu but I notice we were still missing the After=network-online.target so I pushed that.

We also have docs for this use case of course: https://docs.fedoraproject.org/en-US/fedora-coreos/running-containers/
It'd be better to port this to fcct (inline data etc.) but one step at a time.

cgwalters · 2020-09-12T16:19:21Z

OK this time the proxy host definitely came up, and I see access logs in its console.

And I think now the problem we're hitting is that machineAPI isn't able to provision workers due to the proxy:

E0912 15:38:22.441079       1 reconciler.go:236] ci-op-v9v01r44-4c51b-npq94-worker-us-east-2b-pzx25: error getting existing instances: RequestError: send request failed
caused by: Post "https://ec2.us-east-2.amazonaws.com/": dial tcp 52.95.18.3:443: i/o timeout
E0912 15:38:22.441108       1 controller.go:272] ci-op-v9v01r44-4c51b-npq94-worker-us-east-2b-pzx25: failed to check if machine exists: RequestError: send request failed
caused by: Post "https://ec2.us-east-2.amazonaws.com/": dial tcp 52.95.18.3:443: i/o timeout
E0912 15:38:22.441173       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="RequestError: send request failed\ncaused by: Post \"https://ec2.us-east-2.amazonaws.com/\": dial tcp 52.95.18.3:443: i/o timeout"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"ci-op-v9v01r44-4c51b-npq94-worker-us-east-2b-pzx25"}

So we could probably force merge this and then fix that as a followup?

wking · 2020-09-14T18:31:10Z

Hmm. How do we square "proxy console has access logs" with "we still cannot SSH in to gather those logs"?

ci-operator/step-registry/gather/aws-console/gather-aws-console-commands.sh

ci-operator/step-registry/ipi/conf/aws/proxy/ipi-conf-aws-proxy-commands.sh

cgwalters · 2020-09-14T18:34:01Z

"we still cannot SSH in to gather those logs"?

The ssh process is being run from the CI cluster - it can't SSH into a private VPC right?

ci-operator/step-registry/ipi/conf/aws/proxy/ipi-conf-aws-proxy-commands.sh

cgwalters · 2020-09-14T19:10:45Z

Re-simplified this and addressed comments.

cgwalters · 2020-09-14T21:01:36Z

Logs from that run also show the proxy successfully running.

wking · 2020-09-15T17:05:37Z

The ssh process is being run from the CI cluster...

Yes.

... it can't SSH into a private VPC right?

We should be launching the proxy instance in a public subnet, so it should be reachable from the CI cluster.

We're currently using RHCOS as a way to run a container image in a single disposable VM. Let's use FCOS because it's more oriented towards this use case and also gets us out of needing to deal with Ignition version dependencies - we can just unconditionally use spec 3 (which RHCOS also uses in 4.6). Switch instance type to `m5.xlarge` to match the current OpenShift standard on general principle; there's no obvious reason we'd need "storage optimized".

cgwalters · 2020-09-15T17:33:45Z

We should be launching the proxy instance in a public subnet, so it should be reachable from the CI cluster.

OK. Then is there a security group set up allowing SSH?

cgwalters · 2020-09-15T17:34:22Z

Updated the commit message per #11750 (comment)

wking · 2020-09-15T20:54:47Z

/lgtm

e2e-aws-proxy failed to build openstack-installer, presumably a flake. Hold until we see the proxy's connections logs again:

/hold
/retest

openshift-ci-robot · 2020-09-15T20:55:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/ipi/conf/aws/proxy/OWNERS~~ [wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2020-09-15T20:57:50Z

Then is there a security group set up allowing SSH?

Yup.

cgwalters · 2020-09-15T21:15:09Z

I definitely verified that if I booted an instance with this Ignition config outside of Prow (using coreos-assembler tooling) that I was able to ssh in just fine. So I still think this is something related to the VPC.

cgwalters · 2020-09-16T00:43:20Z

/retest

cgwalters · 2020-09-16T12:46:21Z

/test ci/prow/pj-rehearse

openshift-ci-robot · 2020-09-16T12:46:37Z

@cgwalters: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

/test app-ci-config-dry
/test build01-dry
/test build02-dry
/test ci-operator-config
/test ci-operator-config-metadata
/test ci-operator-registry
/test config
/test core-dry
/test core-valid
/test correctly-sharded-config
/test generated-config
/test generated-dashboards
/test ordered-prow-config
/test owners
/test pj-rehearse
/test prow-config
/test prow-config-filenames
/test prow-config-semantics
/test release-controller-config
/test services-dry
/test services-valid
/test step-registry-metadata
/test step-registry-shellcheck
/test vsphere-dry
/test pylint

Use /test all to run the following jobs:

pull-ci-openshift-release-master-app-ci-config-dry
pull-ci-openshift-release-master-build01-dry
pull-ci-openshift-release-master-build02-dry
pull-ci-openshift-release-master-ci-operator-config
pull-ci-openshift-release-master-ci-operator-config-metadata
pull-ci-openshift-release-master-ci-operator-registry
pull-ci-openshift-release-master-config
pull-ci-openshift-release-master-core-dry
pull-ci-openshift-release-master-core-valid
pull-ci-openshift-release-master-correctly-sharded-config
pull-ci-openshift-release-master-generated-config
pull-ci-openshift-release-master-generated-dashboards
pull-ci-openshift-release-master-ordered-prow-config
pull-ci-openshift-release-master-owners
pull-ci-openshift-release-master-pj-rehearse
pull-ci-openshift-release-master-prow-config
pull-ci-openshift-release-master-prow-config-filenames
pull-ci-openshift-release-master-prow-config-semantics
pull-ci-openshift-release-master-release-controller-config
pull-ci-openshift-release-master-services-dry
pull-ci-openshift-release-master-services-valid
pull-ci-openshift-release-master-step-registry-metadata
pull-ci-openshift-release-master-step-registry-shellcheck
pull-ci-openshift-release-master-vsphere-dry

Details

In response to this:

/test ci/prow/pj-rehearse

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cgwalters · 2020-09-16T12:46:53Z

/test pj-rehearse

cgwalters · 2020-09-16T13:04:29Z

/test pj-rehearse

openshift-ci-robot · 2020-09-16T14:48:37Z

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/rehearse/openshift/aws-ebs-csi-driver/master/e2e-aws-csi	f4541a0fa1bbfbd855c11b2892414cffaef1cbe5	link	`/test pj-rehearse`
ci/rehearse/cri-o/cri-o/master/e2e-aws	f4541a0fa1bbfbd855c11b2892414cffaef1cbe5	link	`/test pj-rehearse`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

wking · 2020-09-16T16:13:46Z

The most recent rehearsal failed with:

level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, kube-storage-version-migrator, monitoring"

So it got past bootstrap-complete, which means the proxy must be working. It's possible that machine-API not supporting proxy plus lack of a functional EC2 VPC endpoint is the only remaining problem blocking install, or that there are more. SSH into the proxy still fails, and we still don't understand why not. AWS flaked out on console log gathering too, or I'd expect to see some logs here. Still progress. Let's just land this and keep poking at SSH access in follow-up work.

/hold cancel

openshift-bot · 2020-09-16T16:16:06Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-09-16T16:27:28Z

@cgwalters: Updated the following 2 configmaps:

step-registry configmap in namespace ci at cluster api.ci using the following files:
- key ipi-conf-aws-proxy-commands.sh using file ci-operator/step-registry/ipi/conf/aws/proxy/ipi-conf-aws-proxy-commands.sh
step-registry configmap in namespace ci at cluster app.ci using the following files:
- key ipi-conf-aws-proxy-commands.sh using file ci-operator/step-registry/ipi/conf/aws/proxy/ipi-conf-aws-proxy-commands.sh

Details

In response to this:

We're currently using RHCOS as a way to run a container image
in a single disposable VM. Let's use FCOS because it's more
oriented towards this use case and also gets us out of needing
to deal with Ignition version dependencies - we can just
unconditionally use spec 3 (which RHCOS also uses in 4.6).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from ewolinetz and wking September 10, 2020 20:08

cgwalters force-pushed the proxy-port-fcos branch from 6b466dd to 71716ad Compare September 10, 2020 20:08

cgwalters force-pushed the proxy-port-fcos branch from 71716ad to fffac5e Compare September 10, 2020 20:32

cgwalters force-pushed the proxy-port-fcos branch 2 times, most recently from 7a85ecc to 33dacdf Compare September 11, 2020 12:53

cgwalters force-pushed the proxy-port-fcos branch from 33dacdf to f458c78 Compare September 11, 2020 20:56

cgwalters force-pushed the proxy-port-fcos branch 2 times, most recently from b824d7e to f4541a0 Compare September 12, 2020 14:07

cgwalters mentioned this pull request Sep 14, 2020

Bug 1875773: ci-operator/step-registry/ipi/conf/aws/blackholenetwork/blackhole_vpc_yaml: Add EC2 endpoint #11723

Merged