Skip to content

Conversation

@cgwalters
Copy link
Member

We're currently using RHCOS as a way to run a container image
in a single disposable VM. Let's use FCOS because it's more
oriented towards this use case and also gets us out of needing
to deal with Ignition version dependencies - we can just
unconditionally use spec 3 (which RHCOS also uses in 4.6).

@cgwalters
Copy link
Member Author

(Didn't test this locally, we'll see what rehearse says)

@ewolinetz
Copy link
Contributor

@cgwalters
Copy link
Member Author

do we also need to bump here? https://github.com/openshift/release/pull/11750/files#diff-e687af75ecf4bc3dee02258e9703afa7R218

That link is taking me to the toplevel of the diff - can you elaborate on "here"?

@ewolinetz
Copy link
Contributor

we are specifying version 2.1.0 here still, does that need to be bumped?

UserData:
        Fn::Base64: !Sub
        - '{"ignition":{"config":{"replace":{"source":"\${IgnitionLocation}","verification":{}}},"timeouts":{},"version":"2.1.0"},"networkd":{},"passwd":{},"storage":{},"systemd":{}}'
        - {
          IgnitionLocation: !Ref ProxyIgnitionLocation
        }

@cgwalters
Copy link
Member Author

we are specifying version 2.1.0 here still, does that need to be bumped?

It does and done!

@cgwalters cgwalters force-pushed the proxy-port-fcos branch 2 times, most recently from 7a85ecc to 33dacdf Compare September 11, 2020 12:53
@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

rpm-md issues
/retest

@ewolinetz
Copy link
Contributor

@cgwalters
Copy link
Member Author

OK yeah I'm refactoring this script so I can more easily generate the Ignition outside of Prow and test things.

@cgwalters
Copy link
Member Author

OK running the generated ignition appears to work when doing some quick tests in qemu but I notice we were still missing the After=network-online.target so I pushed that.

We also have docs for this use case of course: https://docs.fedoraproject.org/en-US/fedora-coreos/running-containers/
It'd be better to port this to fcct (inline data etc.) but one step at a time.

@cgwalters cgwalters force-pushed the proxy-port-fcos branch 2 times, most recently from b824d7e to f4541a0 Compare September 12, 2020 14:07
@cgwalters
Copy link
Member Author

cgwalters commented Sep 12, 2020

OK this time the proxy host definitely came up, and I see access logs in its console.

And I think now the problem we're hitting is that machineAPI isn't able to provision workers due to the proxy:

E0912 15:38:22.441079       1 reconciler.go:236] ci-op-v9v01r44-4c51b-npq94-worker-us-east-2b-pzx25: error getting existing instances: RequestError: send request failed
caused by: Post "https://ec2.us-east-2.amazonaws.com/": dial tcp 52.95.18.3:443: i/o timeout
E0912 15:38:22.441108       1 controller.go:272] ci-op-v9v01r44-4c51b-npq94-worker-us-east-2b-pzx25: failed to check if machine exists: RequestError: send request failed
caused by: Post "https://ec2.us-east-2.amazonaws.com/": dial tcp 52.95.18.3:443: i/o timeout
E0912 15:38:22.441173       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="RequestError: send request failed\ncaused by: Post \"https://ec2.us-east-2.amazonaws.com/\": dial tcp 52.95.18.3:443: i/o timeout"  "controller"="machine_controller" "request"={"Namespace":"openshift-machine-api","Name":"ci-op-v9v01r44-4c51b-npq94-worker-us-east-2b-pzx25"}

So we could probably force merge this and then fix that as a followup?

@wking
Copy link
Member

wking commented Sep 14, 2020

Hmm. How do we square "proxy console has access logs" with "we still cannot SSH in to gather those logs"?

@cgwalters
Copy link
Member Author

"we still cannot SSH in to gather those logs"?

The ssh process is being run from the CI cluster - it can't SSH into a private VPC right?

@cgwalters
Copy link
Member Author

Re-simplified this and addressed comments.

@cgwalters cgwalters requested a review from wking September 14, 2020 21:00
@cgwalters
Copy link
Member Author

Logs from that run also show the proxy successfully running.

@wking
Copy link
Member

wking commented Sep 15, 2020

The ssh process is being run from the CI cluster...

Yes.

... it can't SSH into a private VPC right?

We should be launching the proxy instance in a public subnet, so it should be reachable from the CI cluster.

We're currently using RHCOS as a way to run a container image
in a single disposable VM.  Let's use FCOS because it's more
oriented towards this use case and also gets us out of needing
to deal with Ignition version dependencies - we can just
unconditionally use spec 3 (which RHCOS also uses in 4.6).

Switch instance type to `m5.xlarge` to match the current OpenShift
standard on general principle; there's no obvious reason we'd
need "storage optimized".
@cgwalters
Copy link
Member Author

We should be launching the proxy instance in a public subnet, so it should be reachable from the CI cluster.

OK. Then is there a security group set up allowing SSH?

@cgwalters
Copy link
Member Author

Updated the commit message per #11750 (comment)

@wking
Copy link
Member

wking commented Sep 15, 2020

/lgtm

e2e-aws-proxy failed to build openstack-installer, presumably a flake. Hold until we see the proxy's connections logs again:

/hold
/retest

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 15, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 15, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2020
@wking
Copy link
Member

wking commented Sep 15, 2020

Then is there a security group set up allowing SSH?

Yup.

@cgwalters
Copy link
Member Author

I definitely verified that if I booted an instance with this Ignition config outside of Prow (using coreos-assembler tooling) that I was able to ssh in just fine. So I still think this is something related to the VPC.

@cgwalters
Copy link
Member Author

/retest

@cgwalters
Copy link
Member Author

/test ci/prow/pj-rehearse

@openshift-ci-robot
Copy link
Contributor

@cgwalters: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test app-ci-config-dry
  • /test build01-dry
  • /test build02-dry
  • /test ci-operator-config
  • /test ci-operator-config-metadata
  • /test ci-operator-registry
  • /test config
  • /test core-dry
  • /test core-valid
  • /test correctly-sharded-config
  • /test generated-config
  • /test generated-dashboards
  • /test ordered-prow-config
  • /test owners
  • /test pj-rehearse
  • /test prow-config
  • /test prow-config-filenames
  • /test prow-config-semantics
  • /test release-controller-config
  • /test services-dry
  • /test services-valid
  • /test step-registry-metadata
  • /test step-registry-shellcheck
  • /test vsphere-dry
  • /test pylint

Use /test all to run the following jobs:

  • pull-ci-openshift-release-master-app-ci-config-dry
  • pull-ci-openshift-release-master-build01-dry
  • pull-ci-openshift-release-master-build02-dry
  • pull-ci-openshift-release-master-ci-operator-config
  • pull-ci-openshift-release-master-ci-operator-config-metadata
  • pull-ci-openshift-release-master-ci-operator-registry
  • pull-ci-openshift-release-master-config
  • pull-ci-openshift-release-master-core-dry
  • pull-ci-openshift-release-master-core-valid
  • pull-ci-openshift-release-master-correctly-sharded-config
  • pull-ci-openshift-release-master-generated-config
  • pull-ci-openshift-release-master-generated-dashboards
  • pull-ci-openshift-release-master-ordered-prow-config
  • pull-ci-openshift-release-master-owners
  • pull-ci-openshift-release-master-pj-rehearse
  • pull-ci-openshift-release-master-prow-config
  • pull-ci-openshift-release-master-prow-config-filenames
  • pull-ci-openshift-release-master-prow-config-semantics
  • pull-ci-openshift-release-master-release-controller-config
  • pull-ci-openshift-release-master-services-dry
  • pull-ci-openshift-release-master-services-valid
  • pull-ci-openshift-release-master-step-registry-metadata
  • pull-ci-openshift-release-master-step-registry-shellcheck
  • pull-ci-openshift-release-master-vsphere-dry
Details

In response to this:

/test ci/prow/pj-rehearse

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@cgwalters
Copy link
Member Author

/test pj-rehearse

1 similar comment
@cgwalters
Copy link
Member Author

/test pj-rehearse

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 16, 2020

@cgwalters: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/rehearse/openshift/aws-ebs-csi-driver/master/e2e-aws-csi f4541a0fa1bbfbd855c11b2892414cffaef1cbe5 link /test pj-rehearse
ci/rehearse/cri-o/cri-o/master/e2e-aws f4541a0fa1bbfbd855c11b2892414cffaef1cbe5 link /test pj-rehearse

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking
Copy link
Member

wking commented Sep 16, 2020

The most recent rehearsal failed with:

level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, kube-storage-version-migrator, monitoring"

So it got past bootstrap-complete, which means the proxy must be working. It's possible that machine-API not supporting proxy plus lack of a functional EC2 VPC endpoint is the only remaining problem blocking install, or that there are more. SSH into the proxy still fails, and we still don't understand why not. AWS flaked out on console log gathering too, or I'd expect to see some logs here. Still progress. Let's just land this and keep poking at SSH access in follow-up work.

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 16, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 1f2dba0 into openshift:master Sep 16, 2020
@openshift-ci-robot
Copy link
Contributor

@cgwalters: Updated the following 2 configmaps:

  • step-registry configmap in namespace ci at cluster api.ci using the following files:
    • key ipi-conf-aws-proxy-commands.sh using file ci-operator/step-registry/ipi/conf/aws/proxy/ipi-conf-aws-proxy-commands.sh
  • step-registry configmap in namespace ci at cluster app.ci using the following files:
    • key ipi-conf-aws-proxy-commands.sh using file ci-operator/step-registry/ipi/conf/aws/proxy/ipi-conf-aws-proxy-commands.sh
Details

In response to this:

We're currently using RHCOS as a way to run a container image
in a single disposable VM. Let's use FCOS because it's more
oriented towards this use case and also gets us out of needing
to deal with Ignition version dependencies - we can just
unconditionally use spec 3 (which RHCOS also uses in 4.6).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants