Skip to content

Conversation

@jhixson74
Copy link
Member

@jhixson74 jhixson74 commented Jul 24, 2019

This code copies the loopback kubeconfig into the kubernetes configuration so that static pods can use it. approve-csr is also modified to use it in its service script.

This is necessary due to a limitation with Azure internal load balancers. See limitation #2 here: https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-overview#limitations

"Unlike public Load Balancers which provide outbound connections when transitioning from private IP addresses inside the virtual network to public IP addresses, internal Load Balancers do not translate outbound originated connections to the frontend of an internal Load Balancer as both are in private IP address space. This avoids potential for SNAT port exhaustion inside unique internal IP address space where translation is not required. The side effect is that if an outbound flow from a VM in the backend pool attempts a flow to frontend of the internal Load Balancer in which pool it resides and is mapped back to itself, both legs of the flow don't match and the flow will fail."

https://jira.coreos.com/browse/CORS-1094

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 24, 2019
@jhixson74 jhixson74 force-pushed the master_azure_bootstrap_use_loopback_kubeconfig branch from 2d72ab0 to 19415cb Compare July 31, 2019 21:21
@jhixson74
Copy link
Member Author

/retest

@jhixson74 jhixson74 force-pushed the master_azure_bootstrap_use_loopback_kubeconfig branch from 19415cb to 16ccc5d Compare August 1, 2019 20:53
@abhinavdahiya
Copy link
Contributor

@jhixson74
Copy link
Member Author

jhixson74 commented Aug 2, 2019

@jhixson74 jhixson74 force-pushed the master_azure_bootstrap_use_loopback_kubeconfig branch from 16ccc5d to 82d81d9 Compare August 2, 2019 18:41
@wking
Copy link
Member

wking commented Aug 2, 2019

Grepping around for /kubeconfig, do we also need to update these and this? And maybe this for consistency (CC @tomassedovic for OpenStack feedback. Obviously OpenStack services won't be used on Azure, but if the other bootstrap services are using a loopback kubeconfig we might want that service to follow for consistency)?

@abhinavdahiya
Copy link
Contributor

abhinavdahiya commented Aug 2, 2019

Grepping around for /kubeconfig, do we also need to update these and this?

queue resources/nodes.list oc --config=/opt/openshift/auth/kubeconfig --request-timeout=5s get nodes -o jsonpath --template '{range .items[*]}{.metadata.name}{"\n"}{end}'

i'm not too concerned with this... the use-case for this script is that bootstrapping failed and we have an api running.. but bootstrap apiserver is usually not running when this is run... so keeping it LB is fine..

https://github.com/openshift/installer/blob/40f7fa5ed322f06d37d7dfd1e8c222499b3be71/data/data/bootstrap/systemd/units/progress.service#L8

progress always runs when the bootstrapping completes... so it needs to stay LB.

export KUBECONFIG=/opt/openshift/auth/kubeconfig

openstack doesn't end up with weird LB blackhole and can keep using it if it likes.. i'm not concerned when it's platform specific..

@wking
Copy link
Member

wking commented Aug 5, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 5, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jhixson74, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 5, 2019
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

@jhixson74: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-openstack 82d81d9 link /test e2e-openstack
ci/prow/e2e-aws-scaleup-rhel7 82d81d9 link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit d6a211b into openshift:master Aug 5, 2019
wking added a commit to wking/openshift-installer that referenced this pull request Aug 20, 2019
…ys for etcd-signer

Since the pivots to prefer loopback Kube-API access:

* bf59ebf (azure: generate loopback kubeconfig to access API
  locally, 2019-07-17, openshift#2085).
* 82d81d9 (data/data/bootstrap: use loopback kubeconfig for API
  access, 2019-07-24, openshift#2086).
* openshift/cluster-bootstrap@61d1428bea (pkg/start: use loopback
  kubeconfig to talk to API, 2019-07-23,
  openshift/cluster-bootstrap#28).
* possibly more

logs on the bootstrap machine have contained distracting errors like
these reported in [1]:

  $ grep 'not localhost\|etcd-signer' journal-bootstrap.log
  ...
  Aug 20 10:33:56 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com podman[8366]: 2019-08-20 10:33:56.090073216 +0000 UTC m=+2.644782091 container start d0dcc42a1335c1224df35a48a279f63f1cb7a03c94de5ebb29e2633e6ee6c429 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f20394d571ff9a28aed9366434521d221d8d743a6efe2a3d6c6ad242198a522e, name=etcd-signer)
  Aug 20 10:33:58 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com openshift.sh[2867]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": Get https://localhost:6443/api?timeout=32s: x509: certificate is valid for api.bm1.oc4, not localhost
  Aug 20 10:34:01 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com approve-csr.sh[2870]: Unable to connect to the server: x509: certificate is valid for api.bm1.oc4, not localhost
  ...
  Aug 20 10:43:55 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com openshift.sh[2867]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": Get https://localhost:6443/api?timeout=32s: x509: certificate is valid for api.bm1.oc4, not localhost
  Aug 20 10:43:59 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com podman[15272]: 2019-08-20 10:43:59.68789639 +0000 UTC m=+0.188325679 container died d0dcc42a1335c1224df35a48a279f63f1cb7a03c94de5ebb29e2633e6ee6c429 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f20394d571ff9a28aed9366434521d221d8d743a6efe2a3d6c6ad242198a522e, name=etcd-signer)
  ...

With this commit, we pass the localhost cert to etcd-signer so we can
form the TLS connection to gracefully say "sorry, I'm not really a
Kube API server".  Fixes [2].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1743661
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1743840
wking added a commit to wking/openshift-installer that referenced this pull request Sep 17, 2019
With this commit, I take advantage of
openshift/cluster-bootstrap@fc5e0941 (start: wire the library-go
dynamic client create, 2019-02-05, openshift/cluster-bootstrap#14) to
replace our previous openshift.sh (with a minor change to the manifest
directory).  I'm currently using a cp in bootkube.sh to shift those
manifests into the generic directory; I plan on consolidating
Openshift into Manifests in pkg/asset/manifests in follow-up work.

This change is especially important since the pivot to loopback
kubeconfigs in openshift.sh: 82d81d9 (data/data/bootstrap: use
loopback kubeconfig for API access, 2019-07-24, openshift#2086), because once
cluster-bootstrap (launched from bootkube.sh) decides it's done it
tears down the bootstrap control plane.  Without the bootstrap control
plane, further attempts by openshift.sh to push manifests via the
loopback kubeconfig fail [1].

We could roll reporting into bootkube.sh as well (dropping
progress.service), but Abhinav wanted to keep it separate [2].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1748452
[2]: openshift#1381 (comment)
wking added a commit to wking/openshift-installer that referenced this pull request Oct 8, 2019
…ntrol-plane teardown

Since 82d81d9 (data/data/bootstrap: use loopback kubeconfig for API
access, 2019-07-24, openshift#2086), we've been pushing the OpenShift-specific
components via a loopback kubeconfig and the bootstrap control plane.
Since 108a45b (data/bootstrap: Replace openshift.sh with
cluster-bootstrap, 2019-01-24, landed 2019-09-30, openshift#1381), that push
has always happened before the bootstrap control plane shuts down.

I've changed "into" to "via" because the data passes through either
control plane on its way to rest in the shared etcd cluster.

I've dropped "then" from the final step because none of the other
steps said "then".  I think ordering is clear enough from our use of
an ordered list ;).
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
…ys for etcd-signer

Since the pivots to prefer loopback Kube-API access:

* bf59ebf (azure: generate loopback kubeconfig to access API
  locally, 2019-07-17, openshift#2085).
* 82d81d9 (data/data/bootstrap: use loopback kubeconfig for API
  access, 2019-07-24, openshift#2086).
* openshift/cluster-bootstrap@61d1428bea (pkg/start: use loopback
  kubeconfig to talk to API, 2019-07-23,
  openshift/cluster-bootstrap#28).
* possibly more

logs on the bootstrap machine have contained distracting errors like
these reported in [1]:

  $ grep 'not localhost\|etcd-signer' journal-bootstrap.log
  ...
  Aug 20 10:33:56 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com podman[8366]: 2019-08-20 10:33:56.090073216 +0000 UTC m=+2.644782091 container start d0dcc42a1335c1224df35a48a279f63f1cb7a03c94de5ebb29e2633e6ee6c429 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f20394d571ff9a28aed9366434521d221d8d743a6efe2a3d6c6ad242198a522e, name=etcd-signer)
  Aug 20 10:33:58 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com openshift.sh[2867]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": Get https://localhost:6443/api?timeout=32s: x509: certificate is valid for api.bm1.oc4, not localhost
  Aug 20 10:34:01 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com approve-csr.sh[2870]: Unable to connect to the server: x509: certificate is valid for api.bm1.oc4, not localhost
  ...
  Aug 20 10:43:55 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com openshift.sh[2867]: error: unable to recognize "./99_kubeadmin-password-secret.yaml": Get https://localhost:6443/api?timeout=32s: x509: certificate is valid for api.bm1.oc4, not localhost
  Aug 20 10:43:59 cnv-qe-08.cnvqe.lab.eng.rdu2.redhat.com podman[15272]: 2019-08-20 10:43:59.68789639 +0000 UTC m=+0.188325679 container died d0dcc42a1335c1224df35a48a279f63f1cb7a03c94de5ebb29e2633e6ee6c429 (image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:f20394d571ff9a28aed9366434521d221d8d743a6efe2a3d6c6ad242198a522e, name=etcd-signer)
  ...

With this commit, we pass the localhost cert to etcd-signer so we can
form the TLS connection to gracefully say "sorry, I'm not really a
Kube API server".  Fixes [2].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1743661
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1743840
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
With this commit, I take advantage of
openshift/cluster-bootstrap@fc5e0941 (start: wire the library-go
dynamic client create, 2019-02-05, openshift/cluster-bootstrap#14) to
replace our previous openshift.sh (with a minor change to the manifest
directory).  I'm currently using a cp in bootkube.sh to shift those
manifests into the generic directory; I plan on consolidating
Openshift into Manifests in pkg/asset/manifests in follow-up work.

This change is especially important since the pivot to loopback
kubeconfigs in openshift.sh: 82d81d9 (data/data/bootstrap: use
loopback kubeconfig for API access, 2019-07-24, openshift#2086), because once
cluster-bootstrap (launched from bootkube.sh) decides it's done it
tears down the bootstrap control plane.  Without the bootstrap control
plane, further attempts by openshift.sh to push manifests via the
loopback kubeconfig fail [1].

We could roll reporting into bootkube.sh as well (dropping
progress.service), but Abhinav wanted to keep it separate [2].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1748452
[2]: openshift#1381 (comment)
jhixson74 pushed a commit to jhixson74/installer that referenced this pull request Dec 6, 2019
…ntrol-plane teardown

Since 82d81d9 (data/data/bootstrap: use loopback kubeconfig for API
access, 2019-07-24, openshift#2086), we've been pushing the OpenShift-specific
components via a loopback kubeconfig and the bootstrap control plane.
Since 108a45b (data/bootstrap: Replace openshift.sh with
cluster-bootstrap, 2019-01-24, landed 2019-09-30, openshift#1381), that push
has always happened before the bootstrap control plane shuts down.

I've changed "into" to "via" because the data passes through either
control plane on its way to rest in the shared etcd cluster.

I've dropped "then" from the final step because none of the other
steps said "then".  I think ordering is clear enough from our use of
an ordered list ;).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants