Skip to content

Conversation

@tomassedovic
Copy link
Contributor

@tomassedovic tomassedovic commented Jul 9, 2019

openstack: Remove the Service VM

The experimental OpenStack backend used to create an extra server
running DNS and load balancer services that the cluster needed.
OpenStack does not always come with DNSaaS or LBaaS so we had to provide
the functionality the OpenShift cluster depends on (e.g. the etcd SRV
records, the api-int records & load balancing, etc.).

This approach is undesirable for two reasons: first, it adds an extra
node that the other IPI platforms do not need. Second, this node is a
single point of failure.

The Baremetal platform has faced the same issues and they have solved
them with a few virtual IP addresses managed by keepalived in
combination with coredns static pod running on every node using the mDNS
protocol to update records as new nodes are added or removed and a
similar static pod haproxy to load balance the control plane internally.

The VIPs are defined here in the installer and they use the
PlatformStatus field to be passed to the necessary
machine-config-operator fields:

openshift/api#374

The Bare Metal IPI Networking Infrastructure document is broadly
applicable here as well:

https://github.com/openshift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md

Notable differences in OpenStack:

  • We only use the API and DNS VIPs right now
  • Instead of Baremetal's Ingress VIP (which is attached to the OpenShift
    routers) our haproxy static pods balance the 80 & 443 pods to the
    worker nodes
  • We do not run coredns on the bootstrap node. Instead, bootstrap itself
    uses one of the masters for DNS.

These differences are not fundamental to OpenStack and we will be
looking at aligning more closely with the Baremetal provider in the
future.

There is also a great oportunity to share some of the configuration
files and scripts here.

This change needs several other pull requests:

Keepalived plus the coredns & haproxy static pods in the MCO:
openshift/machine-config-operator/pull/740

Passing the API and DNS VIPs through the installer:
#1998

Vendoring the OpenStack PlatformStatus changes in the MCO:
openshift/machine-config-operator#978

Allowing to use PlatformStatus in the MCO templates:
openshift/machine-config-operator#943

Co-authored-by: Emilio Garcia [email protected]
Co-authored-by: John Trowbridge [email protected]
Co-authored-by: Martin Andre [email protected]
Co-authored-by: Tomas Sedovic [email protected]

Massive thanks to the Bare Metal and oVirt people!

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 9, 2019
@tomassedovic
Copy link
Contributor Author

/label platform/openstack

@tomassedovic
Copy link
Contributor Author

Right now, I'm just pushing this out there so we can discuss & collaborate on it in public.

Most of the code has been written by @trown, @mandre and @iamemilio. I consolidated it and rebased on master.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 17, 2019
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2019
@openshift-ci-robot openshift-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 26, 2019
@mandre
Copy link
Member

mandre commented Jul 26, 2019

/retitle openstack: remove the service VM

@openshift-ci-robot openshift-ci-robot changed the title WIP openstack: remove the service VM openstack: remove the service VM Jul 26, 2019
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 26, 2019
@tomassedovic
Copy link
Contributor Author

Note regarding the e2e-openstack job: it will keep failing until openshift/machine-config-operator#740 is merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could just require floating IP support for this first iteration and make this non-optional.

In general, VMs without a floating IP attached are not accessible from the outside (their IP addresses are private). So without a floating IP, the installer or anyone else can't reach the OpenShift API running there.

Some OpenStack environments do not allow any floating IPs at all or limit the amount the user can get.

@tomassedovic
Copy link
Contributor Author

@abhinavdahiya thanks for your comments! They should all be addressed now.

Please let me know if there's anything else.

mandre added a commit to mandre/machine-config-operator that referenced this pull request Jul 30, 2019
openshift#740 and
openshift/installer#1959 depend on each other.
We need to break this dependency to allow them to merge while keeping
the e2e-openstack job green.

This patch disables the static pods added for the openstack platform in
openshift#740 unless the
installer provided the needed info set via
openshift/installer#1959.

It can be safely reverted once
openshift/installer#1959 merges.
mandre added a commit to trown/machine-config-operator that referenced this pull request Jul 30, 2019
…tches

openshift#740 and
openshift/installer#1959 depend on each other.
We need to break this dependency to allow them to merge while keeping
the e2e-openstack job green.

This patch disables the static pods added for the openstack platform in
openshift#740 unless the
installer provided the needed info set via
openshift/installer#1959.

It can be safely reverted once
openshift/installer#1959 merges.
@mandre
Copy link
Member

mandre commented Aug 8, 2019

/retest

@mandre
Copy link
Member

mandre commented Aug 8, 2019

/test e2e-openstack

@tomassedovic
Copy link
Contributor Author

/test e2e-openstack

@tomassedovic
Copy link
Contributor Author

tomassedovic commented Aug 8, 2019

Looks like the current CI failures were caused by: #2086

We're investigating now. This should hopefully fix it:

openshift/machine-config-operator#1050

@tomassedovic
Copy link
Contributor Author

Okay, I've reproduced the issue locally (by rebasing this PR on top of master) and verified that openshift/machine-config-operator#1050 fixes it.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 8, 2019
The experimental OpenStack backend used to create an extra server
running DNS and load balancer services that the cluster needed.
OpenStack does not always come with DNSaaS or LBaaS so we had to provide
the functionality the OpenShift cluster depends on (e.g. the etcd SRV
records, the api-int records & load balancing, etc.).

This approach is undesirable for two reasons: first, it adds an extra
node that the other IPI platforms do not need. Second, this node is a
single point of failure.

The Baremetal platform has faced the same issues and they have solved
them with a few virtual IP addresses managed by keepalived in
combination with coredns static pod running on every node using the mDNS
protocol to update records as new nodes are added or removed and a
similar static pod haproxy to load balance the control plane internally.

The VIPs are defined here in the installer and they use the
PlatformStatus field to be passed to the necessary
machine-config-operator fields:

openshift/api#374

The Bare Metal IPI Networking Infrastructure document is applicable here as
well:

https://github.com/openshift/installer/blob/master/docs/design/baremetal/networking-infrastructure.md

There is also a great opportunity to share some of the configuration
files and scripts here.

This change needs several other pull requests:

Keepalived plus the coredns & haproxy static pods in the MCO:
openshift/machine-config-operator#740

Co-authored-by: Emilio Garcia <[email protected]>
Co-authored-by: John Trowbridge <[email protected]>
Co-authored-by: Martin Andre <[email protected]>
Co-authored-by: Tomas Sedovic <[email protected]>

Massive thanks to the Bare Metal and oVirt people!
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 8, 2019
@tomassedovic
Copy link
Contributor Author

Rebased. The release image now has openshift/machine-config-operator#1050 so this should pass e2e-openstack (unless I've messed the rebase up or there's another breakage).

@tomassedovic
Copy link
Contributor Author

/hold cancel

The OpenStack job passed, I've validated it locally and this PR is rabased on top of master. I think we're ready to merge.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 8, 2019
@mandre
Copy link
Member

mandre commented Aug 8, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 8, 2019
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

2 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@iamemilio
Copy link

/test e2e-aws-scaleup-rhel7

@Fedosin
Copy link
Contributor

Fedosin commented Aug 8, 2019

"/test e2e-aws-scaleup-rhel7" is broken, no need to retest it

@iamemilio
Copy link

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, iamemilio, mandre, tomassedovic

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

@tomassedovic: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-scaleup-rhel7 0c1768d link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. platform/openstack size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants