IBM Public Cloud Support #202

csrwng · 2020-02-04T23:54:42Z

No description provided.

csrwng · 2020-02-04T23:56:26Z

@derekwaynecarr

derekwaynecarr · 2020-02-05T01:27:32Z

/assign @derekwaynecarr

FYI @smarterclayton @eparis

sudhaponnaganti

FYI @jupierce

sttts · 2020-02-05T09:31:43Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+authors:
+  - "@csrwng"
+reviewers:
+  - "@derekwaynecarr"


expecting the owners of the control plane operators here, compare https://github.com/openshift/enhancements/pull/202/files#diff-8d49ecd990d312a72c0ffcdd1784ad05R92.

@sttts did you mean a different pr? the link above is for this one

No, meant this one. Just expected the owners of the mentioned operators to be informed about these plans by being reviewer of the enhancement.

Ack, will add more reviewers

sttts · 2020-02-05T09:34:12Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+- kubernetes apiserver
+- kubernetes controller manager
+- kubernetes scheduler
+- openshift apiserver


has it been discussed what is needed to move that into the customer cluster?

It hasn't... when we first looked at this, however, we were in a catch 22 situation, where in order to be able to schedule pods we needed the openshift crds and controller functional.

sttts · 2020-02-05T09:36:34Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+code should include config observers that assemble a new configuration for their
+respective control plane components. This will ensure that drift in future versions
+is kept under control and that a single code base is used to manage control plane
+configuration.


where is this beta control plane operator?

we're working on it, will be added to the current hypershift-toolkit repo. For the second phase we will create separate repos for each of the control plane controllers.

sttts · 2020-02-05T09:42:36Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+  Public Cloud team the necessary tools to generate manifests needed for a hosted
+  control plane.
+- Ensure that this deployment model remains functional through regular e2e testing
+  on IBM Public Cloud.


"regular" means we get a normal CI job in the openshift org making sure our control plane code changes don't break it?

Will the CI job be blocking for OpenShift PRs? E.g. if deployment topology changes and some apiserver suddenly does not serve certain APIs because they moves, their CI will break.

"regular" means we get a normal CI job in the openshift org making sure our control plane code changes don't break it?

That is the plan. We currently have a periodic job that creates clusters based on the 4.3 branches. One will be added for the master/4.4 branch.

Will the CI job be blocking for OpenShift PRs? E.g. if deployment topology changes and some apiserver suddenly does not serve certain APIs because they moves, their CI will break.

That will be harder. We would require capacity on the IBM Cloud to run that many jobs. Not sure that is feasible right now. The periodic job should block a release, but not individual PRs

We describe those apiserver changes (they are happening, now for oauth) in enhancements. The IBM team has to watch that repo to be informed.

csrwng · 2020-02-05T16:18:50Z

@deads2k @mfojtik @ironcladlou @spadgett @abhinavdahiya @crawford @miabbott

Please let me know if I should include other reviewers

spadgett · 2020-02-05T20:36:13Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+
+#### Console Changes
+The console should not report the control plane as being down if no metrics
+datapoints exist for control plane components in this configuration.


Ack, this shouldn't be an issue.

cc @rawagner @andybraren

I believe Control Plane components would be shown as "Not available" if that's acceptable (better than "Down").

@andybraren We probably need to remove that since we never expect to have control plane metrics. It's misleading to say not available.

Agreed, created an issue to track this for 4.5 Dashboards https://issues.redhat.com/browse/MGMT-438

spadgett · 2020-02-05T20:39:08Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+- openshift controller manager
+- cluster version operator
+- control plane operator(s)\*
+- oauth server\+


@csrwng Are there any lingering issues where the console backend rejects the OAuth server certificate in this deployment?

@spadgett no issue at the moment. Thx!

miabbott · 2020-02-07T04:07:14Z

cc: @lucab

lucab · 2020-02-07T09:14:02Z

From an OS point of view, there are at least two things in this proposal that looks uncomfortably hairy to me:

There is no mention of which cloud flavor this is trying to target. At a compute level, "IBM Cloud" is really an umbrella label for three different kinds of infrastructure: Classic, Gen 1, and Gen 2.
Out of those three, only the latter (IMHO) qualifies as a proper environment where we can sanely support provisioning "cattle nodes" with RHCOS and Ignition.
This briefly mentions several topics which seem to require host-level customization (e.g. VPN setup, service network, certificate minting, IBM-specific automation) without ever mentioning how the logic for that is containerized and provided to the nodes at provisioning time (i.e. before kubelet is bootstrapped).

In short, the post-GA story is under-specified so it's quite hard to judge it. The rest of the document seems to hint at a heavy UPI+RHEL environment, which offers a lot of escape hatches and is more likely to result in a "pet nodes" provisioning flow.

If that's indeed the priority, then it would be better to descope RHCOS and leave if for "future exploration" (with the risk that it may be very hard or impossible to retrofit). If RHCOS workers are instead a requirement, then clarifying the points above may result in vastly different design and required work.

cgwalters · 2020-02-07T13:49:03Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+
+### Non-Goals
+
+- Make hosted control planes a supported deployment model outside of IBM Public Cloud.


I find this confusing...IBM Public Cloud means a lot of different things, and a sort of baseline obvious one is having OpenShift support the default "self driving" path in their existing IaaS. But I guess we're doing hosted control plane first?

Maybe the enhancement should be called: "IBM Public Cloud Hosted Control Plane" ?

And one thing I would say here is that we should think of this "fairly" - if some other IaaS showed up and was willing to commit significant resources to maintaining a similar thing... clearly vast amounts of the design would likely be shared. But that can come later.

Absolutely, I think the things we learn from this work is something we can likely reuse in other cases. And perhaps at the proper time, make the pattern something standalone that's configured per provider. So definitely, this non-goal is a point-in-time statement.

I find this confusing...IBM Public Cloud means a lot of different things, and a sort of baseline obvious one is having OpenShift support the default "self driving" path in their existing IaaS. But I guess we're doing hosted control plane first?

At least in the foreseeable future, supporting the self-hosted path is not a priority afaik, but @derekwaynecarr can likely provide more insight into that.

cgwalters

Thanks for writing this enhancement BTW!

I will say I have trouble keeping in my head the fundamental impacts this makes to the default OpenShift 4 "self-driving" ("non-hosted? Need a term...) mode. Particularly given the other fundamental changes going on like the etcd operator that affect how we think of the control plane too.

Maybe we can use "hostedCP" as a shorthand term when discussing this? (HCP is obvious but three letter acronyms are too common etc.)

csrwng · 2020-02-07T14:34:40Z

In short, the post-GA story is under-specified so it's quite hard to judge it. The rest of the document seems to hint at a heavy UPI+RHEL environment, which offers a lot of escape hatches and is more likely to result in a "pet nodes" provisioning flow.

@lucab thank you for the feedback. Yes, this proposal is definitely under-specified where it comes to RHCOS. It's more a statement that we don't expect to continue supporting IBM cloud without RHCOS forever. We should have a separate enhancement proposal/design specifically for RHCOS, given that as I understand it, the RHCOS team has already done some initial investigations around this.

cgwalters · 2020-04-08T17:17:37Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+
+Enables an OpenShift cluster to be hosted on top of a Kubernetes/OpenShift cluster. 
+
+Given a release image, a CLI tool generates manifests that instantiate the control plane


OS upgrades for the worker nodes is owned by the customer? Or does IBM provide tooling for that? Are they using openshift-ansible?

In the "Post-GA" world with RHCOS...do we forsee trying to enable the MCO to manage upgrades for the workers w/RHCOS?

Isn't this covered in the Managed Workers section below?

The second half is yes, thanks!

wking · 2020-04-08T22:52:06Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+#### Managed Workers
+RHCOS adds support for bootstrapping on IBM Public Cloud. The MCO is added to 
+the components that get installed on the management cluster.  This enables upgrading 
+of RHCOS nodes using the same mechanisms as in self-hosted OpenShift.


So compute nodes still run machine-config daemons and cluster admins can write MachineConfig entries, create MachineSets, and all that good stuff? They just don't have any objects representing or control over the control-plane machines?

derekwaynecarr · 2020-08-04T15:18:37Z

The service is now GA and running 4.4.11, so let's merge this and then make updates to explain usage of Cluster Profile(s).

/approve
/lgtm

openshift-ci-robot · 2020-08-04T15:18:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng, derekwaynecarr, sudhaponnaganti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [derekwaynecarr,sudhaponnaganti]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lilic · 2020-08-05T07:12:41Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+minimum set of manifests to allow skipping the component should be annotated.
+However, in the case of the Machine API and Machine Configuration operators, 
+the CRDs that represent machines, machinesets and autoscalers should also be
+skipped. Monitoring alerts for components that do not get installed in the user


This part won't be that easy, unless we get provided with a list of it. Also it will not fire if there are no metrics for those components, so don't think its a problem.

This is what we addressed with openshift/cluster-monitoring-operator#705
No other alerts related to control plane components have surfaced.

s-urbaniak · 2020-08-05T07:30:18Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+
+Changes are required in different areas of the product in order to make clusters deployed 
+using this method viable. These include changes to the cluster version operator (CVO), web 
+console, second level operators (SLOs) deployed by the CVO, and RHCOS.


please don't use SLO as an acronym here. It is very commonly used for "Service Level Objectives".

ack, will do a follow-up to remove.

s-urbaniak · 2020-08-05T07:46:54Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+the new cluster. Minting of kubelet certificates for these worker nodes is handled
+by IBM automation.
+
+Components that run on the management cluster include:


for cluster-monitoring: as apiserver monitoring is effectively being disabled when running in "ROKS mode" what is monitoring control plane components on the management cluster?

IBM is running their own monitoring solution on their management/tugboat clusters.

s-urbaniak · 2020-08-05T07:52:03Z

enhancements/ibm-public-cloud/ibm-public-cloud-support.md

+cluster should also be skipped where possible.
+
+#### Console Changes
+The console should not report the control plane as being down if no metrics


Are all edge cases for disabling monitoring of control plane components on the worker clusters covered in openshift/cluster-monitoring-operator#705 ?

IBM Public Cloud Support

46301f4

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 4, 2020

openshift-ci-robot requested review from jwmatthews and sudhaponnaganti February 4, 2020 23:55

csrwng mentioned this pull request Feb 4, 2020

Cluster Profiles #200

Merged

openshift-ci-robot assigned derekwaynecarr Feb 5, 2020

sudhaponnaganti approved these changes Feb 5, 2020

View reviewed changes

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2020

sttts reviewed Feb 5, 2020

View reviewed changes

Add more reviewers

cb7728e

spadgett reviewed Feb 5, 2020

View reviewed changes

cgwalters reviewed Feb 7, 2020

View reviewed changes

csrwng mentioned this pull request Feb 19, 2020

Bug 1804921: Exclude etcd operator from hosted control plane deployments openshift/cluster-etcd-operator#172

Merged

cgwalters reviewed Apr 8, 2020

View reviewed changes

wking reviewed Apr 8, 2020

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2020

openshift-merge-robot merged commit e26a363 into openshift:master Aug 4, 2020

lilic reviewed Aug 5, 2020

View reviewed changes

s-urbaniak reviewed Aug 5, 2020

View reviewed changes


		### Non-Goals

		- Make hosted control planes a supported deployment model outside of IBM Public Cloud.


		Enables an OpenShift cluster to be hosted on top of a Kubernetes/OpenShift cluster.

		Given a release image, a CLI tool generates manifests that instantiate the control plane

IBM Public Cloud Support #202

IBM Public Cloud Support #202

Uh oh!

Conversation

csrwng commented Feb 4, 2020

Uh oh!

csrwng commented Feb 4, 2020

Uh oh!

derekwaynecarr commented Feb 5, 2020

Uh oh!

sudhaponnaganti left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csrwng commented Feb 5, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rawagner Feb 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miabbott commented Feb 7, 2020

Uh oh!

lucab commented Feb 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cgwalters left a comment

Choose a reason for hiding this comment

Uh oh!

csrwng commented Feb 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derekwaynecarr commented Aug 4, 2020

Uh oh!

openshift-ci-robot commented Aug 4, 2020

Uh oh!

rawagner Feb 6, 2020 •

edited

Loading