Skip to content

Conversation

@csrwng
Copy link
Contributor

@csrwng csrwng commented Feb 4, 2020

No description provided.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 4, 2020
@csrwng csrwng mentioned this pull request Feb 4, 2020
@csrwng
Copy link
Contributor Author

csrwng commented Feb 4, 2020

@derekwaynecarr

@derekwaynecarr
Copy link
Member

/assign @derekwaynecarr

FYI @smarterclayton @eparis

Copy link

@sudhaponnaganti sudhaponnaganti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @jupierce

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 5, 2020
authors:
- "@csrwng"
reviewers:
- "@derekwaynecarr"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts did you mean a different pr? the link above is for this one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, meant this one. Just expected the owners of the mentioned operators to be informed about these plans by being reviewer of the enhancement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, will add more reviewers

- kubernetes apiserver
- kubernetes controller manager
- kubernetes scheduler
- openshift apiserver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has it been discussed what is needed to move that into the customer cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hasn't... when we first looked at this, however, we were in a catch 22 situation, where in order to be able to schedule pods we needed the openshift crds and controller functional.

code should include config observers that assemble a new configuration for their
respective control plane components. This will ensure that drift in future versions
is kept under control and that a single code base is used to manage control plane
configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this beta control plane operator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're working on it, will be added to the current hypershift-toolkit repo. For the second phase we will create separate repos for each of the control plane controllers.

Public Cloud team the necessary tools to generate manifests needed for a hosted
control plane.
- Ensure that this deployment model remains functional through regular e2e testing
on IBM Public Cloud.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"regular" means we get a normal CI job in the openshift org making sure our control plane code changes don't break it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the CI job be blocking for OpenShift PRs? E.g. if deployment topology changes and some apiserver suddenly does not serve certain APIs because they moves, their CI will break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"regular" means we get a normal CI job in the openshift org making sure our control plane code changes don't break it?

That is the plan. We currently have a periodic job that creates clusters based on the 4.3 branches. One will be added for the master/4.4 branch.

Will the CI job be blocking for OpenShift PRs? E.g. if deployment topology changes and some apiserver suddenly does not serve certain APIs because they moves, their CI will break.

That will be harder. We would require capacity on the IBM Cloud to run that many jobs. Not sure that is feasible right now. The periodic job should block a release, but not individual PRs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We describe those apiserver changes (they are happening, now for oauth) in enhancements. The IBM team has to watch that repo to be informed.

@csrwng
Copy link
Contributor Author

csrwng commented Feb 5, 2020

@deads2k @mfojtik @ironcladlou @spadgett @abhinavdahiya @crawford @miabbott

Please let me know if I should include other reviewers


#### Console Changes
The console should not report the control plane as being down if no metrics
datapoints exist for control plane components in this configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, this shouldn't be an issue.

cc @rawagner @andybraren

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe Control Plane components would be shown as "Not available" if that's acceptable (better than "Down").

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andybraren We probably need to remove that since we never expect to have control plane metrics. It's misleading to say not available.

Copy link

@rawagner rawagner Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, created an issue to track this for 4.5 Dashboards https://issues.redhat.com/browse/MGMT-438

- openshift controller manager
- cluster version operator
- control plane operator(s)\*
- oauth server\+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csrwng Are there any lingering issues where the console backend rejects the OAuth server certificate in this deployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spadgett no issue at the moment. Thx!

@miabbott
Copy link
Member

miabbott commented Feb 7, 2020

cc: @lucab

@lucab
Copy link

lucab commented Feb 7, 2020

From an OS point of view, there are at least two things in this proposal that looks uncomfortably hairy to me:

  1. There is no mention of which cloud flavor this is trying to target. At a compute level, "IBM Cloud" is really an umbrella label for three different kinds of infrastructure: Classic, Gen 1, and Gen 2.
    Out of those three, only the latter (IMHO) qualifies as a proper environment where we can sanely support provisioning "cattle nodes" with RHCOS and Ignition.
  2. This briefly mentions several topics which seem to require host-level customization (e.g. VPN setup, service network, certificate minting, IBM-specific automation) without ever mentioning how the logic for that is containerized and provided to the nodes at provisioning time (i.e. before kubelet is bootstrapped).

In short, the post-GA story is under-specified so it's quite hard to judge it. The rest of the document seems to hint at a heavy UPI+RHEL environment, which offers a lot of escape hatches and is more likely to result in a "pet nodes" provisioning flow.

If that's indeed the priority, then it would be better to descope RHCOS and leave if for "future exploration" (with the risk that it may be very hard or impossible to retrofit). If RHCOS workers are instead a requirement, then clarifying the points above may result in vastly different design and required work.


### Non-Goals

- Make hosted control planes a supported deployment model outside of IBM Public Cloud.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing...IBM Public Cloud means a lot of different things, and a sort of baseline obvious one is having OpenShift support the default "self driving" path in their existing IaaS. But I guess we're doing hosted control plane first?

Maybe the enhancement should be called: "IBM Public Cloud Hosted Control Plane" ?

And one thing I would say here is that we should think of this "fairly" - if some other IaaS showed up and was willing to commit significant resources to maintaining a similar thing... clearly vast amounts of the design would likely be shared. But that can come later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I think the things we learn from this work is something we can likely reuse in other cases. And perhaps at the proper time, make the pattern something standalone that's configured per provider. So definitely, this non-goal is a point-in-time statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this confusing...IBM Public Cloud means a lot of different things, and a sort of baseline obvious one is having OpenShift support the default "self driving" path in their existing IaaS. But I guess we're doing hosted control plane first?

At least in the foreseeable future, supporting the self-hosted path is not a priority afaik, but @derekwaynecarr can likely provide more insight into that.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this enhancement BTW!

I will say I have trouble keeping in my head the fundamental impacts this makes to the default OpenShift 4 "self-driving" ("non-hosted? Need a term...) mode. Particularly given the other fundamental changes going on like the etcd operator that affect how we think of the control plane too.

Maybe we can use "hostedCP" as a shorthand term when discussing this? (HCP is obvious but three letter acronyms are too common etc.)

@csrwng
Copy link
Contributor Author

csrwng commented Feb 7, 2020

In short, the post-GA story is under-specified so it's quite hard to judge it. The rest of the document seems to hint at a heavy UPI+RHEL environment, which offers a lot of escape hatches and is more likely to result in a "pet nodes" provisioning flow.

@lucab thank you for the feedback. Yes, this proposal is definitely under-specified where it comes to RHCOS. It's more a statement that we don't expect to continue supporting IBM cloud without RHCOS forever. We should have a separate enhancement proposal/design specifically for RHCOS, given that as I understand it, the RHCOS team has already done some initial investigations around this.


Enables an OpenShift cluster to be hosted on top of a Kubernetes/OpenShift cluster.

Given a release image, a CLI tool generates manifests that instantiate the control plane
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OS upgrades for the worker nodes is owned by the customer? Or does IBM provide tooling for that? Are they using openshift-ansible?

In the "Post-GA" world with RHCOS...do we forsee trying to enable the MCO to manage upgrades for the workers w/RHCOS?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this covered in the Managed Workers section below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second half is yes, thanks!

#### Managed Workers
RHCOS adds support for bootstrapping on IBM Public Cloud. The MCO is added to
the components that get installed on the management cluster. This enables upgrading
of RHCOS nodes using the same mechanisms as in self-hosted OpenShift.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So compute nodes still run machine-config daemons and cluster admins can write MachineConfig entries, create MachineSets, and all that good stuff? They just don't have any objects representing or control over the control-plane machines?

@derekwaynecarr
Copy link
Member

The service is now GA and running 4.4.11, so let's merge this and then make updates to explain usage of Cluster Profile(s).

/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 4, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: csrwng, derekwaynecarr, sudhaponnaganti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [derekwaynecarr,sudhaponnaganti]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit e26a363 into openshift:master Aug 4, 2020
minimum set of manifests to allow skipping the component should be annotated.
However, in the case of the Machine API and Machine Configuration operators,
the CRDs that represent machines, machinesets and autoscalers should also be
skipped. Monitoring alerts for components that do not get installed in the user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part won't be that easy, unless we get provided with a list of it. Also it will not fire if there are no metrics for those components, so don't think its a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we addressed with openshift/cluster-monitoring-operator#705
No other alerts related to control plane components have surfaced.


Changes are required in different areas of the product in order to make clusters deployed
using this method viable. These include changes to the cluster version operator (CVO), web
console, second level operators (SLOs) deployed by the CVO, and RHCOS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use SLO as an acronym here. It is very commonly used for "Service Level Objectives".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, will do a follow-up to remove.

the new cluster. Minting of kubelet certificates for these worker nodes is handled
by IBM automation.

Components that run on the management cluster include:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for cluster-monitoring: as apiserver monitoring is effectively being disabled when running in "ROKS mode" what is monitoring control plane components on the management cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IBM is running their own monitoring solution on their management/tugboat clusters.

cluster should also be skipped where possible.

#### Console Changes
The console should not report the control plane as being down if no metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all edge cases for disabling monitoring of control plane components on the worker clusters covered in openshift/cluster-monitoring-operator#705 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.