Allow openshift-ansible image to deploy to GCP#6541
Allow openshift-ansible image to deploy to GCP#6541sdodson merged 4 commits intoopenshift:masterfrom
Conversation
ff3fd34 to
0262f30
Compare
0262f30 to
734e436
Compare
734e436 to
f932017
Compare
432cf00 to
231a877
Compare
0b26570 to
716a206
Compare
|
/assign kwoodson mgugino |
|
@sdodson: GitHub didn't allow me to assign the following users: mgugino. Note that only openshift members can be assigned. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/assign michaelgugino |
0ab5f0a to
e990042
Compare
|
/retest |
ansible.cfg
Outdated
| gathering = smart | ||
| fact_caching = jsonfile | ||
| fact_caching_connection = $HOME/ansible/facts | ||
| fact_caching_connection = /tmp/ansible/cached_facts |
There was a problem hiding this comment.
This should probably stay variablized.
There was a problem hiding this comment.
Hrm, why would a folder in $HOME be the right place for a cache? Especially a non-dotted name?
| @@ -0,0 +1,402 @@ | |||
| #!/usr/bin/env python | |||
There was a problem hiding this comment.
any reason to rename this file? This looks like gce.py. I'd prefer if we kept the same name.
There was a problem hiding this comment.
I want to make it clear what it's for, and the naming is redundant. It has additional patches on top of gce.py. I can rename it back, but I could also add a comment at the top that talks about its provenance.
There was a problem hiding this comment.
Ok. This should be fine. Adding the comment and what the expectation is around keeping this file updated when a newer gce.py comes out.
| @@ -0,0 +1 @@ | |||
| {} No newline at end of file | |||
There was a problem hiding this comment.
Yes, this is the default inventory for dynamic behavior.
| @@ -0,0 +1,109 @@ | |||
| --- | |||
There was a problem hiding this comment.
Should we rename aws/gcp files to match? I think the aws equivalent is build_ami.yml.
Also, should we move most of the heavy lifting under the roles/openshift_gce/build_image.yml so that the logic is under the role?
There was a problem hiding this comment.
AMI is Amazon specific, GCE always calls it an image. I would argue for build_image for both before build_ami for both.
There was a problem hiding this comment.
I think maybe yes, but this was going to be a future refactor point.
There was a problem hiding this comment.
Do you mind if I refactor this as a follow up?
| @@ -0,0 +1,38 @@ | |||
| --- | |||
There was a problem hiding this comment.
Consider moving the logic under the role and have a shim that calls:
import_role: with tasks_from: publish_image.yml.
| @@ -0,0 +1,18 @@ | |||
| --- | |||
There was a problem hiding this comment.
Is this something that we should be sharing across clouds? Should this move into a logrotate role?
There was a problem hiding this comment.
Probably yes, ok to do as a followup? This was originally to ensure we were rotating everything, and on cloud images we can definitely be more opinionated.
There was a problem hiding this comment.
Fine as a follow up. Just want to share as much of this useful code as possible.
| @@ -0,0 +1,12 @@ | |||
| --- | |||
There was a problem hiding this comment.
This file or an equivalent lives here:
openshift-ansible/roles/openshift_cloud_provider
These settings should move to the tasks/gce.yml section.
There was a problem hiding this comment.
Moving. Note that multizone does not work on GCP if only one zone is deployed, so I defaulted this to false.
| @@ -0,0 +1,11 @@ | |||
| {% set storagedriver = provision_gce_docker_storage_driver | default('overlay2') %} | |||
There was a problem hiding this comment.
This lives in container-runtime/templates/docker-storage-setup.j2.
|
|
||
| oc serviceaccounts create-kubeconfig -n openshift-infra node-bootstrapper > /root/bootstrap.kubeconfig | ||
| gcloud compute project-info --project '{{ openshift_gcp_project }}' add-metadata --metadata-from-file '{{ openshift_gcp_prefix + openshift_gcp_clusterid | default("default") }}-bootstrap-config=/root/bootstrap.kubeconfig' | ||
| rm -f /root/bootstrap.kubeconfig No newline at end of file |
|
|
||
| set -euo pipefail | ||
|
|
||
| oc serviceaccounts create-kubeconfig -n openshift-infra node-bootstrapper > /root/bootstrap.kubeconfig |
There was a problem hiding this comment.
Where does this run? Should this /root/bootstrap.kubeconfig be in /etc/origin/master?
There was a problem hiding this comment.
This is the job that keeps the bootstrap metadata in project instances up to date. So it's creating a new bootstrap config, uploading it to gcp metadata, then deleting the temporary local file.
5880872 to
7e4dbb7
Compare
|
/test gcp |
2 similar comments
|
/test gcp |
|
/test gcp |
|
The "install to GCP from openshift-ansible PR using image" test job is now passing. Can you guys help get the final round of reviews done so we can merge this and iterate? I want to see some movement on AWS cluster setup. |
michaelgugino
left a comment
There was a problem hiding this comment.
I would like to see a few minor changes.
playbooks/deploy_cluster.yml
Outdated
| - import_playbook: init/main.yml | ||
|
|
||
| - import_playbook: openshift-checks/private/install.yml | ||
| - import_playbook: control_plane.yml |
There was a problem hiding this comment.
Are users going to encouraged to call control_plane.yml, or is this just for dev/test purposes?
As-is, init/main.yml will be called 3 times; we don't want that, we need to come up with a way that init/main.yml is only called once per run of ansible-playbook.
I suggest we don't do this at all.
There was a problem hiding this comment.
I think eventually control_plane is the only part that will be in openshift-ansible, and the other parts will be called by different tools (or at a higher level).
There was a problem hiding this comment.
I agree the init part, and this being available to end users, are both issues. Do we have a practical way to run-once an included playbook? If we don't, then I think we should get rid of deploy_cluster and have control_plane and components (or whatever we call it) and do them in sequence. Installing a control plane has nothing to do with the components on top of it.
There was a problem hiding this comment.
I would prefer to have the split of playbooks entirely as you suggest. However, I don't think we want to do that this late in the cycle.
There was a problem hiding this comment.
As a way forward, I think we can simply make the two parts "private" in a separate folder and include them here, and in AWS and GCP, with comments about pre and post conditions.
There was a problem hiding this comment.
Updated with the less contentious path, including a thorough description of what the two playbooks are for.
playbooks/control_plane.yml
Outdated
|
|
||
| - import_playbook: openshift-master/private/additional_config.yml | ||
|
|
||
| - import_playbook: openshift-node/private/config.yml |
There was a problem hiding this comment.
We could just add install_gcp step here, it only runs against masters, just make it controlled by inventory; no need to split these steps up. In the gcp playbook above, that's really the only difference is that there's an additional play called in the middle of deploy_cluster.yml
There was a problem hiding this comment.
My concern is that long term we are going to have a very hard split here. If we're not ready to do that it's fine, but I don't want to see coupling between "have APIs running" and "install opinionated components" for much longer.
There was a problem hiding this comment.
I don't see any reason the control plane needs to know about GCP at this point - that's a GCP specific detail, or an AWS specific detail, not something the generic playbook should be aware of. We should be using composition, not inheritance, for many of these areas (where we are leveraging openshift-ansible as lower level playbooks)
There was a problem hiding this comment.
My concern is that long term we are going to have a very hard split here. If we're not ready to do that it's fine, but I don't want to see coupling between "have APIs running" and "install opinionated components" for much longer.
We have done a lot of decoupling in this release. Many people want a 'run this one command line' operation to install everything.
I suggest putting the install_gcp play here for simplicity's sake. It's not a lot of tasks to skip if it's not normally used, and it saves you from having to include a bunch of plays in your playbook.
deploy_cluster.yml is a special, top-level play. It utilizes re-usable plays. I don't see value in splitting it at this time, as this gcp play is the only one that needs that to happen and we can seamlessly add the install_gcp play into deploy_cluster.
There was a problem hiding this comment.
AWS needs it split as well. Basically, deploy_cluster.yml is really "BYO cluster". BYO, AWS, GCP, OpenStack, and Azure would all be peers in importance. I'm fine with "BYO cluster" being its own playbook, but we should not be duplicating the steps in it across 5 different providers. So it's not really the right factoring even for 2/3 the targets we deploy to at this very moment.
There was a problem hiding this comment.
IMO, the ideal workflow is:
- Deploy hosts, either manually or via a play we provide
- Prequisites and user-provided setup tasks (cloud-init script, what have you)
- deploy_cluster.yml
Each individual cloud provider is going to have it's own set up unique steps that need to be performed; using image-based deployments is a whole other topic altogether. I think deploy_cluster should be provider agnostic; the more we can have all providers converge on the same path, the better.
There was a problem hiding this comment.
For cloud providers (including OpenStack), the flow is going to end up looking more like:
- Use our golden image, or customize your own, where nodes are already preconfigured
- Deploy one or more core scale groups, using those images (that already have nodes configured)
- Install containerized components onto the masters (not RPM based, but image based) such that the control plane comes up
- Accept / bootstrap the master nodes and any other initial capacity provided
- Install core components like the autoscaler and autosigner
- Install additional components like the router, registry, etc
However, we are not going to be driving these from unified playbooks for much longer - instead, the cluster operator is going to be invoking provisioning in what is probably going to be three stages
- provision cloud topology (load balancers, etc) and scale groups
- provision control plane onto a set of nodes
- install individual components
Each of those is going to be an entrypoint playbook that needs to share code with deploy_cluster - deploy_cluster is not really going to be called by anyone except people on bare metal. I agree on bare metal the workflow you described is probably ok, except that's only 25% of the install footprint and not a priority. We should not be optimizing for "deploy_cluster.yml" is the single entrypoint any longer. We need to be treating the three steps as discrete, and make sure that we are reusing those pieces. Deploy_cluster can be cloud provider agnostic, but only because it won't get called :)
There was a problem hiding this comment.
I definitely prefer not having a single playbook try to do all things. I think it makes sense to have them be completely separate concerns.
Each of those is going to be an entrypoint playbook that needs to share code with deploy_cluster - deploy_cluster is not really going to be called by anyone except people on bare metal. I agree on bare metal the workflow you described is probably ok, except that's only 25% of the install footprint and not a priority.
While I think many users will not be using actual bare metal hosts, many of them will be provisioning hosts manually on a variety of clouds and virt infrastructure. I imagine some users will be quite happy with the instance/infra provisioning plays, and others are not.
I think there is a lot of value in having a simple deploy process of
- Provision some elbs
- Provision some hosts
- Provision openshift
We had discussed previously making all the nodes 'bootstrap-able'. I think we should try to move that direction so we have as common as possible install experience across all deployment types.
| name: openshift_gcp | ||
| tasks_from: setup_scale_group_facts.yml | ||
|
|
||
| - name: configure the control plane |
There was a problem hiding this comment.
just make this ../deploy_cluster.yml and add install_gcp.yml to deploy_cluster.yml in the appropriate spot. You can use a boolean set via inventory or via this playbook to ensure that install_gcp.yml is called only when we want.
There was a problem hiding this comment.
We're probably going to want to add prerequisites.yml to this playbook. I have been working on some patches that integrate prerequisites.yml into these types of special plays.
See this node-scaleup PR for an example: #6784
There was a problem hiding this comment.
The question is - why does prerequisites have to be special, vs being part of lower level playbooks? Is it only the dependency on openshift_facts?
There was a problem hiding this comment.
It's special because it incorporates some things (like installing docker and configuring docker storage) there were previously not included in openshift-ansible and had to be completed by the user before running openshift-ansible.
container_runtime (previously the docker role) needs to be configured very early in the deployment process. The reasons for this are:
- configuring registry authentication credentials
- openshift_version depends on correctly setup container runtime to query latest/available tags for containerized installs
- health checks want to ensure containers are available and registries are configured before running the installer and failing halfway through.
We could put this role in the init code path, but certain users don't want us to touch docker after it's been installed, because as you may know, we have given users license to modify things outside of openshift-ansible, and re-running the container_runtime role may have undesired effects for them. Additionally, it runs against the 'nodes' group, so having to run those tasks each time a play is run adds a lot of execution time to large clusters.
We could probably add it to deploy_cluster.yml now that we have refactored a lot of other components. But I think we're a little late in the cycle for that for 3.9 and we could potentially do it in 3.10.
If you look at 3.7 (or even worse, 3.6), you will see that it was a very tangled web of dependencies surrounding the docker role. I agree that prerequisites.yml is a compromise, might not be the best solution, but it works and is an improvement over past releases.
There was a problem hiding this comment.
openshift_version depends on correctly setup container runtime to query latest/available tags for containerized installs
This is bad. This shouldn't be a dependency then.
health checks want to ensure containers are available and registries are configured before running the installer and failing halfway through.
This seems buggy, like the above. That's not something we should be doing on the nodes.
There was a problem hiding this comment.
Note that I'm not arguing that these aren't valuable and important checks. My issue is that they are fundamentally things that provisioning has to do, so for the cloud providers we will always need to call them. Doing that from inside a system is going to be problematic because we will always have external dependencies. We either need to avoid the dependency on the container runtime to do the check, or figure out how to make those checks special and move them somewhere else.
There was a problem hiding this comment.
This is bad. This shouldn't be a dependency then.
If you compare what we have now versus 3.7 and 3.6 releases you'll see a vast reduction in the web of dependencies in this area. I agree, we can definitely do more in this area, but I think we're in a pretty good place at the moment compared to previous releases in this area. Personally, I would like to force users to specify exactly what they want and we take the training wheels off in regards to openshift_version, but that's not something we implemented this cycle. I think we have the ground work now to move that direction if we choose to.
We either need to avoid the dependency on the container runtime to do the check, or figure out how to make those checks special and move them somewhere else.
container_runtime needs to be setup at some point regardless. We might as well do it right up front and check to make sure things are working before we start deploying to 100's of nodes. Whether we're using atomic, skopeo, or docker to interact with registries, we still need to ensure those items are properly configured on the hosts. I don't see a reason to not configure container_runtime up front when it's a prerequisite to pretty much the whole deployment.
There was a problem hiding this comment.
We might as well do it right up front and check to make sure things are working before we start deploying to 100's of node
This is the old way :) In the cloud provider model, we will need to check these things before we have a single node provisioned, because we are going to be using golden images. Anything that we need in order to check this needs to be something that can be installed into the docker image that the playbook runs out of, and it's better for it not to be tied in any way to a particular container runtime choice (whether skopeo or docker) because the component won't know that.
I think the checks are valuable, but we can't rely on installing a container runtime on a given provisioned node in order to the checks. We probably shouldn't be installing skopeo either, given the amount of dependencies it pulls in, or its relatively inflexibility. The checks won't make any sense in cloud providers really, because we won't even invoke ansible on the nodes during the install (i.e. both AWS and GCP really operate only on the nodes selected as masters).
There was a problem hiding this comment.
At best, this seems like a check that has to be done after container runtime is installed in the node job, not up front in the playbooks.
There was a problem hiding this comment.
This is the old way :) In the cloud provider model, we will need to check these things before we have a single node provisioned, because we are going to be using golden images.
Building the golden images is currently no different on aws vs deploying on bare metal, as far as container_runtime is concerned. If container runtime is not setup, then the node isn't provisioned; if there's no container runtime, it can't run containers.
We can't really check 'before' a single node is provisioned; the value is in checking that each node is configured correctly (ie, networking, proxy), not so much that you put in valid credentials and registry url.
We have a lot of checks built in for user friendliness. Probably you or I or anyone that is commit PRs here can identify most problems somewhat quickly; The average user just sees the nodes didn't deploy and don't know why. Having the checks in place that we do gives them much better feedback if they are not advanced ansible users (I would argue, many of our users are not 'advanced' ansible users, they just want the installer to work).
| @@ -0,0 +1 @@ | |||
| ../../../filter_plugins No newline at end of file | |||
There was a problem hiding this comment.
We are no longer doing filter_plugins this way. All filter plugins should be moved to lib_utils and that role should be brought into scope via a meta-depends in a role or at the play level in the 'roles:' section.
There was a problem hiding this comment.
I'm not using any special filter plugin, I'll run this without.
7e4dbb7 to
c49038b
Compare
|
/test launch-gcp |
|
/retest |
Also add libcloud (required for dynamic GCE lookup) and which (relied on by gcloud).
This moves all core functionality into the openshift-ansible repo, adds the necessary equivalent entrypoint to the openshift-ansible installer image, and ensures the dynamic inventory mechanisms in openshift-ansible continue to work. Notable changes from origin-gce: * playbook extensions changed to .yml * dynamic inventory subdirectory created to prevent accidental use * use the custom entrypoint entrypoint-gcp for this image * move tasks into openshift_gcp role
a8c2910 to
9f01113
Compare
playbooks/deploy_cluster.yml
Outdated
| - import_playbook: init/main.yml | ||
|
|
||
| - import_playbook: openshift-checks/private/install.yml | ||
| - import_playbook: common/openshift-cluster/private/control_plane.yml |
There was a problem hiding this comment.
We're trying to remove things from 'openshift-cluster'
We can't call this control_plane if it installs nodes, because our upgrade_control_plane only upgrades masters and etcd hosts. Calling everything below pods the control plane is confusing.
There was a problem hiding this comment.
I think 'base_cluster' is better than 'control_plane'
I think the file path should be
common/private/base_cluster.yml
There was a problem hiding this comment.
I'm fine splitting out the nodes. GCP and AWS will be using golden images so in general they wouldn't call this step either (I call it today in case people want to manage masters directly in an invocation).
There was a problem hiding this comment.
Actually, when we change how the control plane is installed (moving to static pods and self-hosted) we'll have to configure masters as nodes first. So we shouldn't get attached to the current node order anyway.
For posterity, the arc we're on for control plane is going to be only supporting the following steps:
- runc installed
- kubelet installed inside system container
- host level config that includes static pods
- kubelet launches static pods, which include the control plane
- static pods form working control plane
- master nodes are bootstrapped into the cluster
The "bootstrap node" configuration will then be required as part of control plane setup (or required to have been completed). So we should expect that some node config goes before the control plane, but that all nodes are configured in bootstrap mode (the current node config is going to be deprecated and removed).
I'm going to move the node setup into the callers for now.
playbooks/deploy_cluster.yml
Outdated
|
|
||
| - import_playbook: openshift-management/private/config.yml | ||
| when: openshift_management_install_management | default(false) | bool | ||
| - import_playbook: common/openshift-cluster/private/components.yml |
There was a problem hiding this comment.
I prefer this to be in common/private as well.
9f01113 to
1d262b3
Compare
|
/test gcp |
1d262b3 to
f2b14d2
Compare
|
/test gcp |
|
@michaelgugino last comments addressed |
kwoodson
left a comment
There was a problem hiding this comment.
Sorry for the late 👍 . Was out of town last week. Looks like a lot of good code coming in. Some follow up needed but thanks for the PR @smarterclayton.
/lgtm
|
/test all [submit-queue is verifying that this PR is safe to merge] |
|
@smarterclayton: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
well known conformance flake |
This moves all core GCP functionality into the openshift-ansible repo, adds
the necessary equivalent entrypoint to the openshift-ansible installer
image, and ensures the dynamic inventory mechanisms in openshift-ansible
continue to work. After this change https://github.com/openshift/origin-gce can be removed
TODO: