[WIP] Hostlevel DNS/connectivity pre-flight checks by bogdando · Pull Request #5883 · openshift/openshift-ansible

bogdando · 2017-10-25T17:02:53Z

Add a common pre-flight check to test cluster nodes DNS resolve and
a smoke connectivity test with ICMP ping.
Add openshift_override_resolve_check, openshift_override_icmp_check
(default to False) to allow overriding those checks.
Add an auto-discovery stub for the added connectivity check and
tag to be listed alongside the *.py scripted checks and tags.
Trigger those checks from the common checks pre-install playbook.
Included as a standard ansible task for now.
Also trigger the checks from the common checks adhock playbook, if
openshift_checks has the '@connectivity' tag.
Included as a standard ansible task for now.
Allow the connectivity checks to be directly invoked as a playbook
as well.

TODO implement the check playbook invocation in the connectivity.py
to replace the include directives.

Signed-off-by: Bogdan Dobrelya bdobreli@redhat.com

rh-atomic-bot · 2017-10-25T17:03:33Z

Can one of the admins verify this patch?
I understand the following commands:

bot, add author to whitelist
bot, test pull request
bot, test pull request once

bogdando · 2017-10-25T17:03:48Z

@e-minguez @tomassedovic @sdodson ^^ PTAL

bogdando · 2017-10-25T17:04:25Z

The check and tag supports autodiscovery and listed alongside the other checks available, although is managed externally with an ansible playbook yet.

bogdando · 2017-10-25T17:08:31Z

playbooks/common/openshift-checks/connectivity.yml

+      shell: getent ahostsv4 {{ item }}
+      register: lookuphost
+      changed_when: false
+      failed_when: false


this doesn't work yet, wip

bogdando · 2017-10-25T17:15:07Z

example commands:

ansible-playbook -v openshift-ansible/playbooks/byo/openshift-checks/adhoc.yml -e openshift_disable_check=None -e openshift_checks=@connectivity,@pre-flight,disk_availability

ansible-playbook -v openshift-ansible/playbooks/byo/openshift-checks/pre-install.yml

ansible-playbook -v openshift-ansible/playbooks/byo/openshift-checks/connectivity.yml

* Add a common pre-flight check to test cluster nodes DNS resolve and a smoke connectivity test with ICMP ping. * Add openshift_override_resolve_check, openshift_override_icmp_check (default to False) to allow overriding those checks. * Add an auto-discovery stub for the added connectivity check and tag to be listed alongside the *.py scripted checks and tags. * Trigger those checks from the common checks pre-install playbook. Included as a standard ansible task for now. * Also trigger the checks from the common checks adhock playbook, if openshift_checks has the '@connectivity' tag. Included as a standard ansible task for now. * Allow the connectivity checks to be directly invoked as a playbook as well. TODO implement the check playbook invocation in the connectivity.py to replace the include directives. Signed-off-by: Bogdan Dobrelya <bdobreli@redhat.com>

michaelgugino · 2017-11-06T22:16:07Z

/ok-to-test

bogdando · 2017-11-08T09:02:17Z

^^
3.7 branch is forked hopefully we can "unfreeze" with code reviews now? :)

sosiouxme · 2017-11-09T22:19:22Z

@bogdando sorry for the delay, still catching up after vacation.

Seems like it would be a useful check to check host inter-connectivity. Could it be brought in line with how the rest of the checks work? As opposed to the way Ansible runs and quits on the first task failure, we want to run all the checks and report back a summary of possibly-multiple failures. That's why they're all in python modules, not that we don't love Ansible tasks. If the logic were in roles/openshift_health_checker/openshift_checks/connectivity.py then most of the rest of this would be unnecessary - the check would automatically run under the pre-install playbook or from adhoc on request, and we would not need separate variables to disable.

I'm also trying to think of the different scenarios where this might give a false positive. In a cloud environment (openstack, EC2, ...) the hosts might not be able to reach each other on their external IPs and yet can reach each other fine internally. The hostnames in the inventory file might be aliased in ssh config and not actually intended to resolve at all, while each host can define internal and external names for use with openshift. My feeling would be we should be having the hosts try to reach each other on the internal name if set. It's not necessary to have the check be accurate in every single conceivable scenario (false positives can be overridden), but it should only flag things that are pretty likely to be a problem.

bogdando · 2017-11-10T15:07:02Z

@sosiouxme thanks for inputs!

That's why they're all in python modules, not that we don't love Ansible tasks

Yeah, I suspected that. That becomes a tough task, but I'll do my best :)

we would not need separate variables to disable.
the hosts might not be able to reach each other on their external IPs and yet can reach each other fine
internally. The hostnames in the inventory file might be aliased in ssh config and not actually intended
to resolve at all, while each host can define internal and external names for use with openshift. My
feeling would be we should be having the hosts try to reach each other on the internal name if set. It's
not necessary to have the check be accurate in every single conceivable scenario (false positives can
be overridden), but it should only flag things that are pretty likely to be a problem

right, could you point me if there is a 'soft-fail' mode for a check, like an issued warning?
Also, I'm quite a new to openshift-ansible and not too sure where to look for variables representing internal names for nodes, for all places and cases :) Like, when I deploy my test env, I can see the master-api would fail, if the public hostname can't be resolved, even though I've changed the hostnames to internal FQDNs and expect only those to be contacted anywhere (I have a related WIP patch. And certificates issued for public FQDNs might be a tricky place to touch as well... So I'd appreciate some guidence on that internal vs external configuration nuances. Thanks!

wrt

The hostnames in the inventory file might be aliased in ssh config and not actually intended
to resolve at all

I suppose that would be a major issue, especially for --net=host pods, which rely on the hosts names resolution AFAICT. But I can do this a soft-fail or overridable, no problems.

sosiouxme · 2017-11-10T15:46:13Z

On Fri, Nov 10, 2017 at 10:07 AM, Bogdan Dobrelya ***@***.***> wrote: That's why they're all in python modules, not that we don't love Ansible tasks Yeah, I suspected that. That becomes a tough task, but I'll do my best :)

If building the python module is too intimidating it's something we could conceivably add to our team backlog. BTW is there any particular motivation for this check, e.g. have you seen resolution/connectivity as a common problem in trial installs? What are the common symptoms?

right, could you point me if there is a 'soft-fail' mode for a check, like an issues warning?

Ansible doesn't really have the concept of a warning. Your task either fails and execution stops, or it creates some output that gets buried under an avalanche of later task output so is useless unless someone is specifically looking for it. That's the current state, at least. While I don't see Ansible changing that, I suspect at some point we'll create a callback plugin where checks (and perhaps other tasks) can register a warning that doesn't halt execution but does appear in the summary at the end of the run.

Also, I'm quite a new to openshift-ansible and not too sure where to look for variables representing internal names for nodes, for *all* places and cases :) Like, when I deploy my test env, I can see the master-api would fail, if the public hostname can't be resolved, even if I changed the hostnames to internal FQDNs (I have a related WIP patch <openshift/openshift-ansible-contrib#845>. So I'd appreciate some guidence on that internal vs external configuration nuances. Thanks!

Well, here's an example host entry for an AWS host: ec2-34-229-60-235.compute-1.amazonaws.com openshift_ip=172.18.1.43 openshift_public_ip=34.229.60.235 openshift_hostname=ip-172-18-1-43.ec2.internal openshift_public_hostname= ec2-34-229-60-235.compute-1.amazonaws.com openshift_schedulable=True openshift_node_labels="{'region': 'infra', 'zone': 'east'}" I would recommend consulting openshift_hostname, falling back to openshift_public_hostname, falling back to the inventory hostname.

I suppose that would be a major issue, especially for --net=host pods, which rely on the hosts names resolution AFAICT. But I can do this a soft-fail or overridable, no problems.

The hosts defined in the inventory don't need to have any relationship to the hostnames that actually get used in the cluster; in the example above that first field could be "foobar" and as long as my ssh config specified how to reach "foobar" (key, user, host), ansible could reach it and the actual names and IPs used in the cluster would come from the openshift_ parameters on the host.

bogdando · 2017-11-10T16:13:41Z

@sosiouxme thank you for an example!

BTW is there any particular motivation for this check, e.g. have you seen
resolution/connectivity as a common problem in trial installs? What are the
common symptoms?

The motivation is to help those poor souls who have to deploy a DIY DNS solution
alongside a cloud-hosted openshift cluster, like we're used to do in the shift-on-stack team :)
Unless we have a better solution, like DNSaaS supported natively to provide a smooth UX for OpenShift users on OpenStack.

The hosts defined in the inventory don't need to have any relationship to
the hostnames that actually get used in the cluster; in the example above
that first field could be "foobar" and as long as my ssh config specified
how to reach "foobar" (key, user, host), ansible could reach it and the
actual names and IPs used in the cluster would come from the openshift_
parameters on the host.

right, indeed. IIUC, that's only the case for static inventory and static SSH config?
What if we want to only use a dynamic inventory?

sosiouxme · 2017-11-10T20:55:42Z

right, indeed. IIUC, that's only the case for static inventory and static SSH config? What if we want to only use a dynamic inventory?

The initial hostname is what Ansible uses to reach the host with ssh. The other parameters are all optional for being more specific about what OpenShift is going to use. If the inventory (dynamic or static) doesn't specify them, the playbooks just default to using the same hostname as Ansible does. Just saying, have to look at the parameters if they exist. There might also be some logic in use that already normalizes all this into a fact to avoid the need to look up multiple things... I don't think anything internal refers to external hostnames (if they're different). Master certificates ought to be valid for internal or external names. Generally the only things that need to resolve externally are the API and any domains, which ought to be going through a LB...

openshift-merge-robot · 2017-11-18T08:04:52Z

@bogdando PR needs rebase

bogdando · 2017-11-28T10:50:41Z

So the use case is not relevant anymore, we do not want DIY DNS setups to deploy/verify et al

openshift-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 25, 2017

bogdando changed the title ~~Hostlevel DNS/connectivity pre-flight checks~~ [WIP] Hostlevel DNS/connectivity pre-flight checks Oct 25, 2017

openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Oct 25, 2017

bogdando commented Oct 25, 2017

View reviewed changes

bogdando changed the title ~~[WIP] Hostlevel DNS/connectivity pre-flight checks~~ Hostlevel DNS/connectivity pre-flight checks Oct 25, 2017

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2017

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 30, 2017

michaelgugino self-assigned this Nov 6, 2017

michaelgugino requested a review from sosiouxme November 6, 2017 22:16

openshift-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 6, 2017

bogdando changed the title ~~Hostlevel DNS/connectivity pre-flight checks~~ [WIP] Hostlevel DNS/connectivity pre-flight checks Nov 10, 2017

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 10, 2017

bogdando mentioned this pull request Nov 10, 2017

[WIP] Allow different public and private DNS hostnames openshift/openshift-ansible-contrib#845

Closed

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 18, 2017

bogdando closed this Nov 28, 2017

bogdando deleted the dns_preflight branch November 28, 2017 11:29

bogdando mentioned this pull request Apr 18, 2018

Set OpenStack VM hostname to the entry in Nova #7971

Merged

Conversation

bogdando commented Oct 25, 2017

Uh oh!

rh-atomic-bot commented Oct 25, 2017

Uh oh!

bogdando commented Oct 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bogdando commented Oct 25, 2017

Uh oh!

bogdando Oct 25, 2017

Choose a reason for hiding this comment

Uh oh!

bogdando commented Oct 25, 2017

Uh oh!

michaelgugino commented Nov 6, 2017

Uh oh!

bogdando commented Nov 8, 2017

Uh oh!

sosiouxme commented Nov 9, 2017

Uh oh!

bogdando commented Nov 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sosiouxme commented Nov 10, 2017 via email

Uh oh!

bogdando commented Nov 10, 2017

Uh oh!

sosiouxme commented Nov 10, 2017 via email

Uh oh!

openshift-merge-robot commented Nov 18, 2017

Uh oh!

bogdando commented Nov 28, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bogdando commented Oct 25, 2017 •

edited

Loading

bogdando commented Nov 10, 2017 •

edited

Loading