Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add playbook to run adhoc health checks or list existing checks #4570

Merged
merged 9 commits into from
Aug 28, 2017

Conversation

rhcarvalho
Copy link
Contributor

This is useful while developing/testing health checks.

One step towards making checks more discoverable and easier to run.

args = self._task.args
requested_checks = normalize(args.get('checks', []))

if not requested_checks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I'm missing something, but wouldn't this cause other playbooks like health to fail here if they don't have a checks argument passed to them as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they would fail if either checks is not passed or is passed and is empty.

Did I overlook some genuine case when we want to call the action plugin without arguments?

Note that it fails because that's the easiest way to cause text to go to stdout/stderr (without messing with the verbosity level).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it fails because that's the easiest way to cause text to go to stdout/stderr (without messing with the verbosity level)

That makes sense.

Did I overlook some genuine case when we want to call the action plugin without arguments?

No, I had misunderstood this block, but looking at it again with context, it makes sense.

pre_tasks:
- name: Print list of known health checks
action: openshift_health_check
when: checks is undefined or not checks
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To print a list of known checks:

$ ansible-playbook -i <inventory file> playbooks/byo/openshift-checks/adhoc.yml

But that's not documented (yet). The problem with the current output is that it repeats for as many hosts as you have in the OSEv3 group.

I would like to de-duplicate error messages by grouping together hosts that had the same error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

De-duplication will probably be important for clusters with 10s of hosts. Might need to put some thought into how "same" the messages need to be to be folded together.

@sosiouxme
Copy link
Member

You'll want to adjust this to eval groups like #4495 once that merges.

@rhcarvalho
Copy link
Contributor Author

@sosiouxme thanks for pointing it out!

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 26, 2017
Copy link
Member

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple things, otherwise looks good.

roles:
- openshift_health_checker
vars:
- r_openshift_health_checker_playbook_context: adhoc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if running the adhoc playbook they expect to run checks, so adhoc should be added to the list that gets the less-info summary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pre_tasks:
- name: Print list of known health checks
action: openshift_health_check
when: checks is undefined or not checks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

De-duplication will probably be important for clusters with 10s of hosts. Might need to put some thought into how "same" the messages need to be to be folded together.


try:
known_checks = self.load_known_checks()
args = self._task.args
requested_checks = normalize(args.get('checks', []))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're at it... shouldn't this be named r_openshift_health_checker_names or some such? The adhoc playbook could still pass it in from the value in checks or whatever we want the user to specify.

@rhcarvalho rhcarvalho force-pushed the adhoc-check-runner-misc branch 2 times, most recently from d2ccce6 to 44813ad Compare July 18, 2017 12:30
@openshift-bot openshift-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 18, 2017
@rhcarvalho rhcarvalho force-pushed the adhoc-check-runner-misc branch 3 times, most recently from bfc4186 to aebc18e Compare July 21, 2017 09:44
@@ -27,6 +27,9 @@ callback plugin summarizes execution errors at the end of a playbook run.
3. Certificate expiry playbooks ([certificate_expiry](certificate_expiry)) -
check that certificates in use are valid and not expiring soon.

4. Adhoc playbook ([adhoc.yml](adhoc.yml)) - useful for running adhoc checks.
See the next section for an usage example.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/an/a/

("y" sound at beginning of word apparently doesn't feel like a vowel... English is weird)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

@@ -0,0 +1,16 @@
---
- name: OpenShift health checks
hosts: OSEv3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This playbook needs to follow the pattern of the other check playbooks, invoking a parallel playbooks/common/openshift-checks/adhoc.yml playbook and evaluating groups along the way. openshift_version chokes otherwise. Also at some point we'll want to switch from using BYO group names to the common group names in our is_active methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I have two questions about how that is working now, maybe you could help:

  1. Why do we include: ../openshift-cluster/initialize_groups.yml in byo/openshift-checks and then include: ../openshift-cluster/evaluate_groups.yml in common/openshift-checks?

  2. Why do we have tags: always in include: ../openshift-cluster/evaluate_groups.yml, but not in the evaluate_groups include?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm noticing those includes make the playbook run considerably slower, though I didn't measure time explicitly 😢

For the case when there are no checks to be run (and we print a list of available checks), the different is 6s without includes, 8s with includes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are you getting it to run without those includes at all? I get:

     Play:     OpenShift health checks
     Task:     openshift_version : Set rpm version to configure if openshift_pkg_version specified
     Message:  The conditional check 'inventory_hostname in groups['oo_masters_to_config'] or inventory_hostname in groups['oo_nodes_to_config']' failed. The error was: error while evaluating conditional (inventory_hostname in groups['oo_masters_to_config'] or inventory_hostname in groups['oo_nodes_to_config']): Unable to look up a name or access an attribute in template string ({% if inventory_hostname in groups['oo_masters_to_config'] or inventory_hostname in groups['oo_nodes_to_config'] %} True {% else %} False {% endif %}).
               Make sure your variable name does not contain invalid characters like '-': argument of type 'StrictUndefined' is not iterable
               
               The error appears to have been in '/home/lmeyer/go/src/github.com/openshift/openshift-ansible/roles/openshift_version/tasks/set_version_rpm.yml': line 2, column 3, but may
               be elsewhere in the file depending on the exact syntax problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I get the same. It worked against an older version of master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a patch to logging.py to return {}, what is not even a real check:

$ time ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e checks=logging -vvv
...
PLAY RECAP ***************************************************************************************************************************************
localhost                  : ok=8    changed=0    unreachable=0    failed=0   
master1                    : ok=36   changed=2    unreachable=0    failed=0   
master2                    : ok=36   changed=2    unreachable=0    failed=0   
node1                      : ok=36   changed=2    unreachable=0    failed=0   
node2                      : ok=36   changed=2    unreachable=0    failed=0   


real    1m42.949s
user    0m36.358s
sys     0m30.438s

So that's more than a minute and a half to run the simplest of the checks 😢

But anyway, do you understand why the includes are in different files and why the tag always is only in one of them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we include: ../openshift-cluster/initialize_groups.yml in
byo/openshift-checks and then include: ../openshift-cluster/evaluate_
groups.yml in common/openshift-checks?

The first is for translating BYO group names into g_ names. It's a pretty
simple translation in BYO, more complex for other starting points. Then
these names are fed into the second which further defines oo_first_master
and such that are to actually be used in the roles.

Why do we have tags: always in include: ../openshift-cluster/evaluate_
groups.yml, but not in the evaluate_groups include?

This is probably cruft. Basically nothing actually works if you specify
tags (maybe some very narrow dev use cases).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other roles that run before openshift_health_checker are a problem :( and not just because it's slow. I wish we could separate them out into roles that gather info and set variables versus those which perform an action. Unfortunately there are a few actions (installs) necessary even to gather info. But it could be that those aren't necessary for our use case, and we could further divide the info-gatherers into fast and slow and just stick to info that can be gathered quickly for our checks. Maybe. I don't really look forward to that refactoring but running the checks is pretty heavy like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the original comment has been addressed and it is now in two separate playbooks, following the same pattern as others.

@@ -0,0 +1 @@
../../../roles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not necessary when invoking the playbooks from common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, common already has a symlink. I think back when I started this I was exploring how to have less layers to get something started, but definitely we should follow a single pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symlink removed.

- name: Run health checks
action: openshift_health_check
args:
checks: '{{ checks | default([]) }}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debating whether the checks inventory variable should be qualified. Inputs for most other things are expected to be openshift_<whatever> usually, so maybe openshift_checks? It seems kind of pedantic since this is specific to the playbook. But it's consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. I think we went with checks for making it easier to type.

Still, I can't say I'm happy with such a long command:

$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e checks=fluentd

In this case oc diagnostics (that could have a short alias like oc diag) is much easier on the eyes and less keystrokes. Surely keystrokes matter less for scripting.

If you think openshift_checks is more appropriate, we can plan a migration to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we pass in as an arg to the action plugin is irrelevant, as that's not an inventory variable. But we don't have a checks inventory variable before this PR so there's no migration necessary to change that.

You think that command line is long until you type the commands for running via the container... :)

I think the user should have openshift_checks for consistency with openshift_disable_checks...

args:
  checks: '{{ openshift_checks | default([]) }}'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it openshift_checks, thanks for pointing it out.

There is still room for debate when we compare with disable, because the latter is in singular form: openshift_disable_check...

@rhcarvalho rhcarvalho force-pushed the adhoc-check-runner-misc branch 4 times, most recently from 10aa64c to 3abbc6f Compare July 27, 2017 13:01
@rhcarvalho rhcarvalho force-pushed the adhoc-check-runner-misc branch 3 times, most recently from b2a7fcf to f8a64cb Compare August 1, 2017 15:08
@rhcarvalho rhcarvalho force-pushed the adhoc-check-runner-misc branch 2 times, most recently from 0aa5ee6 to 90f2782 Compare August 3, 2017 15:18
@rhcarvalho
Copy link
Contributor Author

@sosiouxme this is ready for another pass of review.

Copy link
Member

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long PR... not sure I got it all. Anyway, some questions, some suggestions.

pre_tasks:
- name: List known health checks
action: openshift_health_check
when: openshift_checks is undefined or not openshift_checks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice that this runs almost instantly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good for the two cases in the condition, but doesn't cover other cases such as when there is a typo in a check name.

I was thinking we could have a "preflight check" for the openshift_health_check plugin itself, implemented in the plugin perhaps with a flag like dry-run: true. But that doesn't have to be part of this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre_tasks always run before roles. Did you mean to use post_tasks here? The order of execution for play stages is:

- hosts: foo
  vars:
  pre_tasks:
  roles:
  tasks:
  post_tasks:

Putting these items in a different order in the yaml file does not change execution order. Same with vars:, they are defined at the beginning of the play.

result['failed'] = True
result['msg'] = list_known_checks(known_checks)
return result

resolved_checks = resolve_checks(requested_checks, known_checks.values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below if they specify an unknown name/tag they get "Make sure there is no typo in the playbook and no files are missing." Surely with the introduction of this playbook it is much more likely that the user actually specified the wrong name/tag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the observation is correct. Do you suggest some change to the message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well in that case it's probably not a problem with the code we shipped, but rather the user's input... They can do something about it without looking through our Ansible code. So something like "Make sure you have specified only valid check tags or names. Valid tags and names are as follows..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. We could list the existing checks.

I was thinking we may also detect that etcd is a tag name and should be spelled @etcd -- I made that mistake ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: now the error message when the requested check is unknown includes the list of known checks and tags.


try:
known_checks = self.load_known_checks(tmp, task_vars)
args = self._task.args
requested_checks = normalize(args.get('checks', []))

if not requested_checks:
result['failed'] = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not thrilled about having it show up as a failure but I can't think of anything better for now. Possibly the callback plugin could display differently so it doesn't look like a failure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not happy with it being a failure, but that's the most expedient way to get it working without adding yet more features.

An alternative would be to have either a separate action plugin or a different operation mode of the same plugin specific to list the checks, the latter being very similar to what we have now.

The difficulty of printing arbitrary text remains. Ansible will print its operational information, and showing anything else would require either a flag or a callback plugin.

try:
return check.run()
except Exception as exc:
return dict(failed=True, msg=str(exc))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of an OpenShiftCheckException we probably only want to show the text. But for any other exception, it's probably a bug in the check and I would want to get a stack trace. If not in the user-facing message, maybe as another field in the check return dict so it shows up in -v? We could also give a little preamble explaining that it's an unexpected error while running the check before dumping out "KeyError: blah blah blah" that's not meaningful to the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your idea, thank you. Will make it give out more context on arbitrary errors.

For background on how it became this way: I had two except clauses, one was the original except OpenShiftCheckException and the other was the general except Exception; I dropped the first after making both handle the exception in the same way...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I made it include a traceback on arbitrary exceptions using the exception key as described in http://docs.ansible.com/ansible/latest/common_return_values.html#exception

# TODO: we could include a description of each check by taking it from a
# check class attribute (e.g., __doc__) when building the message below.
msg = (
'This playbook is meant to run health checks, but no checks were '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could these have \n so it's not so widescreen...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm relying on the terminal doing the line wrapping...
With manual line breaks it looks weird when the text is re-flowed to fit the screen width.

Manually inserting new lines in strings is rather brittle, sort of like font tags in HTML?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you mean, it looks bad if your terminal is thinner than the arbitrary point where we've placed newlines so it wraps anyway. I guess the real issue with that is on the check outputs since they're shown as indented blocks - tried to keep those short enough so that most people won't see them wrap. If we wanted to do that right we'd need to know the width of the terminal and have a real formatter on the output...

Just in general though I think for prose it makes sense to keep lines pretty short, like in a book. I may have a terminal that's 200 columns wide to look at logs and code and JSON, but it doesn't mean I want to read text at that width. I suppose ideally we'd have a formatter that broke on whitespace at the shorter of the terminal width or a comfortable-to-read width.

Well, I don't really want to start a typographical debate, just saying personally I would wrap it for legibility and consistency... it stuck out at me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, though the difficulty is missing the tools to do it right as you say. We write using some display object; it might mean writing to stdout, but it could be anything, even a file that is not a terminal.

failures = [failure_to_dict(failure) for failure in failures]
failures = deduplicate_failures(failures)

summary = [u'', u'', u'Failure summary:', u'']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for prefixing everything with u?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though my memory is a bit vague on the details; looking it up.

Ansible's stringc function operates on unicode strings (Python 2). If you give it a str with non-ascii content it will lead to an implicit conversion to the default encoding (normally ascii unless customized on the system level) and to potential decoding exceptions. We may ourselves have nothing outside of the ASCII range, but keep in mind that on systems with different locales a system message may bubble up and break our code.

In Python 3, strings are always unicode and the u prefix is simply ignored (it was added to the language at some point past 3.0 to bring backward-compatibility with Python 2 code).

There might be similar considerations to using the display object, though I did not revisit it now.

Generally speaking, it is good practice to work internally with unicode strings, decoding from a known charset when reading and encoding when writing.

name = cls.name
if name in known_checks:
other_cls = known_checks[name].__class__
msg = "non-unique check name '{}' in: '{}' and '{}'".format(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this looks better than before but... with a personal bias toward brevity... how about:

            raise OpenShiftCheckException(
              "duplicate check name '{}' in: '{}' and '{}'"
              "".format(name, full_class_name(cls), full_class_name(other_cls))
            )

("non-unique" feels like a double negative and takes a second to process. formatters line up somewhat with contents...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

hosts are groupped together in a single entry. The relative order of
failures is preserved.
"""
groups = defaultdict(list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you have below is clever but hard for me at least to follow, and it seems like you could do it in one pass through the loop without altering the original failure results:

groups = {}
result = []
for failure in failures:
    # group the failures that are identical except for host
    group_key = tuple(sorted((key, value) for key, value in failure.items() if key != 'host'))
    if group_key not in groups:
        copy = failure.copy()
        copy['host'] = []
        groups[group_key] = copy
        result.append(copy)
    groups[group_key]['host'].append(failure['host'])

One other thought... why isn't the group key just the msg? What else do we expect to care about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do it in one pass through the loop without altering the original failure results

Could be done in one pass. Originally, I had intended to have the two stages in separate functions and I wanted to give a name to each part instead of doing it all together. Naming was hard, and I could only think of "deduplicate_failures", though the multi-stage algorithm stayed.

The first part is concerned with grouping, the second part is concerned with reducing items from a group to a single item, keeping the original order.

Note that failure.copy() does a shallow copy. While the property of not mutating the input is appreciated, in this case it seemed not necessary because we're anyway mutating and overwriting failures. Obviously, we could revisit the whole thing.

why isn't the group key just the msg? What else do we expect to care about?

If the playbook / task / etc is different, we shall not mix the failures. Similarly if the message of two failures is "One or more checks failed" -- we can't group hosts together unless the check failures are the same, or we'd be throwing away information about check results.

"""Group together similar failures from different hosts.

Returns a new list of failures such that similar failures from different
hosts are groupped together in a single entry. The relative order of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"grouped"

also "similar" is a bit vague when it really means "identical except for host" (for now) so I'd s/similar/identical/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will be updated in the next push.

@@ -0,0 +1,63 @@
from zz_failure_summary import deduplicate_failures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 it's a start, thanks!

@rhcarvalho rhcarvalho force-pushed the adhoc-check-runner-misc branch from 90f2782 to 4824662 Compare August 8, 2017 13:07
@rhcarvalho
Copy link
Contributor Author

@sosiouxme could you please review the documentation changes? Wondering if that is clear enough. Being inside of a role is not the most visible location, but it is good to at least have a place to point people to. The docs here would be the base for something more in the official docs.

@rhcarvalho rhcarvalho force-pushed the adhoc-check-runner-misc branch 2 times, most recently from a5f69ef to 48af29f Compare August 21, 2017 16:14
@rhcarvalho
Copy link
Contributor Author

@sosiouxme @juanvallejo if you have some time, could you please try this out and give feedback?

I'm using it like this:

List available checks

$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml

This is notably unintuitive, and feels equivalent to calling a command without the required arguments and getting a usage error message, though it does serve the purpose of listing checks and tags.

It also made me realize an inconsistency with pre-install and preflight tags.

Run a single check

$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e openshift_checks=disk_availability

Run multiple checks

by tag

$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e openshift_checks=@preflight

by name, comma-separated

$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e openshift_checks=docker_storage,disk_availability

by name, YAML

$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e '{"openshift_checks": ["docker_storage", "disk_availability"]}'

Except for this last one, these use cases are all described in openshift-checks/README.md. I omit the last one for it is not very convenient for command line usage.

# first and that takes time. To speed up listing checks, we use a separate play
# that runs before the include to save time and improve the UX.
- name: OpenShift health checks
hosts: OSEv3
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme, in trying to change from OSEv3 to oo_all_hosts (empty group before initialization), I realized we could as well only run this play in localhost. The downside is it will not play nice with host-specific variables.
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, run it against localhost so we don't have to init groups at this point nor put OSEv3 in a common playbook. The only inconsistency would be if they didn't define openshift_checks globally but only on individual hosts, and I think that's not a very important use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on not using OSEv3 in common playbooks.

Copy link
Member

@sosiouxme sosiouxme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM aside from last comment and unit test failures

This is useful on its own, and also aids in developing/testing new
checks that are not part of any playbook.

Since the intent when running this playbook is to execute checks, opt
for a less verbose explanation on the error summary.
This is a simple mechanism to learn what health checks are available.

Note that we defer task_vars verification, so that we can compute
requested_checks and resolved_checks earlier, allowing us to list checks
even if openshift_facts has not run.
This prevents an exception in one check from interfering with other
checks. Skips checks that raise an exception in their is_active method.

Whenever capturing a broad exception in the `is_action` or `run`
methods, include traceback information that can be useful in bug
reports.
The intent is to deduplicate similar errors that happened in many hosts,
making the summary more concise.
This serves two purposes:

- Gracefully omit the summary if there was an error computing it, no
confusion to the regular end user.

- Provide a stacktrace of the error when running verbose, giving
developers or users reporting bugs a better insight of what went wrong,
as opposed to Ansible's opaque handling of errors in callback plugins.
And beautify the code a bit.
@rhcarvalho rhcarvalho force-pushed the adhoc-check-runner-misc branch from ba3a4e7 to 51645ea Compare August 24, 2017 13:00
- openshift_health_checker
vars:
- r_openshift_health_checker_playbook_context: adhoc
pre_tasks:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mtnbikenc

At this point in time this could be any of pre_tasks, tasks or post_tasks and the observable end result would be the same.

pre_tasks run before any tasks defined in the role (there are none), and seemed to express what we want, to run the conditional as soon as possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre_tasks also runs before any role dependencies from meta, which is also desirable here (though I suspect we can now remove openshift_facts from dependencies anyway).

@rhcarvalho
Copy link
Contributor Author

aos-ci-test

# individual hosts while not defined for localhost, we do not support that
# usage. Running this play only in localhost speeds up execution.
hosts: localhost
connection: local
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sosiouxme FYI, I made the listing of checks run on localhost only.

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for 51645ea (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for 51645ea (logs)

@rhcarvalho
Copy link
Contributor Author

[merge]

@rhcarvalho
Copy link
Contributor Author

Flake openshift/origin#16005

[merge]

@openshift-bot
Copy link

Evaluated for openshift ansible merge up to 51645ea

@openshift-bot
Copy link

continuous-integration/openshift-jenkins/merge Running (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/930/) (Base Commit: 60b91e4) (PR Branch Commit: 51645ea)

@sdodson
Copy link
Member

sdodson commented Aug 28, 2017

flake openshift/origin#11873

@sdodson sdodson merged commit ca5ebcb into openshift:master Aug 28, 2017
@rhcarvalho rhcarvalho deleted the adhoc-check-runner-misc branch August 28, 2017 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants