-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add playbook to run adhoc health checks or list existing checks #4570
Conversation
args = self._task.args | ||
requested_checks = normalize(args.get('checks', [])) | ||
|
||
if not requested_checks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I'm missing something, but wouldn't this cause other playbooks like health
to fail here if they don't have a checks
argument passed to them as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they would fail if either checks
is not passed or is passed and is empty.
Did I overlook some genuine case when we want to call the action plugin without arguments?
Note that it fails because that's the easiest way to cause text to go to stdout/stderr (without messing with the verbosity level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it fails because that's the easiest way to cause text to go to stdout/stderr (without messing with the verbosity level)
That makes sense.
Did I overlook some genuine case when we want to call the action plugin without arguments?
No, I had misunderstood this block, but looking at it again with context, it makes sense.
pre_tasks: | ||
- name: Print list of known health checks | ||
action: openshift_health_check | ||
when: checks is undefined or not checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To print a list of known checks:
$ ansible-playbook -i <inventory file> playbooks/byo/openshift-checks/adhoc.yml
But that's not documented (yet). The problem with the current output is that it repeats for as many hosts as you have in the OSEv3
group.
I would like to de-duplicate error messages by grouping together hosts that had the same error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
De-duplication will probably be important for clusters with 10s of hosts. Might need to put some thought into how "same" the messages need to be to be folded together.
You'll want to adjust this to eval groups like #4495 once that merges. |
@sosiouxme thanks for pointing it out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple things, otherwise looks good.
roles: | ||
- openshift_health_checker | ||
vars: | ||
- r_openshift_health_checker_playbook_context: adhoc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if running the adhoc playbook they expect to run checks, so adhoc
should be added to the list that gets the less-info summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pre_tasks: | ||
- name: Print list of known health checks | ||
action: openshift_health_check | ||
when: checks is undefined or not checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
De-duplication will probably be important for clusters with 10s of hosts. Might need to put some thought into how "same" the messages need to be to be folded together.
|
||
try: | ||
known_checks = self.load_known_checks() | ||
args = self._task.args | ||
requested_checks = normalize(args.get('checks', [])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it... shouldn't this be named r_openshift_health_checker_names
or some such? The adhoc playbook could still pass it in from the value in checks
or whatever we want the user to specify.
d2ccce6
to
44813ad
Compare
bfc4186
to
aebc18e
Compare
@@ -27,6 +27,9 @@ callback plugin summarizes execution errors at the end of a playbook run. | |||
3. Certificate expiry playbooks ([certificate_expiry](certificate_expiry)) - | |||
check that certificates in use are valid and not expiring soon. | |||
|
|||
4. Adhoc playbook ([adhoc.yml](adhoc.yml)) - useful for running adhoc checks. | |||
See the next section for an usage example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/an/a/
("y" sound at beginning of word apparently doesn't feel like a vowel... English is weird)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
@@ -0,0 +1,16 @@ | |||
--- | |||
- name: OpenShift health checks | |||
hosts: OSEv3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This playbook needs to follow the pattern of the other check playbooks, invoking a parallel playbooks/common/openshift-checks/adhoc.yml
playbook and evaluating groups along the way. openshift_version
chokes otherwise. Also at some point we'll want to switch from using BYO group names to the common group names in our is_active
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I have two questions about how that is working now, maybe you could help:
-
Why do we
include: ../openshift-cluster/initialize_groups.yml
inbyo/openshift-checks
and theninclude: ../openshift-cluster/evaluate_groups.yml
incommon/openshift-checks
? -
Why do we have
tags: always
ininclude: ../openshift-cluster/evaluate_groups.yml
, but not in the evaluate_groups include?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm noticing those includes make the playbook run considerably slower, though I didn't measure time explicitly 😢
For the case when there are no checks to be run (and we print a list of available checks), the different is 6s without includes, 8s with includes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are you getting it to run without those includes at all? I get:
Play: OpenShift health checks
Task: openshift_version : Set rpm version to configure if openshift_pkg_version specified
Message: The conditional check 'inventory_hostname in groups['oo_masters_to_config'] or inventory_hostname in groups['oo_nodes_to_config']' failed. The error was: error while evaluating conditional (inventory_hostname in groups['oo_masters_to_config'] or inventory_hostname in groups['oo_nodes_to_config']): Unable to look up a name or access an attribute in template string ({% if inventory_hostname in groups['oo_masters_to_config'] or inventory_hostname in groups['oo_nodes_to_config'] %} True {% else %} False {% endif %}).
Make sure your variable name does not contain invalid characters like '-': argument of type 'StrictUndefined' is not iterable
The error appears to have been in '/home/lmeyer/go/src/github.com/openshift/openshift-ansible/roles/openshift_version/tasks/set_version_rpm.yml': line 2, column 3, but may
be elsewhere in the file depending on the exact syntax problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I get the same. It worked against an older version of master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a patch to logging.py
to return {}
, what is not even a real check:
$ time ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e checks=logging -vvv
...
PLAY RECAP ***************************************************************************************************************************************
localhost : ok=8 changed=0 unreachable=0 failed=0
master1 : ok=36 changed=2 unreachable=0 failed=0
master2 : ok=36 changed=2 unreachable=0 failed=0
node1 : ok=36 changed=2 unreachable=0 failed=0
node2 : ok=36 changed=2 unreachable=0 failed=0
real 1m42.949s
user 0m36.358s
sys 0m30.438s
So that's more than a minute and a half to run the simplest of the checks 😢
But anyway, do you understand why the includes are in different files and why the tag always is only in one of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we include: ../openshift-cluster/initialize_groups.yml in
byo/openshift-checks and then include: ../openshift-cluster/evaluate_
groups.yml in common/openshift-checks?
The first is for translating BYO group names into g_
names. It's a pretty
simple translation in BYO, more complex for other starting points. Then
these names are fed into the second which further defines oo_first_master
and such that are to actually be used in the roles.
Why do we have tags: always in include: ../openshift-cluster/evaluate_
groups.yml, but not in the evaluate_groups include?
This is probably cruft. Basically nothing actually works if you specify
tags (maybe some very narrow dev use cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the other roles that run before openshift_health_checker
are a problem :( and not just because it's slow. I wish we could separate them out into roles that gather info and set variables versus those which perform an action. Unfortunately there are a few actions (installs) necessary even to gather info. But it could be that those aren't necessary for our use case, and we could further divide the info-gatherers into fast and slow and just stick to info that can be gathered quickly for our checks. Maybe. I don't really look forward to that refactoring but running the checks is pretty heavy like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, the original comment has been addressed and it is now in two separate playbooks, following the same pattern as others.
playbooks/byo/openshift-checks/roles
Outdated
@@ -0,0 +1 @@ | |||
../../../roles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not necessary when invoking the playbooks from common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, common
already has a symlink. I think back when I started this I was exploring how to have less layers to get something started, but definitely we should follow a single pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Symlink removed.
- name: Run health checks | ||
action: openshift_health_check | ||
args: | ||
checks: '{{ checks | default([]) }}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debating whether the checks
inventory variable should be qualified. Inputs for most other things are expected to be openshift_<whatever>
usually, so maybe openshift_checks
? It seems kind of pedantic since this is specific to the playbook. But it's consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree. I think we went with checks
for making it easier to type.
Still, I can't say I'm happy with such a long command:
$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e checks=fluentd
In this case oc diagnostics
(that could have a short alias like oc diag
) is much easier on the eyes and less keystrokes. Surely keystrokes matter less for scripting.
If you think openshift_checks
is more appropriate, we can plan a migration to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we pass in as an arg to the action plugin is irrelevant, as that's not an inventory variable. But we don't have a checks
inventory variable before this PR so there's no migration necessary to change that.
You think that command line is long until you type the commands for running via the container... :)
I think the user should have openshift_checks
for consistency with openshift_disable_checks
...
args:
checks: '{{ openshift_checks | default([]) }}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it openshift_checks
, thanks for pointing it out.
There is still room for debate when we compare with disable, because the latter is in singular form: openshift_disable_check
...
10aa64c
to
3abbc6f
Compare
b2a7fcf
to
f8a64cb
Compare
0aa5ee6
to
90f2782
Compare
@sosiouxme this is ready for another pass of review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long PR... not sure I got it all. Anyway, some questions, some suggestions.
pre_tasks: | ||
- name: List known health checks | ||
action: openshift_health_check | ||
when: openshift_checks is undefined or not openshift_checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 nice that this runs almost instantly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good for the two cases in the condition, but doesn't cover other cases such as when there is a typo in a check name.
I was thinking we could have a "preflight check" for the openshift_health_check
plugin itself, implemented in the plugin perhaps with a flag like dry-run: true
. But that doesn't have to be part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre_tasks
always run before roles
. Did you mean to use post_tasks
here? The order of execution for play stages is:
- hosts: foo
vars:
pre_tasks:
roles:
tasks:
post_tasks:
Putting these items in a different order in the yaml file does not change execution order. Same with vars:
, they are defined at the beginning of the play.
result['failed'] = True | ||
result['msg'] = list_known_checks(known_checks) | ||
return result | ||
|
||
resolved_checks = resolve_checks(requested_checks, known_checks.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below if they specify an unknown name/tag they get "Make sure there is no typo in the playbook and no files are missing." Surely with the introduction of this playbook it is much more likely that the user actually specified the wrong name/tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the observation is correct. Do you suggest some change to the message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well in that case it's probably not a problem with the code we shipped, but rather the user's input... They can do something about it without looking through our Ansible code. So something like "Make sure you have specified only valid check tags or names. Valid tags and names are as follows..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. We could list the existing checks.
I was thinking we may also detect that etcd
is a tag name and should be spelled @etcd
-- I made that mistake ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: now the error message when the requested check is unknown includes the list of known checks and tags.
|
||
try: | ||
known_checks = self.load_known_checks(tmp, task_vars) | ||
args = self._task.args | ||
requested_checks = normalize(args.get('checks', [])) | ||
|
||
if not requested_checks: | ||
result['failed'] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not thrilled about having it show up as a failure but I can't think of anything better for now. Possibly the callback plugin could display differently so it doesn't look like a failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not happy with it being a failure, but that's the most expedient way to get it working without adding yet more features.
An alternative would be to have either a separate action plugin or a different operation mode of the same plugin specific to list the checks, the latter being very similar to what we have now.
The difficulty of printing arbitrary text remains. Ansible will print its operational information, and showing anything else would require either a flag or a callback plugin.
try: | ||
return check.run() | ||
except Exception as exc: | ||
return dict(failed=True, msg=str(exc)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of an OpenShiftCheckException
we probably only want to show the text. But for any other exception, it's probably a bug in the check and I would want to get a stack trace. If not in the user-facing message, maybe as another field in the check return dict so it shows up in -v
? We could also give a little preamble explaining that it's an unexpected error while running the check before dumping out "KeyError: blah blah blah" that's not meaningful to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your idea, thank you. Will make it give out more context on arbitrary errors.
For background on how it became this way: I had two except
clauses, one was the original except OpenShiftCheckException
and the other was the general except Exception
; I dropped the first after making both handle the exception in the same way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I made it include a traceback on arbitrary exceptions using the exception
key as described in http://docs.ansible.com/ansible/latest/common_return_values.html#exception
# TODO: we could include a description of each check by taking it from a | ||
# check class attribute (e.g., __doc__) when building the message below. | ||
msg = ( | ||
'This playbook is meant to run health checks, but no checks were ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could these have \n
so it's not so widescreen...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm relying on the terminal doing the line wrapping...
With manual line breaks it looks weird when the text is re-flowed to fit the screen width.
Manually inserting new lines in strings is rather brittle, sort of like font tags in HTML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what you mean, it looks bad if your terminal is thinner than the arbitrary point where we've placed newlines so it wraps anyway. I guess the real issue with that is on the check outputs since they're shown as indented blocks - tried to keep those short enough so that most people won't see them wrap. If we wanted to do that right we'd need to know the width of the terminal and have a real formatter on the output...
Just in general though I think for prose it makes sense to keep lines pretty short, like in a book. I may have a terminal that's 200 columns wide to look at logs and code and JSON, but it doesn't mean I want to read text at that width. I suppose ideally we'd have a formatter that broke on whitespace at the shorter of the terminal width or a comfortable-to-read width.
Well, I don't really want to start a typographical debate, just saying personally I would wrap it for legibility and consistency... it stuck out at me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, though the difficulty is missing the tools to do it right as you say. We write using some display
object; it might mean writing to stdout, but it could be anything, even a file that is not a terminal.
failures = [failure_to_dict(failure) for failure in failures] | ||
failures = deduplicate_failures(failures) | ||
|
||
summary = [u'', u'', u'Failure summary:', u''] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason for prefixing everything with u
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though my memory is a bit vague on the details; looking it up.
Ansible's stringc
function operates on unicode strings (Python 2). If you give it a str
with non-ascii content it will lead to an implicit conversion to the default encoding (normally ascii unless customized on the system level) and to potential decoding exceptions. We may ourselves have nothing outside of the ASCII range, but keep in mind that on systems with different locales a system message may bubble up and break our code.
In Python 3, strings are always unicode and the u
prefix is simply ignored (it was added to the language at some point past 3.0 to bring backward-compatibility with Python 2 code).
There might be similar considerations to using the display
object, though I did not revisit it now.
Generally speaking, it is good practice to work internally with unicode strings, decoding from a known charset when reading and encoding when writing.
name = cls.name | ||
if name in known_checks: | ||
other_cls = known_checks[name].__class__ | ||
msg = "non-unique check name '{}' in: '{}' and '{}'".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this looks better than before but... with a personal bias toward brevity... how about:
raise OpenShiftCheckException(
"duplicate check name '{}' in: '{}' and '{}'"
"".format(name, full_class_name(cls), full_class_name(other_cls))
)
("non-unique" feels like a double negative and takes a second to process. formatters line up somewhat with contents...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
hosts are groupped together in a single entry. The relative order of | ||
failures is preserved. | ||
""" | ||
groups = defaultdict(list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you have below is clever but hard for me at least to follow, and it seems like you could do it in one pass through the loop without altering the original failure results:
groups = {}
result = []
for failure in failures:
# group the failures that are identical except for host
group_key = tuple(sorted((key, value) for key, value in failure.items() if key != 'host'))
if group_key not in groups:
copy = failure.copy()
copy['host'] = []
groups[group_key] = copy
result.append(copy)
groups[group_key]['host'].append(failure['host'])
One other thought... why isn't the group key just the msg
? What else do we expect to care about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could do it in one pass through the loop without altering the original failure results
Could be done in one pass. Originally, I had intended to have the two stages in separate functions and I wanted to give a name to each part instead of doing it all together. Naming was hard, and I could only think of "deduplicate_failures", though the multi-stage algorithm stayed.
The first part is concerned with grouping, the second part is concerned with reducing items from a group to a single item, keeping the original order.
Note that failure.copy()
does a shallow copy. While the property of not mutating the input is appreciated, in this case it seemed not necessary because we're anyway mutating and overwriting failures
. Obviously, we could revisit the whole thing.
why isn't the group key just the msg? What else do we expect to care about?
If the playbook / task / etc is different, we shall not mix the failures. Similarly if the message of two failures is "One or more checks failed" -- we can't group hosts together unless the check failures are the same, or we'd be throwing away information about check results.
"""Group together similar failures from different hosts. | ||
|
||
Returns a new list of failures such that similar failures from different | ||
hosts are groupped together in a single entry. The relative order of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"grouped"
also "similar" is a bit vague when it really means "identical except for host" (for now) so I'd s/similar/identical/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will be updated in the next push.
@@ -0,0 +1,63 @@ | |||
from zz_failure_summary import deduplicate_failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 it's a start, thanks!
90f2782
to
4824662
Compare
@sosiouxme could you please review the documentation changes? Wondering if that is clear enough. Being inside of a role is not the most visible location, but it is good to at least have a place to point people to. The docs here would be the base for something more in the official docs. |
a5f69ef
to
48af29f
Compare
@sosiouxme @juanvallejo if you have some time, could you please try this out and give feedback? I'm using it like this: List available checks$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml This is notably unintuitive, and feels equivalent to calling a command without the required arguments and getting a usage error message, though it does serve the purpose of listing checks and tags. It also made me realize an inconsistency with Run a single check$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e openshift_checks=disk_availability Run multiple checksby tag$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e openshift_checks=@preflight by name, comma-separated$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e openshift_checks=docker_storage,disk_availability by name, YAML$ ansible-playbook -i hosts playbooks/byo/openshift-checks/adhoc.yml -e '{"openshift_checks": ["docker_storage", "disk_availability"]}' Except for this last one, these use cases are all described in |
48af29f
to
ba3a4e7
Compare
# first and that takes time. To speed up listing checks, we use a separate play | ||
# that runs before the include to save time and improve the UX. | ||
- name: OpenShift health checks | ||
hosts: OSEv3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sosiouxme, in trying to change from OSEv3 to oo_all_hosts (empty group before initialization), I realized we could as well only run this play in localhost. The downside is it will not play nice with host-specific variables.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes, run it against localhost so we don't have to init groups at this point nor put OSEv3
in a common
playbook. The only inconsistency would be if they didn't define openshift_checks
globally but only on individual hosts, and I think that's not a very important use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed on not using OSEv3 in common playbooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM aside from last comment and unit test failures
This is useful on its own, and also aids in developing/testing new checks that are not part of any playbook. Since the intent when running this playbook is to execute checks, opt for a less verbose explanation on the error summary.
This is a simple mechanism to learn what health checks are available. Note that we defer task_vars verification, so that we can compute requested_checks and resolved_checks earlier, allowing us to list checks even if openshift_facts has not run.
This prevents an exception in one check from interfering with other checks. Skips checks that raise an exception in their is_active method. Whenever capturing a broad exception in the `is_action` or `run` methods, include traceback information that can be useful in bug reports.
The intent is to deduplicate similar errors that happened in many hosts, making the summary more concise.
This serves two purposes: - Gracefully omit the summary if there was an error computing it, no confusion to the regular end user. - Provide a stacktrace of the error when running verbose, giving developers or users reporting bugs a better insight of what went wrong, as opposed to Ansible's opaque handling of errors in callback plugins.
And beautify the code a bit.
ba3a4e7
to
51645ea
Compare
- openshift_health_checker | ||
vars: | ||
- r_openshift_health_checker_playbook_context: adhoc | ||
pre_tasks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @mtnbikenc
At this point in time this could be any of pre_tasks
, tasks
or post_tasks
and the observable end result would be the same.
pre_tasks
run before any tasks defined in the role (there are none), and seemed to express what we want, to run the conditional as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pre_tasks
also runs before any role dependencies from meta, which is also desirable here (though I suspect we can now remove openshift_facts
from dependencies anyway).
aos-ci-test |
# individual hosts while not defined for localhost, we do not support that | ||
# usage. Running this play only in localhost speeds up execution. | ||
hosts: localhost | ||
connection: local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sosiouxme FYI, I made the listing of checks run on localhost only.
[merge] |
Flake openshift/origin#16005 [merge] |
Evaluated for openshift ansible merge up to 51645ea |
continuous-integration/openshift-jenkins/merge Running (https://ci.openshift.redhat.com/jenkins/job/merge_pull_request_openshift_ansible/930/) (Base Commit: 60b91e4) (PR Branch Commit: 51645ea) |
flake openshift/origin#11873 |
This is useful while developing/testing health checks.
One step towards making checks more discoverable and easier to run.