-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diagnostics: individual parameters #17773
diagnostics: individual parameters #17773
Conversation
16eb3d9
to
af825ea
Compare
pkg/oc/admin/diagnostics/client.go
Outdated
|
||
// buildClientDiagnostics builds client Diagnostic objects based on the rawConfig passed in. | ||
// Returns the Diagnostics built, "ok" bool for whether to proceed or abort, and an error if any was encountered during the building of diagnostics.) { | ||
func (o DiagnosticsOptions) buildClientDiagnostics(rawConfig *clientcmdapi.Config) ([]types.Diagnostic, bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any strong reason for changing the struct name? *Options
is the pattern used in most commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, actually, just struck me as odd. Didn't realize it was a pattern, happy to change it back.
af825ea
to
f8fa831
Compare
o := &DiagnosticsConfig{ | ||
RequestedDiagnostics: available.Names().Difference(defaultSkipDiagnostics()), | ||
ParameterizedDiagnostics: types.NewParameterizedDiagnosticMap(available...), | ||
LogOptions: &log.LoggerOptions{Out: out}, | ||
} | ||
|
||
cmd := &cobra.Command{ | ||
Use: name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that you lowercase the names so that we have all subcommands as lowercase. For backwards compatibility we can add the current format as aliases with the Aliases
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I meant this for each individual command, in NewCmdDiagnosticsIndividual
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, will do.
o.Logger.Summary(warnCount, errorCount) | ||
|
||
kcmdutil.CheckErr(err) | ||
if failed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any case where failed
will be true but you get a nil err
? Otherwise you don't need this check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is actually; this is meant to distinguish between problems encountered while gathering everything the diagnostics need (errors as we usually think of them), versus the the problems they find and report (failures which still need to be reflected in the return code).
However, I think it may actually be possible and clearer if those two stages were extricated and handled separately. If it's not too messy, I'll try including it in this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess it's no surprise - it's messy. And I'd like to improve that; there just has to be a way for all this code to be more legible.
To revise my earlier statement, there is more than one error mode here. There are errors in constructing diagnostics that need to be reported to the user without halting execution. There are conditions under which we should abort without running diagnostics at all. And there are diagnostic results that indicate problems to report with an error exit code. I need to think more about the best way to signal which mode is of concern in each place; ok
and err
and failure
just don't provide much nuance.
But probably not for this patch set.
f8fa831
to
c3f1780
Compare
I updated the usage output due to the lowercasing of subcommands. |
re https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_verify/6938/ when I run hack/update-generated-docs.sh locally it persists in generating the man pages for uppercase command names despite those being only aliases. hack/verify-generated-docs.sh passes. Apparently what runs in Jenkins is different. Do I need to update a tool or something? This is pretty annoying. |
c3f1780
to
64e7e60
Compare
@sosiouxme not sure but might just be a matter of calling |
64e7e60
to
e33c6ad
Compare
@fabianofranz thanks, that was it. |
/retest |
looks like flake #17769 |
alright... now all we need is a lgtm... |
|
||
kcmdutil.CheckErr(err) | ||
if failed { | ||
os.Exit(255) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this exit code documented somewhere by any chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Diagnostics exit code only indicates success/failure with no further nuance. Is there a better code to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, could not really find any relevant docs; only this from v2: https://access.redhat.com/documentation/en-US/OpenShift_Online/2.0/html/Cartridge_Specification_Guide/Exit_Status_Codes.html
We do use the 255
exit code elsewhere in our code, but it only seems to be when starting a node, master, etc. from a set of given invalid options [1].
Would an exit code of 1
make more sense? (Although I would not see this as a blocking change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes no difference to me. I'll change it to 1 as I'm making other tweaks anyway.
if !expected { | ||
// no diagnostic required a client config, nothing to do | ||
} else if !detected { | ||
// there just plain isn't any client config file available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this constitute a diagnostics failure (rather than just a log warning) if a clientConfig was expected but was not detected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general design of diagnostics has been to run all applicable diagnostics and skip the rest (try to be as useful as possible without making the user understand a lot up front). So if you're missing a master config file, skip diagnostics that require that config file. Or if you have a client config file but aren't cluster-admin, skip the cluster diagnostics. It's not an error per se, just narrowing the scope of operation.
On the other hand, it would probably make sense to indicate failure if all diagnostics are skipped. With this PR it will be more common that only one diagnostic is requested, and if that's skipped and diagnostics happily declares success, that would be surprising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, it would probably make sense to indicate failure if all diagnostics are skipped.
With this PR it will be more common that only one diagnostic is requested, and if that's skipped and diagnostics happily declares success, that would be surprising.
I agree, it would be reasonable to fail if all are skipped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm putting something in func (o DiagnosticsOptions) Run()
to check for that.
if !expected { | ||
// no diagnostic required a client config, nothing to do | ||
} else if !detected { | ||
// there just plain isn't any client config file available | ||
o.Logger.Notice("CED3014", "No client configuration specified; skipping client and cluster diagnostics.") | ||
} else if rawConfig, err := o.buildRawConfig(); err != nil { // client config is totally broken - won't parse etc (problems may have been detected and logged) | ||
o.Logger.Error("CED3015", fmt.Sprintf("Client configuration failed to load; skipping client and cluster diagnostics due to error: %s", err.Error())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing as above - if expected, but config broken, stop diagnostic and fail?
@@ -154,6 +154,9 @@ var ( | |||
|
|||
// Name is part of the Diagnostic interface and just returns name. | |||
func (d ConfigContext) Name() string { | |||
if d.ContextName == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: len(d.ContextName) == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked up a long gonuts thread a while ago on which way was better, and the consensus was essentially "whichever looks clearest to you". I can look that up again if you want :) Do we have a coding standard on this that says otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needless to say, I think comparing to empty string is less to parse than checking length against 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a coding standard on this that says otherwise?
Not really a coding standard so much as a cli convention (most likely with a few cases here and there that have been missed :) )
{LatestImageParam, "If true, when expanding the image template, use latest version, not release version", &d.ImageTemplate.Latest, false}, | ||
} | ||
} | ||
|
||
// CanRun is part of the Diagnostic interface; it determines if the conditions are right to run this diagnostic. | ||
func (d *DiagnosticPod) CanRun() (bool, error) { | ||
if d.PreventModification { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor change to error message below:
running the diagnostic pod is an API change, which is prevented because the --prevent-modification flag was specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, thanks!
test/cmd/diagnostics.sh
Outdated
os::cmd::expect_success "oc adm diagnostics EtcdWriteVolume --duration=10s --help" | ||
os::cmd::expect_success "oc adm diagnostics MasterConfigCheck --master-config=${MASTER_CONFIG_DIR}/master-config.yaml" | ||
os::cmd::expect_success "oc adm diagnostics NodeConfigCheck --node-config=${NODE_CONFIG_DIR}/node-config.yaml" | ||
os::cmd::expect_success "oc adm diagnostics ServiceExternalIPs --master-config=${MASTER_CONFIG_DIR}/master-config.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: mind changing at least one of these to all lowercase to ensure aliases work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
certainly
updated (and rebased, for good measure) |
79207ee
to
a9387ba
Compare
@soltysh can i get a lgtm? https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_extended_conformance_install/5256/ looks like flake #16994 /retest |
Unless, of course, one of the PRs that includes this one gets lgtm'd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
@@ -144,8 +145,23 @@ func (d *AggregatedLogging) Description() string { | |||
return "Check aggregated logging integration for proper configuration" | |||
} | |||
|
|||
func (d *AggregatedLogging) Requirements() (client bool, host bool) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to define named return parameters if you don't use them. If that's for documentation a doc comment is more appropriate.
@sosiouxme you need someone from top level owners to approve it |
you should figure out which dir is missing. These all look cli related. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, soltysh, sosiouxme The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Completions changed since last run, will fix. |
a9387ba
to
4b5f3db
Compare
Adds the ability to specify parameters for individual diagnostics on the command line (without proliferating flags). Addresses openshift#14640
4b5f3db
to
241fd4f
Compare
Automatic merge from submit-queue. |
…-summary Automatic merge from submit-queue (batch tested with PRs 17857, 18252, 18198). diagnostics: refactor build-and-run for clarity This builds on #17773 which is the source of the first commit. Look at the second commit for the new changes. ---- Improve the legibility of the code that builds and runs diagnostics. The main confusion was the need to track and report the number of diagnostic errors and warnings versus problems that halt execution prematurely and the need to return a correct status code at completion. In the end it seemed simplest to just have the logger report how many diagnostic errors and warnings were seen, leaving function signatures to return only build/run errors. As a side effect, I looked at the ConfigLoading code that does an early check to see if there is a client config, and concluded it was confusing and unnecessary for it to be a diagnostic, so I refactored it away. Commands for main diagnostics as well as pod diagnostics are now implemented more uniformly.
…cs-unify Automatic merge from submit-queue. openshift-diagnostics => diagnostics subcommands Builds on commit from #17773 (diagnostics: individual parameters). Removes `openshift-diagnostics` in favor of hidden `oc adm diagnostics` subcommands as proposed with #18149 (comment). Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1534513 and #18141
Automatic merge from submit-queue (batch tested with PRs 16658, 18643). AppCreate diagnostic Implements https://trello.com/c/Zv4hVlyQ/130-diagnostic-to-recreate-app-create-loop-script as a diagnostic. https://trello.com/c/Zv4hVlyQ/27-3-continue-appcreate-diagnostic-work https://trello.com/c/aNWlMtMk/61-demo-merge-appcreate-diagnostic https://trello.com/c/H0jsgQwu/63-3-complete-appcreate-diagnostic-functionality Status: - [x] Create and cleanup project - [x] Deploy and cleanup app - [x] Wait for app to start - [x] Test ability to connect to app via service - [x] Test that app responds correctly - [x] Test ability to connect via route - [x] Write stats/results to file as json Not yet addressed in this PR (depending on how reviews progress vs development): - [ ] Run a build to completion - [ ] Test ability to attach storage - [ ] Gather and write useful information (logs, status) on failure Builds on top of #17773 for handling parameters to the diagnostic as well as #17857 which is a refactor on top of that.
Updated version of #16589 based on feedback.
This addresses #14640 by making individual diagnostics into subcommands that can have their own flags. Existing top-level flags for
NetworkCheck
are removed and the config envvar forEtcdWriteVolume
is deprecated in favor of a flag. All individual flags are available underneath theall
subcommand.This required rather more refactoring as the flags had to be known in order to define the command, not just at runtime. Usages are given below:
(Note
all
is now intermingled with the individual subcommands.)