Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diagnostics: individual parameters #17773

Conversation

sosiouxme
Copy link
Member

@sosiouxme sosiouxme commented Dec 14, 2017

Updated version of #16589 based on feedback.

This addresses #14640 by making individual diagnostics into subcommands that can have their own flags. Existing top-level flags for NetworkCheck are removed and the config envvar for EtcdWriteVolume is deprecated in favor of a flag. All individual flags are available underneath the all subcommand.

This required rather more refactoring as the flags had to be known in order to define the command, not just at runtime. Usages are given below:

$ oc adm diagnostics --help
This utility helps troubleshoot and diagnose known problems for an OpenShift cluster and/or local host. The base command
runs a standard set of diagnostics: 

  oc adm diagnostics
  
[...]

An individual diagnostic may be run as a subcommand which may have flags for specifying options specific to that
diagnostic. 

Finally, the "all" subcommand runs all available diagnostics (including heavyweight ones skipped in the standard set)
and provides all individual diagnostic flags.

Usage:
  oc adm diagnostics [options]

Available Commands:
  aggregatedlogging          Check aggregated logging integration for proper configuration
  all                        Diagnose common cluster problems
[...]
  unitstatus                 Check status for related systemd units

Options:
      --cluster-context='': Client context to use for cluster administrator
      --config='': Path to the config file to use for CLI requests.
      --context='': The name of the kubeconfig context to use
  -l, --diaglevel=1: Level of diagnostic output: 4: Error, 3: Warn, 2: Notice, 1: Info, 0: Debug
      --host=false: If true, look for systemd and journald units even without master/node config
      --loglevel=0: Set the level of log output (0-10)
      --logspec='': Set per module logging with file|pattern=LEVEL,...
      --master-config='': Path to master config file (implies --host)
      --node-config='': Path to node config file (implies --host)
      --prevent-modification=false: If true, may be set to prevent diagnostics making any changes via the API

(Note all is now intermingled with the individual subcommands.)

$ oc adm diagnostics all --help
This utility helps troubleshoot and diagnose known problems for an OpenShift cluster and/or local host. This subcommand
exists to run all available diagnostics: 

  oc adm diagnostics all
  
Available diagnostics vary based on client config and local OpenShift host config. All flags from the base command work
similarly here, but all possible flags for individual diagnostics are also available.

Usage:
  oc adm diagnostics all [options]

Options:
      --cluster-context='': Client context to use for cluster administrator
      --config='': Path to the config file to use for CLI requests.
      --context='': The name of the kubeconfig context to use
  -l, --diaglevel=1: Level of diagnostic output: 4: Error, 3: Warn, 2: Notice, 1: Info, 0: Debug
      --diagnosticpod-images='openshift/origin-${component}:${version}': Image template to use in creating a pod
      --diagnosticpod-latest-images=false: If true, when expanding the image template, use latest version, not release
version
      --etcdwritevolume-duration='1m': How long to perform the write test
      --host=false: If true, look for systemd and journald units even without master/node config
      --loglevel=0: Set the level of log output (0-10)
      --logspec='': Set per module logging with file|pattern=LEVEL,...
      --master-config='': Path to master config file (implies --host)
      --networkcheck-logdir='/tmp/openshift/': Path to store diagnostic results in case of errors
      --networkcheck-pod-image='openshift/origin:v3.9.0-alpha.0': Image to use for diagnostic pod
      --networkcheck-test-pod-image='openshift/origin-deployer:v3.9.0-alpha.0': Image to use for diagnostic test pod
      --networkcheck-test-pod-port=8080: Serving port on the diagnostic test pod
      --networkcheck-test-pod-protocol='TCP': Protocol used to connect to diagnostic test pod
      --node-config='': Path to node config file (implies --host)
      --prevent-modification=false: If true, may be set to prevent diagnostics making any changes via the API
$ oc adm diagnostics EtcdWriteVolume --help
Runs the EtcdWriteVolume diagnostic. 

Check the volume of writes against etcd over a time period and classify them by operation and key

Aliases:
etcdwritevolume, EtcdWriteVolume

Usage:
  oc adm diagnostics etcdwritevolume [options]

Options:
  -l, --diaglevel=1: Level of diagnostic output: 4: Error, 3: Warn, 2: Notice, 1: Info, 0: Debug
      --duration='1m': How long to perform the write test
      --host=false: If true, look for systemd and journald units even without master/node config
      --loglevel=0: Set the level of log output (0-10)
      --logspec='': Set per module logging with file|pattern=LEVEL,...
      --master-config='': Path to master config file (implies --host)
      --node-config='': Path to node config file (implies --host)
$ oc adm diagnostics NetworkCheck --help
Runs the NetworkCheck diagnostic. 

Create a pod on all schedulable nodes and run network diagnostics from the application standpoint

Aliases:
networkcheck, NetworkCheck

Usage:
  oc adm diagnostics networkcheck [options]

Options:
      --cluster-context='': Client context to use for cluster administrator
      --config='': Path to the config file to use for CLI requests.
      --context='': The name of the kubeconfig context to use
  -l, --diaglevel=1: Level of diagnostic output: 4: Error, 3: Warn, 2: Notice, 1: Info, 0: Debug
      --logdir='/tmp/openshift/': Path to store diagnostic results in case of errors
      --loglevel=0: Set the level of log output (0-10)
      --logspec='': Set per module logging with file|pattern=LEVEL,...
      --pod-image='openshift/origin:v3.9.0-alpha.0': Image to use for diagnostic pod
      --prevent-modification=false: If true, may be set to prevent diagnostics making any changes via the API
      --test-pod-image='openshift/origin-deployer:v3.9.0-alpha.0': Image to use for diagnostic test pod
      --test-pod-port=8080: Serving port on the diagnostic test pod
      --test-pod-protocol='TCP': Protocol used to connect to diagnostic test pod

@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 14, 2017
@sosiouxme sosiouxme force-pushed the 20171213-diagnostic-parameters branch from 16eb3d9 to af825ea Compare December 14, 2017 15:51

// buildClientDiagnostics builds client Diagnostic objects based on the rawConfig passed in.
// Returns the Diagnostics built, "ok" bool for whether to proceed or abort, and an error if any was encountered during the building of diagnostics.) {
func (o DiagnosticsOptions) buildClientDiagnostics(rawConfig *clientcmdapi.Config) ([]types.Diagnostic, bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any strong reason for changing the struct name? *Options is the pattern used in most commands.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, actually, just struck me as odd. Didn't realize it was a pattern, happy to change it back.

@sosiouxme sosiouxme force-pushed the 20171213-diagnostic-parameters branch from af825ea to f8fa831 Compare December 15, 2017 16:34
o := &DiagnosticsConfig{
RequestedDiagnostics: available.Names().Difference(defaultSkipDiagnostics()),
ParameterizedDiagnostics: types.NewParameterizedDiagnosticMap(available...),
LogOptions: &log.LoggerOptions{Out: out},
}

cmd := &cobra.Command{
Use: name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that you lowercase the names so that we have all subcommands as lowercase. For backwards compatibility we can add the current format as aliases with the Aliases attribute.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant this for each individual command, in NewCmdDiagnosticsIndividual.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, will do.

o.Logger.Summary(warnCount, errorCount)

kcmdutil.CheckErr(err)
if failed {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where failed will be true but you get a nil err? Otherwise you don't need this check.

Copy link
Member Author

@sosiouxme sosiouxme Dec 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually; this is meant to distinguish between problems encountered while gathering everything the diagnostics need (errors as we usually think of them), versus the the problems they find and report (failures which still need to be reflected in the return code).

However, I think it may actually be possible and clearer if those two stages were extricated and handled separately. If it's not too messy, I'll try including it in this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess it's no surprise - it's messy. And I'd like to improve that; there just has to be a way for all this code to be more legible.

To revise my earlier statement, there is more than one error mode here. There are errors in constructing diagnostics that need to be reported to the user without halting execution. There are conditions under which we should abort without running diagnostics at all. And there are diagnostic results that indicate problems to report with an error exit code. I need to think more about the best way to signal which mode is of concern in each place; ok and err and failure just don't provide much nuance.

But probably not for this patch set.

@sosiouxme sosiouxme force-pushed the 20171213-diagnostic-parameters branch from f8fa831 to c3f1780 Compare December 16, 2017 03:55
@sosiouxme
Copy link
Member Author

I updated the usage output due to the lowercasing of subcommands.

@sosiouxme
Copy link
Member Author

re https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_verify/6938/ when I run hack/update-generated-docs.sh locally it persists in generating the man pages for uppercase command names despite those being only aliases. hack/verify-generated-docs.sh passes. Apparently what runs in Jenkins is different. Do I need to update a tool or something? This is pretty annoying.

@fabianofranz
Copy link
Member

@sosiouxme not sure but might just be a matter of calling make clean before generating the docs.

@sosiouxme sosiouxme force-pushed the 20171213-diagnostic-parameters branch from 64e7e60 to e33c6ad Compare December 19, 2017 11:51
@sosiouxme
Copy link
Member Author

@fabianofranz thanks, that was it.

@sosiouxme
Copy link
Member Author

new test flakes #17881 #17883
/retest

@sosiouxme
Copy link
Member Author

/retest

@sosiouxme
Copy link
Member Author

looks like flake #17769
/retest

@sosiouxme
Copy link
Member Author

alright... now all we need is a lgtm...

@sosiouxme sosiouxme mentioned this pull request Jan 9, 2018
10 tasks

kcmdutil.CheckErr(err)
if failed {
os.Exit(255)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this exit code documented somewhere by any chance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Diagnostics exit code only indicates success/failure with no further nuance. Is there a better code to use?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, could not really find any relevant docs; only this from v2: https://access.redhat.com/documentation/en-US/OpenShift_Online/2.0/html/Cartridge_Specification_Guide/Exit_Status_Codes.html

We do use the 255 exit code elsewhere in our code, but it only seems to be when starting a node, master, etc. from a set of given invalid options [1].

Would an exit code of 1 make more sense? (Although I would not see this as a blocking change)

  1. https://github.com/openshift/origin/blob/master/pkg/cmd/server/start/start_allinone.go#L96

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no difference to me. I'll change it to 1 as I'm making other tweaks anyway.

if !expected {
// no diagnostic required a client config, nothing to do
} else if !detected {
// there just plain isn't any client config file available
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this constitute a diagnostics failure (rather than just a log warning) if a clientConfig was expected but was not detected?

Copy link
Member Author

@sosiouxme sosiouxme Jan 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general design of diagnostics has been to run all applicable diagnostics and skip the rest (try to be as useful as possible without making the user understand a lot up front). So if you're missing a master config file, skip diagnostics that require that config file. Or if you have a client config file but aren't cluster-admin, skip the cluster diagnostics. It's not an error per se, just narrowing the scope of operation.

On the other hand, it would probably make sense to indicate failure if all diagnostics are skipped. With this PR it will be more common that only one diagnostic is requested, and if that's skipped and diagnostics happily declares success, that would be surprising.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, it would probably make sense to indicate failure if all diagnostics are skipped.
With this PR it will be more common that only one diagnostic is requested, and if that's skipped and diagnostics happily declares success, that would be surprising.

I agree, it would be reasonable to fail if all are skipped

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm putting something in func (o DiagnosticsOptions) Run() to check for that.

if !expected {
// no diagnostic required a client config, nothing to do
} else if !detected {
// there just plain isn't any client config file available
o.Logger.Notice("CED3014", "No client configuration specified; skipping client and cluster diagnostics.")
} else if rawConfig, err := o.buildRawConfig(); err != nil { // client config is totally broken - won't parse etc (problems may have been detected and logged)
o.Logger.Error("CED3015", fmt.Sprintf("Client configuration failed to load; skipping client and cluster diagnostics due to error: %s", err.Error()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing as above - if expected, but config broken, stop diagnostic and fail?

@@ -154,6 +154,9 @@ var (

// Name is part of the Diagnostic interface and just returns name.
func (d ConfigContext) Name() string {
if d.ContextName == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: len(d.ContextName) == 0

Copy link
Member Author

@sosiouxme sosiouxme Jan 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked up a long gonuts thread a while ago on which way was better, and the consensus was essentially "whichever looks clearest to you". I can look that up again if you want :) Do we have a coding standard on this that says otherwise?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needless to say, I think comparing to empty string is less to parse than checking length against 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a coding standard on this that says otherwise?

Not really a coding standard so much as a cli convention (most likely with a few cases here and there that have been missed :) )

{LatestImageParam, "If true, when expanding the image template, use latest version, not release version", &d.ImageTemplate.Latest, false},
}
}

// CanRun is part of the Diagnostic interface; it determines if the conditions are right to run this diagnostic.
func (d *DiagnosticPod) CanRun() (bool, error) {
if d.PreventModification {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor change to error message below:
running the diagnostic pod is an API change, which is prevented because the --prevent-modification flag was specified

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, thanks!

os::cmd::expect_success "oc adm diagnostics EtcdWriteVolume --duration=10s --help"
os::cmd::expect_success "oc adm diagnostics MasterConfigCheck --master-config=${MASTER_CONFIG_DIR}/master-config.yaml"
os::cmd::expect_success "oc adm diagnostics NodeConfigCheck --node-config=${NODE_CONFIG_DIR}/node-config.yaml"
os::cmd::expect_success "oc adm diagnostics ServiceExternalIPs --master-config=${MASTER_CONFIG_DIR}/master-config.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mind changing at least one of these to all lowercase to ensure aliases work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

certainly

@sosiouxme
Copy link
Member Author

updated (and rebased, for good measure)

@sosiouxme sosiouxme force-pushed the 20171213-diagnostic-parameters branch 3 times, most recently from 79207ee to a9387ba Compare January 10, 2018 03:25
@sosiouxme
Copy link
Member Author

sosiouxme commented Jan 10, 2018

@sosiouxme
Copy link
Member Author

Unless, of course, one of the PRs that includes this one gets lgtm'd

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@@ -144,8 +145,23 @@ func (d *AggregatedLogging) Description() string {
return "Check aggregated logging integration for proper configuration"
}

func (d *AggregatedLogging) Requirements() (client bool, host bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to define named return parameters if you don't use them. If that's for documentation a doc comment is more appropriate.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 12, 2018
@soltysh
Copy link
Contributor

soltysh commented Jan 12, 2018

@sosiouxme you need someone from top level owners to approve it

@deads2k
Copy link
Contributor

deads2k commented Jan 19, 2018

@sosiouxme you need someone from top level owners to approve it

you should figure out which dir is missing. These all look cli related.

/approve

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, soltysh, sosiouxme

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 19, 2018
@sosiouxme
Copy link
Member Author

unit test flake is #17974

Completions changed since last run, will fix.

@sosiouxme sosiouxme force-pushed the 20171213-diagnostic-parameters branch from a9387ba to 4b5f3db Compare January 19, 2018 21:13
@openshift-merge-robot openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 19, 2018
Adds the ability to specify parameters for individual diagnostics on the
command line (without proliferating flags).

Addresses openshift#14640
@sosiouxme sosiouxme force-pushed the 20171213-diagnostic-parameters branch from 4b5f3db to 241fd4f Compare January 19, 2018 21:46
@deads2k deads2k added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2018
@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit e703e5b into openshift:master Jan 23, 2018
@sosiouxme sosiouxme deleted the 20171213-diagnostic-parameters branch January 23, 2018 13:47
openshift-merge-robot added a commit that referenced this pull request Jan 24, 2018
…-summary

Automatic merge from submit-queue (batch tested with PRs 17857, 18252, 18198).

diagnostics: refactor build-and-run for clarity

This builds on #17773 which is the source of the first commit. Look at the second commit for the new changes.

----

Improve the legibility of the code that builds and runs diagnostics.

The main confusion was the need to track and report the number of diagnostic errors and warnings versus problems that halt execution prematurely and the need to return a correct status code at completion. In the end it seemed simplest to just have the logger report how many diagnostic errors and warnings were seen, leaving function signatures to return only build/run errors.

As a side effect, I looked at the ConfigLoading code that does an early check to see if there is a client config, and concluded it was confusing and unnecessary for it to be a diagnostic, so I refactored it away.

Commands for main diagnostics as well as pod diagnostics are now implemented more uniformly.
openshift-merge-robot added a commit that referenced this pull request Jan 25, 2018
…cs-unify

Automatic merge from submit-queue.

openshift-diagnostics => diagnostics subcommands

Builds on commit from #17773 (diagnostics: individual parameters).
Removes `openshift-diagnostics` in favor of hidden `oc adm diagnostics` subcommands as proposed with #18149 (comment).

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1534513 and #18141
openshift-merge-robot added a commit that referenced this pull request Feb 17, 2018
Automatic merge from submit-queue (batch tested with PRs 16658, 18643).

AppCreate diagnostic

Implements https://trello.com/c/Zv4hVlyQ/130-diagnostic-to-recreate-app-create-loop-script as a diagnostic.

https://trello.com/c/Zv4hVlyQ/27-3-continue-appcreate-diagnostic-work
https://trello.com/c/aNWlMtMk/61-demo-merge-appcreate-diagnostic
https://trello.com/c/H0jsgQwu/63-3-complete-appcreate-diagnostic-functionality

Status:
- [x] Create and cleanup project
- [x] Deploy and cleanup app
- [x] Wait for app to start
- [x] Test ability to connect to app via service
- [x] Test that app responds correctly
- [x] Test ability to connect via route
- [x] Write stats/results to file as json

Not yet addressed in this PR (depending on how reviews progress vs development):
- [ ] Run a build to completion
- [ ] Test ability to attach storage
- [ ] Gather and write useful information (logs, status) on failure

Builds on top of #17773 for handling parameters to the diagnostic as well as #17857 which is a refactor on top of that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants