diagnostics: individual parameters #17773

sosiouxme · 2017-12-14T02:28:23Z

Updated version of #16589 based on feedback.

This addresses #14640 by making individual diagnostics into subcommands that can have their own flags. Existing top-level flags for NetworkCheck are removed and the config envvar for EtcdWriteVolume is deprecated in favor of a flag. All individual flags are available underneath the all subcommand.

This required rather more refactoring as the flags had to be known in order to define the command, not just at runtime. Usages are given below:

$ oc adm diagnostics --help
This utility helps troubleshoot and diagnose known problems for an OpenShift cluster and/or local host. The base command
runs a standard set of diagnostics: 

  oc adm diagnostics
  
[...]

An individual diagnostic may be run as a subcommand which may have flags for specifying options specific to that
diagnostic. 

Finally, the "all" subcommand runs all available diagnostics (including heavyweight ones skipped in the standard set)
and provides all individual diagnostic flags.

Usage:
  oc adm diagnostics [options]

Available Commands:
  aggregatedlogging          Check aggregated logging integration for proper configuration
  all                        Diagnose common cluster problems
[...]
  unitstatus                 Check status for related systemd units

Options:
      --cluster-context='': Client context to use for cluster administrator
      --config='': Path to the config file to use for CLI requests.
      --context='': The name of the kubeconfig context to use
  -l, --diaglevel=1: Level of diagnostic output: 4: Error, 3: Warn, 2: Notice, 1: Info, 0: Debug
      --host=false: If true, look for systemd and journald units even without master/node config
      --loglevel=0: Set the level of log output (0-10)
      --logspec='': Set per module logging with file|pattern=LEVEL,...
      --master-config='': Path to master config file (implies --host)
      --node-config='': Path to node config file (implies --host)
      --prevent-modification=false: If true, may be set to prevent diagnostics making any changes via the API

(Note all is now intermingled with the individual subcommands.)

$ oc adm diagnostics all --help
This utility helps troubleshoot and diagnose known problems for an OpenShift cluster and/or local host. This subcommand
exists to run all available diagnostics: 

  oc adm diagnostics all
  
Available diagnostics vary based on client config and local OpenShift host config. All flags from the base command work
similarly here, but all possible flags for individual diagnostics are also available.

Usage:
  oc adm diagnostics all [options]

Options:
      --cluster-context='': Client context to use for cluster administrator
      --config='': Path to the config file to use for CLI requests.
      --context='': The name of the kubeconfig context to use
  -l, --diaglevel=1: Level of diagnostic output: 4: Error, 3: Warn, 2: Notice, 1: Info, 0: Debug
      --diagnosticpod-images='openshift/origin-${component}:${version}': Image template to use in creating a pod
      --diagnosticpod-latest-images=false: If true, when expanding the image template, use latest version, not release
version
      --etcdwritevolume-duration='1m': How long to perform the write test
      --host=false: If true, look for systemd and journald units even without master/node config
      --loglevel=0: Set the level of log output (0-10)
      --logspec='': Set per module logging with file|pattern=LEVEL,...
      --master-config='': Path to master config file (implies --host)
      --networkcheck-logdir='/tmp/openshift/': Path to store diagnostic results in case of errors
      --networkcheck-pod-image='openshift/origin:v3.9.0-alpha.0': Image to use for diagnostic pod
      --networkcheck-test-pod-image='openshift/origin-deployer:v3.9.0-alpha.0': Image to use for diagnostic test pod
      --networkcheck-test-pod-port=8080: Serving port on the diagnostic test pod
      --networkcheck-test-pod-protocol='TCP': Protocol used to connect to diagnostic test pod
      --node-config='': Path to node config file (implies --host)
      --prevent-modification=false: If true, may be set to prevent diagnostics making any changes via the API

$ oc adm diagnostics EtcdWriteVolume --help
Runs the EtcdWriteVolume diagnostic. 

Check the volume of writes against etcd over a time period and classify them by operation and key

Aliases:
etcdwritevolume, EtcdWriteVolume

Usage:
  oc adm diagnostics etcdwritevolume [options]

Options:
  -l, --diaglevel=1: Level of diagnostic output: 4: Error, 3: Warn, 2: Notice, 1: Info, 0: Debug
      --duration='1m': How long to perform the write test
      --host=false: If true, look for systemd and journald units even without master/node config
      --loglevel=0: Set the level of log output (0-10)
      --logspec='': Set per module logging with file|pattern=LEVEL,...
      --master-config='': Path to master config file (implies --host)
      --node-config='': Path to node config file (implies --host)

$ oc adm diagnostics NetworkCheck --help
Runs the NetworkCheck diagnostic. 

Create a pod on all schedulable nodes and run network diagnostics from the application standpoint

Aliases:
networkcheck, NetworkCheck

Usage:
  oc adm diagnostics networkcheck [options]

Options:
      --cluster-context='': Client context to use for cluster administrator
      --config='': Path to the config file to use for CLI requests.
      --context='': The name of the kubeconfig context to use
  -l, --diaglevel=1: Level of diagnostic output: 4: Error, 3: Warn, 2: Notice, 1: Info, 0: Debug
      --logdir='/tmp/openshift/': Path to store diagnostic results in case of errors
      --loglevel=0: Set the level of log output (0-10)
      --logspec='': Set per module logging with file|pattern=LEVEL,...
      --pod-image='openshift/origin:v3.9.0-alpha.0': Image to use for diagnostic pod
      --prevent-modification=false: If true, may be set to prevent diagnostics making any changes via the API
      --test-pod-image='openshift/origin-deployer:v3.9.0-alpha.0': Image to use for diagnostic test pod
      --test-pod-port=8080: Serving port on the diagnostic test pod
      --test-pod-protocol='TCP': Protocol used to connect to diagnostic test pod

fabianofranz · 2017-12-15T14:21:33Z

pkg/oc/admin/diagnostics/client.go


 // buildClientDiagnostics builds client Diagnostic objects based on the rawConfig passed in.
 // Returns the Diagnostics built, "ok" bool for whether to proceed or abort, and an error if any was encountered during the building of diagnostics.) {
-func (o DiagnosticsOptions) buildClientDiagnostics(rawConfig *clientcmdapi.Config) ([]types.Diagnostic, bool, error) {


Any strong reason for changing the struct name? *Options is the pattern used in most commands.

No, actually, just struck me as odd. Didn't realize it was a pattern, happy to change it back.

fabianofranz · 2017-12-15T18:19:09Z

pkg/oc/admin/diagnostics/diagnostics.go

+	o := &DiagnosticsConfig{
+		RequestedDiagnostics:     available.Names().Difference(defaultSkipDiagnostics()),
+		ParameterizedDiagnostics: types.NewParameterizedDiagnosticMap(available...),
+		LogOptions:               &log.LoggerOptions{Out: out},
 	}

 	cmd := &cobra.Command{
 		Use:   name,


I suggest that you lowercase the names so that we have all subcommands as lowercase. For backwards compatibility we can add the current format as aliases with the Aliases attribute.

Sorry I meant this for each individual command, in NewCmdDiagnosticsIndividual.

Alright, will do.

fabianofranz · 2017-12-15T18:27:47Z

pkg/oc/admin/diagnostics/diagnostics.go

+			o.Logger.Summary(warnCount, errorCount)
+
+			kcmdutil.CheckErr(err)
+			if failed {


Is there any case where failed will be true but you get a nil err? Otherwise you don't need this check.

There is actually; this is meant to distinguish between problems encountered while gathering everything the diagnostics need (errors as we usually think of them), versus the the problems they find and report (failures which still need to be reflected in the return code).

However, I think it may actually be possible and clearer if those two stages were extricated and handled separately. If it's not too messy, I'll try including it in this change.

So I guess it's no surprise - it's messy. And I'd like to improve that; there just has to be a way for all this code to be more legible.

To revise my earlier statement, there is more than one error mode here. There are errors in constructing diagnostics that need to be reported to the user without halting execution. There are conditions under which we should abort without running diagnostics at all. And there are diagnostic results that indicate problems to report with an error exit code. I need to think more about the best way to signal which mode is of concern in each place; ok and err and failure just don't provide much nuance.

But probably not for this patch set.

sosiouxme · 2017-12-16T03:59:18Z

I updated the usage output due to the lowercasing of subcommands.

sosiouxme · 2017-12-16T16:01:33Z

re https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_verify/6938/ when I run hack/update-generated-docs.sh locally it persists in generating the man pages for uppercase command names despite those being only aliases. hack/verify-generated-docs.sh passes. Apparently what runs in Jenkins is different. Do I need to update a tool or something? This is pretty annoying.

fabianofranz · 2017-12-18T20:34:13Z

@sosiouxme not sure but might just be a matter of calling make clean before generating the docs.

sosiouxme · 2017-12-19T11:53:27Z

@fabianofranz thanks, that was it.

sosiouxme · 2017-12-19T16:34:05Z

new test flakes #17881 #17883
/retest

sosiouxme · 2017-12-19T19:12:43Z

/retest

sosiouxme · 2017-12-19T20:24:03Z

looks like flake #17769
/retest

sosiouxme · 2017-12-19T21:33:50Z

alright... now all we need is a lgtm...

juanvallejo · 2018-01-09T18:31:21Z

pkg/oc/admin/diagnostics/diagnostics.go

+
+			kcmdutil.CheckErr(err)
+			if failed {
+				os.Exit(255)


is this exit code documented somewhere by any chance?

No. Diagnostics exit code only indicates success/failure with no further nuance. Is there a better code to use?

Hm, could not really find any relevant docs; only this from v2: https://access.redhat.com/documentation/en-US/OpenShift_Online/2.0/html/Cartridge_Specification_Guide/Exit_Status_Codes.html

We do use the 255 exit code elsewhere in our code, but it only seems to be when starting a node, master, etc. from a set of given invalid options [1].

Would an exit code of 1 make more sense? (Although I would not see this as a blocking change)

https://github.com/openshift/origin/blob/master/pkg/cmd/server/start/start_allinone.go#L96

It makes no difference to me. I'll change it to 1 as I'm making other tweaks anyway.

juanvallejo · 2018-01-09T18:40:05Z

pkg/oc/admin/diagnostics/diagnostics.go

+		if !expected {
+			// no diagnostic required a client config, nothing to do
+		} else if !detected {
+			// there just plain isn't any client config file available


could this constitute a diagnostics failure (rather than just a log warning) if a clientConfig was expected but was not detected?

The general design of diagnostics has been to run all applicable diagnostics and skip the rest (try to be as useful as possible without making the user understand a lot up front). So if you're missing a master config file, skip diagnostics that require that config file. Or if you have a client config file but aren't cluster-admin, skip the cluster diagnostics. It's not an error per se, just narrowing the scope of operation.

On the other hand, it would probably make sense to indicate failure if all diagnostics are skipped. With this PR it will be more common that only one diagnostic is requested, and if that's skipped and diagnostics happily declares success, that would be surprising.

On the other hand, it would probably make sense to indicate failure if all diagnostics are skipped.
With this PR it will be more common that only one diagnostic is requested, and if that's skipped and diagnostics happily declares success, that would be surprising.

I agree, it would be reasonable to fail if all are skipped

I'm putting something in func (o DiagnosticsOptions) Run() to check for that.

juanvallejo · 2018-01-09T18:42:22Z

pkg/oc/admin/diagnostics/diagnostics.go

+		if !expected {
+			// no diagnostic required a client config, nothing to do
+		} else if !detected {
+			// there just plain isn't any client config file available
 			o.Logger.Notice("CED3014", "No client configuration specified; skipping client and cluster diagnostics.")
 		} else if rawConfig, err := o.buildRawConfig(); err != nil { // client config is totally broken - won't parse etc (problems may have been detected and logged)
 			o.Logger.Error("CED3015", fmt.Sprintf("Client configuration failed to load; skipping client and cluster diagnostics due to error: %s", err.Error()))


same thing as above - if expected, but config broken, stop diagnostic and fail?

juanvallejo · 2018-01-09T18:46:09Z

pkg/oc/admin/diagnostics/diagnostics/client/config_contexts.go

@@ -154,6 +154,9 @@ var (

 // Name is part of the Diagnostic interface and just returns name.
 func (d ConfigContext) Name() string {
+	if d.ContextName == "" {


nit: len(d.ContextName) == 0

I looked up a long gonuts thread a while ago on which way was better, and the consensus was essentially "whichever looks clearest to you". I can look that up again if you want :) Do we have a coding standard on this that says otherwise?

Needless to say, I think comparing to empty string is less to parse than checking length against 0.

Do we have a coding standard on this that says otherwise?

Not really a coding standard so much as a cli convention (most likely with a few cases here and there that have been missed :) )

juanvallejo · 2018-01-09T18:50:30Z

pkg/oc/admin/diagnostics/diagnostics/client/run_diagnostics_pod.go

+		{LatestImageParam, "If true, when expanding the image template, use latest version, not release version", &d.ImageTemplate.Latest, false},
+	}
+}
+
 // CanRun is part of the Diagnostic interface; it determines if the conditions are right to run this diagnostic.
 func (d *DiagnosticPod) CanRun() (bool, error) {
 	if d.PreventModification {


minor change to error message below:
running the diagnostic pod is an API change, which is prevented because the --prevent-modification flag was specified

I like it, thanks!

juanvallejo · 2018-01-09T21:11:16Z

test/cmd/diagnostics.sh

+os::cmd::expect_success "oc adm diagnostics EtcdWriteVolume --duration=10s --help"
+os::cmd::expect_success "oc adm diagnostics MasterConfigCheck --master-config=${MASTER_CONFIG_DIR}/master-config.yaml"
+os::cmd::expect_success "oc adm diagnostics NodeConfigCheck --node-config=${NODE_CONFIG_DIR}/node-config.yaml"
+os::cmd::expect_success "oc adm diagnostics ServiceExternalIPs --master-config=${MASTER_CONFIG_DIR}/master-config.yaml"


nit: mind changing at least one of these to all lowercase to ensure aliases work?

sosiouxme · 2018-01-10T01:59:56Z

updated (and rebased, for good measure)

sosiouxme · 2018-01-10T13:05:05Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_cmd/8331/ appears to be #16273
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_extended_conformance_gce/13954/ => #18045
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_extended_conformance_install/5224/ => #17330
unrelated flakes :(
/retest

sosiouxme · 2018-01-10T15:05:57Z

more flakes in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_extended_conformance_install/5243/

[Feature:DeploymentConfig] deploymentconfigs keep the deployer pod invariant valid [Conformance] should deal with cancellation of running deployment [Suite:openshift/conformance/parallel] #18047
[k8s.io] Sysctls should support sysctls [Suite:openshift/conformance/parallel] [Suite:k8s] #18048

/retest

sosiouxme · 2018-01-10T17:56:30Z

@soltysh can i get a lgtm?

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/17773/test_pull_request_origin_extended_conformance_install/5256/ looks like flake #16994

/retest

sosiouxme · 2018-01-10T18:13:36Z

Unless, of course, one of the PRs that includes this one gets lgtm'd

soltysh

/lgtm
/approve

soltysh · 2018-01-12T10:47:55Z

pkg/oc/admin/diagnostics/diagnostics/cluster/aggregated_logging/diagnostic.go

@@ -144,8 +145,23 @@ func (d *AggregatedLogging) Description() string {
 	return "Check aggregated logging integration for proper configuration"
 }

+func (d *AggregatedLogging) Requirements() (client bool, host bool) {


No need to define named return parameters if you don't use them. If that's for documentation a doc comment is more appropriate.

soltysh · 2018-01-12T10:51:57Z

@sosiouxme you need someone from top level owners to approve it

deads2k · 2018-01-19T19:11:28Z

@sosiouxme you need someone from top level owners to approve it

you should figure out which dir is missing. These all look cli related.

/approve

openshift-ci-robot · 2018-01-19T19:11:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, soltysh, sosiouxme

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~OWNERS~~ [deads2k]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

sosiouxme · 2018-01-19T21:10:52Z

unit test flake is #17974

Completions changed since last run, will fix.

Adds the ability to specify parameters for individual diagnostics on the command line (without proliferating flags). Addresses openshift#14640

openshift-merge-robot · 2018-01-23T12:32:52Z

Automatic merge from submit-queue.

…-summary Automatic merge from submit-queue (batch tested with PRs 17857, 18252, 18198). diagnostics: refactor build-and-run for clarity This builds on #17773 which is the source of the first commit. Look at the second commit for the new changes. ---- Improve the legibility of the code that builds and runs diagnostics. The main confusion was the need to track and report the number of diagnostic errors and warnings versus problems that halt execution prematurely and the need to return a correct status code at completion. In the end it seemed simplest to just have the logger report how many diagnostic errors and warnings were seen, leaving function signatures to return only build/run errors. As a side effect, I looked at the ConfigLoading code that does an early check to see if there is a client config, and concluded it was confusing and unnecessary for it to be a diagnostic, so I refactored it away. Commands for main diagnostics as well as pod diagnostics are now implemented more uniformly.

…cs-unify Automatic merge from submit-queue. openshift-diagnostics => diagnostics subcommands Builds on commit from #17773 (diagnostics: individual parameters). Removes `openshift-diagnostics` in favor of hidden `oc adm diagnostics` subcommands as proposed with #18149 (comment). Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1534513 and #18141

Automatic merge from submit-queue (batch tested with PRs 16658, 18643). AppCreate diagnostic Implements https://trello.com/c/Zv4hVlyQ/130-diagnostic-to-recreate-app-create-loop-script as a diagnostic. https://trello.com/c/Zv4hVlyQ/27-3-continue-appcreate-diagnostic-work https://trello.com/c/aNWlMtMk/61-demo-merge-appcreate-diagnostic https://trello.com/c/H0jsgQwu/63-3-complete-appcreate-diagnostic-functionality Status: - [x] Create and cleanup project - [x] Deploy and cleanup app - [x] Wait for app to start - [x] Test ability to connect to app via service - [x] Test that app responds correctly - [x] Test ability to connect via route - [x] Write stats/results to file as json Not yet addressed in this PR (depending on how reviews progress vs development): - [ ] Run a build to completion - [ ] Test ability to attach storage - [ ] Gather and write useful information (logs, status) on failure Builds on top of #17773 for handling parameters to the diagnostic as well as #17857 which is a refactor on top of that.

sosiouxme requested review from fabianofranz, pravisankar and juanvallejo December 14, 2017 02:28

openshift-ci-robot requested a review from smarterclayton December 14, 2017 02:28

openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 14, 2017

sosiouxme mentioned this pull request Dec 14, 2017

individual diagnostic parameters #16589

Closed

sosiouxme force-pushed the 20171213-diagnostic-parameters branch from 16eb3d9 to af825ea Compare December 14, 2017 15:51

fabianofranz reviewed Dec 15, 2017

View reviewed changes

sosiouxme force-pushed the 20171213-diagnostic-parameters branch from af825ea to f8fa831 Compare December 15, 2017 16:34

fabianofranz reviewed Dec 15, 2017

View reviewed changes

sosiouxme force-pushed the 20171213-diagnostic-parameters branch from f8fa831 to c3f1780 Compare December 16, 2017 03:55

sosiouxme mentioned this pull request Dec 18, 2017

diagnostics: refactor build-and-run for clarity #17857

Merged

sosiouxme force-pushed the 20171213-diagnostic-parameters branch from c3f1780 to 64e7e60 Compare December 18, 2017 19:44

sosiouxme force-pushed the 20171213-diagnostic-parameters branch from 64e7e60 to e33c6ad Compare December 19, 2017 11:51

sosiouxme mentioned this pull request Jan 9, 2018

AppCreate diagnostic #16658

Merged

10 tasks

juanvallejo reviewed Jan 9, 2018

View reviewed changes

sosiouxme force-pushed the 20171213-diagnostic-parameters branch 3 times, most recently from 79207ee to a9387ba Compare January 10, 2018 03:25

soltysh approved these changes Jan 12, 2018

View reviewed changes

openshift-ci-robot assigned soltysh Jan 12, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 12, 2018

This was referenced Jan 19, 2018

openshift-diagnostics: package as oc alias #18149

Closed

openshift-diagnostics => diagnostics subcommands #18186

Merged

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 19, 2018

sosiouxme force-pushed the 20171213-diagnostic-parameters branch from a9387ba to 4b5f3db Compare January 19, 2018 21:13

openshift-merge-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 19, 2018

diagnostics: enable per-diagnostic parameters

241fd4f

Adds the ability to specify parameters for individual diagnostics on the command line (without proliferating flags). Addresses openshift#14640

sosiouxme force-pushed the 20171213-diagnostic-parameters branch from 4b5f3db to 241fd4f Compare January 19, 2018 21:46

deads2k added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2018

openshift-merge-robot merged commit e703e5b into openshift:master Jan 23, 2018

sosiouxme deleted the 20171213-diagnostic-parameters branch January 23, 2018 13:47

sosiouxme mentioned this pull request Jan 25, 2018

Support optional args in diagnostics #14640

Closed

diagnostics: individual parameters #17773

diagnostics: individual parameters #17773

Conversation

sosiouxme commented Dec 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme Dec 15, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme commented Dec 16, 2017

sosiouxme commented Dec 16, 2017

fabianofranz commented Dec 18, 2017

sosiouxme commented Dec 19, 2017

sosiouxme commented Dec 19, 2017

sosiouxme commented Dec 19, 2017

sosiouxme commented Dec 19, 2017

sosiouxme commented Dec 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme Jan 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme Jan 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sosiouxme commented Jan 10, 2018

sosiouxme commented Jan 10, 2018 • edited Loading

sosiouxme commented Jan 10, 2018 • edited Loading

sosiouxme commented Jan 10, 2018 • edited Loading

sosiouxme commented Jan 10, 2018

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soltysh commented Jan 12, 2018

deads2k commented Jan 19, 2018

openshift-ci-robot commented Jan 19, 2018

sosiouxme commented Jan 19, 2018

openshift-merge-robot commented Jan 23, 2018

sosiouxme commented Dec 14, 2017 •

edited

Loading

sosiouxme Dec 15, 2017 •

edited

Loading

sosiouxme Jan 9, 2018 •

edited

Loading

sosiouxme Jan 9, 2018 •

edited

Loading

sosiouxme commented Jan 10, 2018 •

edited

Loading

sosiouxme commented Jan 10, 2018 •

edited

Loading

sosiouxme commented Jan 10, 2018 •

edited

Loading