Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
346 changes: 346 additions & 0 deletions enhancements/insights/insights-config-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,346 @@
---
title: insights-config-api
authors:
- "@tremes"
reviewers:
- "@bparees"
- "@deads2k"
- "@mfojtik"
- "@wking"
- "@soltysh"
approvers:
- "@bparees"
- "@deads2k"
- "@mfojtik"
- "@soltysh"
api-approvers:
- "@deads2k"
- "@mfojtik"
- "@soltysh"
creation-date: 2022-02-16
last-updated: 2022-05-27
status: implementable
tracking-link:
- "https://issues.redhat.com/browse/CCXDEV-6852"
see-also:
- "https://issues.redhat.com/browse/CCXDEV-6852"
- "https://issues.redhat.com/browse/CCX-195"
---

# Define and expose Insights configuration API

## Release Signoff Checklist

- [x] Enhancement is `implementable`
- [x] Design details are appropriately documented from clear requirements
- [x] Test plan is defined
- [x] Graduation criteria for dev preview, tech preview, GA
- [x] User-facing documentation is created/updated in [openshift-docs](https://github.com/openshift/openshift-docs/)

## Summary

This enhancement introduces new Insights API resources. The new API will introduce new configuration options (including an option to force the data gathering as well as to disable it completely) as well as status reporting of the Insights operator.

## Motivation

The main motivation for this new API is to allow users to refresh (or force) the gathering of the Insights data (see the [User stories](#user-stories) section for more details) as well as to disable the gathering completely. This new API will allows us to report various Insights status reports (including the Insights analysis report that is now exposed via Prometheus metrics).

### Goals

The goal of this enhancement is to propose a basic structure for a new Insights configuration API and define new configuration options. This includes following:

- Define a new configuration option providing a way to trigger on demand Insights gathering
- Define a new configuration option providing a way to include/exclude particular Insights data gatherers
- Provide Insights status & reporting in the status field in the new API

### Non-Goals

- Move the majority of the current configuration options to a new configuration API - Insights has quite a lot of configuration options
and it does not make sense to expose everything in the new configuration API
- Removing the original configuration via `support` secret

## Proposal

The proposal is to introduce new Insights `CustomResourceDefinition` types in the OpenShift API providing a way to configure the Insights operator and to report its status.
The `support` secret configuration still takes precedence over the new config API.

The proposal for the new `insights.config.openshift.io` resource is:

```go
type Insights struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`

// spec holds user settable values for configuration
// +kubebuilder:validation:Required
// +required
Spec InsightsSpec `json:"spec"`
// status holds observed values from the cluster. They may not be overridden.
// +optional
Status InsightsStatus `json:"status"`
}

type InsightsSpec struct {
// GatherConfig spec attribute includes all the configuration options related to
// gathering of the Insights archive and its uploading to the ingress.
GatherConfig *GatherConfig `json:"gatherConfig,omitempty"`
}

type InsightsStatus struct {
}

type GatherConfig struct {
// DataPolicy allows user to enable additional global obfuscation of the IP addresses and base domain
// in the Insights archive data.
// +kubebuilder:default=NoPolicy
DataPolicy DataPolicy `json:"dataPolicy"`
// ForceGatherReason enables user to force Insights data gathering by setting a new reason.
// When there is some gathering in the progress then it is interrupted.
// When all the gatherers are deactivated by the `DisabledGatherers`, nothing happens.
ForceGatherReason string `json:"forceGatherReason"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here after the forced gathering, is this field cleared afterwards, or it's being used every time from now on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value will be cleared as we discussed yesterday. I added the comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems transient and the rest seem permanent. I'd make two separate values under spect for them.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k are you referring to forced gathering (transient) and the disablement of gatherers (permanent)? If so, that's already handled separately. This field is only for the former, the latter is done through DisabledGatherers below. Or are you talking about something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed in slack. Also, this could get worked out in an API PR. It's important, but the fitting impact here is relatively minor

// List of gatherers to be excluded from the gathering. All the gatherers can be disabled by providing "all" value.
// If all the gatherers are disabled, the Insights operator does not gather any data.
// The particular gatherers IDs can be found at https://github.com/openshift/insights-operator/blob/master/docs/gathered-data.md.
// An example of disabling gatherers looks like this: `disabledGatherers: ["clusterconfig/machine_configs", "workloads/workload_info"]`
DisabledGatherers []string `json:"disabledGatherers"`
}

const (
// No data obfuscation
NoPolicy DataPolicy = "NoPolicy"
// IP addresses and cluster domain name is obfuscated
IPsAndClusterDomainPolicy DataPolicy = "IPsAndClusterDomainPolicy"
)

type DataPolicy string
```
The proposal for the new `insights.operator.openshift.io` resource is:

```go
type Insights struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata"`

// spec is the specification of the desired behavior of the Insights
// +kubebuilder:validation:Required
// +required
Spec InsightsSpec `json:"spec"`

// status is the most recently observed status of the Insights operator
// +optional
Status InsightsStatus `json:"status"`
}

type InsightsSpec struct {
OperatorSpec `json:",inline"`
}

type InsightsStatus struct {
OperatorStatus `json:",inline"`
// GatheringStatus provides basic information about the last Insights gathering.
GatheringStatus *GatheringStatus `json:"gatheringStatus,omitempty"`
// ReportStatus provides general Insights analysis results.
ReportStatus *ReportStatus `json:"reportStatus,omitempty"`
}

type GatheringStatus struct {
// LastGatherTime is the last time when Insights gathering finished.
LastGatherTime metav1.Time `json:"lastGatherTime,omitempty"`
// LastGatherReason provides last known reason of gathering. This is helpful
// especially when gathering was forced by user
LastGatherReason string `json:"lastGatherReason,omitempty"`
// StartGatherTime is the time when gathering started. The value is 0
// when there is no gathering in progress.
StartGatherTime metav1.Time `json:"startGatherTime,omitempty"`
// List of active gatherers (and their statuses) in the last gathering.
GathererStatuses []GathererStatus `json:"gathererStatuses,omitempty"`
}

type ReportStatus struct {
// Number of active Insights healthchecks with low severity
LowHealthChecksCount int `json:"low"`
// Number of active Insights healthchecks with moderate severity
ModerateHealthChecksCount int `json:"moderate"`
// Number of active Insights healthchecks with important severity
ImportantHealthChecksCount int `json:"important"`
// Number of active Insights healthchecks with critical severity
CriticalHealthChecksCount int `json:"critical"`
// TotalCount is the count of all active Insights healthchecks
TotalHealthChecksCount int `json:"total"`
}

type GathererStatus struct {
// Name is the name of the gatherer.
Name string `json:"name"`
// GathererConditions provide details on the status of each gatherer.
GathererConditions []GathererCondition `json:"conditions"`
// DurationMillisecond represents the time spent gathering.
DurationMillisecond int64 `json:"durationMillisecond"`
}

type GathererCondition struct {
// Type of the gatherer condition
Type GathererConditionType `json:"type"`
// Status is last known status of the particular gatherer condition
Status corev1.ConditionStatus `json:"status"`
// Messages is an optional attribute that provides error and warning messages
// from the gatherer
Messages []string `json:"messages,omitempty"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're using condition, I'd suggest also adding at least LastTransitionTime and Reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice that earlier, but since each gatherer has its own status GathererStatus, and within that you have an array of conditions GathererConditions you could probably change this Messages array to be just Message string I don't expect that much of data in it, based on my previous experience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as the #1037 (comment). I think it's more convenient as a slice for our purposes. We can have more various messages as in https://github.com/openshift/insights-operator/blob/master/docs/insights-archive-sample/insights-operator/gathers.json#L7

// Last time the condition transit from one status to another.
LastTransitionTime metav1.Time `json:"lastTransitionTime,omitempty"`
// Reason for the condition's last transition.
Reason string `json:"reason,omitempty" protobuf:"bytes,5,opt,name=reason"`
}

// GathererConditionType is a valid value for GathererCondition.Type
type GathererConditionType string

const (
// Gatherer was successful withnout any error reported
Successful GathererConditionType = "Successful"
// Gatherer was disabled by user
Disabled GathererConditionType = "Disabled"
// Gatherer failed to run
Failed GathererConditionType = "Failed"
// Gatherer was running, but there were some errors
Warning GathererConditionType = "Warning"
)
```
### Workflow Description

See the [User Stories](#user-stories) and other sections.

### User Stories
#### Force gathering - refresh Insights status
As an OCP user/cluster-admin I would like to be able to trigger on demand gathering (there were some active Insights health checks in the cluster, which should be fixed and I want to refresh the Insights status). Steps:
- the user will set a new value of `ForceGatherReason`
- if all the gatherers are disabled (`disabledGatherers: ["all"]`), nothing happens
- if there is some gathering in progress (`StartGatherTime` is not nil in the `GatheringStatus` section) then it will be interrupted and the new gathering will be triggered
- if there is no gathering in progress (`StartGatherTime` is nil in the `GatheringStatus` section) then the new gathering will be triggered almost immediately
- user can check the `StartGatherTime` or the `LastGatheringTime` attribute in the `GatheringStatus` section to check if the new gathering already finished
- when the gathering is finished then the `LastGatherReason` and `LastGatherTime` in the `GatheringStatus` are updated

#### Disable Insights gathering completely
As an OCP user/cluster-admin I would like to disable Insights gathering completely. Steps:
- Set the `DisabledGatherers` to `["all"]` and the gathering will be turned off - i.e no new data will be gathered

#### Exposing latest Insights status
As an OCP user/cluster-admin I would like to see the latest Insights status. The status is represented by the `InsightsStatus` and it is only updated by the operator.

#### Excluding particular gatherers
As an OCP user/cluster-admin I would like to exclude particular gatherers. Steps:
- Set the `DisabledGatherers` with the list of gatherers to be excluded/disabled - e.g `DisabledGatherers: ["clusterconfig/config_maps", "clusterconfig/node_logs"]`

### API Extensions

This is adding new Insights resource in the `config.openshift.io/v1` and in the `operator.openshift.io/v1` API resources. The proposal of the structure of the new API is mentioned in the [Proposal](#Proposal) section.

### Risks and Mitigations

Risk: Insights operator is not able to read the new config API.

Mitigation: The Insights operator is marked as degraded in the corresponding clusteroperator condition.

Risk: Somebody/something is repeatedly updating the `ForceGatherReason` attribute and thus forcing gathering. There is a risk that
the cluster would upload data too often, but this assumes that the "repeating interval" allows uploading (if it's forced too quickly then there will be no upload, because the gathering is not over).

Mitigation: Maybe we should introduce some limit on the number of forced gatherings within 1 hour, for example.

## Design Details

### On demand data gathering

The on demand data gathering can happen almost immediately when there is no active gathering in the progress, but when there is some in progress
then it will be interrupted and the new gathering will be triggered.

The forced gathering is defined as follows:
- force it - e.g using a `ForceGatherReason` attribute where the user would provide some string as reason (mentioned in the [Proposal](#proposal)).
- pros
- providing a reason (e.g “Refresh Insights”) is simpler for configuration than providing a time
- gathering would be triggered almost immediately
- there is already similar precedent with `ForceRedeploymentReason` when redeploying static pods
- cons
- need to be able to interrupt gathering (if any in progress)

### Open Questions

- ~~Can the API define default values? How does it work or how is it implemented?~~ This is done by using `// +kubebuilder:default=`
- ~~Should the new time of forced gathering (if any defined by user) set a new time for the existing gathering interval? If I force gathering at time A, will the new automatic gathering be in time A + interval?~~ - yes. It makes sense to update the last known time of gathering so that the regular gathering does not happen immediately after the forced gathering.
- Should the `ForceGatherReason` attribute have its own configuration (given the fact that it's transient config) or reuse the existing one? - For now we decided to use the existing `GatherConfig` configuration.

### Test Plan

The new config API will require updates of the existing Insights operator testsuite. The testsuite will be extended to cover the new API and respective attributes.
Following behavior will need to be tested:
- `support` secret config takes precedence over the new config API
- forced gathering works as expected
- disabling some gatherers works as expected and diasbling all will disable the gathering completely

### Graduation Criteria

The plan is to introduce the first version of the new API behind the `TechPreviewNoUpgrade` feature gate. We would like to
keep the original way of configuring the Insights operator (that is via `support` secret), because it is problematic to move every existing configuration option to the new configuration API and the existing `support` secret serves basically as a configuration API.

#### Dev Preview -> Tech Preview

The original way of configuring the Insights is still available. If there is a conflict, in other words, if there is some configuration option defined in the `support` secret as well as in the new configuration API then the value from the `support` secret takes precedence.

#### Tech Preview -> GA

The `TechPreviewNoUpgrade` feature gate requirement is removed. The behavior defined above in the [Dev Preview -> Tech Preview](#dev-preview---tech-preview) section is still true.

Other than that we would like to:
- get some time to gather feedback and some user experience
- ensure stability
- do more e2e testing

#### Removing a deprecated feature

The existing configuration via `support` secret should be still available for backward compatibility reasons. The reason of the `support` secret existence is the option of using basic authentication for the Ingress service, which Insights operator uses as the upload endpoint.

### Upgrade / Downgrade Strategy

No update of the configuration is required. The existing configuration in the `support` secret continues to work. When there is an existing `insights.config.openshift.io` kind in a cluster after upgrade, then the user can optionally use the new configuration API.
There can be a conflict when a configuration option is defined in the `support` secret as well as in the configuration API. The configuration in the `support` secret takes precedence in this case.
If you downgrade to a cluster without the Insights API then the `support` secret configuration will continue to work as before, or it must be created (and possibly configured accordingly)).

### Version Skew Strategy

### Operational Aspects of API Extensions

This API is provided exclusively for the Insights operator and there will be no other component depending on this new API. Current configobserver in the Insights operator runs every 5 minutes to check the `support` secret values.
The new API attributes will have a watch (using an informer).

#### Failure Modes

As noted in the [Risk and Mitigations](#risks-and-mitigations) section, the only possible failure is probably when the Insights operator
cannot communicate with the config API.

#### Support Procedures

The new config API will require a new OCP documentation chapter describing the new way of the Insights configuration
and the new way of the Insights status reporting.

## Implementation History

As noted in the [Graduation criteria](#graduation-criteria) section, the first API version will be behind the `TechPreviewNoUpgrade` feature gate.

### Drawbacks

The original reason for having the config in the `support` secret was the basic authentication option to communicate with the Ingress service.
Currently the preferred authentication option is to use the token provided in the `pull-secret` secret,
but the basic authentication option might be still valid for some use cases (e.g testing with Ingress service in the staging environment) so we should preserve this option. This is described more in the [Graduation](#graduation-criteria) and [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) sections.

## Alternatives

The alternative is to keep all the original configuration options in the `support` secret, but this is not very suitable for the new configuration options (`ForceGatherReason` and `DisabledGatherers`) introduced in this enhancement and it does not allow to expose the Insights status in the API.

The alternative to forcing the gathering is to schedule it. The alternative could look like following:

- schedule it - e.g using something like the `ScheduleGatheringAt` attribute where the user would provide some time in the future.
- Pros
- no need to interrupt gathering in progress
- we could probably use the `RateLimiting` queue and add a new gathering to the queue with the `AddAfter` method. Otherwise we would need to define some waiting time (when there’s already some gathering in progress) and right now it’s difficult to estimate an average gathering time since the gatherers run in parallel
- Cons
- user has to define the scheduling time correctly in the future
- gathering doesn’t happen immediately