openshift · openshift-merge-robot · Dec 16, 2021 · Jul 15, 2021 · Jul 16, 2021 · Jul 16, 2021
diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md
@@ -0,0 +1,294 @@
+---
+title: conditional-data-gathering
+authors:
+  - "@Sergey1011010"
+reviewers:
+  - "@inecas"
+  - "@tremes"
+  - "@sttts"
+  - "@deads2k"
+approvers:
+  - "@tremes"
+creation-date: 2021-07-15
+last-updated: 2021-07-15
+status: implementable
+---
+
+# Conditional Data Gathering
+
+## Release Signoff Checklist
+
+- [x] Enhancement is `implementable`
+- [x] Design details are appropriately documented from clear requirements
+- [x] Test plan is defined
+- [ ] Operational readiness criteria is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/)
+
+## Summary
+
+The conditional gatherer for Insights Operator collects data according to the defined gathering rules*.
+Each rule contains one or more conditions such as "alert A is firing"
+and one or more gatherers with parameters such as "collect X lines of logs from containers in namespace N".
+Current version has these rules defined in the code and the proposal is to load them from an external source
+to make collection of new conditional data faster. It's NOT about running brand new code,
+but just calling the existing gatherers with different validated parameters, so the operator can't
+collect anything we don't expect.
+
+\* note that rule here and later has nothing to do with rules used to analyze data in archives written
+by CCX presentation team
+
+## Motivation
+
+Collecting data on some condition such as an alert is quite common pattern
+during the root-cause analysis and the ability to define the rules
+what to collect on which condition externally will help to do that faster.
+
+### Goals
+
+- Implement validation of the rules for conditional gatherer
+- Create a service taking gathering rules from the git repo and providing them through the API
+- Fetch the rules definitions from the service's API and apply them to the conditional gatherer
+
+### Non-Goals
+
+It's NOT about running brand new code.
+
+## Proposal
+
+Right now we have conditional gatherer which can collect data based on some conditions
+which are defined in the code, for example the next config would collect 100 lines of logs
+from each container in namespace `openshift-cluster-samples-operator`
+and image streams of the same namespace when alert `SamplesImagestreamImportFailing` is firing
+and add it to the next archive.
+
+```json
+[{
+  "conditions": [{
+    "type": "AlertIsFiring",
+    "alert": {
+      "name": "SamplesImagestreamImportFailing"
+    }
+  }],
+  "gathering_functions": {
+    "logs_of_namespace": {
+      "namespace": "openshift-cluster-samples-operator",
+      "tail_lines": 100
+    },
+    "image_streams_of_namespace": {
+      "namespace": "openshift-cluster-samples-operator"
+    }
+  }
+}]
+```
+
+The API for conditions definition is unified with
+[Targeted Update Edge Blocking API](https://github.com/openshift/enhancements/pull/821)
+
+Conditions can have type and parameters, implemented types are:
+
+- `AlertIsFiring` which is met when the alert with the name from parameter `name` is firing
+
+A gathering function consists of its name and parameters, implemented functions are:
+
+- `logs_of_namespace` which collects logs from each container in the namespace from parameter `namespace`
+limiting it to only last N lines from parameter `tail_lines`
+- `image_streams_of_namespace` which collects image streams of the namespace from parameter `namespace`
+
+Conditions to be implemented:
+
+- `ClusterVersionMatches` with parameter `version` makes the rule to be applied on specific cluster version.
+It will use semantic versioning so you can wildcards and ranges like here https://github.com/blang/semver#features
+
+The proposal is to implement the next process of adding new rules:
+
+1. We'll have a repo with json configs defining the rules. The repo is going to have a simple CI with validation
+against JSON schema and possibly some review process. The repo should live in https://github.com/RedHatInsights
+We have created a PoC for the repo:
+
+https://github.com/tremes/insights-operator-gathering-rules
+
+The JSON schema can be found here:
+
+https://raw.githubusercontent.com/openshift/insights-operator/f9b762149cd10ec98079e48b8a96fc02a2aca3c6/pkg/gatherers/conditional/gathering_rule.schema.json
+
+2. There will be a service living in console.redhat.com which will have the rules baked into the container and will
+provide all the rules through its API. The very first version of the service is going to be simple, but later we may
+add some filtering on the API level (for example by cluster version).
+
+3. Insights Operator fetches a config with the rules from the service and unmarshalls JSON to Go structs.
+
+4. Insights Operator makes the most important validation against the next JSON schemas:
+
+- https://raw.githubusercontent.com/openshift/insights-operator/f9b762149cd10ec98079e48b8a96fc02a2aca3c6/pkg/gatherers/conditional/gathering_rule.schema.json
+- https://raw.githubusercontent.com/openshift/insights-operator/f9b762149cd10ec98079e48b8a96fc02a2aca3c6/pkg/gatherers/conditional/gathering_rules.schema.json
+
+which check the next things:
+
+- The JSON version of the config matches the structs defined in the code
+- The maximum number of rules is 64
+- The rules should not repeat
+- There can be up to 8 conditions in each rule
+- Only implemented conditions can be used
+- Alert name from `AlertIsFiring` condition should be a string of length between 1 and 128
+and consist of only alphanumeric characters
+- For each rule, there's at least one gathering function
+- Only implemented gathering functions can be used
+- There can be up to 8 gathering functions in each rule
+- Namespace from `gather_logs_of_namespace` function should be a string of length between 1 and 128,
+match the next regular expression `^[a-zA-Z]([a-zA-Z0-9\-]+[\.]?)*[a-zA-Z0-9]$` and start with `openshift-`
+- Tail lines from `gather_logs_of_namespace` function should be an integer between 1 and 4096
+- Namespace from `gather_image_streams_of_namespace` function should be a string of length between 1 and 128,
+match the next regular expression `^[a-zA-Z]([a-zA-Z0-9\-]+[\.]?)*[a-zA-Z0-9]$` and start with `openshift-`
+
+If anything fails to be validated, the config is rejected and an error is written to the logs.
+The PR with validation on operator side - https://github.com/openshift/insights-operator/pull/483
+
+5. Insights Operator goes rule by rule and collects the requested data if corresponding conditions are met
+
+The new gathering functions and conditions are added the same way as any other code changes to insights-operator.
+The JSON schema also lives in the operator repo and is changed through the regular process.
+
+### User Stories
+
+For OpenShift engineers it will be easier to get necessary information when some problem occurs (like alert is firing).
+Getting such information now means changing operator's code which is especially painful with older versions of clusters.
+When the data is not needed anymore, the rule can be simply removed from the repo.
+
+### API Extensions
+
+Empty
+
+### Implementation Details/Notes/Constraints [optional]
+
+Empty
+
+### Risks and Mitigations
+
+The potential risks could come from an attacker spoofing the config somehow (if they got access to the repo),
+all they could do is let the insights operator collect more data and send it to the c.r.c, but because of validation
+on the operator side, the potential of collecting data is limited and there still would be
+the same anonymization of potentially sensitive data as before.
+For example, we check that namespaces start with `openshift-`, we're also limiting
+the amount of potentially collected data by, for example, introducing the limit for amount of collected logs
+(per container and the number of containers) and, in the worst case, if the conditional gathering takes too much time
+it would be stopped by the timeout and we would just get less data.
+
+Also in the regular workflow, changing the config involves going through the repo's CI (JSON schema validator)
+and probably a simple review process.
+
+Not really a risk, but in case the service is not available or provides invalid config for the operator,
+we would just have less data in the archives (everything except conditional gatherer's data),
+an error would be written to the logs and to the metadata which would allow us to know
+that something is broken and we need to bring the service back.
+
+In case GitHub is not available, we won't be able to add new rules, but the old ones would still be returned by
+the service.
+
+## Design Details
+
+On the initial start the conditional gatherer will try to fetch the config from the service
+by doing http(s) request to the appropriate endpoint `https://cloud.redhat.com/.../gathering_rules`.
+The token from pull secret is used for authentication. There is a timeout of one minute for the request. 
+The response is then cached and sent again only when 12 hours pass since the last request, 
+this value can be configured. In case when Insights Operator can't get the new set of gathering rules,
+the old ones will be used. In case of an error the service returns non 200 response.
+In case of success, the service responds with the rules in the next format:
+
+```json
+{
+  "version": "1.0",
+  "rules": [
+    {
+      "conditions": [
+        {
+          "type": "alert_is_firing",
+          "alert": {
+            "name": "ClusterVersionOperatorIsDown"
+          }
+        },
+        {
+          "type": "cluster_version_matches",
+          "cluster_version": {
+            "version": "4.8.x"
+          }
+        }
+      ],
+      "gathering_functions": {
+        "gather_logs_of_namespace": {
+          "namespace": "openshift-cluster-version",
+          "keep_lines": 100
+        }
+      }
+    }
+  ]
+}
+```
+
+`version` fields shows the version of the rules. `rules` contains the rules for the conditional gatherer.
+
+### Open Questions [optional]
+
+Empty
+
+### Test Plan
+
+The conditional gatherer is covered with unit tests considering many different scenarios.
+Later there will be integration tests as well.
+
+### Graduation Criteria
+
+Empty
+
+#### Dev Preview -> Tech Preview
+
+Empty
+
+#### Tech Preview -> GA
+
+Empty
+
+#### Removing a deprecated feature
+
+Empty
+
+### Upgrade / Downgrade Strategy
+
+The conditional gatherer itself (e.g. adding new/removing old conditions or gathering functions) is
+modified through a standard process like all other operator's code. Only the rules to collect X with Y params on Z
+are updated through the new process described above.
+
+### Version Skew Strategy
+
+Out of scope.
+
+### Operational Aspects of API Extensions
+
+Empty
+
+#### Failure Modes
+
+Empty
+
+#### Support Procedures
+
+Empty
+
+## Implementation History
+
+Major milestones in the life cycle of a proposal should be tracked in `Implementation
+History`.
+
+## Drawbacks
+
+None?
+
+## Alternatives
+
+Thin IO idea (https://github.com/openshift/enhancements/pull/465) could also solve this, but thin IO was about adding
+new code bypassing the standard release process which could potentially be very dangerous.
+
+## Infrastructure Needed [optional]
+
+- The repo in RedHatInsights
+- The service living in console.redhat.com