From ad786ebf592733c6387f96ddbe09a674dca86c82 Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Thu, 15 Jul 2021 20:56:13 +0200 Subject: [PATCH 01/18] conditional data gathering --- .../insights/conditional-data-gathering.md | 194 ++++++++++++++++++ 1 file changed, 194 insertions(+) create mode 100644 enhancements/insights/conditional-data-gathering.md diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md new file mode 100644 index 0000000000..11058e4820 --- /dev/null +++ b/enhancements/insights/conditional-data-gathering.md @@ -0,0 +1,194 @@ +--- +title: conditional-data-gathering +authors: + - "@Sergey1011010" +reviewers: + - "@inecas" + - "@tremes" + - "@smarterclayton" +approvers: + - "@smarterclayton" +creation-date: 2021-07-15 +last-updated: 2021-07-15 +status: implementable +--- + +# Conditional Data Gathering + +## Release Signoff Checklist + +- [x] Enhancement is `implementable` +- [x] Design details are appropriately documented from clear requirements +- [x] Test plan is defined +- [ ] Operational readiness criteria is defined +- [ ] Graduation criteria for dev preview, tech preview, GA +- [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) + +## Summary + +The conditional gatherer for Insights Operator collects data according to the defined rules*. +Each rule contains one or more conditions such as "alert A is firing" +and one or more gatherers with parameters such as "collect N lines of logs from containers in namespace N". +Current version has these rules defined in the code and the proposal is to load them from an external source +to make collection of new conditional data faster. Unlike with Thin IO idea, it's not about running brand new +code, but just calling the existing gatherers with different validated parameters, so the operator can't +collect anything we don't expect. + +\* note that rule here and later has nothing to do with rules used to analyze data in archives written +by CCX presentation team + +## Motivation + +Collecting data on some condition such as an alert is quite common pattern +during the root-cause analysis and the ability to define the rules +what to collect on which condition externally will help to do that faster. + +### Goals + +- Implement validation of the rules for conditional gatherer +- Create a service providing these rules +- Fetch them from there and apply to the conditional gatherer + +### Non-Goals + +Empty + +## Proposal + +Right now we have conditional gatherer which can collect data based on some conditions +which are defined in the code, for example the next config would collect 100 lines of logs +from each container in namespace `openshift-cluster-samples-operator` +and image streams of the same namespace when alert `SamplesImagestreamImportFailing` is firing +and add it to the next archive. + +```json +[{ + "conditions": [{ + "type": "alert_is_firing", + "params": { + "name": "SamplesImagestreamImportFailing" + } + }], + "gathering_functions": { + "logs_of_namespace": { + "namespace": "openshift-cluster-samples-operator", + "tail_lines": 100 + }, + "image_streams_of_namespace": { + "namespace": "openshift-cluster-samples-operator" + } + } +}] +``` + +Conditions can have type and parameters, implemented types are: + +- `alert_is_firing` which is met when the alert with the name from parameter `name` is firing + +A gathering function consists of its name and parameters, implemented functions are: + +- `logs_of_namespace` which collects logs from each container in the namespace from parameter `namespace` +limiting it to only last N lines from parameter `tail_lines` +- `image_streams_of_namespace` which collects image streams of the namespace from parameter `namespace` + +The proposal is to implement the next process of adding new rules: + +1. We have a repo with json configs defining the rules. The repo will have some simple CI with validation +against JSON schema and possibly some review process. +We have created a PoC for the repo: + +https://github.com/tremes/insights-operator-gathering-rules + +or the version with an example of JSON schema validation: + +https://github.com/tremes/insights-operator-gathering-rules/tree/schema + +2. There's a service living in cloud.redhat.com fetching the rules from the repo and providing them through its API. +The very first version is going to provide just all the rules, but later we may consider splitting them by +cluster version and introducing some more complicated logic around fetching the rules +3. Insights Operator fetches a config with the rules from the service and unmarshalls JSON to Go structs +4. Insights Operator makes the most important validation which checks the next things: + +- The JSON version of the config matches the structs defined in the code +- For each rule, there's at least one gathering function +- Only implemented conditions can be used +- Alert name from `alert_is_firing` condition should be a string of length between 1 and 128 +and consist of only alphanumeric characters +- Only implemented gathering functions can be used +- Namespace from `gather_logs_of_namespace` function should be a string of length between 1 and 128, +match the next regular expression `^[a-zA-Z]([a-zA-Z0-9\-]+[\.]?)*[a-zA-Z0-9]$` and start with `openshift-` +- Tail lines from `gather_logs_of_namespace` function should be an integer between 1 and 4096 +- Namespace from `gather_image_streams_of_namespace` function should be a string of length between 1 and 128, +match the next regular expression `^[a-zA-Z]([a-zA-Z0-9\-]+[\.]?)*[a-zA-Z0-9]$` and start with `openshift-` + +If anything fails to be validated, the config is rejected and an error is written to the logs. +The PR with validation on operator side - https://github.com/openshift/insights-operator/pull/470 + +5. Insights Operator goes rule by rule and collects the requested data if corresponding conditions are met + +### User Stories + +For OpenShift engineers it will be easier to get necessary information when some problem occurs (like alert is firing). +Getting such information now means changing operator's code which is especially painful with older versions of clusters. +When the data is not needed anymore, the rule can be simply removed from the repo. + +### Implementation Details/Notes/Constraints [optional] + +Empty + +### Risks and Mitigations + +The potential risks could come from an attacker spoofing the config somehow (if they got access to the repo), +all they could do is let the insights operator collect more data and send it to the c.r.c, but because of validation +on the operator side, the potential of collecting data is limited. For example, +we check that namespaces start with `openshift-`, we're also limiting the amount of potentially collected data by, +for example, introducing the limit for amount of collected logs. + +Also in the regular workflow, changing the config involves going through the repo's CI (JSON schema validator) +and probably a simple review process. + +## Design Details + +Empty + +### Open Questions [optional] + +Empty + +### Test Plan + +The conditional gatherer is covered with unit tests considering many different scenarios. +Later there will be integration tests as well. + +### Graduation Criteria + +Empty + +### Upgrade / Downgrade Strategy + +The conditional gatherer itself (e.g. adding new/removing old conditions or gathering functions) is +modified through a standard process like all other operator's code. Only the rules to collect X with Y params on Z +are updated through the new process described above. + +### Version Skew Strategy + +Out of scope. + +## Implementation History + +Major milestones in the life cycle of a proposal should be tracked in `Implementation +History`. + +## Drawbacks + +None? + +## Alternatives + +Thin IO idea could also solve this, but thin IO was about adding new code bypassing the standard release process +which could potentially be very dangerous. + +## Infrastructure Needed [optional] + +- The repo in RedHatInsights +- The service living in cloud.redhat.com From f8493440d2e8cb4c8dbe38b7b8a7c276b57e40f3 Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Fri, 16 Jul 2021 13:15:54 +0200 Subject: [PATCH 02/18] mention thin io only in alternatives with a link --- enhancements/insights/conditional-data-gathering.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index 11058e4820..51b67ad257 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -30,8 +30,8 @@ The conditional gatherer for Insights Operator collects data according to the de Each rule contains one or more conditions such as "alert A is firing" and one or more gatherers with parameters such as "collect N lines of logs from containers in namespace N". Current version has these rules defined in the code and the proposal is to load them from an external source -to make collection of new conditional data faster. Unlike with Thin IO idea, it's not about running brand new -code, but just calling the existing gatherers with different validated parameters, so the operator can't +to make collection of new conditional data faster. It's NOT about running brand new code, +but just calling the existing gatherers with different validated parameters, so the operator can't collect anything we don't expect. \* note that rule here and later has nothing to do with rules used to analyze data in archives written @@ -185,8 +185,8 @@ None? ## Alternatives -Thin IO idea could also solve this, but thin IO was about adding new code bypassing the standard release process -which could potentially be very dangerous. +Thin IO idea (https://github.com/openshift/enhancements/pull/465) could also solve this, but thin IO was about adding +new code bypassing the standard release process which could potentially be very dangerous. ## Infrastructure Needed [optional] From c742a0418be39e67c8d90eb0dde989c0d3bd619d Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Fri, 16 Jul 2021 13:27:46 +0200 Subject: [PATCH 03/18] example of what would happen in case the service is not available --- enhancements/insights/conditional-data-gathering.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index 51b67ad257..31c7ab1907 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -147,6 +147,11 @@ for example, introducing the limit for amount of collected logs. Also in the regular workflow, changing the config involves going through the repo's CI (JSON schema validator) and probably a simple review process. +Not really a risk, but in case the service is not available or provides invalid config for the operator, +we would just have less data in the archives (everything except conditional gatherer's data), +an error would be written to the logs and to the metadata which would allow us to know +that something is broken and we need to bring the service back. + ## Design Details Empty From 3bf0924c6506a4839395b54344e3659fb46a525c Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Thu, 22 Jul 2021 16:33:37 +0200 Subject: [PATCH 04/18] a few fixes --- enhancements/insights/conditional-data-gathering.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index 31c7ab1907..8199ff3da8 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -26,7 +26,7 @@ status: implementable ## Summary -The conditional gatherer for Insights Operator collects data according to the defined rules*. +The conditional gatherer for Insights Operator collects data according to the defined gathering rules*. Each rule contains one or more conditions such as "alert A is firing" and one or more gatherers with parameters such as "collect N lines of logs from containers in namespace N". Current version has these rules defined in the code and the proposal is to load them from an external source @@ -46,8 +46,8 @@ what to collect on which condition externally will help to do that faster. ### Goals - Implement validation of the rules for conditional gatherer -- Create a service providing these rules -- Fetch them from there and apply to the conditional gatherer +- Create a service taking gathering rules from the git repo and providing them through the API +- Fetch the rules definitions from the service's API and apply them to the conditional gatherer ### Non-Goals @@ -103,7 +103,7 @@ or the version with an example of JSON schema validation: https://github.com/tremes/insights-operator-gathering-rules/tree/schema -2. There's a service living in cloud.redhat.com fetching the rules from the repo and providing them through its API. +2. There's a service living in console.redhat.com fetching the rules from the repo and providing them through its API. The very first version is going to provide just all the rules, but later we may consider splitting them by cluster version and introducing some more complicated logic around fetching the rules 3. Insights Operator fetches a config with the rules from the service and unmarshalls JSON to Go structs @@ -196,4 +196,4 @@ new code bypassing the standard release process which could potentially be very ## Infrastructure Needed [optional] - The repo in RedHatInsights -- The service living in cloud.redhat.com +- The service living in console.redhat.com From eca1dacb3cbdba2c08cea7e32fba4137b4680de8 Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Tue, 10 Aug 2021 23:09:55 +0200 Subject: [PATCH 05/18] updated the enhancement --- .../insights/conditional-data-gathering.md | 46 +++++++++++++------ 1 file changed, 31 insertions(+), 15 deletions(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index 8199ff3da8..b124e5779f 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -28,7 +28,7 @@ status: implementable The conditional gatherer for Insights Operator collects data according to the defined gathering rules*. Each rule contains one or more conditions such as "alert A is firing" -and one or more gatherers with parameters such as "collect N lines of logs from containers in namespace N". +and one or more gatherers with parameters such as "collect X lines of logs from containers in namespace N". Current version has these rules defined in the code and the proposal is to load them from an external source to make collection of new conditional data faster. It's NOT about running brand new code, but just calling the existing gatherers with different validated parameters, so the operator can't @@ -93,28 +93,39 @@ limiting it to only last N lines from parameter `tail_lines` The proposal is to implement the next process of adding new rules: -1. We have a repo with json configs defining the rules. The repo will have some simple CI with validation -against JSON schema and possibly some review process. +1. We'll have a repo with json configs defining the rules. The repo is going to have a simple CI with validation +against JSON schema and possibly some review process. The repo should live in https://github.com/RedHatInsights We have created a PoC for the repo: https://github.com/tremes/insights-operator-gathering-rules -or the version with an example of JSON schema validation: +The JSON schema can be found here: -https://github.com/tremes/insights-operator-gathering-rules/tree/schema +https://raw.githubusercontent.com/openshift/insights-operator/f9b762149cd10ec98079e48b8a96fc02a2aca3c6/pkg/gatherers/conditional/gathering_rule.schema.json -2. There's a service living in console.redhat.com fetching the rules from the repo and providing them through its API. -The very first version is going to provide just all the rules, but later we may consider splitting them by -cluster version and introducing some more complicated logic around fetching the rules -3. Insights Operator fetches a config with the rules from the service and unmarshalls JSON to Go structs -4. Insights Operator makes the most important validation which checks the next things: +2. There will be a service living in console.redhat.com which will have the rules baked into the container and will +provide all the rules through its API. The very first version of the service is going to be simple, but later we may +add some filtering on the API level (for example by cluster version). + +3. Insights Operator fetches a config with the rules from the service and unmarshalls JSON to Go structs. + +4. Insights Operator makes the most important validation against the next JSON schemas: + +- https://raw.githubusercontent.com/openshift/insights-operator/f9b762149cd10ec98079e48b8a96fc02a2aca3c6/pkg/gatherers/conditional/gathering_rule.schema.json +- https://raw.githubusercontent.com/openshift/insights-operator/f9b762149cd10ec98079e48b8a96fc02a2aca3c6/pkg/gatherers/conditional/gathering_rules.schema.json + +which check the next things: - The JSON version of the config matches the structs defined in the code -- For each rule, there's at least one gathering function +- The maximum number of rules is 64 +- The rules should not repeat +- There can be up to 8 conditions in each rule - Only implemented conditions can be used - Alert name from `alert_is_firing` condition should be a string of length between 1 and 128 and consist of only alphanumeric characters +- For each rule, there's at least one gathering function - Only implemented gathering functions can be used +- There can be up to 8 gathering functions in each rule - Namespace from `gather_logs_of_namespace` function should be a string of length between 1 and 128, match the next regular expression `^[a-zA-Z]([a-zA-Z0-9\-]+[\.]?)*[a-zA-Z0-9]$` and start with `openshift-` - Tail lines from `gather_logs_of_namespace` function should be an integer between 1 and 4096 @@ -122,10 +133,13 @@ match the next regular expression `^[a-zA-Z]([a-zA-Z0-9\-]+[\.]?)*[a-zA-Z0-9]$` match the next regular expression `^[a-zA-Z]([a-zA-Z0-9\-]+[\.]?)*[a-zA-Z0-9]$` and start with `openshift-` If anything fails to be validated, the config is rejected and an error is written to the logs. -The PR with validation on operator side - https://github.com/openshift/insights-operator/pull/470 +The PR with validation on operator side - https://github.com/openshift/insights-operator/pull/483 5. Insights Operator goes rule by rule and collects the requested data if corresponding conditions are met +The new gathering functions and conditions are added the same way as any other code changes to insights-operator. +The JSON schema also lives in the operator repo and is changed through the regular process. + ### User Stories For OpenShift engineers it will be easier to get necessary information when some problem occurs (like alert is firing). @@ -140,9 +154,11 @@ Empty The potential risks could come from an attacker spoofing the config somehow (if they got access to the repo), all they could do is let the insights operator collect more data and send it to the c.r.c, but because of validation -on the operator side, the potential of collecting data is limited. For example, -we check that namespaces start with `openshift-`, we're also limiting the amount of potentially collected data by, -for example, introducing the limit for amount of collected logs. +on the operator side, the potential of collecting data is limited and there still would be +the same anonymization of potentially sensitive data as before. +For example, we check that namespaces start with `openshift-`, we're also limiting +the amount of potentially collected data by, for example, introducing the limit for amount of collected logs +(per container and the number of containers). Also in the regular workflow, changing the config involves going through the repo's CI (JSON schema validator) and probably a simple review process. From 25370353a93f276a4d972badf6f5fefefcf35b18 Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Wed, 11 Aug 2021 10:47:18 +0200 Subject: [PATCH 06/18] one more thing to the risks --- enhancements/insights/conditional-data-gathering.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index b124e5779f..daa8cabfae 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -168,6 +168,9 @@ we would just have less data in the archives (everything except conditional gath an error would be written to the logs and to the metadata which would allow us to know that something is broken and we need to bring the service back. +In case GitHub is not available, we won't be able to add new rules, but the old ones would still be returned by +the service. + ## Design Details Empty From ea932443baa20cdfdc907fa4ff676412049f625b Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Wed, 11 Aug 2021 10:58:07 +0200 Subject: [PATCH 07/18] elaborate on timeout --- enhancements/insights/conditional-data-gathering.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index daa8cabfae..66aab6f0b0 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -158,7 +158,8 @@ on the operator side, the potential of collecting data is limited and there stil the same anonymization of potentially sensitive data as before. For example, we check that namespaces start with `openshift-`, we're also limiting the amount of potentially collected data by, for example, introducing the limit for amount of collected logs -(per container and the number of containers). +(per container and the number of containers) and, in the worst case, if the conditional gathering takes too much time +it would be stopped by the timeout and we would just get less data. Also in the regular workflow, changing the config involves going through the repo's CI (JSON schema validator) and probably a simple review process. From 92dd7e8d1a2353376ba32b85568d206ee34838db Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Mon, 16 Aug 2021 15:48:06 +0200 Subject: [PATCH 08/18] lint --- .../insights/conditional-data-gathering.md | 74 +++++++++++-------- 1 file changed, 43 insertions(+), 31 deletions(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index 66aab6f0b0..2cec15ed4d 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -26,21 +26,21 @@ status: implementable ## Summary -The conditional gatherer for Insights Operator collects data according to the defined gathering rules*. -Each rule contains one or more conditions such as "alert A is firing" +The conditional gatherer for Insights Operator collects data according to the defined gathering rules*. +Each rule contains one or more conditions such as "alert A is firing" and one or more gatherers with parameters such as "collect X lines of logs from containers in namespace N". Current version has these rules defined in the code and the proposal is to load them from an external source -to make collection of new conditional data faster. It's NOT about running brand new code, -but just calling the existing gatherers with different validated parameters, so the operator can't -collect anything we don't expect. +to make collection of new conditional data faster. It's NOT about running brand new code, +but just calling the existing gatherers with different validated parameters, so the operator can't +collect anything we don't expect. -\* note that rule here and later has nothing to do with rules used to analyze data in archives written +\* note that rule here and later has nothing to do with rules used to analyze data in archives written by CCX presentation team ## Motivation -Collecting data on some condition such as an alert is quite common pattern -during the root-cause analysis and the ability to define the rules +Collecting data on some condition such as an alert is quite common pattern +during the root-cause analysis and the ability to define the rules what to collect on which condition externally will help to do that faster. ### Goals @@ -55,9 +55,9 @@ Empty ## Proposal -Right now we have conditional gatherer which can collect data based on some conditions -which are defined in the code, for example the next config would collect 100 lines of logs -from each container in namespace `openshift-cluster-samples-operator` +Right now we have conditional gatherer which can collect data based on some conditions +which are defined in the code, for example the next config would collect 100 lines of logs +from each container in namespace `openshift-cluster-samples-operator` and image streams of the same namespace when alert `SamplesImagestreamImportFailing` is firing and add it to the next archive. @@ -67,7 +67,7 @@ and add it to the next archive. "type": "alert_is_firing", "params": { "name": "SamplesImagestreamImportFailing" - } + } }], "gathering_functions": { "logs_of_namespace": { @@ -95,7 +95,7 @@ The proposal is to implement the next process of adding new rules: 1. We'll have a repo with json configs defining the rules. The repo is going to have a simple CI with validation against JSON schema and possibly some review process. The repo should live in https://github.com/RedHatInsights -We have created a PoC for the repo: +We have created a PoC for the repo: https://github.com/tremes/insights-operator-gathering-rules @@ -116,20 +116,20 @@ add some filtering on the API level (for example by cluster version). which check the next things: -- The JSON version of the config matches the structs defined in the code +- The JSON version of the config matches the structs defined in the code - The maximum number of rules is 64 - The rules should not repeat - There can be up to 8 conditions in each rule - Only implemented conditions can be used -- Alert name from `alert_is_firing` condition should be a string of length between 1 and 128 +- Alert name from `alert_is_firing` condition should be a string of length between 1 and 128 and consist of only alphanumeric characters - For each rule, there's at least one gathering function - Only implemented gathering functions can be used - There can be up to 8 gathering functions in each rule -- Namespace from `gather_logs_of_namespace` function should be a string of length between 1 and 128, +- Namespace from `gather_logs_of_namespace` function should be a string of length between 1 and 128, match the next regular expression `^[a-zA-Z]([a-zA-Z0-9\-]+[\.]?)*[a-zA-Z0-9]$` and start with `openshift-` - Tail lines from `gather_logs_of_namespace` function should be an integer between 1 and 4096 -- Namespace from `gather_image_streams_of_namespace` function should be a string of length between 1 and 128, +- Namespace from `gather_image_streams_of_namespace` function should be a string of length between 1 and 128, match the next regular expression `^[a-zA-Z]([a-zA-Z0-9\-]+[\.]?)*[a-zA-Z0-9]$` and start with `openshift-` If anything fails to be validated, the config is rejected and an error is written to the logs. @@ -137,7 +137,7 @@ The PR with validation on operator side - https://github.com/openshift/insights- 5. Insights Operator goes rule by rule and collects the requested data if corresponding conditions are met -The new gathering functions and conditions are added the same way as any other code changes to insights-operator. +The new gathering functions and conditions are added the same way as any other code changes to insights-operator. The JSON schema also lives in the operator repo and is changed through the regular process. ### User Stories @@ -154,22 +154,22 @@ Empty The potential risks could come from an attacker spoofing the config somehow (if they got access to the repo), all they could do is let the insights operator collect more data and send it to the c.r.c, but because of validation -on the operator side, the potential of collecting data is limited and there still would be -the same anonymization of potentially sensitive data as before. -For example, we check that namespaces start with `openshift-`, we're also limiting -the amount of potentially collected data by, for example, introducing the limit for amount of collected logs +on the operator side, the potential of collecting data is limited and there still would be +the same anonymization of potentially sensitive data as before. +For example, we check that namespaces start with `openshift-`, we're also limiting +the amount of potentially collected data by, for example, introducing the limit for amount of collected logs (per container and the number of containers) and, in the worst case, if the conditional gathering takes too much time -it would be stopped by the timeout and we would just get less data. +it would be stopped by the timeout and we would just get less data. -Also in the regular workflow, changing the config involves going through the repo's CI (JSON schema validator) -and probably a simple review process. +Also in the regular workflow, changing the config involves going through the repo's CI (JSON schema validator) +and probably a simple review process. Not really a risk, but in case the service is not available or provides invalid config for the operator, -we would just have less data in the archives (everything except conditional gatherer's data), -an error would be written to the logs and to the metadata which would allow us to know +we would just have less data in the archives (everything except conditional gatherer's data), +an error would be written to the logs and to the metadata which would allow us to know that something is broken and we need to bring the service back. -In case GitHub is not available, we won't be able to add new rules, but the old ones would still be returned by +In case GitHub is not available, we won't be able to add new rules, but the old ones would still be returned by the service. ## Design Details @@ -182,16 +182,28 @@ Empty ### Test Plan -The conditional gatherer is covered with unit tests considering many different scenarios. +The conditional gatherer is covered with unit tests considering many different scenarios. Later there will be integration tests as well. ### Graduation Criteria Empty +#### Dev Preview -> Tech Preview + +Empty + +#### Tech Preview -> GA + +Empty + +#### Removing a deprecated feature + +Empty + ### Upgrade / Downgrade Strategy -The conditional gatherer itself (e.g. adding new/removing old conditions or gathering functions) is +The conditional gatherer itself (e.g. adding new/removing old conditions or gathering functions) is modified through a standard process like all other operator's code. Only the rules to collect X with Y params on Z are updated through the new process described above. @@ -210,7 +222,7 @@ None? ## Alternatives -Thin IO idea (https://github.com/openshift/enhancements/pull/465) could also solve this, but thin IO was about adding +Thin IO idea (https://github.com/openshift/enhancements/pull/465) could also solve this, but thin IO was about adding new code bypassing the standard release process which could potentially be very dangerous. ## Infrastructure Needed [optional] From 40eb5e1916944a479a4f851702ebb73f1d8ebb53 Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Mon, 6 Sep 2021 17:42:07 +0200 Subject: [PATCH 09/18] mention versioning of the rules --- enhancements/insights/conditional-data-gathering.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index 2cec15ed4d..e7223b1623 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -91,6 +91,11 @@ A gathering function consists of its name and parameters, implemented functions limiting it to only last N lines from parameter `tail_lines` - `image_streams_of_namespace` which collects image streams of the namespace from parameter `namespace` +Conditions to be implemented: + +- `cluster_version_matches` with parameter `version` makes the rule to be applied on specific cluster version. +It will use semantic versioning so you can wildcards and ranges like here https://github.com/blang/semver#features + The proposal is to implement the next process of adding new rules: 1. We'll have a repo with json configs defining the rules. The repo is going to have a simple CI with validation From 18fcbc865ea90db2f6bfc629817fc9132fde2da6 Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Thu, 9 Sep 2021 16:07:46 +0200 Subject: [PATCH 10/18] updated reviewers and approvers list --- enhancements/insights/conditional-data-gathering.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index e7223b1623..339b092282 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -5,9 +5,10 @@ authors: reviewers: - "@inecas" - "@tremes" - - "@smarterclayton" + - "@sttts" + - "@deads2k" approvers: - - "@smarterclayton" + - "@tremes" creation-date: 2021-07-15 last-updated: 2021-07-15 status: implementable From 51446fc1b601ad4e7e12488ee6eda6dd1a346d13 Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Thu, 14 Oct 2021 17:36:18 +0200 Subject: [PATCH 11/18] unify conditional gatherer api with targeted update edge blocking api --- enhancements/insights/conditional-data-gathering.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index 339b092282..9eab35e37c 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -66,7 +66,7 @@ and add it to the next archive. [{ "conditions": [{ "type": "alert_is_firing", - "params": { + "alert": { "name": "SamplesImagestreamImportFailing" } }], @@ -82,6 +82,9 @@ and add it to the next archive. }] ``` +The API for conditions definition is unified with +[Targeted Update Edge Blocking API](https://github.com/openshift/enhancements/pull/821) + Conditions can have type and parameters, implemented types are: - `alert_is_firing` which is met when the alert with the name from parameter `name` is firing From 74d4a5b42a01a530dda10186da610cbf9b4af027 Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Wed, 20 Oct 2021 11:15:34 +0200 Subject: [PATCH 12/18] added non-goals --- enhancements/insights/conditional-data-gathering.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index 9eab35e37c..04b027c785 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -52,7 +52,7 @@ what to collect on which condition externally will help to do that faster. ### Non-Goals -Empty +It's NOT about running brand new code. ## Proposal From b56e0e634b83b3e35937ec35fc694efc706000db Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Wed, 20 Oct 2021 17:07:23 +0200 Subject: [PATCH 13/18] fix lint --- enhancements/insights/conditional-data-gathering.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index 04b027c785..92481e1b29 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -82,7 +82,7 @@ and add it to the next archive. }] ``` -The API for conditions definition is unified with +The API for conditions definition is unified with [Targeted Update Edge Blocking API](https://github.com/openshift/enhancements/pull/821) Conditions can have type and parameters, implemented types are: From 310df60a0dc7f89a9c34dbb254ecbcfb96cb4ed0 Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Wed, 1 Dec 2021 10:39:39 +0100 Subject: [PATCH 14/18] condition types should be in camel case --- enhancements/insights/conditional-data-gathering.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index 92481e1b29..e0389963f8 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -65,7 +65,7 @@ and add it to the next archive. ```json [{ "conditions": [{ - "type": "alert_is_firing", + "type": "AlertIsFiring", "alert": { "name": "SamplesImagestreamImportFailing" } @@ -87,7 +87,7 @@ The API for conditions definition is unified with Conditions can have type and parameters, implemented types are: -- `alert_is_firing` which is met when the alert with the name from parameter `name` is firing +- `AlertIsFiring` which is met when the alert with the name from parameter `name` is firing A gathering function consists of its name and parameters, implemented functions are: @@ -97,7 +97,7 @@ limiting it to only last N lines from parameter `tail_lines` Conditions to be implemented: -- `cluster_version_matches` with parameter `version` makes the rule to be applied on specific cluster version. +- `ClusterVersionMatches` with parameter `version` makes the rule to be applied on specific cluster version. It will use semantic versioning so you can wildcards and ranges like here https://github.com/blang/semver#features The proposal is to implement the next process of adding new rules: @@ -130,7 +130,7 @@ which check the next things: - The rules should not repeat - There can be up to 8 conditions in each rule - Only implemented conditions can be used -- Alert name from `alert_is_firing` condition should be a string of length between 1 and 128 +- Alert name from `AlertIsFiring` condition should be a string of length between 1 and 128 and consist of only alphanumeric characters - For each rule, there's at least one gathering function - Only implemented gathering functions can be used From a4563aa5f43a78e29c7437b37aec55436ccad51b Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Wed, 1 Dec 2021 10:59:06 +0100 Subject: [PATCH 15/18] lint --- .../insights/conditional-data-gathering.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index e0389963f8..1bdc53a4ed 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -155,6 +155,10 @@ For OpenShift engineers it will be easier to get necessary information when some Getting such information now means changing operator's code which is especially painful with older versions of clusters. When the data is not needed anymore, the rule can be simply removed from the repo. +### API Extensions + +Empty + ### Implementation Details/Notes/Constraints [optional] Empty @@ -220,6 +224,18 @@ are updated through the new process described above. Out of scope. +### Operational Aspects of API Extensions + +Empty + +#### Failure Modes + +Empty + +#### Support Procedures + +Empty + ## Implementation History Major milestones in the life cycle of a proposal should be tracked in `Implementation From 01d1eafd2e39caf42c360c662b32278aa752098f Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Tue, 14 Dec 2021 16:49:57 +0100 Subject: [PATCH 16/18] design details --- .../insights/conditional-data-gathering.md | 39 ++++++++++++++++++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index 1bdc53a4ed..a2e1b5ae6f 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -187,7 +187,44 @@ the service. ## Design Details -Empty +On the initial start the conditional gatherer will try to fetch the config from the service +by doing http(s) request to the appropriate endpoint `https://cloud.redhat.com/.../gathering_rules`. +There is a timeout of one minute for the request. The response is then cached and sent again only when 12 hours pass +since the last request. In case when Insights Operator can't get the new set of gathering rules, +the old ones will be used. In case of an error the service returns non 200 response. +In case of success, the service responds with the rules in the next format: + +```json +{ + "version": "1.0", + "rules": [ + { + "conditions": [ + { + "type": "alert_is_firing", + "alert": { + "name": "ClusterVersionOperatorIsDown" + } + }, + { + "type": "cluster_version_matches", + "cluster_version": { + "version": "4.8.x" + } + } + ], + "gathering_functions": { + "gather_logs_of_namespace": { + "namespace": "openshift-cluster-version", + "keep_lines": 100 + } + } + } + ] +} +``` + +`version` fields shows the version of the rules. `rules` contains the rules for the conditional gatherer. ### Open Questions [optional] From 4994df823ca627d37ba1d2edd01e5e43f7b9dff9 Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Thu, 16 Dec 2021 10:26:06 +0100 Subject: [PATCH 17/18] fix --- enhancements/insights/conditional-data-gathering.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index a2e1b5ae6f..fbce4d37af 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -190,7 +190,7 @@ the service. On the initial start the conditional gatherer will try to fetch the config from the service by doing http(s) request to the appropriate endpoint `https://cloud.redhat.com/.../gathering_rules`. There is a timeout of one minute for the request. The response is then cached and sent again only when 12 hours pass -since the last request. In case when Insights Operator can't get the new set of gathering rules, +since the last request, this value can be configured. In case when Insights Operator can't get the new set of gathering rules, the old ones will be used. In case of an error the service returns non 200 response. In case of success, the service responds with the rules in the next format: From 233e65d19fc473d48a1a1ce2d25da5cf7344e0d4 Mon Sep 17 00:00:00 2001 From: Serhii Zakharov Date: Thu, 16 Dec 2021 11:02:22 +0100 Subject: [PATCH 18/18] authentication --- enhancements/insights/conditional-data-gathering.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/enhancements/insights/conditional-data-gathering.md b/enhancements/insights/conditional-data-gathering.md index fbce4d37af..a1615f81b4 100644 --- a/enhancements/insights/conditional-data-gathering.md +++ b/enhancements/insights/conditional-data-gathering.md @@ -189,8 +189,9 @@ the service. On the initial start the conditional gatherer will try to fetch the config from the service by doing http(s) request to the appropriate endpoint `https://cloud.redhat.com/.../gathering_rules`. -There is a timeout of one minute for the request. The response is then cached and sent again only when 12 hours pass -since the last request, this value can be configured. In case when Insights Operator can't get the new set of gathering rules, +The token from pull secret is used for authentication. There is a timeout of one minute for the request. +The response is then cached and sent again only when 12 hours pass since the last request, +this value can be configured. In case when Insights Operator can't get the new set of gathering rules, the old ones will be used. In case of an error the service returns non 200 response. In case of success, the service responds with the rules in the next format: