-
Notifications
You must be signed in to change notification settings - Fork 533
Conditional Data Gathering for Insights Operator #837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
ad786eb
f849344
c742a04
3bf0924
eca1dac
2537035
ea93244
92dd7e8
40eb5e1
18fcbc8
51446fc
74d4a5b
b56e0e6
310df60
a4563aa
01d1eaf
4994df8
233e65d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,294 @@ | ||
| --- | ||
| title: conditional-data-gathering | ||
| authors: | ||
| - "@Sergey1011010" | ||
| reviewers: | ||
| - "@inecas" | ||
| - "@tremes" | ||
| - "@sttts" | ||
| - "@deads2k" | ||
| approvers: | ||
| - "@tremes" | ||
| creation-date: 2021-07-15 | ||
| last-updated: 2021-07-15 | ||
| status: implementable | ||
| --- | ||
|
|
||
| # Conditional Data Gathering | ||
|
|
||
| ## Release Signoff Checklist | ||
|
|
||
| - [x] Enhancement is `implementable` | ||
| - [x] Design details are appropriately documented from clear requirements | ||
| - [x] Test plan is defined | ||
| - [ ] Operational readiness criteria is defined | ||
| - [ ] Graduation criteria for dev preview, tech preview, GA | ||
| - [ ] User-facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/) | ||
|
|
||
| ## Summary | ||
|
|
||
| The conditional gatherer for Insights Operator collects data according to the defined gathering rules*. | ||
| Each rule contains one or more conditions such as "alert A is firing" | ||
| and one or more gatherers with parameters such as "collect X lines of logs from containers in namespace N". | ||
| Current version has these rules defined in the code and the proposal is to load them from an external source | ||
| to make collection of new conditional data faster. It's NOT about running brand new code, | ||
| but just calling the existing gatherers with different validated parameters, so the operator can't | ||
| collect anything we don't expect. | ||
|
|
||
| \* note that rule here and later has nothing to do with rules used to analyze data in archives written | ||
| by CCX presentation team | ||
|
|
||
| ## Motivation | ||
|
|
||
| Collecting data on some condition such as an alert is quite common pattern | ||
| during the root-cause analysis and the ability to define the rules | ||
| what to collect on which condition externally will help to do that faster. | ||
|
|
||
| ### Goals | ||
|
|
||
| - Implement validation of the rules for conditional gatherer | ||
| - Create a service taking gathering rules from the git repo and providing them through the API | ||
| - Fetch the rules definitions from the service's API and apply them to the conditional gatherer | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| It's NOT about running brand new code. | ||
|
|
||
| ## Proposal | ||
|
|
||
| Right now we have conditional gatherer which can collect data based on some conditions | ||
| which are defined in the code, for example the next config would collect 100 lines of logs | ||
| from each container in namespace `openshift-cluster-samples-operator` | ||
| and image streams of the same namespace when alert `SamplesImagestreamImportFailing` is firing | ||
| and add it to the next archive. | ||
|
|
||
| ```json | ||
| [{ | ||
| "conditions": [{ | ||
Serhii1011010 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| "type": "AlertIsFiring", | ||
| "alert": { | ||
| "name": "SamplesImagestreamImportFailing" | ||
| } | ||
| }], | ||
| "gathering_functions": { | ||
| "logs_of_namespace": { | ||
| "namespace": "openshift-cluster-samples-operator", | ||
| "tail_lines": 100 | ||
| }, | ||
| "image_streams_of_namespace": { | ||
| "namespace": "openshift-cluster-samples-operator" | ||
| } | ||
| } | ||
| }] | ||
| ``` | ||
|
|
||
| The API for conditions definition is unified with | ||
| [Targeted Update Edge Blocking API](https://github.com/openshift/enhancements/pull/821) | ||
|
|
||
| Conditions can have type and parameters, implemented types are: | ||
|
|
||
| - `AlertIsFiring` which is met when the alert with the name from parameter `name` is firing | ||
|
|
||
| A gathering function consists of its name and parameters, implemented functions are: | ||
|
|
||
| - `logs_of_namespace` which collects logs from each container in the namespace from parameter `namespace` | ||
| limiting it to only last N lines from parameter `tail_lines` | ||
| - `image_streams_of_namespace` which collects image streams of the namespace from parameter `namespace` | ||
|
|
||
| Conditions to be implemented: | ||
|
|
||
| - `ClusterVersionMatches` with parameter `version` makes the rule to be applied on specific cluster version. | ||
| It will use semantic versioning so you can wildcards and ranges like here https://github.com/blang/semver#features | ||
|
|
||
| The proposal is to implement the next process of adding new rules: | ||
|
|
||
| 1. We'll have a repo with json configs defining the rules. The repo is going to have a simple CI with validation | ||
| against JSON schema and possibly some review process. The repo should live in https://github.com/RedHatInsights | ||
| We have created a PoC for the repo: | ||
|
|
||
| https://github.com/tremes/insights-operator-gathering-rules | ||
|
|
||
| The JSON schema can be found here: | ||
|
|
||
| https://raw.githubusercontent.com/openshift/insights-operator/f9b762149cd10ec98079e48b8a96fc02a2aca3c6/pkg/gatherers/conditional/gathering_rule.schema.json | ||
|
|
||
| 2. There will be a service living in console.redhat.com which will have the rules baked into the container and will | ||
Serhii1011010 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| provide all the rules through its API. The very first version of the service is going to be simple, but later we may | ||
| add some filtering on the API level (for example by cluster version). | ||
Serhii1011010 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| 3. Insights Operator fetches a config with the rules from the service and unmarshalls JSON to Go structs. | ||
Serhii1011010 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| 4. Insights Operator makes the most important validation against the next JSON schemas: | ||
|
|
||
| - https://raw.githubusercontent.com/openshift/insights-operator/f9b762149cd10ec98079e48b8a96fc02a2aca3c6/pkg/gatherers/conditional/gathering_rule.schema.json | ||
| - https://raw.githubusercontent.com/openshift/insights-operator/f9b762149cd10ec98079e48b8a96fc02a2aca3c6/pkg/gatherers/conditional/gathering_rules.schema.json | ||
|
|
||
| which check the next things: | ||
|
|
||
| - The JSON version of the config matches the structs defined in the code | ||
| - The maximum number of rules is 64 | ||
| - The rules should not repeat | ||
| - There can be up to 8 conditions in each rule | ||
| - Only implemented conditions can be used | ||
| - Alert name from `AlertIsFiring` condition should be a string of length between 1 and 128 | ||
| and consist of only alphanumeric characters | ||
| - For each rule, there's at least one gathering function | ||
| - Only implemented gathering functions can be used | ||
| - There can be up to 8 gathering functions in each rule | ||
| - Namespace from `gather_logs_of_namespace` function should be a string of length between 1 and 128, | ||
| match the next regular expression `^[a-zA-Z]([a-zA-Z0-9\-]+[\.]?)*[a-zA-Z0-9]$` and start with `openshift-` | ||
| - Tail lines from `gather_logs_of_namespace` function should be an integer between 1 and 4096 | ||
| - Namespace from `gather_image_streams_of_namespace` function should be a string of length between 1 and 128, | ||
| match the next regular expression `^[a-zA-Z]([a-zA-Z0-9\-]+[\.]?)*[a-zA-Z0-9]$` and start with `openshift-` | ||
Serhii1011010 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| If anything fails to be validated, the config is rejected and an error is written to the logs. | ||
| The PR with validation on operator side - https://github.com/openshift/insights-operator/pull/483 | ||
|
|
||
| 5. Insights Operator goes rule by rule and collects the requested data if corresponding conditions are met | ||
|
|
||
| The new gathering functions and conditions are added the same way as any other code changes to insights-operator. | ||
| The JSON schema also lives in the operator repo and is changed through the regular process. | ||
|
|
||
| ### User Stories | ||
|
|
||
| For OpenShift engineers it will be easier to get necessary information when some problem occurs (like alert is firing). | ||
| Getting such information now means changing operator's code which is especially painful with older versions of clusters. | ||
| When the data is not needed anymore, the rule can be simply removed from the repo. | ||
|
|
||
| ### API Extensions | ||
|
|
||
| Empty | ||
|
|
||
| ### Implementation Details/Notes/Constraints [optional] | ||
|
|
||
| Empty | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| The potential risks could come from an attacker spoofing the config somehow (if they got access to the repo), | ||
| all they could do is let the insights operator collect more data and send it to the c.r.c, but because of validation | ||
| on the operator side, the potential of collecting data is limited and there still would be | ||
| the same anonymization of potentially sensitive data as before. | ||
| For example, we check that namespaces start with `openshift-`, we're also limiting | ||
| the amount of potentially collected data by, for example, introducing the limit for amount of collected logs | ||
| (per container and the number of containers) and, in the worst case, if the conditional gathering takes too much time | ||
| it would be stopped by the timeout and we would just get less data. | ||
Serhii1011010 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Also in the regular workflow, changing the config involves going through the repo's CI (JSON schema validator) | ||
| and probably a simple review process. | ||
|
|
||
| Not really a risk, but in case the service is not available or provides invalid config for the operator, | ||
| we would just have less data in the archives (everything except conditional gatherer's data), | ||
| an error would be written to the logs and to the metadata which would allow us to know | ||
| that something is broken and we need to bring the service back. | ||
|
|
||
| In case GitHub is not available, we won't be able to add new rules, but the old ones would still be returned by | ||
| the service. | ||
|
|
||
| ## Design Details | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am missing here few words about the authentication. I guess the Insights operator will have to authenticate (using the "cloud.openshift.com" token from the I think it would be also good to describe the communication between the operator and the service more. How often will we query the service? What will the service API return - will be there some metadata e.g version or some time of update of the conditions? I remember there were some discussions about caching so this would be good to mention as well IMO.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added some info about implementation. Do you think we need the authentication? The rules aren't secret after all
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure, I guess everything running in crc requires authentication. We can check with @tisnik. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tremes Officially it would be needed. This is two-way thing as it allows the cluster (operator) to trust with whom it communicates (so not only for us to trust the cluster) |
||
|
|
||
| On the initial start the conditional gatherer will try to fetch the config from the service | ||
| by doing http(s) request to the appropriate endpoint `https://cloud.redhat.com/.../gathering_rules`. | ||
| The token from pull secret is used for authentication. There is a timeout of one minute for the request. | ||
| The response is then cached and sent again only when 12 hours pass since the last request, | ||
| this value can be configured. In case when Insights Operator can't get the new set of gathering rules, | ||
| the old ones will be used. In case of an error the service returns non 200 response. | ||
| In case of success, the service responds with the rules in the next format: | ||
|
|
||
| ```json | ||
| { | ||
| "version": "1.0", | ||
| "rules": [ | ||
| { | ||
| "conditions": [ | ||
| { | ||
| "type": "alert_is_firing", | ||
| "alert": { | ||
| "name": "ClusterVersionOperatorIsDown" | ||
| } | ||
| }, | ||
| { | ||
| "type": "cluster_version_matches", | ||
| "cluster_version": { | ||
| "version": "4.8.x" | ||
| } | ||
| } | ||
| ], | ||
| "gathering_functions": { | ||
| "gather_logs_of_namespace": { | ||
| "namespace": "openshift-cluster-version", | ||
| "keep_lines": 100 | ||
| } | ||
| } | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| `version` fields shows the version of the rules. `rules` contains the rules for the conditional gatherer. | ||
|
|
||
| ### Open Questions [optional] | ||
|
|
||
| Empty | ||
|
|
||
| ### Test Plan | ||
|
|
||
| The conditional gatherer is covered with unit tests considering many different scenarios. | ||
| Later there will be integration tests as well. | ||
|
|
||
| ### Graduation Criteria | ||
|
|
||
| Empty | ||
|
|
||
| #### Dev Preview -> Tech Preview | ||
|
|
||
| Empty | ||
|
|
||
| #### Tech Preview -> GA | ||
|
|
||
| Empty | ||
|
|
||
| #### Removing a deprecated feature | ||
|
|
||
| Empty | ||
|
|
||
| ### Upgrade / Downgrade Strategy | ||
|
|
||
| The conditional gatherer itself (e.g. adding new/removing old conditions or gathering functions) is | ||
| modified through a standard process like all other operator's code. Only the rules to collect X with Y params on Z | ||
| are updated through the new process described above. | ||
Serhii1011010 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### Version Skew Strategy | ||
|
|
||
| Out of scope. | ||
|
|
||
| ### Operational Aspects of API Extensions | ||
|
|
||
| Empty | ||
|
|
||
| #### Failure Modes | ||
|
|
||
| Empty | ||
|
|
||
| #### Support Procedures | ||
|
|
||
| Empty | ||
|
|
||
| ## Implementation History | ||
|
|
||
| Major milestones in the life cycle of a proposal should be tracked in `Implementation | ||
| History`. | ||
|
|
||
| ## Drawbacks | ||
|
|
||
| None? | ||
|
|
||
| ## Alternatives | ||
|
|
||
| Thin IO idea (https://github.com/openshift/enhancements/pull/465) could also solve this, but thin IO was about adding | ||
| new code bypassing the standard release process which could potentially be very dangerous. | ||
Serhii1011010 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Infrastructure Needed [optional] | ||
|
|
||
| - The repo in RedHatInsights | ||
| - The service living in console.redhat.com | ||
Uh oh!
There was an error while loading. Please reload this page.