[8.10] [Stack Monitoring] Update flows for cpu stats fetching (#167244)#170740
Closed
tonyghiani wants to merge 1 commit intoelastic:8.10from
Closed
[8.10] [Stack Monitoring] Update flows for cpu stats fetching (#167244)#170740tonyghiani wants to merge 1 commit intoelastic:8.10from
tonyghiani wants to merge 1 commit intoelastic:8.10from
Conversation
## 📓 Summary When retrieving the CPU stats for containerized (or non-container) clusters, we were not considering a scenario where the user could run in a cgroup but without limits set. These changes re-write the conditions to determine whether we allow treating limitless containers as non-containerized, covering the case where a user run in a cgroup and for some reason hasn't set the limit. ## Testing > Taken from elastic#159351 since it reproduced the same behaviours There are 3 main states to test: No limit set but Kibana configured to use container stats. Limit changed during lookback period (to/from real value, to/from no limit). Limit set and CPU usage crossing threshold and then falling down to recovery **Note: Please also test the non-container use case for this rule to ensure that didn't get broken during this refactor** **1. Start Elasticsearch in a container without setting the CPU limits:** ``` docker network create elastic docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT ``` (We're using `master-SNAPSHOT` to include a recent fix to reporting for cgroup v2) Make note of the generated password for the `elastic` user. **2. Start another Elasticsearch instance to act as the monitoring cluster** **3. Configure Kibana to connect to the monitoring cluster and start it** **4. Configure Metricbeat to collect metrics from the Docker cluster and ship them to the monitoring cluster, then start it** Execute the below command next to the Metricbeat binary to grab the CA certificate from the Elasticsearch cluster. ``` docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt . ``` Use the `elastic` password and the CA certificate to configure the `elasticsearch` module: ``` - module: elasticsearch xpack.enabled: true period: 10s hosts: - "https://localhost:9201" username: "elastic" password: "PASSWORD" ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt" ``` **5. Configure an alert in Kibana with a chosen threshold** OBSERVE: Alert gets fired to inform you that there looks to be a misconfiguration, together with reporting the current value for the fallback metric (warning if the fallback metric is below threshold, danger is if is above). **6. Set limit** First stop ES using `docker stop es01`, then set the limit using `docker update --cpus=1 es01` and start it again using `docker start es01`. After a brief delay you should now see the alert change to a warning about the limits having changed during the alert lookback period and stating that the CPU usage could not be confidently calculated. Wait for change event to pass out of lookback window. **7. Generate load on the monitored cluster** [Slingshot](https://github.com/elastic/slingshot) is an option. After you clone it, you need to update the `package.json` to match [this change](https://github.com/elastic/slingshot/blob/8bfa8351deb0d89859548ee5241e34d0920927e5/package.json#L45-L46) before running `npm install`. Then you can modify the value for `elasticsearch` in the `configs/hosts.json` file like this: ``` "elasticsearch": { "node": "https://localhost:9201", "auth": { "username": "elastic", "password": "PASSWORD" }, "ssl": { "ca": "PATH_TO_CERT/http_ca.crt", "rejectUnauthorized": false } } ``` Then you can start one or more instances of Slingshot like this: `npx ts-node bin/slingshot load --config configs/hosts.json` **7. Observe the alert firing in the logs** Assuming you're using a connector for server log output, you should see a message like below once the threshold is breached: ``` `[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))` ``` The alert should also be visible in the Stack Monitoring UI overview page. At this point you can stop Slingshot and confirm that the alert recovers once CPU usage goes back down below the threshold. **8. Stop the load and confirm that the rule recovers.** --------- Co-authored-by: Marco Antonio Ghiani <marcoantonio.ghiani@elastic.co> Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> (cherry picked from commit 833c075)
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
Contributor
Author
|
Closing since it won't be released with any new patch for 8.10.x. |
💔 Build FailedFailed CI Steps
Test Failures
Metrics [docs]
HistoryTo update your PR or re-run it, just comment with: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport
This will backport the following commits from
mainto8.10:Questions ?
Please refer to the Backport tool documentation