Skip to content

[8.10] [Stack Monitoring] Update flows for cpu stats fetching (#167244)#170740

Closed
tonyghiani wants to merge 1 commit intoelastic:8.10from
tonyghiani:backport/8.10/pr-167244
Closed

[8.10] [Stack Monitoring] Update flows for cpu stats fetching (#167244)#170740
tonyghiani wants to merge 1 commit intoelastic:8.10from
tonyghiani:backport/8.10/pr-167244

Conversation

@tonyghiani
Copy link
Contributor

Backport

This will backport the following commits from main to 8.10:

Questions ?

Please refer to the Backport tool documentation

## 📓 Summary

When retrieving the CPU stats for containerized (or non-container)
clusters, we were not considering a scenario where the user could run in
a cgroup but without limits set.
These changes re-write the conditions to determine whether we allow
treating limitless containers as non-containerized, covering the case
where a user run in a cgroup and for some reason hasn't set the limit.

## Testing

> Taken from elastic#159351 since it
reproduced the same behaviours

There are 3 main states to test:
No limit set but Kibana configured to use container stats.
Limit changed during lookback period (to/from real value, to/from no
limit).
Limit set and CPU usage crossing threshold and then falling down to
recovery

**Note: Please also test the non-container use case for this rule to
ensure that didn't get broken during this refactor**

**1. Start Elasticsearch in a container without setting the CPU
limits:**
```
docker network create elastic
docker run --name es01 --net elastic -p 9201:9200 -e xpack.license.self_generated.type=trial -it docker.elastic.co/elasticsearch/elasticsearch:master-SNAPSHOT
```

(We're using `master-SNAPSHOT` to include a recent fix to reporting for
cgroup v2)

Make note of the generated password for the `elastic` user.

**2. Start another Elasticsearch instance to act as the monitoring
cluster**

**3. Configure Kibana to connect to the monitoring cluster and start
it**

**4. Configure Metricbeat to collect metrics from the Docker cluster and
ship them to the monitoring cluster, then start it**

Execute the below command next to the Metricbeat binary to grab the CA
certificate from the Elasticsearch cluster.

```
docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .
```

Use the `elastic` password and the CA certificate to configure the
`elasticsearch` module:
```
  - module: elasticsearch
    xpack.enabled: true
    period: 10s
    hosts:
      - "https://localhost:9201"
    username: "elastic"
    password: "PASSWORD"
    ssl.certificate_authorities: "PATH_TO_CERT/http_ca.crt"
```

**5. Configure an alert in Kibana with a chosen threshold**

OBSERVE: Alert gets fired to inform you that there looks to be a
misconfiguration, together with reporting the current value for the
fallback metric (warning if the fallback metric is below threshold,
danger is if is above).

**6. Set limit**
First stop ES using `docker stop es01`, then set the limit using `docker
update --cpus=1 es01` and start it again using `docker start es01`.
After a brief delay you should now see the alert change to a warning
about the limits having changed during the alert lookback period and
stating that the CPU usage could not be confidently calculated.
Wait for change event to pass out of lookback window.

**7. Generate load on the monitored cluster**

[Slingshot](https://github.com/elastic/slingshot) is an option. After
you clone it, you need to update the `package.json` to match [this
change](https://github.com/elastic/slingshot/blob/8bfa8351deb0d89859548ee5241e34d0920927e5/package.json#L45-L46)
before running `npm install`.

Then you can modify the value for `elasticsearch` in the
`configs/hosts.json` file like this:
```
"elasticsearch": {
    "node": "https://localhost:9201",
    "auth": {
      "username": "elastic",
      "password": "PASSWORD"
    },
    "ssl": {
      "ca": "PATH_TO_CERT/http_ca.crt",
      "rejectUnauthorized": false
    }
  }
```

Then you can start one or more instances of Slingshot like this:
`npx ts-node bin/slingshot load --config configs/hosts.json`

**7. Observe the alert firing in the logs**
Assuming you're using a connector for server log output, you should see
a message like below once the threshold is breached:
```
`[2023-06-13T13:05:50.036+02:00][INFO ][plugins.actions.server-log] Server log: CPU usage alert is firing for node e76ce10526e2 in cluster: docker-cluster. [View node](/app/monitoring#/elasticsearch/nodes/OyDWTz1PS-aEwjqcPN2vNQ?_g=(cluster_uuid:kasJK8VyTG6xNZ2PFPAtYg))`
```

The alert should also be visible in the Stack Monitoring UI overview
page.

At this point you can stop Slingshot and confirm that the alert recovers
once CPU usage goes back down below the threshold.

**8. Stop the load and confirm that the rule recovers.**

---------

Co-authored-by: Marco Antonio Ghiani <marcoantonio.ghiani@elastic.co>
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
(cherry picked from commit 833c075)
@tonyghiani tonyghiani added the backport This PR is a backport of another PR label Nov 7, 2023
@tonyghiani tonyghiani enabled auto-merge (squash) November 7, 2023 13:57
@ghost
Copy link

ghost commented Nov 7, 2023

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • /oblt-deploy-serverless : Deploy a serverless Kibana instance using the Observability test environments.
  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@tonyghiani
Copy link
Contributor Author

Closing since it won't be released with any new patch for 8.10.x.

@tonyghiani tonyghiani closed this Nov 7, 2023
@tonyghiani tonyghiani deleted the backport/8.10/pr-167244 branch November 7, 2023 14:35
@kibana-ci
Copy link

kibana-ci commented Nov 7, 2023

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #36 / Agents fleet_list_agent should return metrics if available and called with withMetrics
  • [job] [logs] FTR Configs #36 / Agents fleet_list_agent should return metrics if available and called with withMetrics

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport This PR is a backport of another PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants