[Elasticsearch] Limit maxSockets to 800 by default by afharo · Pull Request #151911 · elastic/kibana

afharo · 2023-02-22T17:18:49Z

Summary

It lowers the default elasticsearch.maxSockets from the current Infinity to 800.

Why 800?
We are trying to prioritize 0% drops. Our tests indicate that 800 is the highest value we can set today to achieve this.

Scalability tests

The scalability tests (compared to main) show an overall improvement in the resilience of Kibana (the rejection rate drops to 0%) at the cost of larger response times on average, although improving the 95 and 99 percentiles (lower standard deviation).

👍 Lower drop rate / Higher capacity

Limiting the number of connections to ES allows Kibana to use those extra sockets to handle more incoming requests, drastically reducing the number of rejections and being able to handle

👎 Slower average response time

The average response time doubles in all tested scenarios. However, the Std Dev and 95 percentiles are lower in the socket-limited scenario.

This branch	`main`
Similar behavior when not loaded. Slower responses during mid-load. Slightly better response time when loaded.	Random failures across any load.
	Mind the max values are higher here (and they failed)
As soon as the load is higher than the number of sockets, Kibana queues further requests, increasing the response time for those	It takes a higher load of users to start increasing the response time. However, the blanks indicate failures (either timed out or rejection)

⏳ Note about timeouts

APIs that are typically slow in main (response times are close to 60s) tend to 60s-timeout more consistently in this branch. We may want to extend the timeout for these tests.

`POST /api/metrics/vis/data`

This branch	`main`

Risk Matrix

Risk	Probability	Severity	Mitigation/Notes
Deploying Kibana on machines with limited resources may require lower values.	Medium	Low	Users can override the default in the configuration. `800` is still better than `Infinity` in those scenarios.
Deploying Kibana on machines with plenty of resources may require higher values.	Low	Low	Users can override the default in the configuration. We may want to run performance tests on a matrix of different hardware to identify the best defaults for different configurations (cc @elastic/kibana-qa). We can create a guide Tuning Kibana.

For maintainers

This was checked for breaking API changes and was labeled appropriately

This PR was built on top of #151110. The actual changes in this PR are in the commit c947695 (#151911).

…n/apply-concurrency-limits

packages/core/elasticsearch/core-elasticsearch-server-internal/src/elasticsearch_config.ts

x-pack/test/scalability/config.ts

afharo · 2023-02-23T13:25:51Z

@elasticmachine merge upstream

afharo · 2023-02-23T13:26:32Z

☝️ Updating from main because Scalability tests now print Ops Metrics to the logs (and we can learn more about memory and CPU usage)

afharo · 2023-02-24T10:41:30Z

There's a bug in the recent changes to the API Capacity CI. @dmlemeshko is looking to resolve it.

We'll run the tests again to compare CPU and Memory usage once it's resolved.

rudolf · 2023-02-24T14:46:53Z

Analysing throughput at different response time cuttoffs shows slight improvements in throughput for some APIs and slight degradation for others. But overall nothing concerning here.

Source https://telemetry-v2-staging.elastic.dev/s/kibana-performance/app/dashboards#/view/d0d6bd30-b390-11ed-a6e6-d32d2209b7b7?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-3d%2Cto%3Anow))

It would be useful to be able to compare the "1600 data views no cache" scenario with main. We should also ask ResponseOps to do their scalability testing against this branch.

afharo · 2023-02-27T08:51:57Z

To extend @rudolf's assessment, I thought it was useful to compare the sum of RPS for all thresholds (which version is capable of handling more requests?) and the RPS for the first threshold (which version is faster?):

Median of the sum of RPS for all thresholds.

Median of the RPS for the first threshold.

As highlighted by Rudolf: it shows slight improvements for some APIs and slight degradation in others. The deeper analysis I went through earlier showed the degradation is caused by a higher amount of timeouts on already timeout-prone APIs.

It would be useful to be able to compare the "1600 data views no cache" scenario with main. We should also ask ResponseOps to do their scalability testing against this branch.

I'll push to get #151110 merged, so we can have those metrics from main.

I'll also work today on comparing the CPU and memory usage in both versions.

afharo · 2023-02-27T12:25:58Z

I'll also work today on comparing the CPU and memory usage in both versions.

I processed the kibana.log files from main and this branch, and extracted the reported metrics.ops. I uploaded those to my own ES server and compared both versions for each API.

The logs and scripts used are here: scalability tests.zip.

TL;DR, CPU load and memory utilization are very similar. Event loop delays might be a decisive factor. However, I noticed that Ops Metrics is not able to report it when it's too high 😢. And, with the current data, I'd say they are inconclusive.

I'll process and upload the logs from subsequent CI jobs to have more data points and reduce the randomness of the data.

Preliminary analysis from 1 CI job for each version

api.core.capabilities

Based on the reported metrics, I think we can claim that CPU and memory utilization are very similar:

This version seems to improve the event loop delay for this API:

However, it's worth noticing that both versions stopped reporting the event loop delay after 2 thirds of the execution (while memory and CPU are still metered).

api.metrics.vis.data

Again CPU and memory look very similar:

Event loop delay is really bad on both versions because they stop reporting right after the middle of the execution:

However, this version was able to report some event loop delay (despite it being very high) while main simply stopped reporting it and never got back.

Looking at the logs, we can leverage the log entries [plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. to attempt to fill the blanks. Around 2m40s of the execution time, main logs:

[2023-02-24T16:32:46.471+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 6443.761664ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
... 40s later
[2023-02-24T16:33:27.624+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 5338.693632ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
... 30s later
[2023-02-24T16:34:02.916+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 4639.883264ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.

However, I'd take these values with a grain of salt because I've seen logs where kibanaUsageCollection logged 6401ms, and 3 seconds later, the metrics.ops would log 27724ms. I don't think it's possible to increase the average 21s in only 3s... Bug?

api.saved_objects_tagging.tags

CPU and memory:

WRT Event-loop delays, it looks like main is able to keep reporting it for longer (and even come back)

Looking closer at this version's logs, we can see the following log entries (the first entry shows at the 3 minutes of execution):

[2023-02-24T18:35:37.038+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 392.9593831325301ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
...
[2023-02-24T18:36:21.432+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 2630.936762181818ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
...
[2023-02-24T18:37:05.526+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 11021.01504ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
...
[2023-02-24T18:37:41.168+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 8484.480341333334ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
...
[2023-02-24T18:38:15.660+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 5732.847616ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.

It goes up to 11s in the 4m30s of execution and starts going down after that.

api.telemetry.cluster_stats

CPU and memory are very similar... I'll focus on Even-loop delays from now on...

bundles.core.entry

internal.security.session

internal.security.user_profile

security.me

afharo · 2023-02-27T13:42:53Z

I'll process and upload the logs from subsequent CI jobs to have more data points and reduce the randomness of the data.

Done! Results can be analyzed at https://sec-tests.kb.europe-west1.gcp.cloud.es.io:9243/s/ops-metrics-analysis/app/dashboards#/view/cab987b0-8c68-4432-b920-1abf76874bb6 (u: ops-metrics-read-only, p: ops-metrics-123 🔒😅).

TL;DR, the differences are smoothed out: Event-loop delays are more similar now. Although there are similar results: some APIs, main is slightly better, and for some others, this version slightly improves it.

Raw data and processing scripts here.

…ckets-to-800

afharo · 2023-02-27T18:59:16Z

Setting it as "ready to review" now... it looks like most of the results point out that this change is good for Kibana resilience in general, and the downsides are pretty minimal based on our tests (RPS and overall CPU and memory utilization are fairly similar).

Adding @dmlemeshko as an additional reviewer.

rudolf

I think this would be an improvement and I think 800 is a good value to start with. But at this point I'm not confident that our testing ruled out that this would not have a negative impact on some workloads.

Because we don't have socket pool size metrics on ESS we also can't detect how many Kibana's might be exhausting their socket pools.

Perhaps we can add a logger similar to the "event loop delay threshold exceeded" logger. This would log whenever one of the httpAgents open sockets count matches the maxSockets limit. This way we (and users) can know that a cluster's throughput might be limited by the maxSockets config.

packages/core/elasticsearch/core-elasticsearch-server-internal/src/elasticsearch_config.ts

afharo · 2023-03-01T12:19:11Z

Perhaps we can add a logger similar to the "event loop delay threshold exceeded" logger. This would log whenever one of the httpAgents open sockets count matches the maxSockets limit. This way we (and users) can know that a cluster's throughput might be limited by the maxSockets config.

I like that! I can look at how to add that logger.

afharo · 2023-03-01T13:01:37Z

After taking a quick look, it looks like adding such log is not a trivial change. I've created #152452 so we can plan that work.

@rudolf, I'll move this to blocked until that issue is resolved.

afharo · 2023-03-06T13:26:38Z

@elasticmachine merge upstream

rudolf · 2023-05-23T12:54:13Z

Update: we've enabled this on Cloud for deployments on 8.8.0+ as a way to monitor any potential impact. Once it's released with ms-93 in early June we should be able to validate the impact.

gsoldevila · 2023-07-10T12:12:11Z

Alternatively, we enriched metricbeat with the kibana.stats.elasticsearch_client.*.
In particular, the number of queued requests should be monitored.
AFAIK it was merged in beats-8.7.0, but the problem is Cloud is still using 7.17.9 ATM.

Do you think we should push to have this updated?

rudolf · 2023-08-03T23:20:51Z

Do you think we should push to have this updated?

This is on their roadmap but they didn't have a timeline last time I asked.

…ckets-to-800

kibana-ci · 2023-11-28T16:44:29Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 28873d0

Failed CI Steps

Defend Workflows Cypress Tests on Serverless #2

Metrics [docs]

✅ unchanged

History

💚 Build #112038 succeeded 861c75c
💚 Build #110862 succeeded c7296f7
💔 Build #110801 failed d8793f7
💚 Build #110702 succeeded 7b98929
💚 Build #110271 succeeded c4067b9
💚 Build #110003 succeeded f2b9a1d

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @afharo

justinkambic

Rubberstamp Codeowners LGTM, changing a value in a test file we own.

afharo and others added 8 commits February 14, 2023 10:36

[Usage Collection] Apply concurrency limits

af70df6

Add more load tests

a01045d

Comply with eslint name casing rules

824a395

Stop applying concurrency limits. We want to merge the extra tests

35ebbf7

Merge branch 'main' into usage_collection/apply-concurrency-limits

74f756f

Merge branch 'main' into usage_collection/apply-concurrency-limits

421688a

Increase timeout

98c1f84

Merge branch 'main' of github.com:elastic/kibana into usage_collectio…

f493167

…n/apply-concurrency-limits

afharo mentioned this pull request Feb 22, 2023

[Elasticsearch Client] Limit Kibana's internal client's maxSockets #151778

Closed

13 tasks

afharo added Team:Core Platform Core services: plugins, logging, config, saved objects, http, ES client, i18n, etc t// performance Feature:elasticsearch resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility labels Feb 22, 2023

[Elasticsearch] Limit maxSockets to 800 by default

c947695

afharo force-pushed the es-limit-max-sockets-to-800 branch from 86fb563 to c947695 Compare February 23, 2023 10:04

afharo added release_note:enhancement backport:skip This PR does not require backporting labels Feb 23, 2023

afharo commented Feb 23, 2023

View reviewed changes

packages/core/elasticsearch/core-elasticsearch-server-internal/src/elasticsearch_config.ts Show resolved Hide resolved

x-pack/test/scalability/config.ts Outdated Show resolved Hide resolved

Merge branch 'main' into es-limit-max-sockets-to-800

5bbe738

Fix duplicated sourceArgs reference

f2b9a1d

afharo self-assigned this Feb 24, 2023

Merge branch 'main' into es-limit-max-sockets-to-800

c4067b9

Merge branch 'main' of github.com:elastic/kibana into es-limit-max-so…

99d9201

…ckets-to-800

afharo requested a review from a team as a code owner February 27, 2023 18:57

afharo requested a review from dmlemeshko February 27, 2023 18:59

afharo mentioned this pull request Feb 27, 2023

Add support to elasticsearch.maxSockets: .Inf #152264

Closed

Merge branch 'main' into es-limit-max-sockets-to-800

c7296f7

Bamieh approved these changes Feb 28, 2023

View reviewed changes

miltonhultgren approved these changes Feb 28, 2023

View reviewed changes

rudolf requested changes Mar 1, 2023

View reviewed changes

packages/core/elasticsearch/core-elasticsearch-server-internal/src/elasticsearch_config.ts Show resolved Hide resolved

afharo mentioned this pull request Mar 1, 2023

[Elasticsearch] Warn log when the open sockets count reaches the maxSockets limit #152452

Closed

afharo added the blocked label Mar 1, 2023

This was referenced Mar 1, 2023

Inconsistent event-loop-delay metrics #152453

Open

[Elasticsearch] Log queued requests #152571

Merged

Merge branch 'main' into es-limit-max-sockets-to-800

861c75c

Merge branch 'main' of github.com:elastic/kibana into es-limit-max-so…

28873d0

…ckets-to-800

afharo requested a review from a team November 28, 2023 15:30

afharo removed the blocked label Nov 28, 2023

afharo requested a review from rudolf November 28, 2023 15:32

rudolf approved these changes Nov 28, 2023

View reviewed changes

afharo enabled auto-merge (squash) November 28, 2023 23:47

justinkambic approved these changes Nov 29, 2023

View reviewed changes

afharo merged commit ac16c65 into elastic:main Nov 29, 2023

afharo deleted the es-limit-max-sockets-to-800 branch November 29, 2023 04:17

kibanamachine added the v8.12.0 label Nov 29, 2023

Conversation

afharo commented Feb 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scalability tests

👍 Lower drop rate / Higher capacity

👎 Slower average response time

⏳ Note about timeouts

POST /api/metrics/vis/data

Risk Matrix

For maintainers

Uh oh!

Uh oh!

Uh oh!

afharo commented Feb 23, 2023

Uh oh!

afharo commented Feb 23, 2023

Uh oh!

afharo commented Feb 24, 2023

Uh oh!

rudolf commented Feb 24, 2023

Uh oh!

afharo commented Feb 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afharo commented Feb 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Preliminary analysis from 1 CI job for each version

api.core.capabilities

api.metrics.vis.data

api.saved_objects_tagging.tags

api.telemetry.cluster_stats

bundles.core.entry

internal.security.session

internal.security.user_profile

security.me

Uh oh!

afharo commented Feb 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afharo commented Feb 27, 2023

Uh oh!

rudolf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

afharo commented Mar 1, 2023

Uh oh!

afharo commented Mar 1, 2023

Uh oh!

afharo commented Mar 6, 2023

Uh oh!

rudolf commented May 23, 2023

Uh oh!

gsoldevila commented Jul 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rudolf commented Aug 3, 2023

Uh oh!

kibana-ci commented Nov 28, 2023

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

History

Uh oh!

justinkambic left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

afharo commented Feb 22, 2023 •

edited

Loading

`POST /api/metrics/vis/data`

afharo commented Feb 27, 2023 •

edited

Loading

afharo commented Feb 27, 2023 •

edited

Loading

afharo commented Feb 27, 2023 •

edited

Loading

gsoldevila commented Jul 10, 2023 •

edited

Loading