Skip to content

[Elasticsearch] Limit maxSockets to 800 by default#151911

Merged
afharo merged 18 commits intoelastic:mainfrom
afharo:es-limit-max-sockets-to-800
Nov 29, 2023
Merged

[Elasticsearch] Limit maxSockets to 800 by default#151911
afharo merged 18 commits intoelastic:mainfrom
afharo:es-limit-max-sockets-to-800

Conversation

@afharo
Copy link
Copy Markdown
Member

@afharo afharo commented Feb 22, 2023

Summary

It lowers the default elasticsearch.maxSockets from the current Infinity to 800.

Why 800?
We are trying to prioritize 0% drops. Our tests indicate that 800 is the highest value we can set today to achieve this.

Scalability tests

The scalability tests (compared to main) show an overall improvement in the resilience of Kibana (the rejection rate drops to 0%) at the cost of larger response times on average, although improving the 95 and 99 percentiles (lower standard deviation).

👍 Lower drop rate / Higher capacity

Limiting the number of connections to ES allows Kibana to use those extra sockets to handle more incoming requests, drastically reducing the number of rejections and being able to handle

👎 Slower average response time

The average response time doubles in all tested scenarios. However, the Std Dev and 95 percentiles are lower in the socket-limited scenario.

This branch main
image Similar behavior when not loaded. Slower responses during mid-load. Slightly better response time when loaded. image Random failures across any load.
image image Mind the max values are higher here (and they failed)
As soon as the load is higher than the number of sockets, Kibana queues further requests, increasing the response time for those image It takes a higher load of users to start increasing the response time. However, the blanks indicate failures (either timed out or rejection) image

⏳ Note about timeouts

APIs that are typically slow in main (response times are close to 60s) tend to 60s-timeout more consistently in this branch. We may want to extend the timeout for these tests.

POST /api/metrics/vis/data
This branch main
image image

Risk Matrix

Risk Probability Severity Mitigation/Notes
Deploying Kibana on machines with limited resources may require lower values. Medium Low Users can override the default in the configuration. 800 is still better than Infinity in those scenarios.
Deploying Kibana on machines with plenty of resources may require higher values. Low Low Users can override the default in the configuration. We may want to run performance tests on a matrix of different hardware to identify the best defaults for different configurations (cc @elastic/kibana-qa). We can create a guide Tuning Kibana.

For maintainers

This PR was built on top of #151110. The actual changes in this PR are in the commit c947695 (#151911).

@afharo afharo added Team:Core Platform Core services: plugins, logging, config, saved objects, http, ES client, i18n, etc t// performance Feature:elasticsearch resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility labels Feb 22, 2023
@afharo afharo force-pushed the es-limit-max-sockets-to-800 branch from 86fb563 to c947695 Compare February 23, 2023 10:04
@afharo afharo added release_note:enhancement backport:skip This PR does not require backporting labels Feb 23, 2023
@afharo
Copy link
Copy Markdown
Member Author

afharo commented Feb 23, 2023

@elasticmachine merge upstream

@afharo
Copy link
Copy Markdown
Member Author

afharo commented Feb 23, 2023

☝️ Updating from main because Scalability tests now print Ops Metrics to the logs (and we can learn more about memory and CPU usage)

@afharo afharo self-assigned this Feb 24, 2023
@afharo
Copy link
Copy Markdown
Member Author

afharo commented Feb 24, 2023

There's a bug in the recent changes to the API Capacity CI. @dmlemeshko is looking to resolve it.

We'll run the tests again to compare CPU and Memory usage once it's resolved.

@rudolf
Copy link
Copy Markdown
Contributor

rudolf commented Feb 24, 2023

Analysing throughput at different response time cuttoffs shows slight improvements in throughput for some APIs and slight degradation for others. But overall nothing concerning here.

Screenshot 2023-02-24 at 15 38 49

Source https://telemetry-v2-staging.elastic.dev/s/kibana-performance/app/dashboards#/view/d0d6bd30-b390-11ed-a6e6-d32d2209b7b7?_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow-3d%2Cto%3Anow))

It would be useful to be able to compare the "1600 data views no cache" scenario with main. We should also ask ResponseOps to do their scalability testing against this branch.

@afharo
Copy link
Copy Markdown
Member Author

afharo commented Feb 27, 2023

To extend @rudolf's assessment, I thought it was useful to compare the sum of RPS for all thresholds (which version is capable of handling more requests?) and the RPS for the first threshold (which version is faster?):

image Median of the sum of RPS for all thresholds.

image Median of the RPS for the first threshold.

As highlighted by Rudolf: it shows slight improvements for some APIs and slight degradation in others. The deeper analysis I went through earlier showed the degradation is caused by a higher amount of timeouts on already timeout-prone APIs.

It would be useful to be able to compare the "1600 data views no cache" scenario with main. We should also ask ResponseOps to do their scalability testing against this branch.

I'll push to get #151110 merged, so we can have those metrics from main.

I'll also work today on comparing the CPU and memory usage in both versions.

@afharo
Copy link
Copy Markdown
Member Author

afharo commented Feb 27, 2023

I'll also work today on comparing the CPU and memory usage in both versions.

I processed the kibana.log files from main and this branch, and extracted the reported metrics.ops. I uploaded those to my own ES server and compared both versions for each API.

The logs and scripts used are here: scalability tests.zip.

TL;DR, CPU load and memory utilization are very similar. Event loop delays might be a decisive factor. However, I noticed that Ops Metrics is not able to report it when it's too high 😢. And, with the current data, I'd say they are inconclusive.

I'll process and upload the logs from subsequent CI jobs to have more data points and reduce the randomness of the data.

Preliminary analysis from 1 CI job for each version

api.core.capabilities

Based on the reported metrics, I think we can claim that CPU and memory utilization are very similar:
image

image

This version seems to improve the event loop delay for this API:
image

However, it's worth noticing that both versions stopped reporting the event loop delay after 2 thirds of the execution (while memory and CPU are still metered).

api.metrics.vis.data

Again CPU and memory look very similar:

image
image

Event loop delay is really bad on both versions because they stop reporting right after the middle of the execution:
image

However, this version was able to report some event loop delay (despite it being very high) while main simply stopped reporting it and never got back.

Looking at the logs, we can leverage the log entries [plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. to attempt to fill the blanks. Around 2m40s of the execution time, main logs:

[2023-02-24T16:32:46.471+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 6443.761664ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
... 40s later
[2023-02-24T16:33:27.624+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 5338.693632ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
... 30s later
[2023-02-24T16:34:02.916+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 4639.883264ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.

However, I'd take these values with a grain of salt because I've seen logs where kibanaUsageCollection logged 6401ms, and 3 seconds later, the metrics.ops would log 27724ms. I don't think it's possible to increase the average 21s in only 3s... Bug?

api.saved_objects_tagging.tags

CPU and memory:

image
image

WRT Event-loop delays, it looks like main is able to keep reporting it for longer (and even come back)
image

Looking closer at this version's logs, we can see the following log entries (the first entry shows at the 3 minutes of execution):

[2023-02-24T18:35:37.038+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 392.9593831325301ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
...
[2023-02-24T18:36:21.432+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 2630.936762181818ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
...
[2023-02-24T18:37:05.526+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 11021.01504ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
...
[2023-02-24T18:37:41.168+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 8484.480341333334ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
...
[2023-02-24T18:38:15.660+01:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 5732.847616ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.

It goes up to 11s in the 4m30s of execution and starts going down after that.

api.telemetry.cluster_stats

CPU and memory are very similar... I'll focus on Even-loop delays from now on...
image

bundles.core.entry

image

internal.security.session

image

internal.security.user_profile

image

security.me

image

@afharo
Copy link
Copy Markdown
Member Author

afharo commented Feb 27, 2023

I'll process and upload the logs from subsequent CI jobs to have more data points and reduce the randomness of the data.

Done! Results can be analyzed at https://sec-tests.kb.europe-west1.gcp.cloud.es.io:9243/s/ops-metrics-analysis/app/dashboards#/view/cab987b0-8c68-4432-b920-1abf76874bb6 (u: ops-metrics-read-only, p: ops-metrics-123 🔒😅).

TL;DR, the differences are smoothed out: Event-loop delays are more similar now. Although there are similar results: some APIs, main is slightly better, and for some others, this version slightly improves it.

Raw data and processing scripts here.

@afharo afharo requested a review from a team as a code owner February 27, 2023 18:57
@afharo
Copy link
Copy Markdown
Member Author

afharo commented Feb 27, 2023

Setting it as "ready to review" now... it looks like most of the results point out that this change is good for Kibana resilience in general, and the downsides are pretty minimal based on our tests (RPS and overall CPU and memory utilization are fairly similar).

Adding @dmlemeshko as an additional reviewer.

Copy link
Copy Markdown
Contributor

@rudolf rudolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be an improvement and I think 800 is a good value to start with. But at this point I'm not confident that our testing ruled out that this would not have a negative impact on some workloads.

Because we don't have socket pool size metrics on ESS we also can't detect how many Kibana's might be exhausting their socket pools.

Perhaps we can add a logger similar to the "event loop delay threshold exceeded" logger. This would log whenever one of the httpAgents open sockets count matches the maxSockets limit. This way we (and users) can know that a cluster's throughput might be limited by the maxSockets config.

@afharo
Copy link
Copy Markdown
Member Author

afharo commented Mar 1, 2023

Perhaps we can add a logger similar to the "event loop delay threshold exceeded" logger. This would log whenever one of the httpAgents open sockets count matches the maxSockets limit. This way we (and users) can know that a cluster's throughput might be limited by the maxSockets config.

I like that! I can look at how to add that logger.

@afharo
Copy link
Copy Markdown
Member Author

afharo commented Mar 1, 2023

After taking a quick look, it looks like adding such log is not a trivial change. I've created #152452 so we can plan that work.

@rudolf, I'll move this to blocked until that issue is resolved.

@afharo
Copy link
Copy Markdown
Member Author

afharo commented Mar 6, 2023

@elasticmachine merge upstream

@rudolf
Copy link
Copy Markdown
Contributor

rudolf commented May 23, 2023

Update: we've enabled this on Cloud for deployments on 8.8.0+ as a way to monitor any potential impact. Once it's released with ms-93 in early June we should be able to validate the impact.

@gsoldevila
Copy link
Copy Markdown
Member

gsoldevila commented Jul 10, 2023

Alternatively, we enriched metricbeat with the kibana.stats.elasticsearch_client.*.
In particular, the number of queued requests should be monitored.
AFAIK it was merged in beats-8.7.0, but the problem is Cloud is still using 7.17.9 ATM.

Do you think we should push to have this updated?

@rudolf
Copy link
Copy Markdown
Contributor

rudolf commented Aug 3, 2023

Do you think we should push to have this updated?

This is on their roadmap but they didn't have a timeline last time I asked.

@afharo afharo requested a review from a team November 28, 2023 15:30
@afharo afharo removed the blocked label Nov 28, 2023
@afharo afharo requested a review from rudolf November 28, 2023 15:32
@kibana-ci
Copy link
Copy Markdown

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @afharo

@afharo afharo enabled auto-merge (squash) November 28, 2023 23:47
Copy link
Copy Markdown
Contributor

@justinkambic justinkambic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rubberstamp Codeowners LGTM, changing a value in a test file we own.

@afharo afharo merged commit ac16c65 into elastic:main Nov 29, 2023
@afharo afharo deleted the es-limit-max-sockets-to-800 branch November 29, 2023 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting Feature:elasticsearch performance release_note:enhancement resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:Core Platform Core services: plugins, logging, config, saved objects, http, ES client, i18n, etc t// v8.12.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants