Add maxIdleSockets and idleSocketTimeout to Elasticsearch config#142019
Add maxIdleSockets and idleSocketTimeout to Elasticsearch config#142019
Conversation
|
Documentation preview: |
| disablePrototypePoisoningProtection: true, | ||
| agent: { | ||
| maxSockets: config.maxSockets, | ||
| maxTotalSockets: config.maxSockets, |
There was a problem hiding this comment.
maxSockets only limits the maximum number of sockets per host. What we want, and what our documentation describes is the maximum amount of sockets that Kibana would open to ES. Because there's multiple ES client instances and hence agents, this still doesn't behave 100% as advertised, but at least maxTotalSockets should bring the behaviour somewhat closer to what we advertise. https://nodejs.org/api/http.html#agentmaxtotalsockets
| maxSockets: config.maxSockets, | ||
| maxTotalSockets: config.maxSockets, | ||
| keepAlive: config.keepAlive ?? true, | ||
| timeout: getDurationAsMs(config.idleSocketTimeout), |
There was a problem hiding this comment.
There was a problem hiding this comment.
Looking at the documentation, it seems that Node itself does not close / destroy the socket.
It rather sends a timeout event, and then it's up to the user to end the connection.
Thus, I'm not sure the connections are being closed due to timeout ATM. I'll give it a try in local and get back to you.
There was a problem hiding this comment.
We're good, the Agent subscribes to the 'timeout' event and destroys the socket:
https://github.com/nodejs/node/blob/main/lib/_http_agent.js#L402-L414
| maxTotalSockets: config.maxSockets, | ||
| keepAlive: config.keepAlive ?? true, | ||
| timeout: getDurationAsMs(config.idleSocketTimeout), | ||
| maxFreeSockets: config.maxIdleSockets, |
There was a problem hiding this comment.
| import type { ClientOptions, HttpAgentOptions } from '@elastic/elasticsearch'; | ||
| import type { ClientOptions } from '@elastic/elasticsearch'; | ||
| import type { ElasticsearchClientConfig } from '@kbn/core-elasticsearch-server'; | ||
| import { AgentOptions } from 'https'; |
There was a problem hiding this comment.
Although Elasticsearch-js exposes an HttpAgentOptions type, this type is not up to date with the Nodejs type.
| defaultValue: 'http://localhost:9200', | ||
| }), | ||
| maxSockets: schema.number({ defaultValue: Infinity, min: 1 }), | ||
| maxIdleSockets: schema.number({ defaultValue: 256, min: 1 }), |
There was a problem hiding this comment.
256 is the Nodejs default so adding a default value for the config just makes this more explicit and decouples us from any changes to Nodejs itself.
| }), | ||
| maxSockets: schema.number({ defaultValue: Infinity, min: 1 }), | ||
| maxIdleSockets: schema.number({ defaultValue: 256, min: 1 }), | ||
| idleSocketTimeout: schema.duration({ defaultValue: '9m' }), |
There was a problem hiding this comment.
the services forwarder has a timeout of 10minutes. But a race condition is possible where the proxy sends a close but there's some delay in the close packet reaching Kibana and at the same time Kibana might start a new request. So to be on the safe side we have a margin of > requestTimeout (30s by default) so that Kibana would close any idle sockets before the proxy might try to close them.
On cloud we can configure Kibana to default to 9minutes, but because this would be a behaviour change and might cause problems if there's a transparent proxy I choose to default to 60s.
(This image shows the old services-forwarder socket timeout of 30s)

There was a problem hiding this comment.
Beware that your implementation is still setting '9m'
|
Apart from the configuration options this introduces a behaviour change in that sockets now have a 60s timeout instead of Elasticsearch-js's 30s default. Because we default to Tested this locally by creating load using Kibana <-> Elasticsearch socket count: K6 <-> Kibana socket count: Some observations from the tests. When Kibana has a large amount of open sockets, the socket numbers continuously jump up and down by several hundreds e.g. 1500 sockets and 2s later 1200 sockets. This is because the default maxIdleSockets is 256 so when there's e.g. 2000 sockets open it quickly happens that more than 256 of them become free and then we close a ton of sockets just to re-open them a second or so later. When I set a maxIdleSockets value that's really high (e.g. same as maxSockets or > 2000) then the amount of open sockets are much more stable. And this basically means sockets will only be closed if they're idle. Which seems preferable to me. If there's a sudden spike in traffic it means that for the next idleSocketTimeout there will be a large amount of unecessary sockets but it's not a huge issue in and of itself. |
|
Pinging @elastic/kibana-core (Team:Core) |
weltenwort
left a comment
There was a problem hiding this comment.
monitoring test changes LGTM
|
@elasticmachine merge upstream |
pgayvallet
left a comment
There was a problem hiding this comment.
Overall looking good to me, just have one comment.
As a side note, do we want to allow-list these new settings in Cloud?
| const config = Object.assign( | ||
| {}, | ||
| DEFAULT_CONFIG, | ||
| this.agentOptions, | ||
| agentOptions, | ||
| connectionOpts.tls | ||
| ); |
There was a problem hiding this comment.
Not sure to get the exact implications of the changes in this file?
There was a problem hiding this comment.
we had a way to create a default agent config passed into the agent manager constructor. But this functionality was never used and sometimes confused my while developing cause it wasn't obvious where the config of the final agent actually came from. It was just a small moment of "what's going on here" but it felt like if we're not using this "feature" then it's easier to reason about the code if we just remove it altogether.
There was a problem hiding this comment.
actually there was several levels:
DEFAULT_CONFIGstatic defaults in codethis.agentOptionsoptional constructor paramater that can set defaults for any agent created by the agent factories of this agent manager- finally the kibana.yml defaults that are passed into the agent factories when they are created.
So now we just have (3)
There was a problem hiding this comment.
Makes sense! (1)'s values are equivalent to NodeJs's defaults, and (2) was not used.
I would say it makes sense, so that we can tweak them once we have metrics about them. LGTM! |
|
@elasticmachine merge upstream |
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Unknown metric groupsAPI count
History
To update your PR or re-run it, just comment with: |
…stic#142019) * Add maxIdleSockets and idleSocketTimeout to Elasticsearch config * Simplify agent manager * Fix types * Fix types * Reduce idleSocketTimeout default to 60s * Fix tests * Update docs/setup/settings.asciidoc * Address review comments Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
…stic#142019) * Add maxIdleSockets and idleSocketTimeout to Elasticsearch config * Simplify agent manager * Fix types * Fix types * Reduce idleSocketTimeout default to 60s * Fix tests * Update docs/setup/settings.asciidoc * Address review comments Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Summary
Closes #137673
Checklist
Delete any items that are not applicable to this PR.
Risk Matrix
Delete this section if it is not applicable to this PR.
Before closing this PR, invite QA, stakeholders, and other developers to identify risks that should be tested prior to the change/feature release.
When forming the risk matrix, consider some of the following examples and how they may potentially impact the change:
For maintainers