Skip to content

reduce fluentd connection churn during high event-handler load#48909

Merged
fspmarshall merged 1 commit intomasterfrom
fspmarshall/reduce-fluentd-connection-resets
Nov 15, 2024
Merged

reduce fluentd connection churn during high event-handler load#48909
fspmarshall merged 1 commit intomasterfrom
fspmarshall/reduce-fluentd-connection-resets

Conversation

@fspmarshall
Copy link
Copy Markdown
Contributor

@fspmarshall fspmarshall commented Nov 13, 2024

This PR fixes an issue where the teleport event handler could cause excess CPU usage and connection reset errors in Fluentd when under load. Previously, the teleport event handler's Fluentd client would create unlimited outbound connections, but only ever retain a maximum of 2 idle connections. This lead to frequent creation and destruction of TLS connections when the event handler was trying to send many concurrent events to Fluentd. With these changes, the Fluentd client now both limits the peak connections, and allows all idle connections to persist up to 30s.

We've been aware for a while that the Fluentd client had a tendency toward connection churn, but it never caused any measurable issues in our performance tuning tests. It turns out this was likely due to how our performance testing were being performed. Our performance tuning is generally done with the intent to maximize event throughput for very massive teleport clusters (i.e. clusters with many tens of thousands of events per minute), and for that reason tend to be done with pretty beefy VMs running the exporter and Fluentd. It seems that for whatever reason, it's only when running on smaller more resource-constrained machines that the churn issue really comes to the fore, causing up to a 4x slowdown in overall event throughput.

changelog: fixed issue resulting in excess cpu usage and connection resets when teleport-event-handler is under moderate to high load.

Comment thread integrations/event-handler/fluentd_client.go Outdated
@fspmarshall fspmarshall force-pushed the fspmarshall/reduce-fluentd-connection-resets branch from 5eefff1 to dbbfa16 Compare November 13, 2024 18:15
@fspmarshall fspmarshall force-pushed the fspmarshall/reduce-fluentd-connection-resets branch from dbbfa16 to 79fa179 Compare November 13, 2024 18:28
@fspmarshall fspmarshall force-pushed the fspmarshall/reduce-fluentd-connection-resets branch from 79fa179 to 77cfffe Compare November 14, 2024 22:01
@fspmarshall fspmarshall added this pull request to the merge queue Nov 14, 2024
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Nov 14, 2024
@fspmarshall fspmarshall added this pull request to the merge queue Nov 15, 2024
Merged via the queue into master with commit 8ab4052 Nov 15, 2024
@fspmarshall fspmarshall deleted the fspmarshall/reduce-fluentd-connection-resets branch November 15, 2024 00:22
@public-teleport-github-review-bot
Copy link
Copy Markdown

@fspmarshall See the table below for backport results.

Branch Result
branch/v15 Failed
branch/v16 Failed
branch/v17 Failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants