reduce fluentd connection churn during high event-handler load#48909
Merged
fspmarshall merged 1 commit intomasterfrom Nov 15, 2024
Merged
reduce fluentd connection churn during high event-handler load#48909fspmarshall merged 1 commit intomasterfrom
fspmarshall merged 1 commit intomasterfrom
Conversation
rosstimothy
approved these changes
Nov 13, 2024
tigrato
reviewed
Nov 13, 2024
tigrato
approved these changes
Nov 13, 2024
5eefff1 to
dbbfa16
Compare
tigrato
approved these changes
Nov 13, 2024
dbbfa16 to
79fa179
Compare
79fa179 to
77cfffe
Compare
|
@fspmarshall See the table below for backport results.
|
fspmarshall
added a commit
that referenced
this pull request
Nov 15, 2024
fspmarshall
added a commit
that referenced
this pull request
Nov 15, 2024
fspmarshall
added a commit
that referenced
this pull request
Nov 15, 2024
This was referenced Nov 15, 2024
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR fixes an issue where the teleport event handler could cause excess CPU usage and connection reset errors in Fluentd when under load. Previously, the teleport event handler's Fluentd client would create unlimited outbound connections, but only ever retain a maximum of 2 idle connections. This lead to frequent creation and destruction of TLS connections when the event handler was trying to send many concurrent events to Fluentd. With these changes, the Fluentd client now both limits the peak connections, and allows all idle connections to persist up to 30s.
We've been aware for a while that the Fluentd client had a tendency toward connection churn, but it never caused any measurable issues in our performance tuning tests. It turns out this was likely due to how our performance testing were being performed. Our performance tuning is generally done with the intent to maximize event throughput for very massive teleport clusters (i.e. clusters with many tens of thousands of events per minute), and for that reason tend to be done with pretty beefy VMs running the exporter and Fluentd. It seems that for whatever reason, it's only when running on smaller more resource-constrained machines that the churn issue really comes to the fore, causing up to a 4x slowdown in overall event throughput.
changelog: fixed issue resulting in excess cpu usage and connection resets when teleport-event-handler is under moderate to high load.