fix hot keepalives by fspmarshall · Pull Request #53298 · gravitational/teleport

fspmarshall · 2025-03-21T19:08:06Z

The original transition to control-stream based heartbeats for the multi-resource services would write the full set of resources associated with a given service in a single hot loop. With relative small resource counts this didn't matter much, but some folks have deployments with thousands of resources per service. In these scenarios the large write spikes are likely to induce throttling and/or performance issues, and may interfere with the event systems of some backends.

This PR switches over to using a jittered delay per resource in order to evenly distribute writes (effectively returning us to the old style backend write load pattern before the switch to control-stream based heartbeats, though at a potentially lower rate now that we have variable-rate heartbeats).

changelog: fixed an issue that could cause backend instability when running very large numbers of app/db/kube resources through a single agent.

espadolini · 2025-03-25T17:38:04Z

+	// key for later return.
+	root := h.heap.Root()
+	key = root.key
+	root.tick = now.Add(h.interval(false /* first */))


Should we bump the tick up from itself rather than from now?

Suggested change

root.tick = now.Add(h.interval(false /* first */))

root.tick = root.tick.Add(h.interval(false /* first */))

I kind of like it this way. If perf is good, the difference is inconsequential and we might as well calculate from the variable that isn't behind a pointer. If perf is bad, this adds a bit of slack by always scheduling the N+1th firing a full duration after then Nth firing, even if the Nth firing was delayed in being observed for some reason. For the intended usecase, I think thats a desirable property, and I think its as sensible a behavior as anything else given that the API doesn't try to hide the fact that we're not resetting until Tick is called.

I don't know if I agree - this is not a goroutine that's only tightly looping on the heartbeats, we might be stuck somewhere else for a bit before coming back to the timer, couldn't we?

If choosing the correct time is inconsequential, why are we bringing over the timestamp from the timer instead of just calling Now()?

backport-bot-workflows · 2025-03-25T20:46:54Z

@fspmarshall See the table below for backport results.

Branch	Result
branch/v17	Failed

fspmarshall added the backport/branch/v17 label Mar 21, 2025

github-actions Bot requested review from gzdunek and rosstimothy March 21, 2025 19:08

github-actions Bot added the size/md label Mar 21, 2025

rosstimothy requested a review from espadolini March 21, 2025 19:09

fspmarshall force-pushed the fspmarshall/fix-hot-keepalives branch 2 times, most recently from 393808a to 6b03d00 Compare March 24, 2025 15:26

espadolini reviewed Mar 24, 2025

View reviewed changes

fspmarshall force-pushed the fspmarshall/fix-hot-keepalives branch 3 times, most recently from 71f8bc2 to 99ae4c3 Compare March 25, 2025 15:33

fspmarshall requested a review from espadolini March 25, 2025 15:33

fspmarshall force-pushed the fspmarshall/fix-hot-keepalives branch 2 times, most recently from 3574156 to 5455a20 Compare March 25, 2025 17:30

espadolini approved these changes Mar 25, 2025

View reviewed changes

fix hot keepalives

1daa4fc

fspmarshall force-pushed the fspmarshall/fix-hot-keepalives branch from 5455a20 to 1daa4fc Compare March 25, 2025 17:57

rosstimothy approved these changes Mar 25, 2025

View reviewed changes

fspmarshall added this pull request to the merge queue Mar 25, 2025

Merged via the queue into master with commit b615f23 Mar 25, 2025
40 checks passed

fspmarshall deleted the fspmarshall/fix-hot-keepalives branch March 25, 2025 20:45

fspmarshall added a commit that referenced this pull request Mar 25, 2025

fix hot keepalives (#53298)

8bbf1d0

fspmarshall mentioned this pull request Mar 25, 2025

[v17] fix hot keepalives #53419

Merged

github-merge-queue Bot pushed a commit that referenced this pull request Mar 25, 2025

fix hot keepalives (#53298) (#53419)

93a4d9f

fspmarshall mentioned this pull request Mar 26, 2025

rate limit resource cleanup #53463

Merged

espadolini mentioned this pull request Apr 8, 2025

Avoid throttling on dynamodb DescribeStream #53790

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix hot keepalives#53298

fix hot keepalives#53298
fspmarshall merged 1 commit intomasterfrom
fspmarshall/fix-hot-keepalives

fspmarshall commented Mar 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

espadolini Mar 25, 2025

Uh oh!

fspmarshall Mar 25, 2025

Uh oh!

espadolini Mar 25, 2025

Uh oh!

Uh oh!

Uh oh!

backport-bot-workflows Bot commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	root.tick = now.Add(h.interval(false /* first */))
	root.tick = root.tick.Add(h.interval(false /* first */))

Conversation

fspmarshall commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

espadolini Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

fspmarshall Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

espadolini Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

backport-bot-workflows Bot commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fspmarshall commented Mar 21, 2025 •

edited

Loading