Skip to content

fix hot keepalives#53298

Merged
fspmarshall merged 1 commit intomasterfrom
fspmarshall/fix-hot-keepalives
Mar 25, 2025
Merged

fix hot keepalives#53298
fspmarshall merged 1 commit intomasterfrom
fspmarshall/fix-hot-keepalives

Conversation

@fspmarshall
Copy link
Copy Markdown
Contributor

@fspmarshall fspmarshall commented Mar 21, 2025

The original transition to control-stream based heartbeats for the multi-resource services would write the full set of resources associated with a given service in a single hot loop. With relative small resource counts this didn't matter much, but some folks have deployments with thousands of resources per service. In these scenarios the large write spikes are likely to induce throttling and/or performance issues, and may interfere with the event systems of some backends.

This PR switches over to using a jittered delay per resource in order to evenly distribute writes (effectively returning us to the old style backend write load pattern before the switch to control-stream based heartbeats, though at a potentially lower rate now that we have variable-rate heartbeats).

changelog: fixed an issue that could cause backend instability when running very large numbers of app/db/kube resources through a single agent.

@github-actions github-actions Bot requested review from gzdunek and rosstimothy March 21, 2025 19:08
@rosstimothy rosstimothy requested a review from espadolini March 21, 2025 19:09
@fspmarshall fspmarshall force-pushed the fspmarshall/fix-hot-keepalives branch 2 times, most recently from 393808a to 6b03d00 Compare March 24, 2025 15:26
Comment thread lib/inventory/controller.go Outdated
Comment thread lib/inventory/controller.go
Comment thread lib/inventory/internal/delay/heap.go Outdated
Comment thread lib/inventory/internal/delay/heap.go Outdated
Comment thread lib/inventory/internal/delay/multi.go Outdated
Comment thread lib/inventory/internal/delay/multi.go Outdated
Comment thread lib/inventory/internal/delay/multi.go Outdated
Comment thread lib/inventory/internal/delay/multi.go Outdated
Comment thread lib/inventory/internal/delay/multi.go Outdated
Comment thread lib/inventory/internal/delay/multi.go Outdated
@fspmarshall fspmarshall force-pushed the fspmarshall/fix-hot-keepalives branch 3 times, most recently from 71f8bc2 to 99ae4c3 Compare March 25, 2025 15:33
@fspmarshall fspmarshall requested a review from espadolini March 25, 2025 15:33
@fspmarshall fspmarshall force-pushed the fspmarshall/fix-hot-keepalives branch 2 times, most recently from 3574156 to 5455a20 Compare March 25, 2025 17:30
// key for later return.
root := h.heap.Root()
key = root.key
root.tick = now.Add(h.interval(false /* first */))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we bump the tick up from itself rather than from now?

Suggested change
root.tick = now.Add(h.interval(false /* first */))
root.tick = root.tick.Add(h.interval(false /* first */))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of like it this way. If perf is good, the difference is inconsequential and we might as well calculate from the variable that isn't behind a pointer. If perf is bad, this adds a bit of slack by always scheduling the N+1th firing a full duration after then Nth firing, even if the Nth firing was delayed in being observed for some reason. For the intended usecase, I think thats a desirable property, and I think its as sensible a behavior as anything else given that the API doesn't try to hide the fact that we're not resetting until Tick is called.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I agree - this is not a goroutine that's only tightly looping on the heartbeats, we might be stuck somewhere else for a bit before coming back to the timer, couldn't we?

If choosing the correct time is inconsequential, why are we bringing over the timestamp from the timer instead of just calling Now()?

Comment thread lib/inventory/internal/delay/multi.go Outdated
@fspmarshall fspmarshall force-pushed the fspmarshall/fix-hot-keepalives branch from 5455a20 to 1daa4fc Compare March 25, 2025 17:57
@fspmarshall fspmarshall added this pull request to the merge queue Mar 25, 2025
Merged via the queue into master with commit b615f23 Mar 25, 2025
40 checks passed
@fspmarshall fspmarshall deleted the fspmarshall/fix-hot-keepalives branch March 25, 2025 20:45
@backport-bot-workflows
Copy link
Copy Markdown
Contributor

@fspmarshall See the table below for backport results.

Branch Result
branch/v17 Failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants