fix: reuse serialization buffer in PayloadSenderV2 to eliminate LOH fragmentation#2770
Conversation
|
💚 CLA has been signed |
🤖 GitHub commentsJust comment with:
|
|
👋 @nimanikoo Thanks a lot for your contribution! It may take some time before we review a PR, so even if you don’t see activity for some time, it does not mean that we have forgotten about it. Every once in a while we go through a process of prioritization, after which we are focussing on the tasks that were planned for the upcoming milestone. The prioritization status is typically reflected through the PR labels. It could be pending triage, a candidate for a future milestone, or have a target milestone set to it. |
There was a problem hiding this comment.
Thanks @nimanikoo for identifying and fixing this. The reasoning is sound and the fix looks appropriate. Code-wise, I'm happy to take this as-is. My slight concern is keeping the large buffer if it grows which may mean one large payload causes a large retained allocation than we generally need. But that concern is likely small and I'll review if we want to make any followup changes to resize.
A couple of minor nits on the styling of the XML docs comments and this is good to merge once CI checks pass.
EDIT: Looks like you may need to run dotnet format over this to fix some of the whitespace etc.
|
@stevejgordon I appreciate you taking the time to look into both the implementation and the potential memory implications. I'll address the XML doc styling comments and run I'm looking forward to contributing more meaningful improvements to the project after this PR . This was my first contribution and more of a starting point, and I'm excited to continue getting involved and helping wherever I can. Thanks again |
stevejgordon
left a comment
There was a problem hiding this comment.
Thanks again @nimanikoo. Looks good, but could you please exclude the LICENSE file from the commit. It gets regenerated and often shows as a change we don't really need.
Restore src/instrumentations/Elastic.Apm.MongoDb/LICENSE to its upstream state — it is auto-generated and should not be part of this PR.
|
run docs-build |
We've been seeing steady process memory growth in long-running deployments (visible after ~2 weeks uptime).
The root cause:
ProcessQueueItemswas allocating anew MemoryStream(1024)on every flush batch.A
MemoryStreamdoubles its internal byte array as it grows, and for batches with many spansthe final buffer crosses the 85 KB Large Object Heap threshold. LOH is not compacted during
ordinary GC cycles, so each flush permanently fragments heap and over millions of batches
the process RSS just keeps climbing.
The fix is simple: allocate one
MemoryStreamas a field, and reset it withSetLength(0)before each batch instead of creating a new one.
SetLength(0)resets the logical length andwrite cursor without releasing the underlying buffer, so the same allocation is reused forever.
This is safe because
ProcessQueueItemsis always called from a single dedicated backgroundthread (
ElasticApmPayloadSenderV2), so there's no concurrency concern.One subtlety caught during testing:
StreamContentdisposes its underlying stream when theusingblock exits. The fix wraps the reusable buffer in a lightweight, non-owningMemoryStreamview before handing it toStreamContent, so the shared buffer is neverclosed between batches.
Benchmark results (Apple M4, .NET 8, BenchmarkDotNet):
At production-scale batch sizes, Gen1 (and with it, LOH-adjacent promotion pressure) drops
to zero. In a high-throughput service flushing thousands of batches per minute, this is the
difference between stable and ever-growing resident memory.
Testing: added
SequentialBatches_SerializationBufferIsIsolatedto guard that buffer reusenever bleeds content from one batch into the next.