Cap TrackedSession activity grid at 500 rows to prevent OOM on timeout#3045
Merged
Conversation
A noisy integration test (e.g. a chaos-monkey'd cluster emitting
thousands of alert / heartbeat / DLQ envelopes during a tracked
window) can have BuildActivityMessage produce a grid so large that
Grid.Write -> StringBuilder.ToString() throws System.OutOfMemoryException
before the TimeoutException is ever surfaced.
Concrete repro from CritterWatch#304 follow-up: ~7,078,703 envelope
records accumulated in 60s due to a relay-handler recursion on the
consumer side; the OOM masked the actual timeout for several diagnostic
cycles. With the cap, the same scenario surfaces:
(showing last 500 of 7,078,703 envelope records — earlier 7,078,203
omitted for readability)
which immediately points at the runaway message type ("99.8% of envelopes
are alert_changes_batch") and the bug is named in seconds.
500 rows is chosen as a value that:
- comfortably formats within StringBuilder bounds (~50KB output);
- shows enough trailing activity to identify where the session is hung
or what message type is dominating the activity;
- doesn't truncate normal-shape integration tests (most run with far
fewer than 500 envelope events).
Cap is exposed as `internal const int ActivityGridRowLimit` so tests
can read it back if needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Jun 8, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A chaos-monkey'd integration test (originally hit on JasperFx/CritterWatch#304) can have
BuildActivityMessageproduce a grid so large thatGrid.Write→StringBuilder.ToString()throwsSystem.OutOfMemoryExceptionbefore theTimeoutExceptionis ever surfaced to the test.The OOM masks the actual timeout for several diagnostic cycles — you don't see "tracked session timed out at 60s waiting for X", you see a stack trace bottoming out in
StringBuilder.ToStringand nothing about which message type was actually flooding the activity buffer.Concrete repro from the CritterWatch follow-up: 7,078,703 envelope records accumulated in 60s because of a relay-handler recursion on the consumer side, and the OOM had me chasing the wrong root cause for a couple iterations before I patched the grid cap locally to see the real story.
The fix
Cap
BuildActivityMessageat the most recent 500 envelope records when the activity list would otherwise be larger. The cap is exposed asinternal const int ActivityGridRowLimit = 500so tests can read it back if needed. When truncation kicks in, the output gets a one-line preamble so the operator knows what was omitted:500 is chosen as a value that:
Why "last N" and not "first N"
Once you're already at avalanche scale, the leading edge is uninteresting — you already know there was an avalanche. The trailing edge is the actionable signal: which message type was the session still chasing when the budget ran out. In the CritterWatch repro this immediately surfaced "every recent envelope is
alert_changes_batch" and named the recursion loop within seconds of reading the output.Testing
Verified against the same chaos-monkey'd integration scenario that surfaced the bug (~7M envelope records in 60s):
BuildActivityMessagethrowsOutOfMemoryException, the timeout cause is masked.BuildActivityMessageformats cleanly with the truncation preamble, the timeout cause is immediately visible.No existing TrackedSession unit/integration tests in the repo are affected — the cap only changes output when there are >500 records, and Wolverine's own test suites stay well under that.