Skip to content

Cap TrackedSession activity grid at 500 rows to prevent OOM on timeout#3045

Merged
jeremydmiller merged 1 commit into
mainfrom
fix/304-trackedsession-grid-truncate
Jun 6, 2026
Merged

Cap TrackedSession activity grid at 500 rows to prevent OOM on timeout#3045
jeremydmiller merged 1 commit into
mainfrom
fix/304-trackedsession-grid-truncate

Conversation

@jeremydmiller

Copy link
Copy Markdown
Member

A chaos-monkey'd integration test (originally hit on JasperFx/CritterWatch#304) can have BuildActivityMessage produce a grid so large that Grid.WriteStringBuilder.ToString() throws System.OutOfMemoryException before the TimeoutException is ever surfaced to the test.

The OOM masks the actual timeout for several diagnostic cycles — you don't see "tracked session timed out at 60s waiting for X", you see a stack trace bottoming out in StringBuilder.ToString and nothing about which message type was actually flooding the activity buffer.

Concrete repro from the CritterWatch follow-up: 7,078,703 envelope records accumulated in 60s because of a relay-handler recursion on the consumer side, and the OOM had me chasing the wrong root cause for a couple iterations before I patched the grid cap locally to see the real story.

The fix

Cap BuildActivityMessage at the most recent 500 envelope records when the activity list would otherwise be larger. The cap is exposed as internal const int ActivityGridRowLimit = 500 so tests can read it back if needed. When truncation kicks in, the output gets a one-line preamble so the operator knows what was omitted:

(showing last 500 of 7,078,703 envelope records — earlier 7,078,203 omitted for readability)

500 is chosen as a value that:

  • comfortably formats within StringBuilder bounds (~50KB output);
  • shows enough trailing activity to identify where the session hung or what message type dominated;
  • doesn't truncate normal-shape integration tests (most run with well under 500 envelope events).

Why "last N" and not "first N"

Once you're already at avalanche scale, the leading edge is uninteresting — you already know there was an avalanche. The trailing edge is the actionable signal: which message type was the session still chasing when the budget ran out. In the CritterWatch repro this immediately surfaced "every recent envelope is alert_changes_batch" and named the recursion loop within seconds of reading the output.

Testing

Verified against the same chaos-monkey'd integration scenario that surfaced the bug (~7M envelope records in 60s):

  • Before: BuildActivityMessage throws OutOfMemoryException, the timeout cause is masked.
  • After: BuildActivityMessage formats cleanly with the truncation preamble, the timeout cause is immediately visible.

No existing TrackedSession unit/integration tests in the repo are affected — the cap only changes output when there are >500 records, and Wolverine's own test suites stay well under that.

A noisy integration test (e.g. a chaos-monkey'd cluster emitting
thousands of alert / heartbeat / DLQ envelopes during a tracked
window) can have BuildActivityMessage produce a grid so large that
Grid.Write -> StringBuilder.ToString() throws System.OutOfMemoryException
before the TimeoutException is ever surfaced.

Concrete repro from CritterWatch#304 follow-up: ~7,078,703 envelope
records accumulated in 60s due to a relay-handler recursion on the
consumer side; the OOM masked the actual timeout for several diagnostic
cycles. With the cap, the same scenario surfaces:

    (showing last 500 of 7,078,703 envelope records — earlier 7,078,203
     omitted for readability)

which immediately points at the runaway message type ("99.8% of envelopes
are alert_changes_batch") and the bug is named in seconds.

500 rows is chosen as a value that:
- comfortably formats within StringBuilder bounds (~50KB output);
- shows enough trailing activity to identify where the session is hung
  or what message type is dominating the activity;
- doesn't truncate normal-shape integration tests (most run with far
  fewer than 500 envelope events).

Cap is exposed as `internal const int ActivityGridRowLimit` so tests
can read it back if needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jeremydmiller jeremydmiller merged commit 01517a4 into main Jun 6, 2026
23 of 24 checks passed
This was referenced Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant