Skip to content

perf(server): SIGUSR2 writes V8 heap snapshot#768

Merged
buremba merged 2 commits into
mainfrom
perf/heap-snapshot-on-sigusr2
May 16, 2026
Merged

perf(server): SIGUSR2 writes V8 heap snapshot#768
buremba merged 2 commits into
mainfrom
perf/heap-snapshot-on-sigusr2

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 16, 2026

Why

Post-incident measurement (see lobu#767): with the queue healthy again, the app pod still grows from baseline toward the 1Gi limit. Without an inspector port we can't see what's allocating.

Sample taken from `summaries-app-lobu-app-77756ccdd7-dkh2l`:

T+ RSS cgroup.memory.current
90 min 649 MB 690 MB
90.5 min 652 MB 693 MB

That's ~3 MB / 30s of slow growth. Pre-fix the same pod hit 1Gi in 70 min from the same baseline (driven by the schema-mismatch error pile-up). Post-fix it's slower but not zero — there's a residual leak.

What

`process.on('SIGUSR2', () => v8.writeHeapSnapshot('/tmp/...'))` so we can dump on demand:

```
POD=$(kubectl get pod -n summaries-prod -l app.kubernetes.io/component=api -o name | head -1)
kubectl exec -n summaries-prod $POD -- kill -USR2 1

wait for "snapshot written" log line, then:

kubectl cp summaries-prod/$(basename $POD):/tmp/.heapsnapshot ./lobu.heapsnapshot

open in Chrome DevTools → Memory → Load

```

Notes

  • `writeHeapSnapshot` is synchronous and blocks the event loop for seconds proportional to heap size. Only trigger manually when investigating.
  • SIGUSR2 is free on Node (SIGUSR1 is the one Node reserves for the inspector).
  • No security risk — triggering requires `kubectl exec` access.

Test plan

  • `make typecheck` clean
  • `make build-packages` builds
  • After merge: deploy, send SIGUSR2 once, confirm a .heapsnapshot lands in /tmp and is openable in DevTools.

Summary by CodeRabbit

  • Chores
    • Server now supports on-demand heap snapshot diagnostics when enabled.
    • Snapshots are written to a temporary location and logged on success or failure.
    • Additional snapshot requests are ignored while a snapshot is in progress to avoid conflicts.

Review Change Stack

Adds a `process.on('SIGUSR2', ...)` handler that calls
`v8.writeHeapSnapshot('/tmp/lobu-<pid>-<ts>.heapsnapshot')`. Lets us
profile the leak that survived lobu#767: post-fix the app pod still
grows ~3 MB/30s toward the 1Gi limit even with the queue healthy.

Usage:

    POD=$(kubectl get pod -n summaries-prod \
      -l app.kubernetes.io/component=api -o name | head -1)
    kubectl exec -n summaries-prod $POD -- kill -USR2 1
    # wait for "snapshot written" log line, then copy out:
    kubectl cp summaries-prod/$(basename $POD):/tmp/<file>.heapsnapshot \
      ./lobu.heapsnapshot
    # open in Chrome DevTools → Memory → Load

Notes:

* writeHeapSnapshot is synchronous and blocks the event loop for several
  seconds proportional to heap size — only trigger manually when
  investigating, never wire to an automated source.
* SIGUSR2 is free on Node (SIGUSR1 is the one reserved for the
  inspector).
* Snapshot goes to /tmp which is the container's writable tmpfs.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 02f009a9-bfab-4cb4-be54-59a51350676c

📥 Commits

Reviewing files that changed from the base of the PR and between 5ebb98b and 4614db3.

📒 Files selected for processing (1)
  • packages/server/src/server.ts

📝 Walkthrough

Walkthrough

Adds an environment-gated SIGUSR2 handler (when ALLOW_HEAP_SNAPSHOT=1) that imports Node's v8 and writes a heap snapshot to /tmp/lobu.heapsnapshot, with logging and a guard to prevent concurrent snapshots.

Changes

Heap Snapshot Signal Handler

Layer / File(s) Summary
Import v8 and SIGUSR2 handler
packages/server/src/server.ts
Adds ESM import of node:v8 and registers a SIGUSR2 handler (gated by ALLOW_HEAP_SNAPSHOT=1) that writes a heap snapshot to the fixed path /tmp/lobu.heapsnapshot, prevents concurrent writes with an inProgress flag, and logs received/ignored/start/success/error events.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A whistle in the server night,
SIGUSR2 brings memory light,
I press my paw, the snapshot's cast,
/tmp/lobu.heapsnapshot holds the past 🥕📸

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding a SIGUSR2 signal handler to write V8 heap snapshots, which is the primary focus of the PR.
Description check ✅ Passed The description includes all required sections: a comprehensive 'Why' explaining the memory growth issue, a 'What' detailing the implementation with usage instructions, and a partially completed test plan. Two test items are checked (typecheck and build-packages), though post-merge deployment testing is noted as pending.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/heap-snapshot-on-sigusr2

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Three findings from pi on PR #768; all addressed:

1. **Secrets in snapshots** — gate the SIGUSR2 handler behind
   ALLOW_HEAP_SNAPSHOT=1. Default off in prod. Operator must
   explicitly opt the pod in, capture, then unset and roll. Workers
   run under the same UID (Dockerfile sets no separate USER), so
   on-disk snapshots aren't isolated from a same-UID exec path.

2. **No rate limit / cleanup** — single-flight via an in-progress
   flag; subsequent SIGUSR2s during a write are dropped with a log
   line. Use a single rolling path /tmp/lobu.heapsnapshot so a
   stuck-on flag can't fill the writable layer.

3. **Probe interaction** — documented in the handler comment:
   trigger needs cgroup-limit headroom (writeHeapSnapshot allocates
   ~heap size while running) and blocks /health/ready (DB SELECT 1).
   Caller-side; nothing programmatic to fix without an
   already-multi-replica deploy.
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 16, 2026

pi review — addressed

Three findings, all in 4614db3:

  1. Secrets in snapshots — handler now gated on `ALLOW_HEAP_SNAPSHOT=1`. Default off in prod. Operator opts the pod in, captures, copies out, then unsets and rolls. Workers run as the same UID (Dockerfile sets no separate `USER`), so on-disk snapshots aren't isolated from same-UID processes — gating is the only mitigation that holds.

  2. No rate limit / cleanup — single-flight (in-progress flag drops subsequent signals with a log line) + fixed rolling path `/tmp/lobu.heapsnapshot` (overwrite each time, no growth). Stuck-on flag can't fill the writable layer.

  3. Probe interaction — documented in the handler comment. `writeHeapSnapshot` needs ~heap-size extra memory while running and blocks `/health/ready` (the new DB `SELECT 1`). Caller-side concern; only programmatic fix would be temporarily marking unready, which requires multi-replica to avoid 503-on-the-service. Today's prod is 1-replica, so the operator playbook is: bump memory headroom / scale to 2 replicas first, then dump.

Copy link
Copy Markdown

@codex-approver codex-approver Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-approved: Codex left a 👍 reaction (no suggestions).

@buremba buremba merged commit e5c93a3 into main May 16, 2026
23 of 24 checks passed
@buremba buremba deleted the perf/heap-snapshot-on-sigusr2 branch May 16, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants