Skip to content

fix(shell): bound pipe read to MaxOutputChars before truncating (#1293)#1298

Merged
Aaronontheweb merged 6 commits into
netclaw-dev:devfrom
Aaronontheweb:fix/1293-shell-bounded-drain
Jun 3, 2026
Merged

fix(shell): bound pipe read to MaxOutputChars before truncating (#1293)#1298
Aaronontheweb merged 6 commits into
netclaw-dev:devfrom
Aaronontheweb:fix/1293-shell-bounded-drain

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

  • shell_execute previously called ReadToEndAsync on both stdout and stderr, buffering the entire child output before any size limit was applied. A command emitting hundreds of MB (e.g. kubectl logs, a large curl response) would allocate the full payload — plus several derived copies for StringBuilder, ToString(), Redact(), and the base64 inflation for LOH purposes — before TruncateOutput ran. The 32k MaxOutputChars cap protected the LLM's context window, not the process.
  • Replaces ReadToEndAsync with BoundedDrainAsync: reads into a head + tail ring-buffer window capped at MaxOutputChars. Chars beyond the cap are discarded on read but the pipe continues to drain so a running child never deadlocks on a full pipe buffer (the existing deadlock-prevention comment and CancellationToken.None rationale are preserved).
  • Redaction and result assembly now operate on the already-bounded strings — not the raw full output.
  • The model sees a head+tail view (...\n...\n...) when truncated, so tail-of-log content (error summaries, final exit status) survives instead of being silently dropped by head-only truncation.

Test plan

  • All existing ShellToolTests pass (30 total, including the updated Output_truncation_applies assertion)
  • 6 new BoundedDrain_* unit tests covering: short output returned verbatim, empty input, output exactly at cap, long output with distinct head/tail markers, even head/tail split, disabled cap falls back to unbounded read
  • dotnet slopwatch analyze — 0 issues

Closes #1293.

netclawd builds on the Web SDK, which defaults ServerGarbageCollection to
true. The daemon is a single-tenant, low-concurrency process typically run
in a memory-limited container, where Server GC inflates peak RSS and is slow
to return memory to the OS — turning transient allocation spikes into cgroup
OOM kills. Switch to Workstation GC with background collection.

Verified System.GC.Server=false in the generated netclawd.runtimeconfig.json.
…law-dev#1293)

Previously ReadToEndAsync drained the entire child stdout/stderr into
memory before TruncateOutput applied the cap. A command emitting 300MB
of output (kubectl logs, curl of a large response) would allocate the
full payload plus multiple derived copies — all on the LOH — before
anything was discarded. The 32k cap protected the LLM context, not the
process.

Replace ReadToEndAsync with BoundedDrainAsync, which reads into a
head+tail ring-buffer window capped at MaxOutputChars. Chars beyond the
cap are discarded on read but the pipe continues to drain so a running
child never deadlocks on a full pipe buffer. Redaction and assembly now
operate on the already-bounded strings instead of the raw full output.

The model sees a head+tail view with a "..." separator when output is
truncated, so tail-of-log content (error summaries, final status) is
preserved instead of being silently dropped by head-only truncation.
@Aaronontheweb Aaronontheweb force-pushed the fix/1293-shell-bounded-drain branch from e838662 to a648cc9 Compare June 2, 2026 15:09
…flow + hygiene hardening (netclaw-dev#1293)

The initial bounded-drain fix capped what the model sees but not what the
drain allocates. A BenchmarkDotNet harness (benchmarks/Netclaw.Benchmarks)
isolating the drain loop surfaced that allocation still scaled with total
output (~1MB for 50M chars) plus two regressions on the common path.

Rework the drain so allocation is O(cap), not O(total output):

  - allocate the tail ring lazily, only once the head fills, so the common
    case (output under the cap) allocates head-only;
  - read through ReadAsync(Memory<char>), which returns a non-allocating
    ValueTask when the pipe already has data buffered, and pool the scratch
    read buffer so it never lands on the LOH;
  - write into the ring with block copies (at most two per chunk) instead
    of a per-char modulo loop.

Result: allocation is flat at ~188KB whether the child prints 1M or 50M
chars, zero Gen2/LOH collections, and the large-output path is ~26x faster.
The common small-output case is faster and lighter than the original
ReadToEnd path.

Hardening from review:

  - long totalChars and overflow-safe headCap so a multi-GB flood (which the
    bounded drain now lets the process keep producing) can't wrap the
    truncation check, and a near-int.MaxValue MaxOutputChars can't overflow
    headCap to a negative StringBuilder capacity;
  - Return the pooled buffer with clearArray:true so raw stdout/stderr
    (possibly secrets) is wiped from the shared pool;
  - reconstruct the result in place on the head StringBuilder, dropping a
    second builder and a head-sized copy on the truncation path;
  - correct the disabled-cap comment (0 is an explicit opt-out, not "previous
    behaviour") and the benchmark reader/csproj comments.

Add a chunked-reader test exercising the ring's wraparound + start-advance
path, which the StringReader-based tests (single 4KB read) did not cover.
@Aaronontheweb Aaronontheweb added tools Issues related to agent tools: file_read, web_search, shell_execute, image processing, etc. performance Memory, CPU, and I/O optimization issues. labels Jun 3, 2026
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) June 3, 2026 03:58
@Aaronontheweb

Copy link
Copy Markdown
Collaborator Author

Benchmark results

BenchmarkDotNet ([MemoryDiagnoser]) over the benchmarks/Netclaw.Benchmarks harness added in this PR. The benchmark isolates the drain loop — it feeds BoundedDrainAsync a synthetic in-memory TextReader (no process spawn, no real pipe I/O), so the numbers reflect only the cost of turning a stream into a bounded, redaction-ready string. Cap is the production default MaxOutputChars = 32_000. Baseline is the old path (ReadToEndAsync + TruncateOutput).

Run config: --job short (3 iterations) on the optimized drain. Allocation figures (MemoryDiagnoser) are deterministic; the time error bars are wide on the small inputs due to the short job, but the order-of-magnitude story holds. The post-review hardening (in-place reconstruction on the truncated path) only reduces allocation further than shown here.

TotalChars Implementation Mean Allocated Gen2
1,000 ReadToEnd + Truncate (old) 1,971 ns 10.53 KB
1,000 BoundedDrainAsync (new) 521 ns 10.23 KB
32,000 ReadToEnd + Truncate (old) 12,409 ns 127.00 KB
32,000 BoundedDrainAsync (new) 8,146 ns 156.76 KB
1,000,000 ReadToEnd + Truncate (old) 1,502,686 ns 4,005.85 KB 498
1,000,000 BoundedDrainAsync (new) 51,219 ns 188.41 KB
50,000,000 ReadToEnd + Truncate (old) 54,004,217 ns 197,197.51 KB 2,600
50,000,000 BoundedDrainAsync (new) 2,042,472 ns 188.41 KB

What the numbers show

  • Allocation is bounded by the cap, not the output. The new path allocates 188.41 KB at both 1M and 50M chars — identical to the byte. The old path scales with total output (193 MB at 50M chars). That O(cap) flatness is the property that stops the cgroup OOM-kill on a memory-limited daemon (shell_execute buffers entire child output into memory before truncating (LOH/OOM risk) #1293).
  • Zero Gen2 / LOH collections on the new path at every size. The old path forces 2,600 Gen2 collections per 1k ops at 50M chars — that GC pressure was the kill mechanism.
  • Faster across the board, despite doing more bookkeeping: ~26× at 50M chars (54 ms → 2 ms), ~29× at 1M.
  • The common small-output case is faster and lighter than the old ReadToEnd path (521 ns / 10.23 KB vs 1,971 ns / 10.53 KB at 1,000 chars).

The one spot slightly above baseline is exactly at the cap (32k: 156.76 KB vs 127 KB) — head and tail buffers both allocate right at the boundary where nothing is truncated. It's bounded (~30 KB), only at that precise size, and the path is still ~34% faster there.

Reproduce: dotnet run -c Release --project benchmarks/Netclaw.Benchmarks -- --filter '*ShellDrain*' --job short

Output_truncation_applies used `python -c` to generate output on Windows,
but the Windows CI runner resolves `python` to the Microsoft Store stub,
which writes its message to stderr. With the new per-stream truncation
markers the truncation then lands on stderr ("[stderr truncated"), so the
test's "[stdout truncated" assertion failed on windows-latest only.

Use `echo` with a long literal instead — a builtin on both bash and
cmd.exe that deterministically writes >50 chars to stdout, with no
interpreter dependency.
@Aaronontheweb Aaronontheweb merged commit cfa1550 into netclaw-dev:dev Jun 3, 2026
14 checks passed
@Aaronontheweb Aaronontheweb deleted the fix/1293-shell-bounded-drain branch June 3, 2026 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Memory, CPU, and I/O optimization issues. tools Issues related to agent tools: file_read, web_search, shell_execute, image processing, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

shell_execute buffers entire child output into memory before truncating (LOH/OOM risk)

1 participant