fix(shell): bound pipe read to MaxOutputChars before truncating (#1293)#1298
Conversation
netclawd builds on the Web SDK, which defaults ServerGarbageCollection to true. The daemon is a single-tenant, low-concurrency process typically run in a memory-limited container, where Server GC inflates peak RSS and is slow to return memory to the OS — turning transient allocation spikes into cgroup OOM kills. Switch to Workstation GC with background collection. Verified System.GC.Server=false in the generated netclawd.runtimeconfig.json.
…law-dev#1293) Previously ReadToEndAsync drained the entire child stdout/stderr into memory before TruncateOutput applied the cap. A command emitting 300MB of output (kubectl logs, curl of a large response) would allocate the full payload plus multiple derived copies — all on the LOH — before anything was discarded. The 32k cap protected the LLM context, not the process. Replace ReadToEndAsync with BoundedDrainAsync, which reads into a head+tail ring-buffer window capped at MaxOutputChars. Chars beyond the cap are discarded on read but the pipe continues to drain so a running child never deadlocks on a full pipe buffer. Redaction and assembly now operate on the already-bounded strings instead of the raw full output. The model sees a head+tail view with a "..." separator when output is truncated, so tail-of-log content (error summaries, final status) is preserved instead of being silently dropped by head-only truncation.
e838662 to
a648cc9
Compare
…flow + hygiene hardening (netclaw-dev#1293) The initial bounded-drain fix capped what the model sees but not what the drain allocates. A BenchmarkDotNet harness (benchmarks/Netclaw.Benchmarks) isolating the drain loop surfaced that allocation still scaled with total output (~1MB for 50M chars) plus two regressions on the common path. Rework the drain so allocation is O(cap), not O(total output): - allocate the tail ring lazily, only once the head fills, so the common case (output under the cap) allocates head-only; - read through ReadAsync(Memory<char>), which returns a non-allocating ValueTask when the pipe already has data buffered, and pool the scratch read buffer so it never lands on the LOH; - write into the ring with block copies (at most two per chunk) instead of a per-char modulo loop. Result: allocation is flat at ~188KB whether the child prints 1M or 50M chars, zero Gen2/LOH collections, and the large-output path is ~26x faster. The common small-output case is faster and lighter than the original ReadToEnd path. Hardening from review: - long totalChars and overflow-safe headCap so a multi-GB flood (which the bounded drain now lets the process keep producing) can't wrap the truncation check, and a near-int.MaxValue MaxOutputChars can't overflow headCap to a negative StringBuilder capacity; - Return the pooled buffer with clearArray:true so raw stdout/stderr (possibly secrets) is wiped from the shared pool; - reconstruct the result in place on the head StringBuilder, dropping a second builder and a head-sized copy on the truncation path; - correct the disabled-cap comment (0 is an explicit opt-out, not "previous behaviour") and the benchmark reader/csproj comments. Add a chunked-reader test exercising the ring's wraparound + start-advance path, which the StringReader-based tests (single 4KB read) did not cover.
Benchmark results
What the numbers show
The one spot slightly above baseline is exactly at the cap (32k: 156.76 KB vs 127 KB) — head and tail buffers both allocate right at the boundary where nothing is truncated. It's bounded (~30 KB), only at that precise size, and the path is still ~34% faster there. Reproduce: |
Output_truncation_applies used `python -c` to generate output on Windows,
but the Windows CI runner resolves `python` to the Microsoft Store stub,
which writes its message to stderr. With the new per-stream truncation
markers the truncation then lands on stderr ("[stderr truncated"), so the
test's "[stdout truncated" assertion failed on windows-latest only.
Use `echo` with a long literal instead — a builtin on both bash and
cmd.exe that deterministically writes >50 chars to stdout, with no
interpreter dependency.
Summary
shell_executepreviously calledReadToEndAsyncon both stdout and stderr, buffering the entire child output before any size limit was applied. A command emitting hundreds of MB (e.g.kubectl logs, a largecurlresponse) would allocate the full payload — plus several derived copies forStringBuilder,ToString(),Redact(), and the base64 inflation for LOH purposes — beforeTruncateOutputran. The 32kMaxOutputCharscap protected the LLM's context window, not the process.ReadToEndAsyncwithBoundedDrainAsync: reads into a head + tail ring-buffer window capped atMaxOutputChars. Chars beyond the cap are discarded on read but the pipe continues to drain so a running child never deadlocks on a full pipe buffer (the existing deadlock-prevention comment andCancellationToken.Nonerationale are preserved)....\n...\n...) when truncated, so tail-of-log content (error summaries, final exit status) survives instead of being silently dropped by head-only truncation.Test plan
ShellToolTestspass (30 total, including the updatedOutput_truncation_appliesassertion)BoundedDrain_*unit tests covering: short output returned verbatim, empty input, output exactly at cap, long output with distinct head/tail markers, even head/tail split, disabled cap falls back to unbounded readdotnet slopwatch analyze— 0 issuesCloses #1293.