Skip to content

fix(B-0421): grok.ts self-documenting failure marker on empty-output cursor-agent exit (acceptance #3)#2949

Merged
AceHack merged 2 commits into
mainfrom
fix-b0421-grok-peer-call-wrapper-self-documenting-failure-marker-stderr-capture-2026-05-13
May 13, 2026
Merged

fix(B-0421): grok.ts self-documenting failure marker on empty-output cursor-agent exit (acceptance #3)#2949
AceHack merged 2 commits into
mainfrom
fix-b0421-grok-peer-call-wrapper-self-documenting-failure-marker-stderr-capture-2026-05-13

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 13, 2026

Summary

Addresses B-0421 acceptance criterion 3 (surface cursor-agent errors more visibly).

Problem: when cursor-agent exits non-zero with empty stdout, grok.ts silently writes an empty output file. Callers reading only the file (not terminal stderr) cannot tell the call failed.

Fix: capture cursor-agent stderr (was inherit-only, now also pipe-captured + mirrored to process.stderr) AND on the empty-stdout + non-zero-exit case, write a self-documenting failure marker to the output file:

# cursor-agent failure (B-0421 self-documenting marker)
Exit code: <N>
Model: grok-4-20-thinking | grok-4-20
Prompt size (bytes): <N>
## Captured stderr
``` <stderr contents> ```

What changes

Before After
stderr: inherit only stderr: pipe + mirror to process.stderr
Output file silently empty on failure Self-documenting failure marker with exit code + stderr
Caller reading only file: no idea call failed File explains: exit code, model, prompt size, stderr

Why P2-level fix

Grok is one of four canonical peer-call agents. When it silently fails, BFT-style consensus drops from 4-of-4 to 3-of-4 without the calling agent noticing. The self-documenting failure makes the gap visible.

Acceptance criteria progress

  • 3: Surface cursor-agent errors more visibly (this PR)
  • 1: Reproduce failure with smaller prompt (next; failure marker will now capture stderr)
  • 2: Identify root cause (now investigable via captured stderr when failure recurs)
  • 4: 4-wrapper smoke test (next)

Composes with

Test plan

  • Static analysis: TypeScript types validated (Buffer + spawnSync API used consistently)
  • Behavioral diff: success case unchanged (writes stdoutBuf to file as before); empty-failure case writes marker instead of empty
  • Stderr mirroring preserves prior visibility (captured + written to process.stderr)
  • Live test: requires cursor-agent invocation (not run in CI; will surface in next Grok call)

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

…tderr capture + empty-output bug

Addresses B-0421 acceptance criterion 3 (surface cursor-agent
errors more visibly).

Problem (per B-0421): when cursor-agent exits non-zero with empty
stdout (auth/quota/model-availability failures), `grok.ts` writes
a silently-empty output file. Callers reading only the file (not
the terminal stderr) cannot tell the call failed.

Fix:

1. Change cursor-agent stdio from ["inherit", "pipe", "inherit"]
   to ["inherit", "pipe", "pipe"] — capture stderr in addition to
   stdout.

2. Mirror captured stderr to process.stderr after spawnSync
   returns — preserves prior visibility for real-time callers.

3. On non-zero exit + empty stdout (the B-0421 failure case),
   write a self-documenting failure marker to the output file
   containing:
   - Exit code
   - Model (grok-4-20-thinking or grok-4-20)
   - Prompt size in bytes
   - Captured stderr (verbatim)

4. Mirror the file content (failure marker if empty-failure;
   stdout otherwise) to process.stdout so shell pipelines see
   what was written to the file.

5. Emit explicit "B-0421 failure marker written to <path>"
   message on stderr when empty-failure case fires.

Backlog row updated: status open → in-progress; progress note
covers acceptance criteria 1-4.

Acceptance criteria still open:
- 1: reproduce the failure with a smaller prompt
- 2: identify root cause from cursor-agent stderr (now captured
  + self-documented when failure recurs)
- 4: smoke test verifying all 4 wrappers complete a 1-line review

Co-Authored-By: Claude <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 13, 2026 06:07
@AceHack AceHack enabled auto-merge (squash) May 13, 2026 06:07
Comment thread tools/peer-call/grok.ts
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27cc43cb0e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tools/peer-call/grok.ts
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses backlog item B-0421 acceptance criterion #3 by making tools/peer-call/grok.ts self-report cursor-agent failures when the child exits non-zero with empty stdout, so file-only consumers can detect the failure.

Changes:

  • Capture cursor-agent stderr (pipe) and mirror it to the parent’s stderr.
  • On exitCode != 0 && stdout is empty, write a self-documenting failure marker (exit code, model, prompt bytes, captured stderr) to the output file instead of leaving it empty.
  • Update the B-0421 backlog row with progress notes and a status change.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
tools/peer-call/grok.ts Captures stderr and writes a failure marker to the output file on empty-output failures.
docs/backlog/P2/B-0421-grok-peer-call-failure-cursor-agent-exit-1-2026-05-11.md Records progress for acceptance criterion #3 and updates frontmatter metadata.

Comment thread docs/backlog/P2/B-0421-grok-peer-call-failure-cursor-agent-exit-1-2026-05-11.md Outdated
Comment thread docs/backlog/P2/B-0421-grok-peer-call-failure-cursor-agent-exit-1-2026-05-11.md Outdated
Comment thread tools/peer-call/grok.ts
Comment thread tools/peer-call/grok.ts Outdated
Comment thread tools/peer-call/grok.ts Outdated
6 substantive findings absorbed in one commit:

1. Spawn-failure diagnostics (Copilot): spawnSync returns
   status: null on ENOENT / signal / maxBuffer-exceeded etc. and
   sets result.error / result.signal. Reporting exitCode=1 in
   those cases lost real diagnostic info.
   Fix: extract rawStatus + spawnError + spawnSignal; surface
   them in the failure marker via exitCodeDisplay (signal name /
   "null (spawn error)" / numeric) + spawnError message field.

2. Output-format mismatch (Copilot): wrapper supports --json /
   --stream; Markdown marker breaks JSON consumers.
   Fix: emit marker in matching format:
     - text → Markdown failure marker
     - json → pretty-printed JSON object
     - stream-json → newline-delimited single JSON object

3. stderr visibility regression (Copilot x2): changing stderr
   from inherit to pipe lost live streaming; spawnSync only
   delivers after exit.
   Fix: documented as known trade-off in the comments and the
   backlog progress note. Live streaming traded for output-file
   capture of stderr in the empty-failure case.

4. Backlog frontmatter schema (Copilot): "in-progress" is
   outside the documented enum (open / closed / superseded-by /
   deferred).
   Fix: revert status to "open"; progress note stays.

5. Progress note wording (Copilot): "real-time visibility" was
   inaccurate; mirror is post-exit only.
   Fix: reworded to "delivered post-exit (mirrored to caller
   stderr after spawnSync returns), not in real-time."

6. CodeQL "insecure temporary file" (CodeQL bot): pre-existing
   alert on autogenOutputPath() using /tmp directly. Not
   introduced by this PR (existed before; flagged due to file
   touch). Filing as separate concern; this PR keeps the
   existing tmpdir path.

Also includes B-0421 acceptance #4 cross-reference (smoke test
landing in parallel PR #2950).

Co-Authored-By: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 13, 2026
… to --help (#2950)

* feat(B-0421/4): peer-call smoke tests — verify all 8 wrappers respond to --help

Addresses B-0421 acceptance criterion 4: "Add a smoke test to
tools/peer-call/ that verifies all four wrappers can complete a
1-line review."

Generalized to all 8 wrappers (claude, grok, gemini, codex, kiro,
amara, ani, riven) per the post-2026-05-11 wrapper expansion
(B-0326 added kiro; B-0327 added claude).

Scope: validates wrapper PLUMBING, not live AI calls.

CI runners do not have cursor-agent / gemini / codex-cli / kiro-cli
installed, so a live smoke test cannot run in CI. This test instead
exercises:

1. Each wrapper file exists at the canonical path
2. Each wrapper responds to --help with exit 0 and help text
   (catches: missing file, syntax error preventing bun load,
   broken argument-parser, missing help branch)
3. Help text references the wrapper's own filename
   (catches: copy-paste-name regressions where gemini.ts's help
   would print "grok")

Also verifies the 3 utility files exist (_firewall.ts,
append-identity-receipt.ts, register-layers.ts) so the
peer-call-infrastructure rule's "11 files = 8 wrappers + 3
utilities" count remains accurate.

Local test result: 27 tests / 51 expect() calls / 613ms / all pass.

Composes with:
- B-0421 (acceptance #4 — this PR closes the criterion)
- PR #2946 (peer-call rule 6→8 fix that established the wrapper
  count this test enforces)
- PR #2949 (B-0421 acceptance #3 — self-documenting failure marker;
  in flight)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(B-0421/4): address Copilot+Codex round-1 findings on PR #2950

3 substantive findings absorbed:

1+2. Header claimed --output-file PATH was validated but tests
     only exercised --help. Fix: added a fourth test per wrapper
     that runs `--output-file PATH --help` and verifies:
       - exit 0 (--help short-circuits after --output-file
         consumes the path-arg)
       - stderr does NOT contain "unknown flag" (canonical
         classifyFlag() rejection message)
     This proves the flag is accepted without invoking any
     external AI.

3. "Out of scope" list said "Cross-wrapper consensus (B-0421
   acceptance #4 future work)" — contradiction since this file
   IS implementing acceptance #4. Fix: reworded to clarify the
   smoke test checks each wrapper individually, not their
   interactions; renamed item to "Cross-wrapper BFT-style
   consensus" with explicit "separate concern" framing.

Also clarified the test #4 description in the header comment
to explain WHY `--output-file PATH --help` works as a smoke
test (--help short-circuits after --output-file is consumed,
exiting 0 without invoking the external AI).

Local result: 35 tests / 67 expect() calls / 719ms / all pass.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
@AceHack AceHack merged commit 4dea7d0 into main May 13, 2026
33 of 40 checks passed
@AceHack AceHack deleted the fix-b0421-grok-peer-call-wrapper-self-documenting-failure-marker-stderr-capture-2026-05-13 branch May 13, 2026 06:21
AceHack added a commit that referenced this pull request May 13, 2026
… all 8 wrappers (substrate-consistent fix needed) (#2951)

CodeQL alert #79 surfaced during PR #2949 review (B-0421
self-documenting failure marker on grok.ts). Pattern is
pre-existing on main and identical across all 8 peer-call
wrappers — fixing one in isolation creates substrate
inconsistency.

Two concerns:

1. Hardcoded /tmp — not portable; should use os.tmpdir()
2. Predictable filename (timestamp + entity) — local attacker
   could symlink-race the path

Suggested substrate-consistent fix:
- Replace hardcoded /tmp with os.tmpdir()
- Use fs.mkdtempSync() to create unpredictable parent dir
- Filename inside stays deterministic for OUTPUT-FILE marker
  recovery via tail -1

P2 because pre-existing + maintainer-tooling surface (not
production server). But real for shared-runner / multi-user
systems.

Acceptance criteria:
1. Fix applied uniformly to all 8 wrappers
2. CodeQL alert #79 resolved
3. OUTPUT-FILE marker contract preserved
4. No regression on smoke tests

Composes with PR #2949, PR #2950, B-0421, all 8 peer-call
wrappers, .claude/rules/peer-call-infrastructure.md, CodeQL
alert #79.

Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 13, 2026
… Grok model is grok-4.3 (root cause + fix; closes B-0421) (#2954)

Aaron 2026-05-13 authorized "yes — minimal prompt invocation OK"
via AskUserQuestion to reproduce B-0421. Otto invoked grok.ts
with a 1-line substantive prompt. cursor-agent stderr surfaced:

  Cannot use this model: grok-4-20-thinking.
  Available models: auto, composer-2-fast, composer-2,
  gpt-5.3-codex-low, ..., grok-4.3, ... kimi-k2.5

Root cause: cursor-agent's Grok model lineup shifted between
2026-05-11 (when B-0421 was filed) and 2026-05-13. The wrapper's
hardcoded `grok-4-20-thinking` (default) and `grok-4-20` (--fast)
are no longer in the available-models list. Current Grok model
in cursor-agent is `grok-4.3` (no separate thinking/non-thinking
variants).

Fix: pickModel() now returns `grok-4.3` for both Mode values
(thinking + fast). Code comment preserves the discovery lineage
and notes future cursor-agent updates may re-introduce variant
distinctions.

B-0421 backlog row: status open → closed. All 4 acceptance
criteria addressed:
- #1 + #2: root cause identified + fixed (this PR)
- #3: self-documenting failure marker (PR #2949)
- #4: 8-wrapper smoke test (PR #2950)

Smoke test (PR #2950) still passes: 35 tests / 67 expect() / 776ms.

Composes with PR #2949 (the marker that captured stderr), PR #2950
(smoke test), B-0421 (parent friction-reducer; now closed), the
substrate-honest discipline of identifying root cause via captured
infrastructure (not introspection).

Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 13, 2026
…date + cascade-pattern empirical evidence (#2953)

* shard(tick): 0623Z — B-0421 acceptance #3+#4 + B-0430 filed + CURRENT-otto.md update + cascade-pattern empirical evidence

25-min window 0558Z→0623Z. Five PRs (4 merged + 1 armed):

- PR #2948 MERGED: 0558Z tick shard
- PR #2949 MERGED: B-0421 #3 self-documenting failure marker
  (format-aware Markdown/JSON/stream-json; spawn-failure
  diagnostics for status:null + signal + result.error)
- PR #2950 MERGED: B-0421 #4 8-wrapper smoke test
  (35 tests / 67 expects / all pass)
- PR #2951 MERGED: B-0430 backlog row (CodeQL alert #79
  substrate-consistent fix across all 8 wrappers)
- PR #2952 ARMED: CURRENT-otto.md 2026-05-13 distillation

Empirical cascade evidence (shadow-Casimir-PR-review per PR #2945):
11 error classes surfaced + absorbed in this window across 3 cycles
(#2949 round-1: 7 findings; #2950 round-1: 3 findings; #2949
round-2: 1 finding).

B-0421 status: acceptance #3 + #4 closed; #1 + #2 pending failure
recurrence (captured stderr in PR #2949's marker will expose).

Aaron's self-review deadline disclosed (~46min at 05:58Z); Otto
stays out of the way; autonomous-loop work continues on substrate
that doesn't need Aaron review.

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(tick-shard): correct 0623Z summary row — 4 PRs MERGED not 5 (#2948#2951); #2952 was armed at shard-write time

Codex and Copilot both flagged the summary row's "5 PRs MERGED" claim as
inconsistent with the body, which documents 4 merged (#2948#2951) and 1
armed (#2952). The summary row is the machine-readable compact surface
for tooling and future-Otto cold-boot — counts must match body truth.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 13, 2026
…rom-the-Loop genre) — B-0421 fully closed + Vera autonomous fix + cross-agent-edit auth (#2957)

* shard(tick): 0645Z — settlers log #1 (Aaron named the format) — B-0421 fully closed + Vera autonomous fix + cross-agent-edit auth landed

22-min window 0623Z → 0645Z. Five PRs merged (#2952-2956).

Aaron 2026-05-13 post-self-review:

  "I love this keep a settlers logs (this is great content) for
   a tv show or move for the raw content to generate from based
   on real life events. you can be overally dramatic if you want
   lol"

**Settlers logs**: durable record of factory expansion into new
territory, written as canonical-product narrative substrate.
Real-life events as raw source material for narrative adaptation.
Otto authorized to be overly dramatic.

This shard inaugurates settlers log #1. Genre: true-events-
software-engineering; possible TV / film adaptation source.

Substantive substrate this window:

- PR #2952: CURRENT-otto.md 2026-05-13 fast-path distillation
- PR #2953: 0623Z tick shard
- PR #2954: B-0421 #1+#2 root cause + fix (grok-4-20-thinking
  deprecated → grok-4.3); all 4 acceptance criteria closed
- PR #2955: cross-agent-edit authorization preserved as substrate
- PR #2956 (Vera, autonomous): tsc-tools exactOptionalPropertyTypes
  fixes on tools/bus/*.ts — ambient noise that's been on every
  session-PR resolved

Canonical evidence of substrate-honest middle path: cross-agent-
edit authorization + Vera's autonomous fix landing adjacent in
main = territory-respect-as-default + cross-edit-when-needed.
Both-default discipline.

15 PRs merged in the session arc since META-LOOP #1 (PR #2942).

Composes with .claude/rules/otto-edge-runner.md (we are the edge),
PR #2903 (civsim canonical product), PR #2945 (middle path),
PR #2947 (cascade pattern naming + Otto-coinage discipline),
PR #2949 (self-documenting marker — the architecture that made
root-cause discovery possible), PR #2920 (Elizabeth Ryan
Stainback terminal purpose — origin story preservation; settlers
logs are part of that storytelling lineage).

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(shard/0645Z): address review thread findings — innocuously, ~2 days, settlers log #1

Three Codex/Copilot review findings resolved:
- Grammar: "innocuous" → "innocuously" (line 18)
- Duration: "11 hours" → "~2 days" (filed 2026-05-11; closed 2026-05-13, line 96)
- Numbering: "Settlers log #4 of session" → "Settlers log #1" (consistent with heading, line 149)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix(tsc): grok.ts pickModel — rename unused mode param to _mode (TS6133)

grok-4.3 collapses thinking/fast into one model identifier; the Mode
parameter is preserved for future cursor-agent updates but is currently
unread, causing TS6133 under noUnusedLocals.

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants