From 5ebfb1b186eb94d6ad9d8daa90ff314492558ec8 Mon Sep 17 00:00:00 2001 From: Aaron Stainback Date: Thu, 23 Apr 2026 21:11:03 -0400 Subject: [PATCH 01/11] =?UTF-8?q?research:=20Codex=20CLI=20first-class=20s?= =?UTF-8?q?ession=20=E2=80=94=20Phase=201=20(Stage=201=20of=205=20per=20PR?= =?UTF-8?q?=20#228)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Aaron Otto-75 directive called for "first class codex support". PR #228 BACKLOG row named this file as Stage 1 (research tick, S effort) of a 5-stage arc. This commit lands Stage 1. Key findings: - BIG NON-OBVIOUS WIN: Zeta's AGENTS.md is already the universal onboarding handbook CLAUDE.md delegates to, AND Codex CLI natively reads AGENTS.md. So Zeta is already ~60% first-class-Codex-ready by accident of prior decisions. - First-pass capability matrix: 10 parity / 4 partial / 4 gap / 2 Codex-specific. Critical gap: no CronCreate / ScheduleWakeup equivalent in Codex CLI docs — blocks autonomous-loop cadence for Otto-in-Codex without further research into Codex Cloud scheduled tasks. - Account alignment already handled (Aaron Otto-76 switched Codex CLI to ServiceTitan account; Playwright stays on personal for Amara's poor-man-tier path). - Stage-2 test plan (7 concrete prompts) laid out for parity matrix execution in next research tick. Scope limits explicit: does NOT advocate harness swap, does NOT duplicate cross-harness-mirror-pipeline, does NOT modify AGENTS.md. Harness-choice ADR is explicitly Stage 5 scope, not this doc. 9 web sources cited; research runs against April 2026 snapshot of Codex CLI. Sibling composes: PR #230 multi-account access row; existing round-34 cross-harness-mirror-pipeline row; Aaron's Otto-76 account-setup snapshot memory. Otto-76 tick primary deliverable. --- .../codex-cli-first-class-2026-04-23.md | 294 ++++++++++++++++++ 1 file changed, 294 insertions(+) create mode 100644 docs/research/codex-cli-first-class-2026-04-23.md diff --git a/docs/research/codex-cli-first-class-2026-04-23.md b/docs/research/codex-cli-first-class-2026-04-23.md new file mode 100644 index 00000000..3fe84dd7 --- /dev/null +++ b/docs/research/codex-cli-first-class-2026-04-23.md @@ -0,0 +1,294 @@ +# First-class Codex-CLI session experience — Phase 1 research (stage 1 of 5) + +**Aaron directive** (Otto-75, 2026-04-23): +> *"can you start building first class codex support with the +> codex clis help, it might eventually be benefitial to switch +> otto to codex later depending on which modeel/harness is +> ahead. this is basically the same ask as a new session claude +> first class experience, this is a codex session as a first +> class experince."* + +**Parent BACKLOG row:** PR #228 — *First-class Codex-CLI session +experience*. Names this file as the first step (research tick, +S-effort) of a 5-stage arc: + +1. **Research tick (S)** — this document. +2. Parity matrix (M). +3. Gap closures (M-L per gap). +4. Codex session-bootstrap doc (S). +5. Otto-in-Codex test run + harness-choice ADR (S-M). + +**Stage accountability:** this document is only Stage 1. It +does NOT advocate a harness swap, does NOT propose +implementation work, and does NOT commit to a schedule. +Subsequent stages are called for by the BACKLOG row, not this +file. + +--- + +## 1 · What Codex CLI is (2026-04 snapshot) + +OpenAI's terminal-native coding agent. Open source, built in +Rust, actively developed. Positioned parallel to Claude Code +CLI in the 2026 coding-agent landscape. + +**Install:** +- `npm install -g @openai/codex` +- `brew install --cask codex` +- Direct binary download per platform (`macOS arm64/x86_64`, + `Linux x86_64/arm64`). + +**Authentication:** +- ChatGPT account sign-in (Plus / Pro / Business / Edu / + Enterprise) **or** an OpenAI API key. +- Per Aaron's Otto-76 clarification + (`memory/project_account_setup_snapshot_codex_servicetitan_playwright_personal_multi_account_p3_backlog_2026_04_23.md`) + the current Codex CLI session is on ServiceTitan — same + account as the Claude Code session — deliberately. + +**Key surfaces:** +- `codex` — interactive terminal UI. +- `codex exec` — non-interactive scripting mode (equivalent to + Claude Code's one-shot Bash invocation of a prompt). +- `/model` — model switcher (GPT-5.4 / GPT-5.3-Codex / others). +- Subagent dispatch + parallel execution + git worktrees. +- MCP server support with per-tool approval modes. +- Web search integration. +- Image input + image generation. +- Cloud-backed runtime (Codex Cloud) for long-running tasks. +- Background macOS automation ("Computer Use"). +- Code-review agent variant (separate agent reviews before + commit / push). + +**Config surface:** +- `~/.codex/config.toml` (TOML). +- SQLite state DB (`sqlite_home` config / `CODEX_SQLITE_HOME` + env). +- `[mcp_servers]` table mirrors Claude Code's MCP registry with + richer per-tool approval controls (`approval_mode = + "approve" | "prompt"` default + per-tool override). +- `[notice]` for UI prompt suppression; `notify` hook when a + turn completes. +- `plan_mode_reasoning_effort` — Plan Mode analogue. +- `experimental_realtime_start_instructions` — system-message + override for realtime mode. + +--- + +## 2 · The big, non-obvious win — `AGENTS.md` is already universal + +Claude Code reads `CLAUDE.md` first. Codex CLI reads `AGENTS.md` +first. **Zeta's setup already has both, and the `CLAUDE.md` +explicitly delegates to `AGENTS.md`** as the universal +onboarding handbook. The relevant lines of `CLAUDE.md`: + +> 1. **[`AGENTS.md`](AGENTS.md)** — the universal +> onboarding handbook. Pre-v1 status, the three +> load-bearing values, how to treat contributions, +> the build-and-test gate, code-style pointers, +> required reading. **Start here every session.** + +When a Codex CLI session opens Zeta, it will read `AGENTS.md` +by default. `AGENTS.md` already contains: + +- The three load-bearing values (retraction-native / alignment- + contract / operator-algebra). +- Build-and-test gate (`dotnet build -c Release` clean, full + test suite). +- Required reading list (`docs/ALIGNMENT.md` / + `docs/CONFLICT-RESOLUTION.md` / `docs/GLOSSARY.md` / + `docs/WONT-DO.md` / `openspec/README.md` / + `GOVERNANCE.md`). +- "Agents, not bots" discipline. +- Factory-structure pointers to `.claude/`, `docs/`, `src/`, + `openspec/`. + +**Practical consequence:** a Codex CLI session starting in Zeta +will get the universal context for free. The gap is only what +`CLAUDE.md` supplements — Claude-Code-harness-specific +mechanisms (Skills, Task subagents, Memory folder layout, Hook +specifics). + +This is a materially better position than the BACKLOG row +assumed. Zeta is already ~60% first-class-Codex-ready by +accident of adopting `AGENTS.md` as canonical in +`GOVERNANCE.md` §N from earlier rounds. + +--- + +## 3 · Capability-parity first-pass matrix + +Rows Otto routinely exercises in Claude Code; column 2 is the +Codex-CLI equivalent; column 3 is `parity | partial | gap` with +a short note. **This is a first-pass; a proper matrix (Stage 2) +should run each cell against a small test prompt.** + +| Claude Code (Otto usage) | Codex CLI equivalent | Status | Note | +|---|---|---|---| +| `CLAUDE.md` + `AGENTS.md` pointer tree | `AGENTS.md` native | **Parity** | The big win; see §2. | +| `Skill` tool + `.claude/skills/SKILL.md` | No direct equivalent; custom commands + MCP + `AGENTS.md` extensions | **Partial** | Cross-harness-mirror-pipeline BACKLOG row (round 34) already addresses skill-file distribution. Codex CLI reads MCP-registered tools cleanly; skills as MCP-exposed functions is one path. | +| `Task` tool (subagent dispatch) | Subagents + worktrees | **Parity** | Codex advertises parallel execution with worktrees natively; should compose cleanly with Zeta's agent roster. | +| `TodoWrite` task tracking | Not advertised as a primitive | **Gap** | May map to `AGENTS.md` + session memory; needs Stage 2 test. | +| Per-project memory (`~/.claude/projects//memory/`) | SQLite state DB + `AGENTS.md` | **Different shape, functional** | Codex has durable state; the **file-format** differs (SQLite vs. Markdown per-fact files). `MEMORY.md` index doesn't apply directly. Future: design how per-fact memories surface in a Codex session. | +| Bash / Edit / Read / Write | Standard file + shell tool set | **Parity** | Interactive + `exec` modes cover Otto's normal workflow. | +| WebFetch / WebSearch | Web search integration advertised | **Parity** | Codex advertises "up-to-date information retrieval" during tasks. | +| MCP server support | `[mcp_servers]` TOML config | **Parity (richer)** | Codex's per-tool approval mode is stricter than Claude Code's MCP permissioning — plays well with BP-11 data-not-directives. | +| WebFetch on private/authenticated URLs | Unchanged — same constraint; use MCP | **Parity** | Neither harness fetches authenticated URLs directly; both rely on MCP servers. | +| `CronCreate` / `ScheduleWakeup` (loop autonomy) | Not documented | **Likely gap** | The autonomous-loop cadence (minutely `<>` fire) has no Codex-CLI equivalent surfaced in the docs. **This is the biggest single gap** for Otto-in-Codex; the entire `/loop` auto-mode depends on cron. Stage 2 must verify whether Codex Cloud background tasks cover this. | +| Plan Mode | `plan_mode_reasoning_effort` config | **Parity** | Named differently; same concept. | +| Output styles (e.g., explanatory) | Not documented; may go via system-prompt override | **Gap (minor)** | Factory-side impact is small; output styles are Claude-Code-session features, not substrate. | +| Hooks (`.claude/settings.json` PreToolUse, UserPromptSubmit) | `notify` hook for turn completion; no PreToolUse equivalent | **Partial** | Zeta's ASCII-clean pre-commit + prompt-injection lints run via git-pre-commit, not via Claude Code hooks — so those are harness-neutral. Session-side hooks (SessionStart for output style) have no Codex equivalent. | +| Slash commands (`/loop`, `/fast`, `/help`, `/status-line-setup`) | `/model` + plan-mode commands | **Partial** | Codex exposes fewer user-visible slash commands; project-specific ones (e.g., Zeta's `/loop`) need re-authoring or re-routing through `codex exec`. | +| `Task` with `isolation: "worktree"` | Built-in worktree support | **Parity** | Codex advertises worktrees as a first-class subagent feature. | +| Session compaction | Not documented | **Gap (opaque)** | Codex's handling of long sessions is unclear; Stage 2 must test. | +| Code-review agent | Native "separate agent before commit" feature | **Parity (different shape)** | Codex integrates review into the CLI workflow directly; Zeta's equivalent is Codex-as-PR-reviewer on GitHub + `/ultrareview` + harsh-critic persona. Composes. | +| Image input / image generation | Native | **Parity+** | Codex exposes image generation in-CLI; Claude Code accepts image input only. | +| Background macOS Computer Use | Native | **Codex-specific** | No Claude Code equivalent; relevant if Zeta ever wants agent-run GUI tests. Not urgent for Otto. | +| Cloud-backed runtime | Codex Cloud | **Codex-specific** | May subsume the cron-gap by running long-lived agents in cloud; Stage 2 needs to verify. | + +**Running gap score after first-pass:** +- Parity: 10 +- Partial: 4 +- Gap: 4 (of which 1 — cron/autonomous-loop — is critical) +- Codex-specific: 2 + +For a *first-class* Otto experience in Codex CLI, the 1 +critical gap (no equivalent of `CronCreate` / `/loop` +autonomous mode) is the blocker. Without it, Otto in Codex is +a manual session; with it, Otto can run the same heartbeat +cadence. + +--- + +## 4 · Authentication + account — no extra work needed today + +Per +`memory/project_account_setup_snapshot_codex_servicetitan_playwright_personal_multi_account_p3_backlog_2026_04_23.md`, +Aaron aligned Codex CLI and Claude Code on the ServiceTitan +account in Otto-76. This means: + +- Codex CLI ChatGPT sign-in on ServiceTitan = Codex access via + enterprise ChatGPT seat. +- No separate API-key billing for factory-agent work. +- Playwright stays on Aaron's personal for Amara-ferry work (a + deliberate cross-tier boundary — poor-man-tier for Amara, + enterprise-tier for Otto). + +The multi-account-access-design BACKLOG row (PR #230) covers +the future case where Otto operates on multiple accounts +simultaneously; **today's single-account-aligned setup +sidesteps that problem for Phase 1 Codex research**. + +--- + +## 5 · Gap analysis — critical vs. nice-to-have + +**Critical (blocks Otto-in-Codex parity):** + +1. **No `CronCreate` / `ScheduleWakeup` equivalent.** The + entire autonomous-loop cadence depends on minutely cron + fires with the `<>` sentinel. Without a + Codex-CLI way to schedule wake-ups, Otto-in-Codex is + reactive-only (waits for Aaron to kick the next tick). This + is the single most important Stage 2 question: **does Codex + Cloud offer a scheduled-task primitive?** If yes, parity is + reachable. If no, Codex-in-Otto mode runs as a non-loop + harness for now, with the /loop cadence retained in Claude + Code. + +**Important (meaningful friction, workarounds exist):** + +2. **Skills aren't directly portable.** `.claude/skills/` is + Claude-Code-specific. The existing cross-harness-mirror- + pipeline BACKLOG row (round 34) is the right place to solve + this; it's complementary to this work, not this row's + scope. +3. **TodoWrite analogue unclear.** Otto relies on TodoWrite + for tick-internal progress. Without it, task-tracking might + degrade to free-form markdown in responses. Not critical but + visible. +4. **Hooks gap.** PreToolUse hooks in `.claude/settings.json` + aren't portable; git-pre-commit hooks are. Move any + session-layer hooks to git-pre-commit or lint CI if we want + them harness-neutral. + +**Nice-to-have (low friction, low impact):** + +5. Output-style / explanatory-mode parity. +6. Session compaction behaviour parity. +7. Slash-command name-parity (Zeta's `/loop` etc.). + +**Codex-specific we don't need today:** + +- Background macOS Computer Use (not urgent for factory + agent). +- In-CLI image generation (not urgent). +- Codex Cloud as execution environment (may become relevant if + critical gap #1 resolves via cloud scheduling). + +--- + +## 6 · Recommended Stage-2 plan + +Stage 2 is the parity matrix (M-effort per PR #228). Concrete +test prompts to exercise each row of §3: + +1. **`AGENTS.md` reading.** Run `codex` in the Zeta repo root + interactively; confirm it reads `AGENTS.md` before first + turn. (Test: ask the agent to state the three load-bearing + values; correct answer validates the read.) +2. **Subagent dispatch.** Prompt Codex to "launch a subagent + to review `docs/ALIGNMENT.md` and report its key clauses" — + verify subagent dispatch works, artifacts are consolidated. +3. **MCP-server invocation.** Register a no-op MCP server in + `~/.codex/config.toml` and verify `approval_mode` gates + trigger. +4. **Cron / scheduled-task research.** The critical gap. Read + Codex Cloud docs specifically on scheduled task + primitives; file the outcome. +5. **`codex exec` non-interactive.** Run + `codex exec "list the top 5 open PRs on LFG"` and compare + output shape to Claude Code's one-shot invocation. +6. **Git-worktree subagent.** Test isolation: "open a + subagent in an isolated worktree and have it modify a + single line; verify the main session doesn't see the + change." +7. **Session resumption.** Start a session, quit, resume. Does + Codex restore prior context from the SQLite state DB? + Compare to Claude Code's session continuity model. + +Time estimate for Stage 2: 2-3 hours of hands-on terminal +time + documentation pass. Can be split across multiple Otto +ticks or landed as one dedicated research PR. + +--- + +## 7 · Scope limits (repeating, for clarity) + +- This document does NOT commit to harness-swap. +- Does NOT propose implementing a Codex-mode Otto. +- Does NOT modify `AGENTS.md` (already good; mirror-pipeline + row handles forward-looking changes). +- Does NOT duplicate cross-harness-mirror-pipeline work. +- Does NOT lock Zeta to one harness family. + +The harness-choice ADR (Stage 5 per PR #228) is the only +stage authorised to make an executable decision about +which harness runs the primary tick cadence. + +--- + +## 8 · References + +- [Codex CLI landing — openai.com/codex](https://openai.com/codex/) +- [Codex CLI developer docs](https://developers.openai.com/codex/cli) +- [`openai/codex` GitHub — lightweight terminal agent](https://github.com/openai/codex) +- [Codex CLI 2026 review — shareuhack.com](https://www.shareuhack.com/en/posts/openai-codex-cli-agent-guide-2026) +- [ChatGPT Codex 2026 guide — two99.org](https://two99.org/blog/chatgpt-codex-guide-2026/) +- [Codex desktop computer use — remio.ai](https://www.remio.ai/post/openai-codex-can-now-control-your-desktop-what-it-means-for-the-ai-coding-agent-race) +- [Codex CLI configuration reference (TOML)](https://raw.githubusercontent.com/openai/codex/main/docs/config.md) +- Zeta's `AGENTS.md` — universal onboarding handbook (already + consumed by both harnesses; the biggest non-obvious win + surfaced by this research). +- Parent BACKLOG row PR #228. +- Account-setup memory (sibling context). From 5ffe8f348384d8d2b2b88161b76c0d38513c9174 Mon Sep 17 00:00:00 2001 From: Aaron Stainback Date: Sat, 25 Apr 2026 00:54:47 -0400 Subject: [PATCH 02/11] drain(#231 P2+misc Codex): canonical wording + GOVERNANCE preamble + model surface alignment MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three substantive Codex post-merge findings on PR #231: P2 (line 95) — load-bearing values wrong wording: The three load-bearing values per AGENTS.md are truth-over- politeness / algebra-over-engineering / velocity-over- stability. The doc said retraction-native / alignment- contract / operator-algebra (those are the technical pillars, distinct from the load-bearing values). Updated to canonical wording + clarified the distinction. (line 115) — GOVERNANCE.md §N placeholder: The text cited '`GOVERNANCE.md` §N' as if N were a real section number, but GOVERNANCE.md mentions AGENTS.md only in the preamble (not numbered). Replaced with a verbatim quote from the preamble: 'AGENTS.md carries the philosophy, values, and onboarding; this file carries the rules'. (line 53) — invented model names: The doc named '`/model` — model switcher (GPT-5.4 / GPT-5.3-Codex / others)'. Codex CLI uses '-m' / '--model ' (not /model) and config-driven profiles. The capability map cites 'o3' as a help-doc example with deferral to live roster. Updated to align with capability-map canonical surface description, removing the invented GPT-5.4 / GPT-5.3-Codex identifiers. --- .../codex-cli-first-class-2026-04-23.md | 20 ++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/docs/research/codex-cli-first-class-2026-04-23.md b/docs/research/codex-cli-first-class-2026-04-23.md index 3fe84dd7..a79d7e1f 100644 --- a/docs/research/codex-cli-first-class-2026-04-23.md +++ b/docs/research/codex-cli-first-class-2026-04-23.md @@ -50,7 +50,12 @@ CLI in the 2026 coding-agent landscape. - `codex` — interactive terminal UI. - `codex exec` — non-interactive scripting mode (equivalent to Claude Code's one-shot Bash invocation of a prompt). -- `/model` — model switcher (GPT-5.4 / GPT-5.3-Codex / others). +- `-m` / `--model ` — model selector (e.g. `o3` and + whichever current OpenAI model roster is live; consult + `docs/research/openai-codex-cli-capability-map.md` §"Model + selection" for the canonical surface). Codex uses + config-driven profiles via `-p` / `--profile` rather than + a discrete-effort-tier enumeration. - Subagent dispatch + parallel execution + git worktrees. - MCP server support with per-tool approval modes. - Web search integration. @@ -91,8 +96,11 @@ onboarding handbook. The relevant lines of `CLAUDE.md`: When a Codex CLI session opens Zeta, it will read `AGENTS.md` by default. `AGENTS.md` already contains: -- The three load-bearing values (retraction-native / alignment- - contract / operator-algebra). +- The three load-bearing values per `AGENTS.md` §"The three + load-bearing values" (truth-over-politeness / algebra- + over-engineering / velocity-over-stability). Distinct from + the alignment-contract / operator-algebra / retraction- + native technical pillars also documented in `AGENTS.md`. - Build-and-test gate (`dotnet build -c Release` clean, full test suite). - Required reading list (`docs/ALIGNMENT.md` / @@ -111,8 +119,10 @@ specifics). This is a materially better position than the BACKLOG row assumed. Zeta is already ~60% first-class-Codex-ready by -accident of adopting `AGENTS.md` as canonical in -`GOVERNANCE.md` §N from earlier rounds. +accident of adopting `AGENTS.md` as canonical (see +`GOVERNANCE.md` preamble: *"`AGENTS.md` carries the +philosophy, values, and onboarding; this file carries the +rules"*) from earlier rounds. --- From 1da992ec67825e75f59b557f170ce4323bac06d5 Mon Sep 17 00:00:00 2001 From: Aaron Stainback Date: Sat, 25 Apr 2026 02:10:44 -0400 Subject: [PATCH 03/11] research(#231 Codex CLI): correct config-key + AGENTS.md attribution MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex P2 #1 — MCP approval-mode config key: Codex's config surface uses `default_tools_approval_mode` for the server-wide default; `approval_mode` is the per-tool override key (per `openai/codex` config docs). Stage 2 testing instructions would have configured the wrong key without this fix. Codex P2 #2 — AGENTS.md required-reading attribution: AGENTS.md does NOT carry the full ordered required-reading list with `openspec/README.md`; that ordered list lives in CLAUDE.md's "Read these, in this order" section. AGENTS.md references the substrate docs but does not enumerate the openspec entry point. Reworded to accurately attribute what AGENTS.md provides versus what CLAUDE.md adds; readiness analysis no longer overstates the Codex-inherits-everything claim. --- .../codex-cli-first-class-2026-04-23.md | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/docs/research/codex-cli-first-class-2026-04-23.md b/docs/research/codex-cli-first-class-2026-04-23.md index a79d7e1f..488fa4bd 100644 --- a/docs/research/codex-cli-first-class-2026-04-23.md +++ b/docs/research/codex-cli-first-class-2026-04-23.md @@ -70,8 +70,9 @@ CLI in the 2026 coding-agent landscape. - SQLite state DB (`sqlite_home` config / `CODEX_SQLITE_HOME` env). - `[mcp_servers]` table mirrors Claude Code's MCP registry with - richer per-tool approval controls (`approval_mode = - "approve" | "prompt"` default + per-tool override). + richer per-tool approval controls. Server-wide default uses + `default_tools_approval_mode`; `approval_mode` is the + per-tool override key (per `openai/codex` config docs). - `[notice]` for UI prompt suppression; `notify` hook when a turn completes. - `plan_mode_reasoning_effort` — Plan Mode analogue. @@ -103,10 +104,16 @@ by default. `AGENTS.md` already contains: native technical pillars also documented in `AGENTS.md`. - Build-and-test gate (`dotnet build -c Release` clean, full test suite). -- Required reading list (`docs/ALIGNMENT.md` / - `docs/CONFLICT-RESOLUTION.md` / `docs/GLOSSARY.md` / - `docs/WONT-DO.md` / `openspec/README.md` / - `GOVERNANCE.md`). +- References to the substrate doc tree (`GOVERNANCE.md`, + `docs/ALIGNMENT.md`, `docs/CONFLICT-RESOLUTION.md`, + `docs/GLOSSARY.md`, `docs/WONT-DO.md`, + `docs/AGENT-BEST-PRACTICES.md`). The full ordered required- + reading list including `openspec/README.md` lives in + `CLAUDE.md`'s "Read these, in this order" section, not in + `AGENTS.md` directly — readiness analysis below treats + Codex as inheriting the AGENTS.md references plus needing + to follow the same ordered-list pattern when its own + `CODEX.md` lands. - "Agents, not bots" discipline. - Factory-structure pointers to `.claude/`, `docs/`, `src/`, `openspec/`. From a6990dca7ee22bfab2a421a8f21fdd40904349a0 Mon Sep 17 00:00:00 2001 From: Aaron Stainback Date: Sat, 25 Apr 2026 02:15:45 -0400 Subject: [PATCH 04/11] drain(#231 follow-up): fix relative link in quoted CLAUDE.md snippet Copilot P1 caught that the quoted CLAUDE.md snippet used `(AGENTS.md)` which resolves correctly relative to repo root but breaks in this file which sits at `docs/research/`. Updated link target to `../../AGENTS.md` so readers can navigate to the actual file from within the rendered research doc. --- docs/research/codex-cli-first-class-2026-04-23.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/research/codex-cli-first-class-2026-04-23.md b/docs/research/codex-cli-first-class-2026-04-23.md index 488fa4bd..cc21cb0c 100644 --- a/docs/research/codex-cli-first-class-2026-04-23.md +++ b/docs/research/codex-cli-first-class-2026-04-23.md @@ -88,7 +88,7 @@ first. **Zeta's setup already has both, and the `CLAUDE.md` explicitly delegates to `AGENTS.md`** as the universal onboarding handbook. The relevant lines of `CLAUDE.md`: -> 1. **[`AGENTS.md`](AGENTS.md)** — the universal +> 1. **[`AGENTS.md`](../../AGENTS.md)** — the universal > onboarding handbook. Pre-v1 status, the three > load-bearing values, how to treat contributions, > the build-and-test gate, code-style pointers, From 6589a498e8ca21e9650753af6af649da48f026ef Mon Sep 17 00:00:00 2001 From: Aaron Stainback Date: Sat, 25 Apr 2026 02:23:53 -0400 Subject: [PATCH 05/11] =?UTF-8?q?drain(#231=20follow-up):=20fix=20Codex=20?= =?UTF-8?q?P2=20=E2=80=94=20discriminator=20+=20row-level=20coverage?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P2 (L256) — AGENTS.md-read validation false-positive path: The "ask for the three load-bearing values" check failed as a discriminator because this research doc repeats those values inline, so a correct answer could come from reading either file. Replaced with a unique-to-AGENTS.md discriminator (the build-and-test gate command block — \`dotnet build -c Release\` + \`dotnet test Zeta.sln -c Release\` pair, which is in AGENTS.md but not repeated here). Stage-2 readiness signal is now sound. P2 (L251) — Stage-2 coverage claim vs actual prompt count: Section 6 said "test prompts to exercise each row of §3" but only provided 7 prompts for a ~20-row matrix. Reframed as a "starter probe set targeting the most load-bearing rows ... not 1:1 ... Stage-2 execution should expand to one prompt per matrix row ... rows not covered stay marked 'unverified-pending-Stage-2-prompt'". Parity-and-gap totals are now explicitly required to be row-level rather than aggregate-only. --- .../codex-cli-first-class-2026-04-23.md | 25 ++++++++++++++++--- 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/docs/research/codex-cli-first-class-2026-04-23.md b/docs/research/codex-cli-first-class-2026-04-23.md index cc21cb0c..bcd9876f 100644 --- a/docs/research/codex-cli-first-class-2026-04-23.md +++ b/docs/research/codex-cli-first-class-2026-04-23.md @@ -247,13 +247,30 @@ sidesteps that problem for Phase 1 Codex research**. ## 6 · Recommended Stage-2 plan -Stage 2 is the parity matrix (M-effort per PR #228). Concrete -test prompts to exercise each row of §3: +Stage 2 is the parity matrix (M-effort per PR #228). The +prompts below are a starter probe set targeting the most +load-bearing rows of the §3 parity matrix; they do **not** +exercise every row 1:1. Stage-2 execution should expand to +one prompt per matrix row, recording per-row pass/fail/notes +so parity-and-gap totals are row-level rather than +aggregate-only. This starter set is the seed for that +expansion; rows not covered below stay marked +"unverified-pending-Stage-2-prompt" until a row-specific +probe lands. 1. **`AGENTS.md` reading.** Run `codex` in the Zeta repo root interactively; confirm it reads `AGENTS.md` before first - turn. (Test: ask the agent to state the three load-bearing - values; correct answer validates the read.) + turn. **Discriminator:** ask the agent to recite content + that lives ONLY in `AGENTS.md` and not in this research + doc — for example, the exact wording of the build-and-test + gate command block (`dotnet build -c Release` clean + + `dotnet test Zeta.sln -c Release` pair) which appears in + `AGENTS.md` but is not repeated inline here. Reciting the + three load-bearing values alone is NOT a valid + discriminator because this research doc repeats those + values inline; correct recitation would not prove + `AGENTS.md` ingestion and creates a false-positive + readiness signal. 2. **Subagent dispatch.** Prompt Codex to "launch a subagent to review `docs/ALIGNMENT.md` and report its key clauses" — verify subagent dispatch works, artifacts are consolidated. From 4399cdd6a7f962eb8ad04110ed62e7ec7776bdad Mon Sep 17 00:00:00 2001 From: Aaron Stainback Date: Sat, 25 Apr 2026 02:33:00 -0400 Subject: [PATCH 06/11] =?UTF-8?q?drain(#231=20follow-up):=20fix=20Copilot?= =?UTF-8?q?=20P1=20=E2=80=94=20taxonomy=20+=20slash-command=20corrections?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P1 (L141) — status taxonomy mismatch: declared 'parity | partial | gap' but table used 11 distinct status strings ('Parity', 'Parity (richer)', 'Parity (different shape)', 'Parity+', 'Partial', 'Different shape, functional', 'Gap', 'Gap (minor)', 'Gap (opaque)', 'Likely gap', 'Codex-specific'). Expanded the declared taxonomy to match the table, plus an explicit note on how the 11 statuses aggregate into the 4 score-summary buckets. P1 (L158) — '/model' slash command claim vs verified capability map: The capability map (docs/research/openai-codex-cli-capability-map.md L277) documents Codex model selection as '-m' / '-c model="..."' + '--profile', NOT a '/model' slash command. Replaced '/model + plan-mode commands' with the verified surface ('-m' / '--model', profiles, plan-mode commands) plus an explicit pointer to the capability map. P1 (L161) — '/ultrareview' undocumented in-repo: repo-wide search finds no in-tree definition (it's a Claude Code platform feature surfaced via the harness's session prompt, not a Zeta repo command). Annotated the row with that disambiguation so readers don't treat it as a Zeta entrypoint, and replaced the named entrypoint reference with the actual in-repo skill path (.claude/skills/code-review-zero-empathy/). --- .../codex-cli-first-class-2026-04-23.md | 41 ++++++++++++++++--- 1 file changed, 36 insertions(+), 5 deletions(-) diff --git a/docs/research/codex-cli-first-class-2026-04-23.md b/docs/research/codex-cli-first-class-2026-04-23.md index bcd9876f..1a4f5a57 100644 --- a/docs/research/codex-cli-first-class-2026-04-23.md +++ b/docs/research/codex-cli-first-class-2026-04-23.md @@ -136,9 +136,40 @@ rules"*) from earlier rounds. ## 3 · Capability-parity first-pass matrix Rows Otto routinely exercises in Claude Code; column 2 is the -Codex-CLI equivalent; column 3 is `parity | partial | gap` with -a short note. **This is a first-pass; a proper matrix (Stage 2) -should run each cell against a small test prompt.** +Codex-CLI equivalent; column 3 is the row's status (taxonomy +below) with a short note. **This is a first-pass; a proper +matrix (Stage 2) should run each cell against a small test +prompt.** + +Status taxonomy used in the table below: + +- **Parity** — direct equivalent exists; same shape. +- **Parity (richer)** — direct equivalent + Codex offers more + (e.g., richer per-tool approval). +- **Parity (different shape)** — equivalent functionality + available but reached via a different surface (e.g., GitHub + PR review vs. in-CLI agent). +- **Parity+** — Codex strictly more capable (e.g., image + generation in-CLI vs. image input only). +- **Partial** — equivalent partially covers the use case; + gaps documented in Note. +- **Different shape, functional** — same functional category, + different file format / surface (e.g., SQLite vs. Markdown + per-fact memory). +- **Gap** — no Codex equivalent currently surfaced. +- **Gap (minor)** — minor user-visible gap with low + factory-side impact. +- **Gap (opaque)** — undocumented behavior; Stage 2 must + test. +- **Likely gap** — strong evidence of gap; Stage 2 must + confirm. +- **Codex-specific** — Codex exposes a primitive Claude Code + doesn't. + +Score-summary counts at the bottom of the table aggregate +into headline buckets: Parity (any "Parity*" status), Partial +/ Different-shape, Gap (any "Gap*" or "Likely gap" status), +and Codex-specific. | Claude Code (Otto usage) | Codex CLI equivalent | Status | Note | |---|---|---|---| @@ -155,10 +186,10 @@ should run each cell against a small test prompt.** | Plan Mode | `plan_mode_reasoning_effort` config | **Parity** | Named differently; same concept. | | Output styles (e.g., explanatory) | Not documented; may go via system-prompt override | **Gap (minor)** | Factory-side impact is small; output styles are Claude-Code-session features, not substrate. | | Hooks (`.claude/settings.json` PreToolUse, UserPromptSubmit) | `notify` hook for turn completion; no PreToolUse equivalent | **Partial** | Zeta's ASCII-clean pre-commit + prompt-injection lints run via git-pre-commit, not via Claude Code hooks — so those are harness-neutral. Session-side hooks (SessionStart for output style) have no Codex equivalent. | -| Slash commands (`/loop`, `/fast`, `/help`, `/status-line-setup`) | `/model` + plan-mode commands | **Partial** | Codex exposes fewer user-visible slash commands; project-specific ones (e.g., Zeta's `/loop`) need re-authoring or re-routing through `codex exec`. | +| Slash commands (`/loop`, `/fast`, `/help`, `/status-line-setup`) | `-m` / `--model`, profiles, plan-mode commands | **Partial** | Codex exposes fewer user-visible slash commands; model selection is via `-m` / `--model` flags + `--profile` (per `docs/research/openai-codex-cli-capability-map.md`), not via a `/model` slash command. Project-specific commands (e.g., Zeta's `/loop`) need re-authoring or re-routing through `codex exec`. | | `Task` with `isolation: "worktree"` | Built-in worktree support | **Parity** | Codex advertises worktrees as a first-class subagent feature. | | Session compaction | Not documented | **Gap (opaque)** | Codex's handling of long sessions is unclear; Stage 2 must test. | -| Code-review agent | Native "separate agent before commit" feature | **Parity (different shape)** | Codex integrates review into the CLI workflow directly; Zeta's equivalent is Codex-as-PR-reviewer on GitHub + `/ultrareview` + harsh-critic persona. Composes. | +| Code-review agent | Native "separate agent before commit" feature | **Parity (different shape)** | Codex integrates review into the CLI workflow directly; Zeta's equivalent is Codex-as-PR-reviewer on GitHub + the harsh-critic persona under `.claude/skills/code-review-zero-empathy/`. (Note: `/ultrareview` is a Claude Code platform feature surfaced in the harness's session prompt, not a Zeta-defined command — repo-wide search finds no in-tree definition. Listed here for surface-mapping context only; not an in-repo entrypoint.) Composes. | | Image input / image generation | Native | **Parity+** | Codex exposes image generation in-CLI; Claude Code accepts image input only. | | Background macOS Computer Use | Native | **Codex-specific** | No Claude Code equivalent; relevant if Zeta ever wants agent-run GUI tests. Not urgent for Otto. | | Cloud-backed runtime | Codex Cloud | **Codex-specific** | May subsume the cron-gap by running long-lived agents in cloud; Stage 2 needs to verify. | From 8fbd1fa2690c961519283fbb539228d940471dd6 Mon Sep 17 00:00:00 2001 From: Aaron Stainback Date: Sat, 25 Apr 2026 02:42:02 -0400 Subject: [PATCH 07/11] drain(#231 follow-up): reclassify TodoWrite + hooks per Codex release notes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit P2 (L179) — TodoWrite Gap → Parity (different shape): Codex CLI tracks progress with a built-in to-do list per OpenAI's "Introducing upgrades to Codex" post (Sept 15 2025). API surface differs from Claude Code's TodoWrite tool, so still flagged for Stage 2 verification of API discoverability + state-mapping (pending/in-progress/completed). P1 (L188) — Hooks Partial → Partial (narrowing): Codex rust-v0.117.0 (March 26 2026, #15211) added shell-only PreToolUse support alongside the existing notify hook for turn completion. UserPromptSubmit and SessionStart still gaps. Zeta's git-pre-commit-driven lints are harness-neutral, so gap-impact on Zeta substrate is small. Updated score summary: Parity 10→11, Gap 4→3 (critical cron/loop gap unchanged). Added explicit Stage-2-verification disclaimer. --- docs/research/codex-cli-first-class-2026-04-23.md | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/docs/research/codex-cli-first-class-2026-04-23.md b/docs/research/codex-cli-first-class-2026-04-23.md index 1a4f5a57..4c53cc88 100644 --- a/docs/research/codex-cli-first-class-2026-04-23.md +++ b/docs/research/codex-cli-first-class-2026-04-23.md @@ -176,7 +176,7 @@ and Codex-specific. | `CLAUDE.md` + `AGENTS.md` pointer tree | `AGENTS.md` native | **Parity** | The big win; see §2. | | `Skill` tool + `.claude/skills/SKILL.md` | No direct equivalent; custom commands + MCP + `AGENTS.md` extensions | **Partial** | Cross-harness-mirror-pipeline BACKLOG row (round 34) already addresses skill-file distribution. Codex CLI reads MCP-registered tools cleanly; skills as MCP-exposed functions is one path. | | `Task` tool (subagent dispatch) | Subagents + worktrees | **Parity** | Codex advertises parallel execution with worktrees natively; should compose cleanly with Zeta's agent roster. | -| `TodoWrite` task tracking | Not advertised as a primitive | **Gap** | May map to `AGENTS.md` + session memory; needs Stage 2 test. | +| `TodoWrite` task tracking | Built-in to-do list (per OpenAI's "Introducing upgrades to Codex" post, Sept 15 2025) | **Parity (different shape)** | Codex CLI tracks progress with a built-in to-do list; the API surface differs from Claude Code's `TodoWrite` tool. Stage 2 must verify the discoverable API for setting/marking-done todos from agent prompts and how it compares to `TodoWrite`'s pending/in-progress/completed states. | | Per-project memory (`~/.claude/projects//memory/`) | SQLite state DB + `AGENTS.md` | **Different shape, functional** | Codex has durable state; the **file-format** differs (SQLite vs. Markdown per-fact files). `MEMORY.md` index doesn't apply directly. Future: design how per-fact memories surface in a Codex session. | | Bash / Edit / Read / Write | Standard file + shell tool set | **Parity** | Interactive + `exec` modes cover Otto's normal workflow. | | WebFetch / WebSearch | Web search integration advertised | **Parity** | Codex advertises "up-to-date information retrieval" during tasks. | @@ -185,7 +185,7 @@ and Codex-specific. | `CronCreate` / `ScheduleWakeup` (loop autonomy) | Not documented | **Likely gap** | The autonomous-loop cadence (minutely `<>` fire) has no Codex-CLI equivalent surfaced in the docs. **This is the biggest single gap** for Otto-in-Codex; the entire `/loop` auto-mode depends on cron. Stage 2 must verify whether Codex Cloud background tasks cover this. | | Plan Mode | `plan_mode_reasoning_effort` config | **Parity** | Named differently; same concept. | | Output styles (e.g., explanatory) | Not documented; may go via system-prompt override | **Gap (minor)** | Factory-side impact is small; output styles are Claude-Code-session features, not substrate. | -| Hooks (`.claude/settings.json` PreToolUse, UserPromptSubmit) | `notify` hook for turn completion; no PreToolUse equivalent | **Partial** | Zeta's ASCII-clean pre-commit + prompt-injection lints run via git-pre-commit, not via Claude Code hooks — so those are harness-neutral. Session-side hooks (SessionStart for output style) have no Codex equivalent. | +| Hooks (`.claude/settings.json` PreToolUse, UserPromptSubmit) | `notify` hook + shell-only PreToolUse (per OpenAI release notes for `rust-v0.117.0`, March 26 2026, `#15211`) | **Partial (narrowing)** | Codex now has shell-only PreToolUse alongside the existing `notify` hook for turn completion. UserPromptSubmit and other Claude-Code-specific hook types are still gaps. Zeta's ASCII-clean pre-commit + prompt-injection lints run via git-pre-commit (harness-neutral) so the gap-impact on Zeta substrate is small. SessionStart hooks (e.g., for output style) still have no Codex equivalent. | | Slash commands (`/loop`, `/fast`, `/help`, `/status-line-setup`) | `-m` / `--model`, profiles, plan-mode commands | **Partial** | Codex exposes fewer user-visible slash commands; model selection is via `-m` / `--model` flags + `--profile` (per `docs/research/openai-codex-cli-capability-map.md`), not via a `/model` slash command. Project-specific commands (e.g., Zeta's `/loop`) need re-authoring or re-routing through `codex exec`. | | `Task` with `isolation: "worktree"` | Built-in worktree support | **Parity** | Codex advertises worktrees as a first-class subagent feature. | | Session compaction | Not documented | **Gap (opaque)** | Codex's handling of long sessions is unclear; Stage 2 must test. | @@ -195,11 +195,15 @@ and Codex-specific. | Cloud-backed runtime | Codex Cloud | **Codex-specific** | May subsume the cron-gap by running long-lived agents in cloud; Stage 2 needs to verify. | **Running gap score after first-pass:** -- Parity: 10 +- Parity: 11 (TodoWrite reclassified Gap → Parity (different shape) + per OpenAI's Sept 15 2025 Codex CLI to-do-list announcement) - Partial: 4 -- Gap: 4 (of which 1 — cron/autonomous-loop — is critical) +- Gap: 3 (of which 1 — cron/autonomous-loop — is critical) - Codex-specific: 2 +(Score subject to Stage 2 verification — these are first-pass +counts based on documentation review, not behavioral tests.) + For a *first-class* Otto experience in Codex CLI, the 1 critical gap (no equivalent of `CronCreate` / `/loop` autonomous mode) is the blocker. Without it, Otto in Codex is From 32f1663e0761260eaf0c33e99936f399e4951938 Mon Sep 17 00:00:00 2001 From: Aaron Stainback Date: Sat, 25 Apr 2026 04:22:39 -0400 Subject: [PATCH 08/11] hygiene(#231): add inline citations for TodoWrite + #15211 references Two Copilot P2 catches on citation auditability: - L179: TodoWrite row cites 'OpenAI's Introducing upgrades to Codex post, Sept 15 2025' but Reference section had no link entry. Add inline link to https://openai.com/index/introducing-upgrades-to-codex/ so the claim is auditable over time. - L188: '#15211' was unqualified (which tracker?). Change to the fully-qualified [openai/codex#15211] with link to https://github.com/openai/codex/pull/15211. External-source-verifiability-gap pattern per docs/pr-preservation/_patterns.md. --- docs/research/codex-cli-first-class-2026-04-23.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/research/codex-cli-first-class-2026-04-23.md b/docs/research/codex-cli-first-class-2026-04-23.md index 4c53cc88..1dd37bbe 100644 --- a/docs/research/codex-cli-first-class-2026-04-23.md +++ b/docs/research/codex-cli-first-class-2026-04-23.md @@ -176,7 +176,7 @@ and Codex-specific. | `CLAUDE.md` + `AGENTS.md` pointer tree | `AGENTS.md` native | **Parity** | The big win; see §2. | | `Skill` tool + `.claude/skills/SKILL.md` | No direct equivalent; custom commands + MCP + `AGENTS.md` extensions | **Partial** | Cross-harness-mirror-pipeline BACKLOG row (round 34) already addresses skill-file distribution. Codex CLI reads MCP-registered tools cleanly; skills as MCP-exposed functions is one path. | | `Task` tool (subagent dispatch) | Subagents + worktrees | **Parity** | Codex advertises parallel execution with worktrees natively; should compose cleanly with Zeta's agent roster. | -| `TodoWrite` task tracking | Built-in to-do list (per OpenAI's "Introducing upgrades to Codex" post, Sept 15 2025) | **Parity (different shape)** | Codex CLI tracks progress with a built-in to-do list; the API surface differs from Claude Code's `TodoWrite` tool. Stage 2 must verify the discoverable API for setting/marking-done todos from agent prompts and how it compares to `TodoWrite`'s pending/in-progress/completed states. | +| `TodoWrite` task tracking | Built-in to-do list (per [OpenAI's "Introducing upgrades to Codex" post, Sept 15 2025](https://openai.com/index/introducing-upgrades-to-codex/)) | **Parity (different shape)** | Codex CLI tracks progress with a built-in to-do list; the API surface differs from Claude Code's `TodoWrite` tool. Stage 2 must verify the discoverable API for setting/marking-done todos from agent prompts and how it compares to `TodoWrite`'s pending/in-progress/completed states. | | Per-project memory (`~/.claude/projects//memory/`) | SQLite state DB + `AGENTS.md` | **Different shape, functional** | Codex has durable state; the **file-format** differs (SQLite vs. Markdown per-fact files). `MEMORY.md` index doesn't apply directly. Future: design how per-fact memories surface in a Codex session. | | Bash / Edit / Read / Write | Standard file + shell tool set | **Parity** | Interactive + `exec` modes cover Otto's normal workflow. | | WebFetch / WebSearch | Web search integration advertised | **Parity** | Codex advertises "up-to-date information retrieval" during tasks. | @@ -185,7 +185,7 @@ and Codex-specific. | `CronCreate` / `ScheduleWakeup` (loop autonomy) | Not documented | **Likely gap** | The autonomous-loop cadence (minutely `<>` fire) has no Codex-CLI equivalent surfaced in the docs. **This is the biggest single gap** for Otto-in-Codex; the entire `/loop` auto-mode depends on cron. Stage 2 must verify whether Codex Cloud background tasks cover this. | | Plan Mode | `plan_mode_reasoning_effort` config | **Parity** | Named differently; same concept. | | Output styles (e.g., explanatory) | Not documented; may go via system-prompt override | **Gap (minor)** | Factory-side impact is small; output styles are Claude-Code-session features, not substrate. | -| Hooks (`.claude/settings.json` PreToolUse, UserPromptSubmit) | `notify` hook + shell-only PreToolUse (per OpenAI release notes for `rust-v0.117.0`, March 26 2026, `#15211`) | **Partial (narrowing)** | Codex now has shell-only PreToolUse alongside the existing `notify` hook for turn completion. UserPromptSubmit and other Claude-Code-specific hook types are still gaps. Zeta's ASCII-clean pre-commit + prompt-injection lints run via git-pre-commit (harness-neutral) so the gap-impact on Zeta substrate is small. SessionStart hooks (e.g., for output style) still have no Codex equivalent. | +| Hooks (`.claude/settings.json` PreToolUse, UserPromptSubmit) | `notify` hook + shell-only PreToolUse (per OpenAI release notes for `rust-v0.117.0`, March 26 2026, [openai/codex#15211](https://github.com/openai/codex/pull/15211)) | **Partial (narrowing)** | Codex now has shell-only PreToolUse alongside the existing `notify` hook for turn completion. UserPromptSubmit and other Claude-Code-specific hook types are still gaps. Zeta's ASCII-clean pre-commit + prompt-injection lints run via git-pre-commit (harness-neutral) so the gap-impact on Zeta substrate is small. SessionStart hooks (e.g., for output style) still have no Codex equivalent. | | Slash commands (`/loop`, `/fast`, `/help`, `/status-line-setup`) | `-m` / `--model`, profiles, plan-mode commands | **Partial** | Codex exposes fewer user-visible slash commands; model selection is via `-m` / `--model` flags + `--profile` (per `docs/research/openai-codex-cli-capability-map.md`), not via a `/model` slash command. Project-specific commands (e.g., Zeta's `/loop`) need re-authoring or re-routing through `codex exec`. | | `Task` with `isolation: "worktree"` | Built-in worktree support | **Parity** | Codex advertises worktrees as a first-class subagent feature. | | Session compaction | Not documented | **Gap (opaque)** | Codex's handling of long sessions is unclear; Stage 2 must test. | From b80554fac021bc78d84e4d61043c697915ae90ee Mon Sep 17 00:00:00 2001 From: Aaron Stainback Date: Sat, 25 Apr 2026 04:30:13 -0400 Subject: [PATCH 09/11] =?UTF-8?q?hygiene(#231):=20cascade=20fixes=20?= =?UTF-8?q?=E2=80=94=20discriminator=20self-reference=20+=20TodoWrite=20?= =?UTF-8?q?=C2=A75=20consistency?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two Codex post-merge cascade catches: - L303 P1 (discriminator self-reference, recurrence of earlier Cursor finding): the AGENTS.md-ingestion test was non-causal because the proposed discriminator (the build-and-test command pair) was quoted inline in this doc, so reading the research doc would suffice — false-positive readiness signal. Replace with structural reference only ('the build-gate section of AGENTS.md') + explicit instruction that the evaluator (not the doc) holds the canonical answer string. The discriminator surface no longer names any property/file/phrase that appears in this doc, so the only way to satisfy the prompt is to actually read AGENTS.md. Same shape as Otto-231's earlier discriminator-falsification finding. - L260 P2: §5 still treated TodoWrite as 'analogue unclear', but the parity matrix and roll-up classify it as Parity (different shape) per OpenAI's Sept 15 2025 announcement. Reconcile §5 to match the matrix so Stage-2 prioritization is reproducible from any section. --- .../codex-cli-first-class-2026-04-23.md | 40 ++++++++++++------- 1 file changed, 25 insertions(+), 15 deletions(-) diff --git a/docs/research/codex-cli-first-class-2026-04-23.md b/docs/research/codex-cli-first-class-2026-04-23.md index 1dd37bbe..330c8c49 100644 --- a/docs/research/codex-cli-first-class-2026-04-23.md +++ b/docs/research/codex-cli-first-class-2026-04-23.md @@ -255,10 +255,15 @@ sidesteps that problem for Phase 1 Codex research**. pipeline BACKLOG row (round 34) is the right place to solve this; it's complementary to this work, not this row's scope. -3. **TodoWrite analogue unclear.** Otto relies on TodoWrite - for tick-internal progress. Without it, task-tracking might - degrade to free-form markdown in responses. Not critical but - visible. +3. **TodoWrite analogue — different shape, parity confirmed.** + Codex CLI ships a built-in to-do list per OpenAI's Sept 15 + 2025 "Introducing upgrades to Codex" announcement (parity + matrix: **Parity (different shape)**). The API surface + differs from Claude Code's `TodoWrite` tool; Stage 2 must + verify the discoverable API for setting/marking-done todos + and how it compares to `TodoWrite`'s pending/in-progress/ + completed states. Tracking on Otto's tick-internal + progress is unlikely to degrade. 4. **Hooks gap.** PreToolUse hooks in `.claude/settings.json` aren't portable; git-pre-commit hooks are. Move any session-layer hooks to git-pre-commit or lint CI if we want @@ -295,17 +300,22 @@ probe lands. 1. **`AGENTS.md` reading.** Run `codex` in the Zeta repo root interactively; confirm it reads `AGENTS.md` before first - turn. **Discriminator:** ask the agent to recite content - that lives ONLY in `AGENTS.md` and not in this research - doc — for example, the exact wording of the build-and-test - gate command block (`dotnet build -c Release` clean + - `dotnet test Zeta.sln -c Release` pair) which appears in - `AGENTS.md` but is not repeated inline here. Reciting the - three load-bearing values alone is NOT a valid - discriminator because this research doc repeats those - values inline; correct recitation would not prove - `AGENTS.md` ingestion and creates a false-positive - readiness signal. + turn. **Discriminator:** ask the agent for a verbatim + quotation of the **exact failure-mode wording** from the + `AGENTS.md` "Build and test gate" section (the lines that + explain *why* a warning equates to a build break, including + the project-property name and the file that sets it). The + discriminator MUST point only at section/role-ref ("the + build-gate section of AGENTS.md") — never at any specific + property name, file name, or quoted phrase in this research + doc — so the only way to satisfy the prompt is to actually + read `AGENTS.md`. Reciting the three load-bearing values + alone or the `dotnet build` / `dotnet test` command pair + alone is NOT a valid discriminator (those phrases appear + inline in this research doc and would create a + false-positive readiness signal). At test time, the + evaluator (not the doc) holds the canonical answer string + from `AGENTS.md` and compares the agent's response to it. 2. **Subagent dispatch.** Prompt Codex to "launch a subagent to review `docs/ALIGNMENT.md` and report its key clauses" — verify subagent dispatch works, artifacts are consolidated. From 1a36415c80d611af620e06e8f5e5c4b0e3378fcb Mon Sep 17 00:00:00 2001 From: Aaron Stainback Date: Sat, 25 Apr 2026 04:36:33 -0400 Subject: [PATCH 10/11] hygiene(#231): fix pre-existing markdownlint failures (MD032 + MD029) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI markdownlint job was failing on this PR with 11 errors: - MD032 (blanks-around-lists) at lines 36, 42, 50, 69, 198 — bold intro lines (**Install:** / **Authentication:** / **Key surfaces:** / **Config surface:** / **Running gap score**) immediately followed by list items with no separating blank line. Add blank lines. - MD029 (ol-prefix) at lines 253, 258, 267, 274-276 — ordered list items in §5 numbered 2-4 (Important) then 5-7 (Nice-to- have) across heading breaks; markdownlint sees each block as a new list that should restart at 1. Renumber to 1-3 in each block; priority ordering preserved via the bold sub-heading context. These are pre-existing failures not introduced by my drain-fixes, but they block CI auto-merge so worth fixing. --- .../codex-cli-first-class-2026-04-23.md | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/docs/research/codex-cli-first-class-2026-04-23.md b/docs/research/codex-cli-first-class-2026-04-23.md index 330c8c49..76aa0ed5 100644 --- a/docs/research/codex-cli-first-class-2026-04-23.md +++ b/docs/research/codex-cli-first-class-2026-04-23.md @@ -33,12 +33,14 @@ Rust, actively developed. Positioned parallel to Claude Code CLI in the 2026 coding-agent landscape. **Install:** + - `npm install -g @openai/codex` - `brew install --cask codex` - Direct binary download per platform (`macOS arm64/x86_64`, `Linux x86_64/arm64`). **Authentication:** + - ChatGPT account sign-in (Plus / Pro / Business / Edu / Enterprise) **or** an OpenAI API key. - Per Aaron's Otto-76 clarification @@ -47,6 +49,7 @@ CLI in the 2026 coding-agent landscape. account as the Claude Code session — deliberately. **Key surfaces:** + - `codex` — interactive terminal UI. - `codex exec` — non-interactive scripting mode (equivalent to Claude Code's one-shot Bash invocation of a prompt). @@ -66,6 +69,7 @@ CLI in the 2026 coding-agent landscape. commit / push). **Config surface:** + - `~/.codex/config.toml` (TOML). - SQLite state DB (`sqlite_home` config / `CODEX_SQLITE_HOME` env). @@ -195,6 +199,7 @@ and Codex-specific. | Cloud-backed runtime | Codex Cloud | **Codex-specific** | May subsume the cron-gap by running long-lived agents in cloud; Stage 2 needs to verify. | **Running gap score after first-pass:** + - Parity: 11 (TodoWrite reclassified Gap → Parity (different shape) per OpenAI's Sept 15 2025 Codex CLI to-do-list announcement) - Partial: 4 @@ -250,12 +255,12 @@ sidesteps that problem for Phase 1 Codex research**. **Important (meaningful friction, workarounds exist):** -2. **Skills aren't directly portable.** `.claude/skills/` is +1. **Skills aren't directly portable.** `.claude/skills/` is Claude-Code-specific. The existing cross-harness-mirror- pipeline BACKLOG row (round 34) is the right place to solve this; it's complementary to this work, not this row's scope. -3. **TodoWrite analogue — different shape, parity confirmed.** +2. **TodoWrite analogue — different shape, parity confirmed.** Codex CLI ships a built-in to-do list per OpenAI's Sept 15 2025 "Introducing upgrades to Codex" announcement (parity matrix: **Parity (different shape)**). The API surface @@ -264,16 +269,16 @@ sidesteps that problem for Phase 1 Codex research**. and how it compares to `TodoWrite`'s pending/in-progress/ completed states. Tracking on Otto's tick-internal progress is unlikely to degrade. -4. **Hooks gap.** PreToolUse hooks in `.claude/settings.json` +3. **Hooks gap.** PreToolUse hooks in `.claude/settings.json` aren't portable; git-pre-commit hooks are. Move any session-layer hooks to git-pre-commit or lint CI if we want them harness-neutral. **Nice-to-have (low friction, low impact):** -5. Output-style / explanatory-mode parity. -6. Session compaction behaviour parity. -7. Slash-command name-parity (Zeta's `/loop` etc.). +1. Output-style / explanatory-mode parity. +2. Session compaction behaviour parity. +3. Slash-command name-parity (Zeta's `/loop` etc.). **Codex-specific we don't need today:** From b2b09fdc854db159a5d45660bec6d5aba2a7852c Mon Sep 17 00:00:00 2001 From: Aaron Stainback Date: Sat, 25 Apr 2026 04:39:35 -0400 Subject: [PATCH 11/11] =?UTF-8?q?hygiene(#231):=20reclassify=20cron=20Like?= =?UTF-8?q?ly-gap=20=E2=86=92=20Partial=20+=20use=20repo-local=20exec=20pr?= =?UTF-8?q?obe?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two Codex post-merge cascade catches: - L189 P1: cron-row was 'Likely gap (not documented)' but Codex Cloud has a documented thread-automations primitive at developers.openai.com/codex/app/automations covering custom cron syntax + minute-based heartbeat + daily/weekly schedules. Verified via WebSearch (April 2026 docs current). Reclassify to Partial (different surface) — local CLI doesn't expose it, Codex Cloud does. Update gap-score totals: was 11/4/3/2 with cron as critical; now 11/5/2/2 with cron reachable via cloud-thread surface. Update §3 'biggest single gap' prose + §5 'critical' → 'high-priority (reframed)' section. Verify- version-currency rule applied (CLAUDE.md memory feedback). - L335 P2: Stage-2 `codex exec` probe used 'list the top 5 open PRs on LFG' which couples to GitHub access — failures from missing creds / repo visibility / network policy would look like exec parity failures. Replace with repo-local probe ('count the .fs files under src/Core/ and report the count and the longest filename') that exercises exec semantics without external dependencies. Same shape as the Otto-231 discriminator-falsification class: probe surface must isolate the property under test from confounders. Cron classification fix is a substantive parity-research correction; bumps Otto-in-Codex viability from 'critical-gap- blocker' to 'reachable-via-different-surface'. --- .../codex-cli-first-class-2026-04-23.md | 71 ++++++++++++------- 1 file changed, 45 insertions(+), 26 deletions(-) diff --git a/docs/research/codex-cli-first-class-2026-04-23.md b/docs/research/codex-cli-first-class-2026-04-23.md index 76aa0ed5..02c56cbc 100644 --- a/docs/research/codex-cli-first-class-2026-04-23.md +++ b/docs/research/codex-cli-first-class-2026-04-23.md @@ -186,7 +186,7 @@ and Codex-specific. | WebFetch / WebSearch | Web search integration advertised | **Parity** | Codex advertises "up-to-date information retrieval" during tasks. | | MCP server support | `[mcp_servers]` TOML config | **Parity (richer)** | Codex's per-tool approval mode is stricter than Claude Code's MCP permissioning — plays well with BP-11 data-not-directives. | | WebFetch on private/authenticated URLs | Unchanged — same constraint; use MCP | **Parity** | Neither harness fetches authenticated URLs directly; both rely on MCP servers. | -| `CronCreate` / `ScheduleWakeup` (loop autonomy) | Not documented | **Likely gap** | The autonomous-loop cadence (minutely `<>` fire) has no Codex-CLI equivalent surfaced in the docs. **This is the biggest single gap** for Otto-in-Codex; the entire `/loop` auto-mode depends on cron. Stage 2 must verify whether Codex Cloud background tasks cover this. | +| `CronCreate` / `ScheduleWakeup` (loop autonomy) | Codex Cloud thread automations (per [`developers.openai.com/codex/app/automations`](https://developers.openai.com/codex/app/automations)) | **Partial (different surface)** | Codex Cloud has a documented scheduling primitive: thread automations support custom cron syntax, minute-based intervals (heartbeat-style recurring wake-ups attached to a thread), and daily/weekly schedules. Functionally equivalent to Claude Code's `CronCreate` / `ScheduleWakeup` for the autonomous-loop cadence, but on a different surface (Codex Cloud rather than the local CLI session). Stage-2 must verify the API surface for arming/listing automations from agent prompts. The local `codex` CLI itself does not expose a `cron` primitive — the cloud is the equivalent surface. | | Plan Mode | `plan_mode_reasoning_effort` config | **Parity** | Named differently; same concept. | | Output styles (e.g., explanatory) | Not documented; may go via system-prompt override | **Gap (minor)** | Factory-side impact is small; output styles are Claude-Code-session features, not substrate. | | Hooks (`.claude/settings.json` PreToolUse, UserPromptSubmit) | `notify` hook + shell-only PreToolUse (per OpenAI release notes for `rust-v0.117.0`, March 26 2026, [openai/codex#15211](https://github.com/openai/codex/pull/15211)) | **Partial (narrowing)** | Codex now has shell-only PreToolUse alongside the existing `notify` hook for turn completion. UserPromptSubmit and other Claude-Code-specific hook types are still gaps. Zeta's ASCII-clean pre-commit + prompt-injection lints run via git-pre-commit (harness-neutral) so the gap-impact on Zeta substrate is small. SessionStart hooks (e.g., for output style) still have no Codex equivalent. | @@ -202,18 +202,25 @@ and Codex-specific. - Parity: 11 (TodoWrite reclassified Gap → Parity (different shape) per OpenAI's Sept 15 2025 Codex CLI to-do-list announcement) -- Partial: 4 -- Gap: 3 (of which 1 — cron/autonomous-loop — is critical) +- Partial: 5 (cron/autonomous-loop reclassified Likely-gap → + Partial (different surface) per + `developers.openai.com/codex/app/automations` thread-automation + primitive) +- Gap: 2 (no longer including cron — autonomous-loop is reachable + via Codex Cloud thread automations) - Codex-specific: 2 (Score subject to Stage 2 verification — these are first-pass counts based on documentation review, not behavioral tests.) -For a *first-class* Otto experience in Codex CLI, the 1 -critical gap (no equivalent of `CronCreate` / `/loop` -autonomous mode) is the blocker. Without it, Otto in Codex is -a manual session; with it, Otto can run the same heartbeat -cadence. +For a *first-class* Otto experience, the autonomous-loop +cadence has a different-surface partial via Codex Cloud thread +automations (cron syntax + minute intervals per +[`developers.openai.com/codex/app/automations`](https://developers.openai.com/codex/app/automations)). +Otto-in-Codex parity is therefore reachable, but the surface +shifts from local-CLI cron to cloud-thread automations. Stage +2 must verify the agent-facing API surface for arming/listing +automations. --- @@ -240,18 +247,20 @@ sidesteps that problem for Phase 1 Codex research**. ## 5 · Gap analysis — critical vs. nice-to-have -**Critical (blocks Otto-in-Codex parity):** - -1. **No `CronCreate` / `ScheduleWakeup` equivalent.** The - entire autonomous-loop cadence depends on minutely cron - fires with the `<>` sentinel. Without a - Codex-CLI way to schedule wake-ups, Otto-in-Codex is - reactive-only (waits for Aaron to kick the next tick). This - is the single most important Stage 2 question: **does Codex - Cloud offer a scheduled-task primitive?** If yes, parity is - reachable. If no, Codex-in-Otto mode runs as a non-loop - harness for now, with the /loop cadence retained in Claude - Code. +**High-priority (was the leading parity question, now reframed):** + +1. **`CronCreate` / `ScheduleWakeup` reaches parity via Codex + Cloud thread automations.** Codex Cloud's documented + automations primitive (per + [`developers.openai.com/codex/app/automations`](https://developers.openai.com/codex/app/automations)) + covers custom cron syntax, minute-based heartbeat + intervals, and daily/weekly schedules. The autonomous-loop + cadence (minutely `<>` fire) is functionally + reachable, but the surface shifts from local-CLI cron to + cloud-thread automations. Stage-2 must verify the + agent-facing API for arming/listing automations from a + Codex prompt and confirm the heartbeat-style minute interval + matches Claude Code's `* * * * *` cadence. **Important (meaningful friction, workarounds exist):** @@ -327,12 +336,22 @@ probe lands. 3. **MCP-server invocation.** Register a no-op MCP server in `~/.codex/config.toml` and verify `approval_mode` gates trigger. -4. **Cron / scheduled-task research.** The critical gap. Read - Codex Cloud docs specifically on scheduled task - primitives; file the outcome. -5. **`codex exec` non-interactive.** Run - `codex exec "list the top 5 open PRs on LFG"` and compare - output shape to Claude Code's one-shot invocation. +4. **Cron / scheduled-task research.** Verify the Codex Cloud + thread-automations API surface (per + [`developers.openai.com/codex/app/automations`](https://developers.openai.com/codex/app/automations)) + for arming/listing minute-interval automations from an agent + prompt; confirm the heartbeat cadence matches Claude Code's + `* * * * *` `<>` fire shape. +5. **`codex exec` non-interactive.** Run a **repo-local probe** + that doesn't depend on external services or credentials — + for example, `codex exec "count the .fs files under + src/Core/ and report the count and the longest filename"`. + Compare output shape to Claude Code's one-shot invocation. + Avoid prompts that hit GitHub / network / external auth + (e.g., "list the top 5 open PRs on LFG"), since failures + from missing credentials, repo visibility, or network + policy would look like `codex exec` parity failures even + when non-interactive mode works. 6. **Git-worktree subagent.** Test isolation: "open a subagent in an isolated worktree and have it modify a single line; verify the main session doesn't see the