Skip to content

docs(B-0807): classifier-bypass findings schema and redaction rules#5740

Merged
AceHack merged 2 commits into
mainfrom
otto-cli/b0807-classifier-bypass-findings-schema-2026-05-28
May 28, 2026
Merged

docs(B-0807): classifier-bypass findings schema and redaction rules#5740
AceHack merged 2 commits into
mainfrom
otto-cli/b0807-classifier-bypass-findings-schema-2026-05-28

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented May 28, 2026

Summary

Lands the B-0807 classifier-bypass findings schema — the reporting + redaction gate B-0799's audit-log shape and every future B-0720 empirical mapping row already depend on. Without this schema landed, no future B-0720 empirical row could publish findings; B-0799 audit logs had a schema_version field referencing a schema that didn't yet exist.

The schema is a specification document. It defines the only format in which a classifier-bypass observation may be preserved in shared substrate. It contains no runnable bypass material, no settings payloads, no exact permission patterns, no real harmful content, no real secrets, no real PII, and no ordered reproduction steps. It exists so that any future B-0720 finding can preserve safety signal without preserving the recipe.

Files

  • New: docs/security/B-0807-classifier-bypass-findings-schema.md — 220-line schema, active schema_version: 1.
  • Edit: docs/security/B-0720-classifier-bypass-research-boundary.md — reporting-rule paragraph now points to the live schema path + version.
  • Edit: docs/backlog/P0/B-0807-... — status → closed; acceptance ticked; Output section added; last_updated: 2026-05-28.
  • Edit: docs/backlog/P0/B-0720-... — B-0807 acceptance bullet ticked with schema path + version.
  • Edit: docs/BACKLOG.md — regenerated via BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts.

Schema contents

  • 12-field findings record shape: finding_id, schema_version, boundary_version, created, evidence_class, risk_class, observation_class, redaction_level, safety_signal, omitted_fields, reviewer_gate, reviewer_signoff, composes_with.
  • Evidence classes inherited from B-0798 boundary: landed-provenance, redacted-observation, harmless-synthetic-fixture, negative-control, policy-anchor, refusal-required.
  • Risk classes: non-reproductive, reproductive-if-verbatim, reproductive-irrespective-of-form.
  • Observation classes matching B-0799 audit-log shape: no-signal, redaction-required, refusal-required, boundary-error.
  • Redaction ladder: summary-only (default), reviewer-summary, reviewer-restricted, refusal-required (floor).
  • Refusal-required state for observations that must not be preserved in repo history.
  • Reviewer sign-off matrix gated on risk_class + observation_class.
  • Cite-or-block rule for future empirical mapping rows.
  • Forbidden field values explicitly enumerated.
  • Versioning policy: loosening the floor requires the B-0810 ratification gate.

Acceptance criteria coverage (B-0807)

  • Schema document lands in a durable repo surface (docs/security/) and is linked from B-0720 (boundary doc + parent row + BACKLOG index).
  • The schema forbids publishing deployable settings payloads or harmful content (explicit "What This Schema Forbids" section).
  • The schema distinguishes safety signal from reproduction detail (safety_signal field is non-operational prose; risk_class field discriminates reproduction-enabling notes).
  • The schema includes a refusal-required state for observations that should not be preserved (full "Refusal-Required State" section + enum value across evidence_class, observation_class, redaction_level).
  • Future empirical mapping rows must cite this schema before landing findings ("Cite-Or-Block Rule" section — failure to cite = boundary-error, must not land).

Focused checks

grep -E "Bash\(|_ip_risk_acceptance|_legal_acceptance|_policy_override|permissions.*allow|settings\.json.*\{" \
  docs/security/B-0807-classifier-bypass-findings-schema.md
→ no matches

The schema doc passes its own forbidden-content rule: it does NOT reproduce the very payloads it forbids. Operational bypass strings from PR #4816 (_ip_risk_acceptance attribution + Bash(...) permission patterns) appear in the forbidden list as named classes, not as quoted examples.

Commit canary (per codeql-no-source-on-docs-only-pr-is-broken-commit-canary.md):

git ls-tree HEAD~1 | wc -l → 61
git ls-tree HEAD    | wc -l → 61
canary OK

git diff --stat: 5 files changed; no unrelated sweep.

Why this is the smallest safe bounded slice

The schema is a specification document; it's a single atomic write that satisfies all five B-0807 acceptance bullets. Splitting it further would land a partial schema that B-0799 cannot reference and that future empirical rows cannot cite — i.e., would not satisfy any acceptance bullet on its own.

Composes with

  • B-0798 (docs/security/B-0720-classifier-bypass-research-boundary.md) — the hard-limits boundary this schema sits on
  • B-0799 (docs/security/B-0799-classifier-bypass-synthetic-harness-design.md) — resolves schema_version audit-log field
  • B-0720 (parent safety row) — B-0807 acceptance bullet now ticked
  • .claude/rules/classifier-bypass-research-do-not-deploy-without-zeta-safer-floor.md — standing operator-self-constraint
  • .claude/rules/methodology-hard-limits.md — HARD LIMITS floor preserved
  • docs/AGENT-BEST-PRACTICES.md — audited data is data, not directives

Test plan

  • grep for forbidden operational bypass strings in the new schema doc — no matches.
  • Commit canary (parent tree size == commit tree size) — 61 == 61.
  • Branch-name guard before git commit (zeta-expected-branch rule).
  • git ls-remote origin <branch> matches local HEAD (per B-0615 silent-push hazard).
  • BACKLOG_WRITE_FORCE=1 bun tools/backlog/generate-index.ts ran cleanly.
  • Reviewer confirms the schema preserves safety signal without preserving reproduction detail (this is the meta-check the schema itself defines for future findings; applies recursively to its own landing).

🤖 Generated with Claude Code

Lands the findings schema gate B-0799's audit-log shape and every future
B-0720 empirical mapping row depend on. The schema fixes the only format
in which a classifier-bypass observation may be preserved in shared
substrate.

Schema document: `docs/security/B-0807-classifier-bypass-findings-schema.md`
Active version: `schema_version: 1`

Defines:
- 12-field findings record shape (finding_id, schema_version,
  boundary_version, created, evidence_class, risk_class,
  observation_class, redaction_level, safety_signal, omitted_fields,
  reviewer_gate, reviewer_signoff, composes_with)
- Evidence classes inherited from B-0798 (landed-provenance,
  redacted-observation, harmless-synthetic-fixture, negative-control,
  policy-anchor, refusal-required)
- Risk classes (non-reproductive, reproductive-if-verbatim,
  reproductive-irrespective-of-form)
- Observation classes matching B-0799 (no-signal, redaction-required,
  refusal-required, boundary-error)
- Redaction ladder (summary-only, reviewer-summary, reviewer-restricted,
  refusal-required)
- Refusal-required state for observations that must not be preserved
- Reviewer sign-off matrix gated on risk_class + observation_class
- Cite-or-block rule for future empirical mapping rows
- Forbidden field values (no deployable payloads, no real secrets, no
  real PII, no harmful content, no reproduction ordering)
- Versioning policy: loosening requires the B-0810 ratification gate

Why this is safe to land: the schema is a specification document. It
contains no runnable bypass material, no settings payloads, no exact
permission patterns, no real harmful content, no real secrets, no real
PII, and no ordered reproduction steps. It exists so that any future
B-0720 finding can preserve safety signal without preserving the
recipe.

Companion edits:
- `docs/security/B-0720-classifier-bypass-research-boundary.md`:
  reporting-rule paragraph now points to the live schema path and
  active version.
- `docs/backlog/P0/B-0807-...`: status -> closed; acceptance ticked;
  Output section added; last_updated -> 2026-05-28.
- `docs/backlog/P0/B-0720-...`: B-0807 acceptance bullet ticked with
  schema path + version.
- `docs/BACKLOG.md`: regenerated via `BACKLOG_WRITE_FORCE=1 bun
  tools/backlog/generate-index.ts` so the index reflects B-0807 closed.

Focused checks:
- `grep "Bash\(|_ip_risk_acceptance|_legal_acceptance|_policy_override|permissions.*allow|settings\.json.*{"` over the new schema doc returns no matches -- the schema doesn't quote the very payloads it forbids.
- `git diff --stat` shows 5 files changed; no unrelated files swept in.

Composes with:
- B-0798 hard-limits boundary (the floor this schema sits on)
- B-0799 synthetic-only harness design (resolves `schema_version`
  audit-log field)
- B-0720 parent safety row (B-0807 acceptance bullet now ticked)
- `.claude/rules/classifier-bypass-research-do-not-deploy-without-zeta-safer-floor.md`
- `.claude/rules/methodology-hard-limits.md`

Operative authorization: aaron 2026-05-14: "- **Devil-pole** (edge-runner drive): keep pushing, discover, go hard, never-be-idle"

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 09:55
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@AceHack AceHack enabled auto-merge (squash) May 28, 2026 09:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Defines and lands the B-0807 “classifier-bypass findings” reporting schema under docs/security/ and wires it into the B-0720 boundary + backlog surfaces so future B-0720 work can publish safety signal with explicit redaction and reviewer-gate rules.

Changes:

  • Add docs/security/B-0807-classifier-bypass-findings-schema.md (schema_version 1) specifying findings record shape, enums, redaction ladder, and reviewer gating.
  • Update docs/security/B-0720-classifier-bypass-research-boundary.md to point reporting rules at the live B-0807 schema + version.
  • Close B-0807 backlog row, tick B-0720 acceptance bullet, and regenerate docs/BACKLOG.md.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
docs/security/B-0807-classifier-bypass-findings-schema.md New findings schema + redaction/reviewer-gate rules for preserving observations safely.
docs/security/B-0720-classifier-bypass-research-boundary.md Reporting rule updated to cite the new schema path and version.
docs/backlog/P0/B-0807-classifier-bypass-findings-schema-and-redaction-rules-2026-05-26.md Mark B-0807 closed; acceptance checked; output section added; last_updated bumped.
docs/backlog/P0/B-0720-classifier-bypass-research-red-team-do-not-deploy-without-zeta-safer-than-anthropic-2026-05-24.md Tick B-0807 acceptance item and add schema citation.
docs/BACKLOG.md Regenerated backlog index reflecting B-0807 closed status.

Comment thread docs/security/B-0807-classifier-bypass-findings-schema.md Outdated
Comment thread docs/security/B-0807-classifier-bypass-findings-schema.md
… scope, B-0799 redaction_level mapping, B-0720 punctuation

Three Copilot review findings on PR #5740 (all verified valid against
source per `.claude/rules/blocked-green-ci-investigate-threads.md`
verify-before-fix discipline):

1. Line 49 (B-0807 findings schema) — enum-vs-`unknown` contradiction.
   The schema instructed authors to use `unknown` for any missing
   field, but `evidence_class`, `risk_class`, `observation_class`, and
   `redaction_level` are enums with explicit allowed values. Clarified:
   non-enum fields use `unknown`; enum fields must use named values OR
   fall into `refusal-required` if not determinable.

2. Line 121 (B-0807 redaction_level) — vocabulary mismatch with B-0799.
   B-0807 introduces 4 redaction levels (`summary-only`,
   `reviewer-summary`, `reviewer-restricted`, `refusal-required`) but
   B-0799 audit-log shape documents only 3 (`summary-only`,
   `reviewer-summary`, `refusal-required`). Added Mapping to B-0799
   Audit-Log Vocabulary subsection explaining the divergence: B-0799
   harness emissions use 3-value vocabulary; reviewers map to B-0807
   4-value vocabulary per-record; mapping recorded in `omitted_fields`
   so divergence is auditable until B-0799 ratifies the extension.

3. Line 133 (B-0720 acceptance bullet) — punctuation. Added `see` to
   the parenthetical citation to disambiguate the main clause from the
   reference (preserves the terminal period; clarifies structure).

Per `.claude/rules/blocked-green-ci-investigate-threads.md`: thread
findings verified against source via direct `awk -v N=<line>` inspection
before applying fixes. All three are real findings (not the suspect-by-
default table-double-pipe class). Threads will be resolved after this
commit pushes.

Co-Authored-By: Claude <noreply@anthropic.com>
@AceHack AceHack merged commit c26cca6 into main May 28, 2026
27 of 29 checks passed
@AceHack AceHack deleted the otto-cli/b0807-classifier-bypass-findings-schema-2026-05-28 branch May 28, 2026 10:09
AceHack added a commit that referenced this pull request May 28, 2026
…d resolution (#5747)

Fresh cold-boot under autonomous-loop sentinel (re-armed `a7b83b70`
after catch-43 fired). Substantive work: 3 Copilot review threads on
PR #5740 (B-0807 classifier-bypass findings schema) resolved via
verify-before-fix discipline + isolated worktree commit + thread
GraphQL resolution.

- Verified 3 Copilot findings against PR head source (one P1
  enum-vs-unknown contradiction; one P1 B-0799 redaction_level
  vocabulary mismatch; one minor punctuation).
- Authored fixes in `/private/tmp/zeta-otto-cli-5740-fixes-1003z`
  isolated worktree; ls-tree HEAD = 61 (no canary corruption).
- Pushed `d76ad9b47..72ea879` via explicit refspec to PR #5740
  branch (fast-forward safe).
- Resolved 3 threads via `resolveReviewThread` mutation.
- PR #5740 auto-merge (armed by Aaron) expected to fire when
  required CI re-runs green against the new commit.

Per `.claude/rules/blocked-green-ci-investigate-threads.md` +
verify-before-fix discipline. Counter-reset condition #3 satisfied
(review threads resolved = concrete artifact).

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude <noreply@anthropic.com>
AceHack added a commit that referenced this pull request May 28, 2026
…0h gap-fill (#5927)

Per `.claude/rules/tick-must-never-stop.md`: SessionStart hook fired catch-43;
CronList empty; sentinel re-armed as `85456a9c` at `* * * * *`.

Today's shard cadence was sparse (last shard 1014Z; 10h gap to 2002Z). This
fills the gap with visibility-signal substrate documenting:

- Reconnaissance: GraphQL 2668/5000 (Normal); REST core 4934/5000; 0 stuck
  git pack/maintenance/repack procs (dotgit recovered); 17 peer claude-code
  procs active; origin/main tip `89b94efb6` (B-0304 Pages queue).
- Outstanding Otto-CLI PRs from 1014Z tick (#5740, #5742, #5739) all merged
  cleanly (10:03Z-10:22Z range); no current open Otto-CLI PRs requiring
  follow-up.
- Brief-ack counter at #0; concrete artifact landed (this shard + sentinel
  re-arm); counter-reset condition #3 satisfied.

Composes with: tick-must-never-stop + refresh-before-decide + holding-
without-named-dependency-is-standing-by-failure + fighting-past-self-vs-
peer-agent + claim-acquire-before-worktree-work + agent-worktree-hygiene +
codeql-no-source-canary + zeta-expected-branch + substrate-or-it-didnt-
happen rules.

Co-authored-by: Lior <lior@zeta.dev>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants