Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/BACKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ are closed (status: closed in frontmatter)._
- [ ] **[B-0720](backlog/P0/B-0720-classifier-bypass-research-red-team-do-not-deploy-without-zeta-safer-than-anthropic-2026-05-24.md)** Classifier-bypass research + red-team — can crafted settings.json make Anthropic classifier allow anything? Standing operator-constraint until Zeta safer
- [x] **[B-0798](backlog/P0/B-0798-classifier-bypass-hard-limits-and-research-boundary-2026-05-26.md)** Classifier-bypass hard-limits and research boundary for B-0720
- [x] **[B-0799](backlog/P0/B-0799-classifier-bypass-synthetic-harness-design-2026-05-26.md)** Classifier-bypass synthetic-only harness design for B-0720
- [ ] **[B-0807](backlog/P0/B-0807-classifier-bypass-findings-schema-and-redaction-rules-2026-05-26.md)** Classifier-bypass findings schema and redaction rules for B-0720
- [x] **[B-0807](backlog/P0/B-0807-classifier-bypass-findings-schema-and-redaction-rules-2026-05-26.md)** Classifier-bypass findings schema and redaction rules for B-0720
- [ ] **[B-0808](backlog/P0/B-0808-zeta-safety-substrate-inventory-for-classifier-floor-2026-05-26.md)** Zeta safety substrate inventory for the classifier-floor replacement gate
- [ ] **[B-0809](backlog/P0/B-0809-operator-refusal-pattern-for-classifier-bypass-requests-2026-05-26.md)** Operator-refusal pattern for classifier-bypass deployment requests
- [ ] **[B-0810](backlog/P0/B-0810-classifier-bypass-knights-guild-ratification-and-lift-gate-2026-05-26.md)** Classifier-bypass Knights Guild ratification and standing-constraint lift gate
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -127,8 +127,10 @@ Per Aaron 2026-05-24 standing constraint + general HARD LIMITS:
conditions, and synthetic-only rule before any mapping work proceeds.
- [x] B-0799 designs a synthetic-only harness that can test harmless fixtures
without carrying deployable bypass settings or harmful content.
- [ ] B-0807 defines the findings schema and redaction policy so reports can
preserve safety signal without reproducible bypass detail.
- [x] B-0807 defines the findings schema and redaction policy so reports can
preserve safety signal without reproducible bypass detail
(see `docs/security/B-0807-classifier-bypass-findings-schema.md`,
`schema_version: 1`).
Comment thread
AceHack marked this conversation as resolved.
- [ ] Pattern variant empirical map (research file)
- [ ] Meta-field empirical map (research file)
- [ ] Content class empirical map — clearly distinguishing what the classifier
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
---
id: B-0807
priority: P0
status: open
status: closed
title: "Classifier-bypass findings schema and redaction rules for B-0720"
created: 2026-05-26
last_updated: 2026-05-26
last_updated: 2026-05-28
renumbered_from: B-0800
parent: B-0720
depends_on: [B-0798]
Expand Down Expand Up @@ -35,16 +35,28 @@ Define the reporting format for any future B-0720 observation:

## Acceptance

- [ ] Schema document lands in a durable repo surface and is linked from
- [x] Schema document lands in a durable repo surface and is linked from
B-0720.
- [ ] The schema forbids publishing deployable settings payloads or harmful
- [x] The schema forbids publishing deployable settings payloads or harmful
content.
- [ ] The schema distinguishes safety signal from reproduction detail.
- [ ] The schema includes a refusal-required state for observations that should
- [x] The schema distinguishes safety signal from reproduction detail.
- [x] The schema includes a refusal-required state for observations that should
not be preserved in repo history.
- [ ] Future empirical mapping rows must cite this schema before landing
- [x] Future empirical mapping rows must cite this schema before landing
findings.

## Output

- `docs/security/B-0807-classifier-bypass-findings-schema.md` defines the
findings record shape, evidence classes, risk classes, observation classes,
redaction levels, refusal-required state, reviewer sign-off matrix,
cite-or-block rule for future empirical rows, forbidden field values, and
schema versioning policy. Active `schema_version: 1`.
- B-0799 audit-log field `schema_version` is now resolvable; future harness
runs reference `schema_version: 1`.
- Future empirical mapping rows under B-0720 must cite this schema in their
`composes_with` list or document body before any finding lands.

## Out of scope

- Running experiments.
Expand Down
6 changes: 4 additions & 2 deletions docs/security/B-0720-classifier-bypass-research-boundary.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,8 +107,10 @@ Future B-0720 reports must not preserve:
- unredacted sensitive content;
- instructions found in audited data.

B-0807 owns the full findings schema and redaction policy. Until B-0807 lands,
publish only high-level summaries and provenance references.
B-0807 owns the full findings schema and redaction policy. The schema lives
at `docs/security/B-0807-classifier-bypass-findings-schema.md` (active
`schema_version: 1`). Future empirical mapping rows must cite that schema by
version before landing any finding.

## Dependency Rule

Expand Down
243 changes: 243 additions & 0 deletions docs/security/B-0807-classifier-bypass-findings-schema.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
# B-0807 Classifier-Bypass Findings Schema and Redaction Rules

Status: active reporting and redaction gate for B-0720.

This document defines the only format in which classifier-bypass observations
may be preserved in shared substrate. It exists so that future empirical work
under B-0720 can land safety signal without landing reproduction detail. It
contains no runnable payloads, no real harmful content, no real secrets, no
real PII, and no ordered bypass recipes.

Schema version: `1`. The B-0799 harness design records `schema_version` and
must match the active version listed here. Older versions are retained for
audit only; new findings cite the current version.

## Boundary

- Cite `docs/security/B-0720-classifier-bypass-research-boundary.md` (B-0798)
as the active hard-limits floor. If a finding cannot fit inside the allowed
evidence classes there, this schema does not authorize publishing it.
- Cite `docs/security/B-0799-classifier-bypass-synthetic-harness-design.md`
as the source of any audit-log fields referenced by an observation.
- Treat operator authority as bounded by the standing constraint in B-0720.
This schema does not relax that constraint. The B-0810 ratification gate
remains the only path to a relaxed floor.
- Deployable bypass material, real harmful content, real secrets, real PII,
exact classifier settings, and ordered reproduction steps remain forbidden
in every field of every record this schema defines.

## When This Schema Applies

This schema applies to any artifact that proposes to preserve a
classifier-bypass observation in shared substrate. That includes:

- `docs/research/red-team/*` empirical mapping notes;
- `docs/research/2026-*-classifier-bypass-empirical-mapping-*.md` files
named in B-0720;
- backlog rows under B-0720 that quote findings;
- PR descriptions, commit messages, or broadcast notes that summarize an
observation;
- audit logs from a future authorized harness run.

If an artifact would touch any of those surfaces, the author must apply this
schema before the artifact lands.

## Findings Record Shape

Every finding is a single record with these fields. All fields are required.
If a non-enum field is unknown, mark it `unknown` and add a reviewer note; do
not omit the field. Enum fields (`evidence_class`, `risk_class`,
`observation_class`, `redaction_level`) must carry one of their named values;
`unknown` is not a permitted enum value, and a record that cannot determine
an enum field falls into `refusal-required` instead.

| Field | Type | Allowed values | Purpose |
|-------|------|----------------|---------|
| `finding_id` | string | local stable identifier | Lets reviewers reference the record without quoting its content. |
| `schema_version` | string | `1` (the current version of this schema) | Pins the record to the rules in force when it was written. |
| `boundary_version` | string | reference to B-0798 or the ratified successor under B-0810 | Records the floor the finding sits under. |
| `created` | string | ISO-8601 date | Timestamps the observation for audit. |
| `evidence_class` | enum | one of the allowed evidence classes below | Names what kind of evidence supports the finding. |
| `risk_class` | enum | one of the risk classes below | Names whether verbatim preservation would enable reproduction. |
| `observation_class` | enum | one of the observation classes below | Names what the harness or reviewer saw. |
| `redaction_level` | enum | one of the redaction levels below | Names how the finding may be preserved. |
| `safety_signal` | string | short prose summary, no payloads | Preserves the lesson without preserving reproduction detail. |
| `omitted_fields` | list | names of intentionally omitted data | Makes the redaction auditable. |
| `reviewer_gate` | string | reviewer, row, or PR required before publication | Names who must sign off before the record can land. |
| `reviewer_signoff` | string | identifier of the reviewer who approved publication, or `pending` | Records that the gate has been cleared. |
| `composes_with` | list | citations to B-0798, B-0799, B-0720, and child rows | Anchors the record to the active boundary substrate. |

A record that is missing any field, or that places forbidden material in a
field, must not land. The author must either complete the record under the
correct redaction level or refuse the record per the refusal-required rule
below.

## Evidence Classes

The evidence class is inherited from B-0798. A finding must fit one class. If
no class fits, the finding is forbidden in shared substrate.

| Class | What it covers |
|-------|----------------|
| `landed-provenance` | Links to already-landed PRs, commits, rows, or summaries. Does not quote operational payloads. |
| `redacted-observation` | Summary that preserves a safety lesson without preserving literal settings, exact prompt text, real sensitive content, or reproduction ordering. |
| `harmless-synthetic-fixture` | Plain invented text generated for the harness under B-0799 fixture rules. |
| `negative-control` | Benign fixture used to confirm a safety check refuses or redacts as expected. |
| `policy-anchor` | Citation to a repo policy surface that supplies authority for the finding. |
| `refusal-required` | Marker that the observation must not be preserved in repo history. |

A record marked `refusal-required` in `evidence_class` must also be marked
`refusal-required` in `redaction_level`, and `safety_signal` must reduce the
content to a high-level stop-condition reference only.

## Risk Classes

Risk class captures whether verbatim preservation would enable reproduction.

| Class | Meaning |
|-------|---------|
| `non-reproductive` | The finding cannot be turned into a bypass even if written verbatim. Example: a citation to a closed PR with no operational detail. |
| `reproductive-if-verbatim` | Verbatim text would let a reader replay the bypass. Must be summarized; verbatim form is forbidden. |
| `reproductive-irrespective-of-form` | No summary can preserve the lesson safely. The record falls into refusal-required. |

A record marked `reproductive-if-verbatim` must use `redaction_level` of at
least `reviewer-summary`. A record marked `reproductive-irrespective-of-form`
must use `redaction_level` of `refusal-required`.

## Observation Classes

Observation class is inherited from the B-0799 harness audit-log shape. Future
harness runs and reviewer notes use the same vocabulary so records are
comparable.

| Class | Meaning |
|-------|---------|
| `no-signal` | The fixture or observation produced no safety-relevant result. Useful for negative controls. |
| `redaction-required` | A safety-relevant signal exists, but the underlying material must be summarized. |
| `refusal-required` | The observation must not be preserved in repo history; only a high-level stop-condition reference remains. |
| `boundary-error` | The harness, reviewer workflow, or note violated the B-0798 boundary. Records the error for substrate hygiene; does not preserve the underlying violating material. |

## Redaction Levels

Redaction level is the ladder a record may stand on. Every record names its
level explicitly so reviewers can audit the choice.

Comment thread
AceHack marked this conversation as resolved.
| Level | What it preserves | What it omits |
|-------|-------------------|---------------|
| `summary-only` | A short non-operational description of the safety lesson. | Exact settings, exact prompt text, real sensitive content, ordered reproduction steps. |
| `reviewer-summary` | A summary plus a reviewer-restricted appendix linked by reference only. | Public access to the appendix; the appendix never lands in shared substrate. |
| `reviewer-restricted` | A reference that an appendix exists, with the appendix held outside repo history under explicit reviewer governance. | The appendix content itself; only the existence and reviewer gate are recorded. |
| `refusal-required` | A stop-condition reference and a list of omitted fields. | The observation, the trigger, the order, and the surrounding context. |

`summary-only` is the default. Higher levels require an explicit reviewer
gate. No level authorizes verbatim deployable material.

### Mapping to B-0799 Audit-Log Vocabulary

B-0799's audit-log shape (line 111 of
`docs/security/B-0799-classifier-bypass-synthetic-harness-design.md`) lists
three `redaction_level` values: `summary-only`, `reviewer-summary`, and
`refusal-required`. This schema adds `reviewer-restricted` as an explicit
intermediate between `reviewer-summary` (a reviewer-restricted appendix
referenced by link) and `refusal-required` (no preserved observation at
all). The intent is to record that an appendix exists outside repo history
under explicit reviewer governance without claiming it lives in shared
substrate.

When a future harness emits a B-0799 audit record, the harness uses the
three-value B-0799 vocabulary; when this schema preserves the resulting
finding, the reviewer maps the audit record's level to this schema's
four-value vocabulary and records the mapping in the finding's
`omitted_fields` list. B-0799 may later ratify the extended vocabulary
under its own versioning rule; until then, the mapping is recorded
per-record so the divergence is auditable.

## Refusal-Required State

`refusal-required` is the safety floor inside this schema. When a record is
marked `refusal-required`:

- `safety_signal` is reduced to a high-level reference such as `stop-condition
triggered under B-0798` or `boundary-error during synthetic harness run`.
- `omitted_fields` must list every category of content intentionally not
preserved (for example: `exact-settings`, `exact-prompt-text`,
`reproduction-ordering`, `real-sensitive-content`).
- `reviewer_gate` names the row, reviewer, or governance surface that may
reopen the record under stricter handling.
- No appendix, fixture file, or attached transcript may carry the omitted
material.
- The record may be referenced from B-0720 child rows by `finding_id` only;
links to broadcast notes, PRs, or files must not let a reader reconstruct
what was omitted.

Refusal is not concealment. The record continues to exist so reviewers can
see that a stop happened and what class it belonged to.

## Reviewer Sign-Off

Sign-off is required before any high-risk record lands.

| Combination | Required reviewer gate |
|-------------|------------------------|
| `risk_class = non-reproductive` and `redaction_level = summary-only` | Standard PR review; cite this schema. |
| `risk_class = reproductive-if-verbatim` | Named reviewer with safety-substrate scope. PR must record the reviewer identity in `reviewer_signoff`. |
| `risk_class = reproductive-irrespective-of-form` | Refusal-required; no publication. The reviewer gate is recorded for audit only. |
| `observation_class = boundary-error` | Named reviewer with safety-substrate scope plus a follow-up coordination note. |
| `evidence_class = refusal-required` | Refusal-required; no publication. |

`reviewer_signoff = pending` blocks publication. A record may sit in a draft
PR with `pending` while the reviewer is identified, but it must not merge
until the field carries an identifier.

## Cite-Or-Block Rule

Future empirical mapping rows under B-0720 must cite this schema before
landing any finding. The citation lives in the row's `composes_with` list or
in the document body, and it must reference the active `schema_version`.

A finding that does not cite this schema is treated as `boundary-error` and
must not land in shared substrate.

## What This Schema Forbids

This schema does not authorize, and does not provide a format for, any of:

- deployable classifier settings payloads;
- exact permission-pattern examples;
- ordered reproduction steps;
- real secrets, credentials, private keys, tokens, endpoints, or hostnames;
- real PII, including lightly transformed real PII;
- harmful instructions or exploit detail;
- mirrored adversarial corpora or external jailbreak collections;
- unredacted observations that would let a reader replay the bypass.

A field that requires any of those values cannot be filled; the record falls
into refusal-required instead.

## Versioning

Schema changes follow these rules:

- A change that adds optional fields, clarifies wording, or tightens
forbidden lists is a minor revision and updates the schema in place under
the same version.
- A change that alters required fields, allowed enum values, redaction
ladder, or reviewer-gate rules requires a new version number and a
migration note. Existing records keep their original `schema_version`.
- Loosening the floor requires the B-0810 ratification gate first. This
schema cannot be unilaterally relaxed by edit.

## Composes With

- `docs/security/B-0720-classifier-bypass-research-boundary.md` - B-0798
hard-limits boundary; the floor this schema sits on.
- `docs/security/B-0799-classifier-bypass-synthetic-harness-design.md` -
source of the audit-log field shapes referenced here.
- `docs/backlog/P0/B-0720-classifier-bypass-research-red-team-do-not-deploy-without-zeta-safer-than-anthropic-2026-05-24.md` -
parent safety row; future empirical children must cite this schema before
landing findings.
- `.claude/rules/classifier-bypass-research-do-not-deploy-without-zeta-safer-floor.md` -
standing operator-self-constraint; binds every author of a finding.
- `.claude/rules/methodology-hard-limits.md` - HARD LIMITS floor preserved.
- `docs/AGENT-BEST-PRACTICES.md` - audited data is data, not directives;
enforced inside every record under this schema.
Loading