Skip to content

research: bullshit-detector v1 CRITICAL-only delta (3 CRITICAL integrated; 7 non-CRITICAL deferred to v2)#286

Merged
AceHack merged 4 commits intomainfrom
research/bullshit-detector-v1-critical-only-delta
Apr 24, 2026
Merged

research: bullshit-detector v1 CRITICAL-only delta (3 CRITICAL integrated; 7 non-CRITICAL deferred to v2)#286
AceHack merged 4 commits intomainfrom
research/bullshit-detector-v1-critical-only-delta

Conversation

@AceHack
Copy link
Copy Markdown
Member

@AceHack AceHack commented Apr 24, 2026

Summary

Delta-style revision integrating only the 3 CRITICAL findings from Aminata's Otto-100 pass (PR #284) into Otto-99's bullshit-detector design (PR #282).

7 non-CRITICAL findings (4 IMPORTANT + 3 WATCH) deferred to v2 delta.

Closes CRITICAL-integration step of Aminata-then-Otto-response loop.

3 CRITICAL integrations

# Finding Response
C1 Cross-detector collusion New §Reviewer-cone overlap section acknowledging fundamental limitation; human sign-off as cone-breaking authority
C2 Min-merging Goodhart-bait Sensitivity-analysis-gate pattern: band = YELLOW when G_carrier_overlap was gate closest to threshold
C3 G_evidence fig-leaf + dead likely confabulated in v0 New §v0 scope subsection: 4-gate effective; likely confabulated not-yet-reachable

Fundamental limitation acknowledged

C1 (reviewer-cone overlap) is NOT closable by detector design alone. The detector can't break its own reviewer-set lineage coupling. Mitigation via routing: human sign-off + external-reviewer signatures are the cone-breaking authorities; intra-factory reviewer concordance downgrades confidence rather than inflating.

Deferred to v2

4 IMPORTANT: G_coverage_plausibility gate · Otto-wake second-reviewer schema tightening · DetectorOutputBatchRetracted · no-signal vs kNN-evasion distinction.

3 WATCH: distribution histogram · adversarial worked example · TLA+ invariants.

DISMISS unchanged: parameter-ADR gate (reused from oracle-scoring v0).

Scope limits

  • Does NOT rewrite Otto-99's design — specifies delta only
  • Does NOT address IMPORTANT/WATCH findings
  • Does NOT implement
  • Does NOT propose human-sign-off UI
  • Does NOT commit content-similarity-in-cone (alternative to C2)
  • Does NOT change 5-gate target structure (v0 effective is 4-gate; v1-post-substrate is 5-gate)

Authority

Within standing authority per Otto-82/90/93 calibration — research-grade design revision.

Dependencies

  • Aminata adversarial pass on v1 delta (5th session-pass if lands)
  • Integrate v1 changes into Otto-99 design PR (separate PR)
  • v2 delta for non-CRITICALs
  • Independent-oracle substrate for full G_evidence
  • Human-sign-off UI/protocol

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 24, 2026 04:08
@AceHack AceHack enabled auto-merge (squash) April 24, 2026 04:09
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a research-grade delta document that integrates the 3 CRITICAL findings from Aminata’s Otto-100 pass into the provenance-aware detector design narrative, while explicitly deferring the 7 non-CRITICAL findings to a later v2 delta.

Changes:

  • Introduces a new research doc describing the CRITICAL-only delta and its intended insert/replace points in the Otto-99 design.
  • Specifies responses for C1 (reviewer-cone overlap limitation), C2 (carrier-overlap sensitivity downgrade), and C3 (v0 scope + reachable output types).
  • Captures deferred IMPORTANT/WATCH items and adoption dependencies as explicit scope boundaries.

AceHack added a commit that referenced this pull request Apr 24, 2026
…me backlogged) + header-fields + name attrib + table count + DRIFT-TAXONOMY xref

- Thread 1 (PRRT_kwDOSF9kNM59SpxP): compacted four archive-header
  fields (Scope / Attribution / Operational status / Non-fusion
  disclaimer) into first six lines so "Archive-header self-applied"
  claim is now accurate; removed the trailing claim sentence in
  favour of the structural compliance itself.
- Thread 2 (PRRT_kwDOSF9kNM59Spxg): shifted body vocabulary from
  "bullshit-detector" to "veridicality-detector" throughout; added a
  vocabulary-note paragraph explaining the filename-slug retention;
  appended BACKLOG P2 research-grade row for the cross-repo filename
  rename sweep (three doc files + link-update across PRs / round-
  history / memory-index). Otto-229 append-only discipline observed.
- Thread 3 (PRRT_kwDOSF9kNM59Spxl): converted persona-specific names
  ("Aminata", "Otto-99/100/101", "Aaron", "Max", "Codex", "Soraya")
  to role references (Aminata-persona / main-agent persona /
  maintainer / external-peer-agent / formal-methods-persona).
- Thread 4 (PRRT_kwDOSF9kNM59Spx1): corrected §"v0 scope" header
  from "four gates + four reachable output types" to "five gates
  (four active + one advisory) + five reachable output types (one
  dead-code)", matching the tables that list 5 gates and 6 output
  types (5 reachable + 1 not-yet-reachable).
- Thread 5 (PRRT_kwDOSF9kNM59Spx-): "DRIFT-TAXONOMY pattern 5"
  softened to point at actual precursor file
  docs/research/drift-taxonomy-bootstrap-precursor-2026-04-22.md,
  noting there is no canonical docs/DRIFT-TAXONOMY.md at time of
  writing.

No new PR; filename rename is backlogged. No merge.
@AceHack AceHack force-pushed the research/bullshit-detector-v1-critical-only-delta branch from ead499e to 39e7681 Compare April 24, 2026 13:35
AceHack added a commit that referenced this pull request Apr 24, 2026
…ta integrating Aminata Otto-100 findings

Post-milestone tick. First integration pass on Aminata's
Otto-100 10 findings. PR #286 delta-style revision addresses
only the 3 CRITICAL findings; 7 non-CRITICAL deferred to v2.

Key observations:

1. **Delta-style revision** is a new useful pattern — Otto-99
   design stays intact; v1 specifies diff-style additions; v2
   composes on v1. Reviewer-clean; revertable; extends to
   v3/v4/vN as more findings arrive.
2. **C1 acknowledged as fundamental limitation not closable
   gap** — honest move. Detector can't break own reviewer-set
   lineage coupling; authority-routing to human + external is
   the mitigation. Architectural cleanness of "acknowledge +
   route" beats "invent closure mechanism that doesn't
   exist."
3. **Sensitivity-analysis-gate** is cheap mechanical defense
   against Goodhart-bait on adversary-manipulable gates.
   Candidate BP-NN promotion if exercised on second classifier.
4. **v0-scope explicit subsection** closes dead-code-in-
   safety-critical-classifier hazard. Applicable to any
   design-on-paper vs operational-in-practice gap in the
   factory.

Stacked on #285 (Otto-100 history).
AceHack added a commit that referenced this pull request Apr 24, 2026
…ta integrating Aminata Otto-100 findings

Post-milestone tick. First integration pass on Aminata's
Otto-100 10 findings. PR #286 delta-style revision addresses
only the 3 CRITICAL findings; 7 non-CRITICAL deferred to v2.

Key observations:

1. **Delta-style revision** is a new useful pattern — Otto-99
   design stays intact; v1 specifies diff-style additions; v2
   composes on v1. Reviewer-clean; revertable; extends to
   v3/v4/vN as more findings arrive.
2. **C1 acknowledged as fundamental limitation not closable
   gap** — honest move. Detector can't break own reviewer-set
   lineage coupling; authority-routing to human + external is
   the mitigation. Architectural cleanness of "acknowledge +
   route" beats "invent closure mechanism that doesn't
   exist."
3. **Sensitivity-analysis-gate** is cheap mechanical defense
   against Goodhart-bait on adversary-manipulable gates.
   Candidate BP-NN promotion if exercised on second classifier.
4. **v0-scope explicit subsection** closes dead-code-in-
   safety-critical-classifier hazard. Applicable to any
   design-on-paper vs operational-in-practice gap in the
   factory.

Stacked on #285 (Otto-100 history).
AceHack added a commit that referenced this pull request Apr 24, 2026
…ta integrating Aminata Otto-100 findings

Post-milestone tick. First integration pass on Aminata's
Otto-100 10 findings. PR #286 delta-style revision addresses
only the 3 CRITICAL findings; 7 non-CRITICAL deferred to v2.

Key observations:

1. **Delta-style revision** is a new useful pattern — Otto-99
   design stays intact; v1 specifies diff-style additions; v2
   composes on v1. Reviewer-clean; revertable; extends to
   v3/v4/vN as more findings arrive.
2. **C1 acknowledged as fundamental limitation not closable
   gap** — honest move. Detector can't break own reviewer-set
   lineage coupling; authority-routing to human + external is
   the mitigation. Architectural cleanness of "acknowledge +
   route" beats "invent closure mechanism that doesn't
   exist."
3. **Sensitivity-analysis-gate** is cheap mechanical defense
   against Goodhart-bait on adversary-manipulable gates.
   Candidate BP-NN promotion if exercised on second classifier.
4. **v0-scope explicit subsection** closes dead-code-in-
   safety-critical-classifier hazard. Applicable to any
   design-on-paper vs operational-in-practice gap in the
   factory.

Stacked on #285 (Otto-100 history).
AceHack added a commit that referenced this pull request Apr 24, 2026
…me backlogged) + header-fields + name attrib + table count + DRIFT-TAXONOMY xref

- Thread 1 (PRRT_kwDOSF9kNM59SpxP): compacted four archive-header
  fields (Scope / Attribution / Operational status / Non-fusion
  disclaimer) into first six lines so "Archive-header self-applied"
  claim is now accurate; removed the trailing claim sentence in
  favour of the structural compliance itself.
- Thread 2 (PRRT_kwDOSF9kNM59Spxg): shifted body vocabulary from
  "bullshit-detector" to "veridicality-detector" throughout; added a
  vocabulary-note paragraph explaining the filename-slug retention;
  appended BACKLOG P2 research-grade row for the cross-repo filename
  rename sweep (three doc files + link-update across PRs / round-
  history / memory-index). Otto-229 append-only discipline observed.
- Thread 3 (PRRT_kwDOSF9kNM59Spxl): converted persona-specific names
  ("Aminata", "Otto-99/100/101", "Aaron", "Max", "Codex", "Soraya")
  to role references (Aminata-persona / main-agent persona /
  maintainer / external-peer-agent / formal-methods-persona).
- Thread 4 (PRRT_kwDOSF9kNM59Spx1): corrected §"v0 scope" header
  from "four gates + four reachable output types" to "five gates
  (four active + one advisory) + five reachable output types (one
  dead-code)", matching the tables that list 5 gates and 6 output
  types (5 reachable + 1 not-yet-reachable).
- Thread 5 (PRRT_kwDOSF9kNM59Spx-): "DRIFT-TAXONOMY pattern 5"
  softened to point at actual precursor file
  docs/research/drift-taxonomy-bootstrap-precursor-2026-04-22.md,
  noting there is no canonical docs/DRIFT-TAXONOMY.md at time of
  writing.

No new PR; filename rename is backlogged. No merge.
@AceHack AceHack force-pushed the research/bullshit-detector-v1-critical-only-delta branch from 39e7681 to 40a5de9 Compare April 24, 2026 14:42
Copilot AI review requested due to automatic review settings April 24, 2026 14:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

Comment thread docs/BACKLOG.md Outdated
Comment thread docs/BACKLOG.md Outdated
AceHack added 3 commits April 24, 2026 11:33
…ngs integrated (Otto-101)

Delta-style revision integrating only the 3 CRITICAL findings
from Aminata's Otto-100 pass (PR #284) into Otto-99's
bullshit-detector design (PR #282). 7 non-CRITICAL findings
deferred to v2 delta; DISMISS unchanged.

Three CRITICAL integrations:

**C1 — Cross-detector collusion (new §Reviewer-cone overlap
section).** Acknowledges as FUNDAMENTAL LIMITATION NOT
CLOSABLE GAP: detector cannot break its own reviewer-set
lineage coupling. Mitigation via routing authority:
- Human sign-off = only cone-breaking authority for
  authorization-impacting band=supported queries.
- External-reviewer signatures as legitimate cone-breakers
  when available.
- Intra-factory reviewer concordance should DOWNGRADE
  confidence per SD-9, not inflate.
- Every receipt carries reviewer_cone_signature field.

**C2 — Min-merging Goodhart-bait (sensitivity-analysis-gate
pattern).** v1 enhancement:
  band(y|q) = if band_raw=GREEN AND G_carrier_overlap closest
              to threshold → YELLOW
              else band_raw
Closes cheap cheat path (threshold-hugging on author-
controlled gate). Alternative content-similarity-in-cone
deferred to v2 cost-benefit.

**C3 — G_evidence fig-leaf (new §v0 scope subsection).** v0
effective = 4-gate classifier (G_evidence advisory-only until
independent-oracle substrate ships). `likely confabulated`
output type explicitly marked not-yet-reachable in v0.
Conservative under-detection stance (RED comes only from
known-bad-pattern; confabulations land YELLOW as plausible-
but-unresolved). v0→v1 transition plan names
DetectorOutputBatchRetracted for historical re-classification.

v1 delta does NOT rewrite Otto-99's design; specifies 3
additive sections to insert + 1 section to supersede. Otto-99
original preserved in git history.

Seven non-CRITICALs still open:
- 4 IMPORTANT (deferred to v2): G_coverage_plausibility gate;
  Otto-wake second-reviewer schema; DetectorOutputBatchRetracted;
  no-signal vs kNN-evasion.
- 3 WATCH (deferred to v2+): distribution histogram;
  adversarial worked example; TLA+ invariants.

1 fundamental limitation (C1) acknowledged not closed —
routes authority to human + external reviewer layers.

Scope limits: no design rewrite; no implementation; no
human-sign-off UI proposal; no content-similarity-in-cone
commit; 5-gate/5-type target structure unchanged.

5 dependencies-to-adoption: Aminata pass on v1 delta (fifth
session-pass); integrate v1 changes into Otto-99 design PR
(separate PR); v2 delta; independent-oracle substrate;
human-sign-off UI/protocol.

Archive-header format self-applied — 18th aurora/research
doc in a row.

Lands within-standing-authority per Otto-82/90/93
calibration.

Otto-101 tick primary deliverable — closes the CRITICAL-
integration step of the Aminata-then-Otto-response loop for
bullshit-detector design.
…me backlogged) + header-fields + name attrib + table count + DRIFT-TAXONOMY xref

- Thread 1 (PRRT_kwDOSF9kNM59SpxP): compacted four archive-header
  fields (Scope / Attribution / Operational status / Non-fusion
  disclaimer) into first six lines so "Archive-header self-applied"
  claim is now accurate; removed the trailing claim sentence in
  favour of the structural compliance itself.
- Thread 2 (PRRT_kwDOSF9kNM59Spxg): shifted body vocabulary from
  "bullshit-detector" to "veridicality-detector" throughout; added a
  vocabulary-note paragraph explaining the filename-slug retention;
  appended BACKLOG P2 research-grade row for the cross-repo filename
  rename sweep (three doc files + link-update across PRs / round-
  history / memory-index). Otto-229 append-only discipline observed.
- Thread 3 (PRRT_kwDOSF9kNM59Spxl): converted persona-specific names
  ("Aminata", "Otto-99/100/101", "Aaron", "Max", "Codex", "Soraya")
  to role references (Aminata-persona / main-agent persona /
  maintainer / external-peer-agent / formal-methods-persona).
- Thread 4 (PRRT_kwDOSF9kNM59Spx1): corrected §"v0 scope" header
  from "four gates + four reachable output types" to "five gates
  (four active + one advisory) + five reachable output types (one
  dead-code)", matching the tables that list 5 gates and 6 output
  types (5 reachable + 1 not-yet-reachable).
- Thread 5 (PRRT_kwDOSF9kNM59Spx-): "DRIFT-TAXONOMY pattern 5"
  softened to point at actual precursor file
  docs/research/drift-taxonomy-bootstrap-precursor-2026-04-22.md,
  noting there is no canonical docs/DRIFT-TAXONOMY.md at time of
  writing.

No new PR; filename rename is backlogged. No merge.
… sweep-refs skill path (BACKLOG) + Output-types wording (§5) + fail-to-RED typo + PR-#282 annotation on dead-path

Resolves the 5 new copilot findings on commit 40a5de9 plus
rebases onto origin/main (append-only per Otto-229; both
rows retained in BACKLOG.md conflict resolution).

- BACKLOG row rewritten so `docs/research/provenance-aware-
  bullshit-detector-*` inline-code fits on a single line
  (markdown inline-code cannot span newlines).
- BACKLOG row points at the `sweep-refs` skill at
  `.claude/skills/sweep-refs/SKILL.md` (verified to exist)
  instead of the non-existent `tools/sweep-refs/*` path.
- Delta-doc "§5 output types" wording changed to "the Output
  types section" to avoid the §5-as-count misread.
- Delta-doc `fail-to-RED` joined on a single line (was
  split across a line-break, read as a malformed token).
- Delta-doc references to the base-design file
  (`provenance-aware-bullshit-detector-2026-04-23.md`) now
  annotated "(in PR #282 — not yet on main at time of this
  delta's writing)" so readers know the path is a forward
  reference, not a dead one.

Otto-229 append-only; Otto-230 reply-and-resolve-after-push;
Otto-236 every-thread-resolved.
@AceHack AceHack force-pushed the research/bullshit-detector-v1-critical-only-delta branch from 40a5de9 to fc50828 Compare April 24, 2026 15:35
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 19:33
@AceHack AceHack merged commit 19bc7a1 into main Apr 24, 2026
12 checks passed
@AceHack AceHack deleted the research/bullshit-detector-v1-critical-only-delta branch April 24, 2026 19:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment on lines +48 to +52
> **re-introduces one meta-layer up**: the detector itself
>
> + the adversarial-reviewer persona + the external-peer-
> agent reviewer + any other factory-internal reviewer
> share training-corpus / repo-access / PR-comment lineage.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: The blockquote line starting with + (> + the adversarial-reviewer persona ...) will be parsed as an unordered-list item. Repo guidance explicitly avoids line-start + in prose/lists because it’s easy to accidentally create nested lists and confuse readers/linters; please rewrite this as normal prose (e.g., “the detector itself, the adversarial-reviewer persona, … share …”) or use - bullets consistently.

Suggested change
> **re-introduces one meta-layer up**: the detector itself
>
> + the adversarial-reviewer persona + the external-peer-
> agent reviewer + any other factory-internal reviewer
> share training-corpus / repo-access / PR-comment lineage.
> **re-introduces one meta-layer up**: the detector itself,
> the adversarial-reviewer persona, the external-peer-agent
> reviewer, and any other factory-internal reviewer share
> training-corpus / repo-access / PR-comment lineage.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +70
> - **Maintainer sign-off is the only cone-breaking
> authority.** When a query's band is `supported` and
> the stakes are authorization-impacting, the detector's
> output is advisory; maintainer sign-off is required as
> the independent-cone signal.
> - **External-reviewer signatures are legitimate cone-
> breaking** when they exist (external peer review;
> formal-methods proofs landed at Lean/Mathlib scale;
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: This section says “Maintainer sign-off is the only cone-breaking authority”, but the next bullet says external-reviewer signatures are also cone-breaking. Please clarify the scope (e.g., “only internal cone-breaking authority” / “default required authority for authorization-impacting decisions”) so the policy doesn’t contradict itself.

Copilot uses AI. Check for mistakes.
Comment on lines +115 to +120
"G_carrier_overlap was the gate closest to threshold"
↔ overlap_ratio(cone(q), cone(y)) ∈ [θ_med - ε, θ_med]
OR ∈ [θ_high - ε, θ_high]

band(y | q) = if band_raw = GREEN AND sensitivity → YELLOW
else band_raw
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: The sensitivity pseudo-code mixes a predicate definition with a rewrite arrow (sensitivity → YELLOW) and then uses sensitivity without applying it (it’s defined as sensitivity(y|q) = ...). This is ambiguous for implementers; please rewrite as a boolean predicate (no quoted string) and use a standard conditional like if band_raw = GREEN && sensitivity(y|q) then YELLOW else band_raw.

Suggested change
"G_carrier_overlap was the gate closest to threshold"
↔ overlap_ratio(cone(q), cone(y)) ∈ [θ_med - ε, θ_med]
OR ∈ [θ_high - ε, θ_high]
band(y | q) = if band_raw = GREEN AND sensitivity → YELLOW
else band_raw
overlap_ratio(cone(q), cone(y)) ∈ [θ_med - ε, θ_med]
OR overlap_ratio(cone(q), cone(y)) ∈ [θ_high - ε, θ_high]
band(y | q) =
if band_raw = GREEN && sensitivity(y | q)
then YELLOW
else band_raw

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants