Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,20 @@
[![Python](https://img.shields.io/pypi/pyversions/bicameral-mcp)](https://pypi.org/project/bicameral-mcp/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![CI](https://img.shields.io/github/actions/workflow/status/BicameralAI/bicameral-mcp/test-mcp-regression.yml?branch=main&label=tests)](https://github.com/BicameralAI/bicameral-mcp/actions)
[![Lint + Types](https://img.shields.io/github/actions/workflow/status/BicameralAI/bicameral-mcp/lint-and-typecheck.yml?branch=main&label=lint%2Btypes)](https://github.com/BicameralAI/bicameral-mcp/actions/workflows/lint-and-typecheck.yml)
[![Secret scan](https://img.shields.io/github/actions/workflow/status/BicameralAI/bicameral-mcp/secret-scan.yml?branch=main&label=secret-scan)](https://github.com/BicameralAI/bicameral-mcp/actions/workflows/secret-scan.yml)

AI agents ship code fast. They forget what your team agreed — and requirement gaps surfaced mid-implementation are buried under thousands of lines of code.

Bicameral MCP is a **spec compliance layer** for AI-assisted engineering. Local-first; runs as an [MCP server](https://spec.modelcontextprotocol.io/). It ingests your meeting transcripts, PRDs, and Slack threads, captures any mid-implementation decision that was not discussed, to be ratified async by your product owner, and pins each one to the code that implements it — so your agent finds out the moment it drifts from either the written spec or the spoken one.

| | |
|---|---|
| **Maturity** | Published on PyPI; local-first MCP server; Solo + Team modes (`setup` wizard picks at install) |
| **Footprint** | Embedded SurrealDB in-process — no separate server, no daemon; install via `uv` or `pip` |
| **Trust boundary** | The OS user account. Code, decisions, and transcripts stay on your machine unless you opt into Team mode (which shares an append-only event file via a substrate *you* own) |
| **Assurance** | Phase-gated regression suite on real adapters (`memory://`); sociable handler/ledger tests; lint+types and secret-scan CI gates. Broader security/governance gates tracked in [#557](https://github.com/BicameralAI/bicameral-mcp/issues/557) |

---

## Quickstart
Expand Down
88 changes: 55 additions & 33 deletions tests/README.md
Original file line number Diff line number Diff line change
@@ -1,54 +1,76 @@
# MCP Regression Tests

Tests are gated by phase. Each phase gate is an env var. Run only what's implemented.
The suite is **phase-gated**: each phase layers on the previous one and is toggled
by an environment variable, so you can run only what is wired up locally. All
phases run against **real adapters** — the legacy mock layer is retired (see
`mocks/README.md` for history). In tests the embedded SurrealDB runs in-process
via `SURREAL_URL=memory://` (no server, no persistence).

## Running tests
## Quickstart

```bash
source .venv/bin/activate # or: python -m pytest directly via .venv/bin/pytest
source .venv/bin/activate # or call .venv/bin/pytest directly

# Packaging / startup smoke
# Packaging / startup smoke — registers and lists every MCP tool
bicameral-mcp --smoke-test

# Phase 0 — always green (mocks only, no dependencies)
pytest tests/test_phase0_mocks.py -v

# Phase 1 — requires real code locator (Silong's work)
USE_REAL_CODE_LOCATOR=1 REPO_PATH=/path/to/repo pytest tests/test_phase1_code_locator.py -v

# Phase 2 — embedded SurrealDB path for tests
USE_REAL_LEDGER=1 SURREAL_URL=memory:// pytest tests/test_phase2_ledger.py -v

# Phase 3 — full integration (requires both)
USE_REAL_CODE_LOCATOR=1 USE_REAL_LEDGER=1 SURREAL_URL=memory:// REPO_PATH=/path/to/repo pytest tests/test_phase3_integration.py -v

# All phases at once (use for CI once all phases are complete)
pytest tests/ -v
# Full suite, the way CI runs it
SURREAL_URL=memory:// pytest tests/ -v
```

## Phase status
## Phase gates

| File | Passes without dependencies | Unblocked by |
|------|-----------------------------|--------------|
| `test_phase0_mocks.py` | YES | — |
| `test_phase1_code_locator.py` | NO | real code locator index + provider credentials |
| `test_phase2_ledger.py` | NO | `USE_REAL_LEDGER=1` + `memory://` or SurrealDB URL |
| `test_phase3_integration.py` | NO | Both Phase 1 + Phase 2 complete |
| Phase | File | Gate (env) | Validates |
|---|---|---|---|
| 1 | `test_phase1_code_locator.py` | `USE_REAL_CODE_LOCATOR=1` + `REPO_PATH=…` | Code-locator correctness: located paths exist on disk, symbols are real repo names, confidence in range |
| 2 | `test_phase2_ledger.py` | `USE_REAL_LEDGER=1` + `SURREAL_URL=memory://` | Ledger correctness: idempotent ingest, BM25 search relevance, file→decision reverse traversal, `link_commit` status updates |
| 3 | `test_phase3_integration.py` | Both of the above | End-to-end: ingest transcript → code locator → graph store → query-back coheres |

## What each phase validates
```bash
# Phase 1
USE_REAL_CODE_LOCATOR=1 REPO_PATH=/path/to/repo pytest tests/test_phase1_code_locator.py -v

**Phase 0**: Contract shapes. Do all 4 tools return valid Pydantic types? Are all required fields present?
# Phase 2
USE_REAL_LEDGER=1 SURREAL_URL=memory:// pytest tests/test_phase2_ledger.py -v

**Phase 1**: Code locator correctness. Do located file paths exist on disk? Are symbols real names from the repo? Is confidence in the expected range?
# Phase 3 (full integration — needs both gates)
USE_REAL_CODE_LOCATOR=1 USE_REAL_LEDGER=1 SURREAL_URL=memory:// REPO_PATH=/path/to/repo \
pytest tests/test_phase3_integration.py -v
```

**Phase 2**: Ledger correctness. Is ingestion idempotent? Does BM25 search return relevant results? Does reverse traversal (file → decisions) work? Does `link_commit` update statuses correctly?
## Environment variables

**Phase 3**: End-to-end pipeline. Does ingesting a sample transcript + running code locator + storing in graph + querying back produce a coherent result?
| Var | Default | Effect |
|---|---|---|
| `SURREAL_URL` | `memory://` | Ledger URL for tests (in-process, no persistence). Override when exercising a persistent SurrealKV path. |
| `USE_REAL_CODE_LOCATOR` | unset | Gate phase-1/3 code-locator tests on a real tree-sitter index. |
| `USE_REAL_LEDGER` | unset | Gate phase-2/3 tests on a real embedded SurrealDB adapter. |
| `REPO_PATH` | `.` | Repo the code locator indexes. |

## Packaging smoke

The installable package surface is now the first startup check:
The installable surface is the first startup check:

1. `pip install -r requirements.txt`
1. `pip install -e ".[test]"`
2. `bicameral-mcp --smoke-test`
3. Verify the command prints the 5 registered tool names
3. It prints the server name/version and **every registered MCP tool name** — 20
today (18 `bicameral.*` ledger/session tools + the 2 code-locator primitives
`validate_symbols` and `get_neighbors`). The asserted source of truth is
`EXPECTED_TOOL_NAMES` in `server.py`; the smoke test fails if the live registry
drifts from it. The user-facing subset is documented in the root `README.md`
§ MCP Tools Reference.

## Sociable testing

Handler and ledger tests default to **sociable** units (real `memory://` adapter,
`SimpleNamespace` ctx) — not mocks. The full contract and the reference patterns
are in the repo-root `CLAUDE.md` § "Sociable Testing for UX Paths".

## What CI runs

`.github/workflows/test-mcp-regression.yml` runs the phase suites plus the ledger,
schema-recovery, replay-determinism, extractor-parity, shadow-dispatch, and
dashboard tests in a single `pytest` invocation against `SURREAL_URL=memory://`,
then uploads JUnit XML + a self-contained HTML report as artifacts. The end-to-end
user-flow suite is separate and currently shelved to manual dispatch — see
`tests/e2e/README.md`.
33 changes: 21 additions & 12 deletions tests/e2e/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,20 @@
# v0 user flow e2e

End-to-end validation of `BicameralAI/bicameral#108`'s six canonical user
flows, driven by **real Claude Code CLI sessions** with `bicameral-mcp`
registered as an MCP server. Test fixture: a pinned commit of
`github.com/desktop/desktop`, with `docs/process/roadmap.md` as ingest
content.

This is the canonical CI test for the spec. The handler-replay simulation
at `scripts/sim_issue_108_flows.py` complements it for fast local iteration
on handler logic without burning Claude API calls.
End-to-end validation of the canonical user flows in
`BicameralAI/bicameral#108`, driven by **real Claude Code CLI sessions** with
`bicameral-mcp` registered as an MCP server. Five flows (1–5) are automated.
Test fixture: a pinned commit of `github.com/desktop/desktop`, with
`docs/process/roadmap.md` as ingest content.

> **Status: shelved to manual dispatch (#556).** This suite is no longer a PR
> gate. The harness accumulated maintenance debt — API-key credit exhaustion,
> agent-budget non-determinism (#272), and twice-reworked auth (#528, #540) —
> that blocked PRs without actionable signal. The test code and prompts are
> preserved; run it manually via **Actions → v0 user flow e2e → Run workflow**.
> A replacement validation strategy is tracked in RFQ #555.

The handler-replay simulation at `scripts/sim_issue_108_flows.py` is the fast
local path for iterating on handler logic without burning Claude API calls.

## What it tests

Expand Down Expand Up @@ -66,10 +72,13 @@ per flow.

## CI

GitHub Actions workflow: `.github/workflows/v0-user-flow-e2e.yml`.
GitHub Actions workflow: `.github/workflows/v0-user-flow-e2e.yml` —
**dispatch-only (shelved, #556)**.

- Triggers on PRs touching `tests/e2e/**`, `handlers/**`, `ledger/**`,
`contracts.py`, `skills/bicameral-*/**`, or the workflow itself.
- **No PR trigger.** Run manually: Actions → *v0 user flow e2e* → *Run workflow*.
(It previously triggered on PRs touching `tests/e2e/**`, `handlers/**`,
`ledger/**`, `contracts.py`, or `skills/bicameral-*/**`.)
- Replacement validation strategy: RFQ #555.
- Runs in the `ci-test` GitHub environment for `ANTHROPIC_API_KEY`
(switched from `production` + `CLAUDE_CODE_OAUTH_TOKEN` in #528 after the
org subscription was disabled).
Expand Down
Loading