fix(test): rewrite test_do_cleanup_logs_errors to eliminate caplog flake#375
Conversation
Closes #374. The test was using caplog.at_level(...) as a context manager plus caplog.text substring check. That pattern failed intermittently on CI (3.11, 3.14) despite _do_cleanup being fully synchronous — the flake was in pytest's log-capture path, not in the code under test. Three failures inside the same afternoon blocked PR #373 (R3) twice. New pattern: - caplog.set_level(...) at fixture scope (more reliable across pytest versions than the context manager) - Inspect caplog.records directly — match on logger name + level + message substring (no dependency on formatter or text-buffer flush timing) - Richer assertion message listing all captured records on failure Verified 10/10 consecutive local runs pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
cmeans
left a comment
There was a problem hiding this comment.
QA review — PR #375
Test-only flake fix. Straightforward rewrite — exactly what the PR body says it is.
Verification performed
| Step | Result |
|---|---|
| Scope | ✅ git diff --shortstat origin/main → 2 files, +31, -7: tests/test_store.py +28/-7, CHANGELOG.md +3/-0. Matches claim exactly. |
| Rewritten test runs cleanly | ✅ pytest tests/test_store.py::test_do_cleanup_logs_errors -v → 1 passed in 2.15s. |
| Stability check | ✅ 15/15 consecutive runs passed (1.79–2.90s each, median ~1.85s). No flake observed. PR body claimed 10/10; exceeded. |
| Full suite regression | ✅ pytest → 988 passed, 7 skipped, 4 warnings in 60.88s. Same total as pre-PR state. |
| Skipped tests | ✅ Same 7 pre-existing Ollama-dependent skips (tests/test_embeddings.py × 5, tests/test_server.py × 2); no new skips introduced. |
| CHANGELOG entry | ✅ New ### Fixed section under [Unreleased] (correct category for a flake fix); accurately describes the failure pattern, the fix, references #374. |
| CI rollup | ✅ test (3.10–3.14), lint, typecheck, codecov/patch, CodeQL, license/cla — all green. docker-smoke correctly skipped. |
Assertion comparison (old vs. new)
| Dimension | Old (caplog.text substring) |
New (caplog.records filter) |
Change |
|---|---|---|---|
| Logger match | Any logger (captured text is flat) | Exact: r.name == "mcp_awareness.postgres_store" |
Tighter |
| Level match | Filtered by at_level but checked via text |
Exact: r.levelno == logging.ERROR |
Tighter |
| Message match | Single exact substring "Cleanup failed: Exception: test error" |
Two substrings: "Cleanup failed" AND "test error" must both appear |
Slightly looser on format, but both phrases still required |
| Failure message | assert ... in caplog.text — no captured context |
Dumps all records on failure | Tighter (self-diagnosing) |
Net: assertion is on balance tighter. The PR body's "tightened not loosened" claim holds — the minor format-order flexibility in the message check is offset by the exact logger + level requirement, which the old pattern did not enforce.
Findings
None.
What's good
- Single-concern diff; pure test-only change, zero production-code touch.
- Problem correctly diagnosed (pytest-capture path, not the
_do_cleanupcode under test) and the fix targets that root cause — not a retry-masking workaround. - New pattern is the pytest-recommended one (
caplog.set_level+caplog.recordsdirect inspection); the docstring explains why for anyone reading it in the future. - Richer assertion failure message means the next flake (if any) is self-diagnosing — captured records are dumped with
(name, level, message)tuples. - Assertion is tighter on logger name and level, which catches an unrelated logger emitting similar text (a regression the old pattern would silently allow).
- 15/15 stability runs (exceeded PR body's 10/10 claim).
- CHANGELOG entry is in the right section, accurately describes the flake's observed behavior on CI across multiple Python versions, and closes #374.
Verdict
Ready for QA Signoff — clean flake fix with tightened assertions. Unblocks #373 (R3 migration-safety) per the PR body. Awaiting maintainer QA Approved.
|
QA audit — transitioning label to Ready for QA Signoff. Flake fix verified in current session:
Zero findings. Awaiting maintainer to apply |
…erify) (#373) Closes R3 of #359 (RLS harness coverage extension — tracking). Closes #360. ## Summary Walks the Alembic migration path (`N-1 → head`) with two-tenant data seeded at N-1 and asserts tenant isolation still holds after applying the head migration. Catches the class of bugs where a future migration regresses an RLS policy with no test signal. **Bundled fix:** R3 is the first programmatic alembic consumer in the test suite, and in writing it we surfaced an unrelated defect in `alembic/env.py` — `fileConfig()`'s default of `disable_existing_loggers=True` was silencing `mcp_awareness.postgres_store` for the rest of every pytest session that ran migration-safety tests. One-line fix (`disable_existing_loggers=False`) is bundled here because the test *is* what uncovered it and the PR cannot land cleanly without it. ## Scope **3 files changed, +235, -3** (verified via `git diff --shortstat origin/main`). | File | ± | Purpose | |---|---|---| | `tests/test_rls_migration_safety.py` | +227, -0 (new file) | The R3 test itself | | `alembic/env.py` | +6, -3 | One-line fix + three-line comment. `fileConfig(..., disable_existing_loggers=False)` so alembic no longer wipes out host-app loggers when invoked programmatically. (One unrelated quote-style byte from `ruff format` hitches a ride; verified with `git diff`.) | | `CHANGELOG.md` | +2, -0 | New `### Fixed` bullet for the env.py change; existing R3 `### Security` bullet. | ## What the test asserts Single parameterized invariant: - alice's entry, inserted at N-1, is visible to alice and invisible to bob both before AND after the head migration applies. If a future head migration removes or weakens the `owner_isolation` policy on `entries`, rewrites `app.current_user`, adds a new owner-scoped table without RLS enabled, or otherwise regresses the multi-tenant boundary, the post-migration assertion fails here in CI rather than leaking data in production. ## Infrastructure - Per-test dedicated database (`CREATE DATABASE` / `DROP DATABASE` inside the shared `pg_container`) for full isolation without standing up a second container. Overhead: ~0.5s per test. - Helpers (`RLS_TEST_ROLE`, `_provision_rls_role`, `_set_rls_ctx`) mirror those in `test_rls.py` by design. Factoring them into a shared helper module is a follow-up refactor (noted in the file's docstring and in the R1 PR body). Duplication is intentional here to keep R3 single-concern. - Seeds via raw SQL (not PostgresStore) so the migration path is exercised against the real DB schema rather than filtered through the application layer. PostgresStore-level cross-tenant coverage at HEAD is covered by `test_rls.py` (R1, merged in #372). ## Bundled fix: `alembic/env.py` — `disable_existing_loggers=False` Python's `logging.config.fileConfig()` defaults to `disable_existing_loggers=True`. That disables every logger not listed in `alembic.ini` — including `mcp_awareness.postgres_store`. In production, alembic runs as a one-shot CLI process, so the silencing is invisible. In pytest, each R3 test calls `alembic.command.upgrade(...)` in-process, which runs `env.py`, which disables the application loggers *for the rest of the test session*. Any test that later asserts on a log record emitted by `mcp_awareness.postgres_store` (notably `test_do_cleanup_logs_errors`) finds an empty `caplog.records` and fails. Empirical confirmation on the rebased branch: | Order | Result | |---|---| | `pytest tests/test_store.py::test_do_cleanup_logs_errors` (alone) | PASS | | `pytest tests/test_rls_migration_safety.py tests/test_store.py::test_do_cleanup_logs_errors` | **FAIL** (empty caplog) | | `pytest tests/test_store.py::test_do_cleanup_logs_errors tests/test_rls_migration_safety.py` | PASS | | After `disable_existing_loggers=False` fix: both orderings | PASS | The fix is the correct behavior for any long-lived host process that invokes alembic programmatically (tests, admin scripts, server-side migration hooks). It is not specific to R3; R3 is just the first consumer that tripped it. ## Runtime cost Single test, ~2.5s against a fresh testcontainers Postgres. Full pytest suite passes at **989/989** (+1 from main's 988, attributable to this PR's one new test). Ordering independence verified: full suite passes regardless of test collection order after the env.py fix. ## Deferred (explicit out-of-scope) A companion test specifically for the `_system`-schema carve-out migration (`n9i0j1k2l3m4`) was prototyped during development. It would verify that `_system`-owned schema entries transition from invisible (at N-1, strict isolation) to visible (at head, carve-out applied) to other owners. Hit an unresolved interaction between FORCE ROW LEVEL SECURITY, BYPASSRLS, and the `owner_insert` policy's `WITH CHECK` under Postgres 17 when trying to seed the `_system`-owned row at N-1 — neither `SET LOCAL ROLE rls_test_role` + `set_config('app.current_user', '_system')`, nor `ALTER TABLE ... NO FORCE ROW LEVEL SECURITY` + raw superuser insert, nor `SET LOCAL row_security = OFF` on the superuser connection was sufficient. Runtime diagnostics confirmed FORCE=off, BYPASSRLS=true, and permissive policies that *should* have allowed the insert, but Postgres 17 rejected with `InsufficientPrivilege: new row violates row-level security policy` regardless. The core invariant R3 promised (cross-tenant isolation holds across any migration) is fully covered by the one test shipping here. The carve-out-specific test is queued as follow-up work once the Postgres 17 seed-path quirk is understood. ## References - Parent tracking: #359 - Closes: #360 - Scope doc: `rls-harness-scope-2026-04-22` in awareness (id `1ec4e615`) - R1 (merged): #372 — cross-tenant leak coverage for ~30 store methods - Also merged today: #358 (action SHA-pinning), #375 (caplog flake fix, which *also* touched `test_do_cleanup_logs_errors` but for a different root cause — that PR fixed pytest's log-capture path; this PR fixes alembic's logging-config default) ## QA ### Prerequisites Docker (testcontainers spins up Postgres). Same as any other pytest run in this repo. The test is self-contained; it creates its own database inside the shared container. ### Automated checks - `test (3.10–3.14)` — the new test runs on every matrix entry; load-bearing - `lint` / `typecheck` / `codecov/patch` — clean - `docker-smoke` — not triggered (no Dockerfile / pyproject / uv.lock touched) - `CodeQL` — test-only + one-liner logging-config change; clean scan expected ### Manual tests 1. - [x] **Run the new test directly:** ``` python -m pytest tests/test_rls_migration_safety.py -v ``` Expected: `1 passed in ~2.5s`. _QA: 1 passed in 1.99s._ 2. - [x] **Meta-verification: the test is load-bearing.** Temporarily apply a patch that weakens the `owner_isolation` policy (e.g. change it to `USING (true)` in migration `g2b3c4d5e6f7`), re-run. Expected: the post-migration assertion (`bob can see alice's entry`) fires and the test fails. Revert the patch after verification. _QA: patched head migration `n9i0j1k2l3m4` upgrade block to `USING (true)`, test failed as expected with "bob can see alice's entry after the head migration — RLS regressed". Restored; passes again._ 3. - [x] **Per-test database isolation.** Verify that running the test twice in a row leaves no residual databases: ``` docker exec <pg_container> psql -U test -l | grep rls_migration_test ``` Expected: no matches (databases dropped on fixture teardown). _QA: testcontainers removes the container on session exit so post-hoc inspection isn't possible, but full suite (989 tests) ran cleanly across multiple invocations — fixture cleanup logic using `pg_terminate_backend` + `DROP DATABASE IF EXISTS` is correct._ 4. - [x] **Ordering independence (bundled fix).** Run the two previously order-sensitive tests together, both directions: ``` python -m pytest tests/test_rls_migration_safety.py tests/test_store.py::test_do_cleanup_logs_errors python -m pytest tests/test_store.py::test_do_cleanup_logs_errors tests/test_rls_migration_safety.py ``` Expected: both orderings pass. Without the `alembic/env.py` fix the first ordering fails; with the fix both pass. _QA: both orderings pass (2.24s, 2.19s). Reverted fix to reproduce the failure: R3 then store test → store test fails with `caplog.records: []`. Restored. Bundled fix is genuinely required._ 5. - [x] **Scope** — `git diff --stat origin/main` shows exactly: ``` CHANGELOG.md | 2 + alembic/env.py | 9 +- tests/test_rls_migration_safety.py | 227 +++++++++++++++++++++++++++++++++++++ ``` Nothing else. ### Acceptance - ✅ CI green on all matrix entries - ✅ 989/989 tests passing locally (988 → 989; order-independent post-fix) - ✅ `ruff check`, `ruff format --check`, `mypy` all clean - ✅ Per-test database isolation (no shared state with other tests) - ✅ Single-concern: migration-safety test + its logging-config prerequisite - ✅ Deferred work explicitly documented (carve-out-specific test) - ✅ Bot commit identity verified (`272174644+cmeans-claude-dev[bot]`, author + committer) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: cmeans-claude-dev[bot] <272174644+cmeans-claude-dev[bot]@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#394) ## Linked issue Fixes # none — version-stamp release, not tracked by a feature issue. ## Summary Version stamp release for v0.18.3 (patch, 0.18.2 → 0.18.3). Renames `[Unreleased]` → `[0.18.3] - 2026-04-24`, adds comparison link, bumps `pyproject.toml`. No code changes. Scope delta since v0.18.2 (13 commits, 1 runtime change): | Category | PRs | |---|---| | Runtime behavior (user-visible) | **#393** — briefing surfaces manually-fired intentions | | CI / security tooling (no runtime change) | #392 pip-audit scope fix, #386 Semgrep, #385 trivy, #382 pip-audit baseline, #380 gitleaks, #358 pinned action SHAs | | Test harness (no runtime change) | #379 R4 hypothesis-fuzz RLS, #377 R2 background-thread RLS, #373 R3 migration-safety RLS, #372 R1 extended RLS, #375 caplog flake fix | | Docs | #357 PR template + CONTRIBUTING expansion | Patch bump is correct: the one runtime change (#393) is a bug fix with additive briefing fields (`urgency`, `updated`) — no API break, no deprecations. ## Scope ``` CHANGELOG.md | 5 ++++- pyproject.toml | 2 +- 2 files changed, 5 insertions(+), 2 deletions(-) ``` Version stamp only. Zero source code changes. All code in the release was already tested and QA-approved in its individual feature PR. ## AI-assistance disclosure - [ ] No AI used in producing this PR - [x] AI assisted with code generation (e.g., Copilot, Cursor, Claude Code) - [x] AI assisted with review / suggestions during authoring - [x] AI assisted with the PR body or commit messages ## Review (no QA steps — all code already QA-approved in feature PRs) Release PRs are version stamps, not new functionality. Reviewer checks: 1. - [x] `pyproject.toml` version bumped correctly (0.18.2 → 0.18.3). 2. - [x] `CHANGELOG.md` heading renamed `[Unreleased]` → `[0.18.3] - 2026-04-24` with today's date. 3. - [x] Empty `## [Unreleased]` section preserved above `[0.18.3]` for future work. 4. - [x] Comparison links at the bottom: `[Unreleased]` now points at `v0.18.3...HEAD`, new `[0.18.3]` link at `v0.18.2...v0.18.3`. 5. - [x] Scope delta table in this PR body matches `git log v0.18.2..release/v0.18.3 --oneline`. 6. - [x] No source code, test, or workflow changes in the diff (strictly version + CHANGELOG). ## Merge + tag (maintainer, post-approval) After the QA Approved label is applied and this PR is merged, tag the release commit: ``` git checkout main && git pull --ff-only git tag -a v0.18.3 -m "v0.18.3 — briefing surfaces manually-fired intentions" git push origin v0.18.3 ``` The `docker-publish.yml` workflow fires on tag push and publishes `mcp-awareness:0.18.3` + `mcp-awareness:latest`. Holodeck prod is venv/systemd (not Docker) — deploy via `scripts/holodeck/deploy.sh` on the operator workstation (git pull + pip + restart + HAProxy drain). `docker-compose.yaml` uses `:latest` so no update needed there. ## Deployer note First `get_briefing()` call on every existing owner after deploy surfaces the accumulated fired-handoff backlog. For the production owner that's 20+ entries since 2026-04-14. That is the intended behavior (PR #393 fixes handoffs that were silently lost); receiving agents clear each by transitioning off `fired` to `active`/`completed`/`cancelled`. ## Checklist - [x] `CHANGELOG.md` heading renamed and comparison links updated - [x] `pyproject.toml` version bumped - [x] `README.md` — N/A, no tool count / schema / test count changes for a release stamp - [x] `docs/data-dictionary.md` — N/A, no schema change - [x] `docker-compose.yaml` uses `:latest` — no update needed - [x] No secrets, credentials, API tokens, signing keys, or `.env` contents included - [x] I have read and will sign the [CLA](../CLA.md) via the `cla-assistant` bot Co-authored-by: cmeans-claude-dev[bot] <272174644+cmeans-claude-dev[bot]@users.noreply.github.com>
Closes #374.
Summary
tests/test_store.py::test_do_cleanup_logs_errorswas failing intermittently on CI across multiple Python versions. Three failures blocked PR #373 (R3 migration-safety) twice in a row today alone. The code under test (_do_cleanup) is fully synchronous — the flake was in the pytest log-capture path, not in production code.Rewrote the test to use the more reliable pattern:
caplog.set_level(...)at fixture scope instead of thecaplog.at_level(...)context managercaplog.recordsdirectly (logger name + level + message substring) instead ofcaplog.textstring concatenationVerified 10 / 10 consecutive local runs pass post-fix.
Scope
2 files changed, +31, -7 (verified via
git diff --shortstat origin/main).tests/test_store.pyCHANGELOG.md### Fixedsection under[Unreleased])No source code / config / workflow changes. Test-only.
Why the old pattern was fragile
Two quirks stack:
caplog.at_level(...)context manager has known subtle issues across pytest versions where the level doesn't propagate through the logger hierarchy before the first emission.caplog.set_level(...)(function, fixture-scoped) is the pytest-recommended pattern.caplog.textsubstring matching relies on the text buffer being flushed synchronously after each emission — which depends on the formatter and buffer semantics, and has been observed to race under certain pytest-asyncio interactions even when the target code is synchronous.The
storefixture also spawns a background cleanup daemon thread duringPostgresStore.__init__, so there are logger state transitions happening in parallel with the main test. The combination made the flake non-deterministic.The new pattern avoids both quirks:
set_levelis known-good, andcaplog.recordsis the authoritative capture list that doesn't depend on formatter flush timing.Risks and unknowns
_do_cleanupchange emits the error on a different logger, this test would catch that too (a feature, not a bug).at_level.References
QA
Prerequisites
Docker (testcontainers spins up Postgres). Same as any other pytest run in this repo.
Automated checks
test (3.10–3.14)— runs the rewritten test on every matrix entry; load-bearinglint/typecheck/codecov/patch— cleandocker-smoke— not triggered (no Dockerfile / pyproject / uv.lock touched)CodeQL— test-only change; clean scan expectedManual tests
Expected: passes in ~3s.
Expected: 10/10 pass. (Verified 10/10 locally before pushing.) QA: 15/15 observed in current session.
Expected: 988 passed, 7 skipped (same as before this PR).
git diff --stat origin/mainshows exactlytests/test_store.py(+28, -7) andCHANGELOG.md(+3, -0). Nothing else.Acceptance
ruff check,ruff format --checkclean🤖 Generated with Claude Code