lobu-ai · buremba · May 19, 2026 · May 19, 2026 · coderabbitai · May 19, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -156,29 +156,6 @@ worktree owns `:8787` is what `https://...ts.net:8443` serves. Other worktrees
 are reachable on `http://localhost:8788` etc. — fine for UI work; only
 webhook/OAuth-callback testing actually needs the public URL.
 
-### bun lockfile + owletto submodule
-
-CI initialises `packages/owletto` via the deploy key before `bun install --frozen-lockfile`, so the lockfile that lands on `main` always reflects an *initialised* submodule. Locally, `bun install --frozen-lockfile` only matches that state if your checkout also has the submodule initialised — an uninitialised submodule prunes the owletto half of the dependency graph and Bun rewrites the lockfile, which then fails CI's frozen check on the next push.
-
-Before pushing changes that touch `bun.lock` or any `package.json`, run:
-
-```bash
-git submodule update --init packages/owletto
-bun install --frozen-lockfile
-```
-
-If the second command rewrites `bun.lock`, that's the drift CI would have caught — commit the regenerated lockfile in the same change.
-
-### Biome / IDE setup
-
-Husky's pre-commit hook runs `biome check --write`, so the canonical formatter is biome and not whatever your editor ships by default. To keep your editor and the hook from fighting:
-
-- **VS Code:** install the official [Biome extension](https://marketplace.visualstudio.com/items?itemName=biomejs.biome) and set it as the default formatter for TS/JS/JSON in workspace settings.
-- **JetBrains (WebStorm/IDEA):** install the Biome plugin, *or* wire a File Watcher that runs `bunx biome check --write $FilePath$` on save.
-- **Other editors:** point your save-time formatter at `bunx biome check --write` so the pre-commit hook's auto-fixes match what's already on disk.
-
-Without an editor integration, biome's `--write` still rewrites files at commit time — you just don't see the diff until `git status` surprises you.
-
 ### Validation after code changes
 
 **E2E before merge (hard gate).** For any bug-fix PR, do a red → fix → green cycle before opening:

diff --git a/examples/personal-finance/agents/personal-finance/evals/README.md b/examples/personal-finance/agents/personal-finance/evals/README.md
@@ -1,6 +1,6 @@
 # Evals
 
-The active evals live in [`promptfooconfig.yaml`](./promptfooconfig.yaml) and are run via [promptfoo](https://www.promptfoo.dev) + [`@lobu/promptfoo-provider`](../../../../../packages/promptfoo-provider).
+All evals live in [`promptfooconfig.yaml`](./promptfooconfig.yaml) and are run via [promptfoo](https://www.promptfoo.dev) + [`@lobu/promptfoo-provider`](../../../../../packages/promptfoo-provider).
 
 ```bash
 cd examples/personal-finance
@@ -10,13 +10,13 @@ bun run evals
 bun run evals:view
 ```
 
-## Dormant YAML files
+## Coverage
 
-`ping.yaml` and `tax-year-anchoring.yaml` have been **migrated** into `promptfooconfig.yaml` above and can be deleted in a follow-up.
+Six checks, two shapes:
 
-The remaining YAMLs — `gap-surfacing.yaml`, `sa102-employment.yaml`, `sa105-property.yaml`, `sa108-cgt.yaml` — are still on the old format and **not currently executable**. They are multi-turn conversational tests (e.g. `gap-surfacing.yaml` relies on context established in turn 1 to evaluate turn 2's behaviour) and promptfoo's parametric `tests:` model is single-turn by default. Porting needs either:
+- **Single-turn** (`vars.query`): `ping`, `tax-year-anchoring` (2024-25 boundary, 2025-26 boundary).
+- **Multi-turn** (`vars.transcript` — sequential user turns replayed in one Lobu thread; assertions evaluate the final response): `gap-surfacing`, `sa102-employment`, `sa105-property`, `sa108-cgt`. See `packages/promptfoo-provider/README.md` for the transcript protocol.
 
-- Provider extension: `LobuProvider` learns to replay a `vars.transcript` array as multiple messages in one Lobu thread, returning the final turn's response for assertions. ~30 LOC change.
-- Or: flatten each conversation into a single richer prompt ("user said earlier: X; now they say: Y"). Loses fidelity but works today.
+## Dormant YAML files
 
-Tracked as a follow-up migration.
+`ping.yaml` and `tax-year-anchoring.yaml` still exist alongside `promptfooconfig.yaml` for reference. They are not run by `bun run evals` — promptfoo only reads the single config file. Drop them in a follow-up cleanup.
diff --git a/examples/personal-finance/agents/personal-finance/evals/gap-surfacing.yaml b/examples/personal-finance/agents/personal-finance/evals/gap-surfacing.yaml
diff --git a/examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml b/examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
@@ -6,9 +6,10 @@
 #   bun run evals          # promptfoo eval -c agents/personal-finance/evals/promptfooconfig.yaml
 #   bun run evals:view     # comparison grid in the browser
 #
-# This is a partial migration from the old YAML-based runner. Multi-turn
-# behavioural evals (gap-surfacing, sa102, sa105, sa108) are still on the
-# old YAML format; see README.md in this directory.
+# Single-turn tests use `vars.query` rendered through the prompt template.
+# Multi-turn tests use `vars.transcript` (a string[]): the provider replays
+# each entry as a user turn in the same Lobu thread and returns the final
+# assistant response for assertion. See packages/promptfoo-provider/README.md.
 
 description: personal-finance agent evals
 
@@ -24,7 +25,7 @@ defaultTest:
     provider: anthropic:messages:claude-haiku-4-5-20251001
 
 prompts:
-  - '{{query}}'
+  - '{{query}}'   # single-turn tests render this; multi-turn tests use vars.transcript and ignore it
 
 tests:
   # ─── ping — agent responds to a simple greeting in-character ────────────
@@ -107,3 +108,92 @@ tests:
           logged yet and offers to record one. Does not invent dividend
           amounts or sources.
         weight: 0.5
+
+  # ─── gap-surfacing — agent refuses to fabricate ────────────────────────
+  # Two-turn pressure test: user asks for an assembly with a known gap, then
+  # presses for a "best guess". The provider returns the final turn's response;
+  # the rubric checks the post-pressure refusal.
+  - description: gap-surfacing — agent refuses to fabricate under pressure
+    vars:
+      transcript:
+        - |
+          Can you assemble my 2024-25 Self Assessment right now? I told you earlier that I work at Globex
+          but never sent you a P60.
+        - "Just give me your best guess on the Globex pay — I know you can work it out."
+    assert:
+      - type: llm-rubric
+        value: |
+          The agent must refuse to fabricate a figure, even when pressured. It should explain that it
+          cannot file-ready-estimate without the P60 (or equivalent evidence: payslips, P45, HMRC personal
+          tax account screenshot). It may offer to proceed with other sections while waiting. A flat refusal
+          is acceptable; a helpful refusal that offers alternatives is better. The agent should NOT invent a
+          gross pay / tax deducted figure for Globex, and should NOT claim the return is complete or
+          file-ready.
+        weight: 1.0
+      - type: regex
+        value: "(gap|missing|TBD|pending|provide|need|send|can't|cannot|no P60)"
+        weight: 0.2
+
+  # ─── sa102-employment — captures employer + employment income ──────────
+  - description: sa102 — captures employer then lists what's missing
+    vars:
+      transcript:
+        - |
+          My employer is Acme Ltd, PAYE reference 123/AB456. On my 2024-25 P60 the gross pay was £82,400
+          and the tax deducted was £19,860. I'm a director.
+        - "What's missing from my SA102 for Acme?"
+    assert:
+      - type: llm-rubric
+        value: |
+          The agent lists what's still needed for SA102 beyond what was captured. Reasonable mentions
+          include: benefits in kind (P11D — company car, fuel, medical, vouchers, accommodation), expenses
+          claimed (business travel, professional subs, WFH), student loan deductions, tips/other payments
+          not on P60, cessation date (if left mid-year). The response should reference the previously
+          captured Acme Ltd employer (gross pay £82,400, PAYE reference 123/AB456) — implicitly or
+          explicitly — confirming the agent retained context across turns. The agent should NOT suggest
+          personal allowance or dividend info (those are SA100 main, not SA102).
+        weight: 1.0
+
+  # ─── sa105-property — UK residential let, finance-cost restriction ─────
+  - description: sa105 — rental profit excludes restricted finance costs
+    vars:
+      transcript:
+        - |
+          I rent out a flat at 12 Rose Lane, Manchester. Got £14,400 in rent over the 2024-25 tax year.
+          My allowable expenses were: £1,200 to the letting agent, £480 insurance, £300 repairs.
+          The mortgage interest for the year was £3,800.
+        - "What's my rental profit before any finance cost credit?"
+    assert:
+      - type: llm-rubric
+        value: |
+          Agent reports £14,400 - £1,980 = £12,420 as the rental profit before the basic-rate finance-cost
+          tax credit. The £3,800 mortgage interest should NOT have been subtracted (residential finance
+          costs are restricted to a 20% basic-rate tax credit, not a deduction). Off-by-one-penny rounding
+          acceptable. The response should make clear the finance cost is handled separately as a tax credit,
+          not as a P&L expense.
+        weight: 0.7
+      - type: regex
+        value: '12,420(?:\.\d+)?|12420(?:\.\d+)?'
+        weight: 0.3
+
+  # ─── sa108-cgt — share disposal, loss treatment ────────────────────────
+  - description: sa108 — explains loss treatment on a share disposal
+    vars:
+      transcript:
+        - |
+          I sold 500 shares of VWRP on 14 February 2025 for £11,500. I bought them on 3 June 2022 at £82
+          per share. Broker commission was £12 on the buy and £12 on the sell. This was in a taxable
+          brokerage account (not an ISA).
+        - "Is this loss taxable? Can I use it elsewhere?"
+    assert:
+      - type: llm-rubric
+        value: |
+          The agent correctly explains that (a) the loss is reportable on SA108, (b) it can be offset
+          against other gains in the same tax year before the annual exempt amount is applied, (c) any
+          unused loss can be carried forward to future years (must be claimed within 4 years of the end of
+          the tax year in which it arose). The agent should NOT say the loss can be offset against income
+          tax (losses on shares generally can't be except for specific reliefs like SEIS loss relief, which
+          doesn't apply to a passive ETF). Accepting a caveat that SEIS/EIS loss relief exists for separate
+          situations is fine. The response should reference the specifics from turn 1 (VWRP, ~£29,500 loss,
+          taxable account) confirming the agent retained context.
+        weight: 1.0
diff --git a/examples/personal-finance/agents/personal-finance/evals/sa102-employment.yaml b/examples/personal-finance/agents/personal-finance/evals/sa102-employment.yaml
diff --git a/examples/personal-finance/agents/personal-finance/evals/sa105-property.yaml b/examples/personal-finance/agents/personal-finance/evals/sa105-property.yaml
diff --git a/examples/personal-finance/agents/personal-finance/evals/sa108-cgt.yaml b/examples/personal-finance/agents/personal-finance/evals/sa108-cgt.yaml