Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 0 additions & 23 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,29 +156,6 @@ worktree owns `:8787` is what `https://...ts.net:8443` serves. Other worktrees
are reachable on `http://localhost:8788` etc. — fine for UI work; only
webhook/OAuth-callback testing actually needs the public URL.

### bun lockfile + owletto submodule

CI initialises `packages/owletto` via the deploy key before `bun install --frozen-lockfile`, so the lockfile that lands on `main` always reflects an *initialised* submodule. Locally, `bun install --frozen-lockfile` only matches that state if your checkout also has the submodule initialised — an uninitialised submodule prunes the owletto half of the dependency graph and Bun rewrites the lockfile, which then fails CI's frozen check on the next push.

Before pushing changes that touch `bun.lock` or any `package.json`, run:

```bash
git submodule update --init packages/owletto
bun install --frozen-lockfile
```

If the second command rewrites `bun.lock`, that's the drift CI would have caught — commit the regenerated lockfile in the same change.

### Biome / IDE setup

Husky's pre-commit hook runs `biome check --write`, so the canonical formatter is biome and not whatever your editor ships by default. To keep your editor and the hook from fighting:

- **VS Code:** install the official [Biome extension](https://marketplace.visualstudio.com/items?itemName=biomejs.biome) and set it as the default formatter for TS/JS/JSON in workspace settings.
- **JetBrains (WebStorm/IDEA):** install the Biome plugin, *or* wire a File Watcher that runs `bunx biome check --write $FilePath$` on save.
- **Other editors:** point your save-time formatter at `bunx biome check --write` so the pre-commit hook's auto-fixes match what's already on disk.

Without an editor integration, biome's `--write` still rewrites files at commit time — you just don't see the diff until `git status` surprises you.

### Validation after code changes

**E2E before merge (hard gate).** For any bug-fix PR, do a red → fix → green cycle before opening:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Evals

The active evals live in [`promptfooconfig.yaml`](./promptfooconfig.yaml) and are run via [promptfoo](https://www.promptfoo.dev) + [`@lobu/promptfoo-provider`](../../../../../packages/promptfoo-provider).
All evals live in [`promptfooconfig.yaml`](./promptfooconfig.yaml) and are run via [promptfoo](https://www.promptfoo.dev) + [`@lobu/promptfoo-provider`](../../../../../packages/promptfoo-provider).

```bash
cd examples/personal-finance
Expand All @@ -10,13 +10,13 @@ bun run evals
bun run evals:view
```

## Dormant YAML files
## Coverage

`ping.yaml` and `tax-year-anchoring.yaml` have been **migrated** into `promptfooconfig.yaml` above and can be deleted in a follow-up.
Six checks, two shapes:

The remaining YAMLs — `gap-surfacing.yaml`, `sa102-employment.yaml`, `sa105-property.yaml`, `sa108-cgt.yaml` — are still on the old format and **not currently executable**. They are multi-turn conversational tests (e.g. `gap-surfacing.yaml` relies on context established in turn 1 to evaluate turn 2's behaviour) and promptfoo's parametric `tests:` model is single-turn by default. Porting needs either:
- **Single-turn** (`vars.query`): `ping`, `tax-year-anchoring` (2024-25 boundary, 2025-26 boundary).
- **Multi-turn** (`vars.transcript` — sequential user turns replayed in one Lobu thread; assertions evaluate the final response): `gap-surfacing`, `sa102-employment`, `sa105-property`, `sa108-cgt`. See `packages/promptfoo-provider/README.md` for the transcript protocol.
Comment on lines +15 to +18
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix coverage count mismatch.

Line 15 says “Six checks,” but the bullets on Lines 17–18 enumerate seven checks total (3 single-turn + 4 multi-turn). Update the count to avoid confusion in eval reporting docs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/personal-finance/agents/personal-finance/evals/README.md` around
lines 15 - 18, Update the documentation header count from “Six checks” to “Seven
checks” to match the listed evals: single-turn checks (vars.query) include ping
and tax-year-anchoring (two items described as three? ensure you count them
correctly) and multi-turn checks (vars.transcript) include gap-surfacing,
sa102-employment, sa105-property, sa108-cgt; adjust the opening sentence so it
accurately reads “Seven checks, two shapes:” to match the seven enumerated
checks (ping, tax-year-anchoring, gap-surfacing, sa102-employment,
sa105-property, sa108-cgt) referenced in the README.


- Provider extension: `LobuProvider` learns to replay a `vars.transcript` array as multiple messages in one Lobu thread, returning the final turn's response for assertions. ~30 LOC change.
- Or: flatten each conversation into a single richer prompt ("user said earlier: X; now they say: Y"). Loses fidelity but works today.
## Dormant YAML files

Tracked as a follow-up migration.
`ping.yaml` and `tax-year-anchoring.yaml` still exist alongside `promptfooconfig.yaml` for reference. They are not run by `bun run evals` — promptfoo only reads the single config file. Drop them in a follow-up cleanup.

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,10 @@
# bun run evals # promptfoo eval -c agents/personal-finance/evals/promptfooconfig.yaml
# bun run evals:view # comparison grid in the browser
#
# This is a partial migration from the old YAML-based runner. Multi-turn
# behavioural evals (gap-surfacing, sa102, sa105, sa108) are still on the
# old YAML format; see README.md in this directory.
# Single-turn tests use `vars.query` rendered through the prompt template.
# Multi-turn tests use `vars.transcript` (a string[]): the provider replays
# each entry as a user turn in the same Lobu thread and returns the final
# assistant response for assertion. See packages/promptfoo-provider/README.md.

description: personal-finance agent evals

Expand All @@ -24,7 +25,7 @@ defaultTest:
provider: anthropic:messages:claude-haiku-4-5-20251001

prompts:
- '{{query}}'
- '{{query}}' # single-turn tests render this; multi-turn tests use vars.transcript and ignore it

tests:
# ─── ping — agent responds to a simple greeting in-character ────────────
Expand Down Expand Up @@ -107,3 +108,92 @@ tests:
logged yet and offers to record one. Does not invent dividend
amounts or sources.
weight: 0.5

# ─── gap-surfacing — agent refuses to fabricate ────────────────────────
# Two-turn pressure test: user asks for an assembly with a known gap, then
# presses for a "best guess". The provider returns the final turn's response;
# the rubric checks the post-pressure refusal.
- description: gap-surfacing — agent refuses to fabricate under pressure
vars:
transcript:
- |
Can you assemble my 2024-25 Self Assessment right now? I told you earlier that I work at Globex
but never sent you a P60.
- "Just give me your best guess on the Globex pay — I know you can work it out."
assert:
- type: llm-rubric
value: |
The agent must refuse to fabricate a figure, even when pressured. It should explain that it
cannot file-ready-estimate without the P60 (or equivalent evidence: payslips, P45, HMRC personal
tax account screenshot). It may offer to proceed with other sections while waiting. A flat refusal
is acceptable; a helpful refusal that offers alternatives is better. The agent should NOT invent a
gross pay / tax deducted figure for Globex, and should NOT claim the return is complete or
file-ready.
weight: 1.0
- type: regex
value: "(gap|missing|TBD|pending|provide|need|send|can't|cannot|no P60)"
weight: 0.2

# ─── sa102-employment — captures employer + employment income ──────────
- description: sa102 — captures employer then lists what's missing
vars:
transcript:
- |
My employer is Acme Ltd, PAYE reference 123/AB456. On my 2024-25 P60 the gross pay was £82,400
and the tax deducted was £19,860. I'm a director.
- "What's missing from my SA102 for Acme?"
assert:
- type: llm-rubric
value: |
The agent lists what's still needed for SA102 beyond what was captured. Reasonable mentions
include: benefits in kind (P11D — company car, fuel, medical, vouchers, accommodation), expenses
claimed (business travel, professional subs, WFH), student loan deductions, tips/other payments
not on P60, cessation date (if left mid-year). The response should reference the previously
captured Acme Ltd employer (gross pay £82,400, PAYE reference 123/AB456) — implicitly or
explicitly — confirming the agent retained context across turns. The agent should NOT suggest
personal allowance or dividend info (those are SA100 main, not SA102).
weight: 1.0

# ─── sa105-property — UK residential let, finance-cost restriction ─────
- description: sa105 — rental profit excludes restricted finance costs
vars:
transcript:
- |
I rent out a flat at 12 Rose Lane, Manchester. Got £14,400 in rent over the 2024-25 tax year.
My allowable expenses were: £1,200 to the letting agent, £480 insurance, £300 repairs.
The mortgage interest for the year was £3,800.
- "What's my rental profit before any finance cost credit?"
assert:
- type: llm-rubric
value: |
Agent reports £14,400 - £1,980 = £12,420 as the rental profit before the basic-rate finance-cost
tax credit. The £3,800 mortgage interest should NOT have been subtracted (residential finance
costs are restricted to a 20% basic-rate tax credit, not a deduction). Off-by-one-penny rounding
acceptable. The response should make clear the finance cost is handled separately as a tax credit,
not as a P&L expense.
weight: 0.7
- type: regex
value: '12,420(?:\.\d+)?|12420(?:\.\d+)?'
weight: 0.3

# ─── sa108-cgt — share disposal, loss treatment ────────────────────────
- description: sa108 — explains loss treatment on a share disposal
vars:
transcript:
- |
I sold 500 shares of VWRP on 14 February 2025 for £11,500. I bought them on 3 June 2022 at £82
per share. Broker commission was £12 on the buy and £12 on the sell. This was in a taxable
brokerage account (not an ISA).
- "Is this loss taxable? Can I use it elsewhere?"
assert:
- type: llm-rubric
value: |
The agent correctly explains that (a) the loss is reportable on SA108, (b) it can be offset
against other gains in the same tax year before the annual exempt amount is applied, (c) any
unused loss can be carried forward to future years (must be claimed within 4 years of the end of
the tax year in which it arose). The agent should NOT say the loss can be offset against income
tax (losses on shares generally can't be except for specific reliefs like SEIS loss relief, which
doesn't apply to a passive ETF). Accepting a caveat that SEIS/EIS loss relief exists for separate
situations is fine. The response should reference the specifics from turn 1 (VWRP, ~£29,500 loss,
taxable account) confirming the agent retained context.
weight: 1.0

This file was deleted.

This file was deleted.

This file was deleted.

Loading
Loading