Skip to content

feat(examples): evals for personal-finance agent (SA102/SA105/SA108 + behavioral)#356

Merged
buremba merged 2 commits into
mainfrom
feat/personal-finance-evals
Apr 26, 2026
Merged

feat(examples): evals for personal-finance agent (SA102/SA105/SA108 + behavioral)#356
buremba merged 2 commits into
mainfrom
feat/personal-finance-evals

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented Apr 25, 2026

Summary

Six evals under `examples/personal-finance/agents/personal-finance/evals/`:

  • `ping` — agent stays in accountant persona for a simple greeting.
  • `sa102-employment` — captures employer + income source from a P60 description, knows what's still missing.
  • `sa105-property` — handles UK residential let with the correct finance-cost treatment (20% basic-rate credit, NOT a deduction) and computes rental profit accordingly.
  • `sa108-cgt` — captures a share disposal, correctly identifies a loss, explains how CGT losses carry forward.
  • `tax-year-anchoring` — maps dates to the correct UK fiscal year (6 April–5 April boundary).
  • `gap-surfacing` — refuses to fabricate missing figures when asked to assemble; holds the line under pressure.

Each eval uses `llm-rubric` assertions for the semantic check and `regex`/`contains` for the lexical fast-path, following the `examples/careops/` convention.

Stacked on

Targets `feat/personal-finance-example` (#350). Rebase onto main once #350 merges.

Test plan

  • All 6 eval YAMLs parse.
  • Pre-commit checks pass (Biome + tsc — no code changes).
  • `lobu eval` against the seeded personal-finance agent, check pass rates.
  • Iterate on any thresholds/assertions that are too loose or too tight based on actual model output.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Base automatically changed from feat/personal-finance-example to main April 26, 2026 16:23
buremba added 2 commits April 26, 2026 17:30
Six evals covering the behaviors we most want to guard:

- ping — agent stays in accountant persona for a simple greeting.
- sa102-employment — captures employer + income source from a P60
  description and knows what's still missing for SA102.
- sa105-property — handles UK residential let with the correct
  finance-cost treatment (20% basic-rate credit, NOT a deduction) and
  computes rental profit accordingly.
- sa108-cgt — captures a share disposal, correctly identifies a loss,
  and explains how CGT losses carry forward.
- tax-year-anchoring — maps dates to the correct UK fiscal year
  (6 April - 5 April boundary).
- gap-surfacing — refuses to fabricate missing figures when asked to
  assemble, surfaces them instead (and holds the line under pressure).

Each eval uses llm-rubric assertions for the semantic check and
regex/contains for the fast-path lexical checks.
The original regex `12,420|12420` failed on "£12,420.00" or
"12420.50" even when the number is correct, leaving the llm-rubric
(weight 0.7) to carry the eval alone. Allow an optional decimal tail.
@buremba buremba force-pushed the feat/personal-finance-evals branch from bb78e50 to b5903db Compare April 26, 2026 16:30
@buremba buremba merged commit cf49872 into main Apr 26, 2026
10 checks passed
@buremba buremba deleted the feat/personal-finance-evals branch April 26, 2026 16:31
buremba added a commit that referenced this pull request Apr 26, 2026
, #359 install-half) (#372)

The install flow as built — schema-mirror clones a template's entity types /
relationship types / classifiers / watchers into each user's personal org —
was the wrong abstraction. Cross-org vocabulary (an entity in tenant org A
referencing a type defined in a public-catalog org B by FK) is the planned
direction; the mirror pipeline duplicated rows per user and added re-sync
complexity for no working installs (verified 0 rows used the mirror columns
in prod).

Removed:
- packages/owletto-backend/src/agents/install.ts (installAgentFromTemplate, resyncInstalledAgent)
- packages/owletto-backend/src/agents/install-routes.ts (POST /api/install)
- packages/owletto-backend/src/agents/install-manifest-routes.ts (GET /api/install/manifest/:slug)
- All associated integration tests
- subject-identities WhatsApp helpers (normalizePhoneE164, phoneToWhatsAppJid, linkWhatsAppToMember) + their unit tests
- db/migrations/20260425120000_add_template_mirror_tracking.sql (rolled back on prod first)
- Route registrations from src/index.ts

Kept:
- subject-identities.ts provisionMemberAndCoreIdentities — used by the signup
  hook in personal-org-provisioning.ts, orthogonal to install flow.
- #352 personal-org-on-signup, #350/#354/#355/#356 personal-finance content —
  no install dependencies.

DB state: prod migrated down via dbmate (mirror columns dropped), then
20260426120000_entities_entity_type_fk re-applied. 0 user-visible data lost.
@buremba buremba restored the feat/personal-finance-evals branch May 12, 2026 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant