feat(examples): evals for personal-finance agent (SA102/SA105/SA108 + behavioral) by buremba · Pull Request #356 · lobu-ai/lobu

buremba · 2026-04-25T01:57:14Z

Summary

Six evals under `examples/personal-finance/agents/personal-finance/evals/`:

`ping` — agent stays in accountant persona for a simple greeting.
`sa102-employment` — captures employer + income source from a P60 description, knows what's still missing.
`sa105-property` — handles UK residential let with the correct finance-cost treatment (20% basic-rate credit, NOT a deduction) and computes rental profit accordingly.
`sa108-cgt` — captures a share disposal, correctly identifies a loss, explains how CGT losses carry forward.
`tax-year-anchoring` — maps dates to the correct UK fiscal year (6 April–5 April boundary).
`gap-surfacing` — refuses to fabricate missing figures when asked to assemble; holds the line under pressure.

Each eval uses `llm-rubric` assertions for the semantic check and `regex`/`contains` for the lexical fast-path, following the `examples/careops/` convention.

Stacked on

Targets `feat/personal-finance-example` (#350). Rebase onto main once #350 merges.

Test plan

All 6 eval YAMLs parse.
Pre-commit checks pass (Biome + tsc — no code changes).
`lobu eval` against the seeded personal-finance agent, check pass rates.
Iterate on any thresholds/assertions that are too loose or too tight based on actual model output.

chatgpt-codex-connector · 2026-04-25T01:57:18Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Six evals covering the behaviors we most want to guard: - ping — agent stays in accountant persona for a simple greeting. - sa102-employment — captures employer + income source from a P60 description and knows what's still missing for SA102. - sa105-property — handles UK residential let with the correct finance-cost treatment (20% basic-rate credit, NOT a deduction) and computes rental profit accordingly. - sa108-cgt — captures a share disposal, correctly identifies a loss, and explains how CGT losses carry forward. - tax-year-anchoring — maps dates to the correct UK fiscal year (6 April - 5 April boundary). - gap-surfacing — refuses to fabricate missing figures when asked to assemble, surfaces them instead (and holds the line under pressure). Each eval uses llm-rubric assertions for the semantic check and regex/contains for the fast-path lexical checks.

The original regex `12,420|12420` failed on "£12,420.00" or "12420.50" even when the number is correct, leaving the llm-rubric (weight 0.7) to carry the eval alone. Allow an optional decimal tail.

, #359 install-half) (#372) The install flow as built — schema-mirror clones a template's entity types / relationship types / classifiers / watchers into each user's personal org — was the wrong abstraction. Cross-org vocabulary (an entity in tenant org A referencing a type defined in a public-catalog org B by FK) is the planned direction; the mirror pipeline duplicated rows per user and added re-sync complexity for no working installs (verified 0 rows used the mirror columns in prod). Removed: - packages/owletto-backend/src/agents/install.ts (installAgentFromTemplate, resyncInstalledAgent) - packages/owletto-backend/src/agents/install-routes.ts (POST /api/install) - packages/owletto-backend/src/agents/install-manifest-routes.ts (GET /api/install/manifest/:slug) - All associated integration tests - subject-identities WhatsApp helpers (normalizePhoneE164, phoneToWhatsAppJid, linkWhatsAppToMember) + their unit tests - db/migrations/20260425120000_add_template_mirror_tracking.sql (rolled back on prod first) - Route registrations from src/index.ts Kept: - subject-identities.ts provisionMemberAndCoreIdentities — used by the signup hook in personal-org-provisioning.ts, orthogonal to install flow. - #352 personal-org-on-signup, #350/#354/#355/#356 personal-finance content — no install dependencies. DB state: prod migrated down via dbmate (mirror columns dropped), then 20260426120000_entities_entity_type_fk re-applied. 0 user-visible data lost.

Base automatically changed from feat/personal-finance-example to main April 26, 2026 16:23

buremba added 2 commits April 26, 2026 17:30

fix(examples): accept formatting variants in sa105 regex

b5903db

The original regex `12,420|12420` failed on "£12,420.00" or "12420.50" even when the number is correct, leaving the llm-rubric (weight 0.7) to carry the eval alone. Allow an optional decimal tail.

buremba force-pushed the feat/personal-finance-evals branch from bb78e50 to b5903db Compare April 26, 2026 16:30

buremba merged commit cf49872 into main Apr 26, 2026
10 checks passed

buremba deleted the feat/personal-finance-evals branch April 26, 2026 16:31

This was referenced Apr 26, 2026

chore(main): release lobu 5.0.0 #368

Merged

revert(install-flow): remove template-install pipeline #372

Merged

buremba restored the feat/personal-finance-evals branch May 12, 2026 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): evals for personal-finance agent (SA102/SA105/SA108 + behavioral)#356

feat(examples): evals for personal-finance agent (SA102/SA105/SA108 + behavioral)#356
buremba merged 2 commits into
mainfrom
feat/personal-finance-evals

buremba commented Apr 25, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

buremba commented Apr 25, 2026

Summary

Stacked on

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant