feat(examples): evals for personal-finance agent (SA102/SA105/SA108 + behavioral)#356
Merged
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Six evals covering the behaviors we most want to guard: - ping — agent stays in accountant persona for a simple greeting. - sa102-employment — captures employer + income source from a P60 description and knows what's still missing for SA102. - sa105-property — handles UK residential let with the correct finance-cost treatment (20% basic-rate credit, NOT a deduction) and computes rental profit accordingly. - sa108-cgt — captures a share disposal, correctly identifies a loss, and explains how CGT losses carry forward. - tax-year-anchoring — maps dates to the correct UK fiscal year (6 April - 5 April boundary). - gap-surfacing — refuses to fabricate missing figures when asked to assemble, surfaces them instead (and holds the line under pressure). Each eval uses llm-rubric assertions for the semantic check and regex/contains for the fast-path lexical checks.
The original regex `12,420|12420` failed on "£12,420.00" or "12420.50" even when the number is correct, leaving the llm-rubric (weight 0.7) to carry the eval alone. Allow an optional decimal tail.
bb78e50 to
b5903db
Compare
This was referenced Apr 26, 2026
buremba
added a commit
that referenced
this pull request
Apr 26, 2026
, #359 install-half) (#372) The install flow as built — schema-mirror clones a template's entity types / relationship types / classifiers / watchers into each user's personal org — was the wrong abstraction. Cross-org vocabulary (an entity in tenant org A referencing a type defined in a public-catalog org B by FK) is the planned direction; the mirror pipeline duplicated rows per user and added re-sync complexity for no working installs (verified 0 rows used the mirror columns in prod). Removed: - packages/owletto-backend/src/agents/install.ts (installAgentFromTemplate, resyncInstalledAgent) - packages/owletto-backend/src/agents/install-routes.ts (POST /api/install) - packages/owletto-backend/src/agents/install-manifest-routes.ts (GET /api/install/manifest/:slug) - All associated integration tests - subject-identities WhatsApp helpers (normalizePhoneE164, phoneToWhatsAppJid, linkWhatsAppToMember) + their unit tests - db/migrations/20260425120000_add_template_mirror_tracking.sql (rolled back on prod first) - Route registrations from src/index.ts Kept: - subject-identities.ts provisionMemberAndCoreIdentities — used by the signup hook in personal-org-provisioning.ts, orthogonal to install flow. - #352 personal-org-on-signup, #350/#354/#355/#356 personal-finance content — no install dependencies. DB state: prod migrated down via dbmate (mirror columns dropped), then 20260426120000_entities_entity_type_fk re-applied. 0 user-visible data lost.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Six evals under `examples/personal-finance/agents/personal-finance/evals/`:
Each eval uses `llm-rubric` assertions for the semantic check and `regex`/`contains` for the lexical fast-path, following the `examples/careops/` convention.
Stacked on
Targets `feat/personal-finance-example` (#350). Rebase onto main once #350 merges.
Test plan