feat(evals)!: drop in-house YAML runner, ship @lobu/promptfoo-provider#911
Conversation
…ptfoo-provider Drop packages/cli/src/eval/ (client/grader/reporter/runner/types) and the `lobu eval` command. Eval authoring moves to promptfoo's native promptfooconfig.yaml format; a new @lobu/promptfoo-provider workspace package drives a Lobu agent end-to-end via the gateway's public Agent API (POST /api/v1/agents -> /messages -> SSE /events -> DELETE). Adds examples/qmsum-demo/: a new example project that ingests Yale-LILY's QMSum meeting-summarization dataset as merged speaking-turn events with per-domain speaker entity rules (Academic per-meeting, Product/Committee per-domain), exposes the corpus to any MCP client, and ships a promptfooconfig.yaml with four eval suites — answer-quality vs gold, meeting-summary vs gold, speaker-attribution, cross-meeting synthesis. Known limitation flagged in both READMEs: retrieval-recall + context-recall + context-faithfulness need the gateway to emit tool_use SSE events so the provider can populate metadata.toolCalls / metadata.retrievedContext. The scenario is stubbed in promptfooconfig.yaml; follow-up gateway PR will unlock it. Personal-finance evals are deferred — their multi-turn semantics don't map cleanly to promptfoo's single-turn parametric tests. README in their evals/ dir explains the migration path. Net diff: -2,719 LOC (deletions of the in-house runner + tests), +1,506 LOC (provider package, example project, prompts).
…ce eval
Per user direction, narrowing this PR to:
* Delete packages/cli/src/eval/ + the `lobu eval` command (already in
previous commit on this branch).
* Ship @lobu/promptfoo-provider as the published replacement plugin.
* Use it in an existing example (personal-finance) to prove the wiring.
Drops examples/qmsum-demo/ entirely — that work belongs in a follow-up PR
once the gateway tool_use SSE events land (without them, the QMSum demo's
killer retrieval-recall beat doesn't function).
Migrates two single-turn personal-finance evals (ping + tax-year-anchoring,
split into 2 independent cases) into a real promptfooconfig.yaml. The four
multi-turn behavioural YAMLs (gap-surfacing, sa102, sa105, sa108) stay
dormant pending either a provider-side multi-turn extension or a flattening
port; documented in their evals/README.md.
Provider id uses promptfoo's package: protocol:
package:@lobu/promptfoo-provider:LobuProvider
Cleans lingering `lobu eval` references in Makefile, packages/cli/README.md,
skills/lobu/SKILL.md, and a stale dev.ts comment.
Adds packages/promptfoo-provider/HANDOFF.md documenting the must-fix
plumbing changes that a follow-up agent (with core-code access) needs to
make: release-please-config.json, scripts/publish-packages.mjs, Makefile
build-packages target. Plus the should-fix gateway SSE tool_use change that
unlocks the RAG-specific assertions.
BREAKING CHANGE: The in-house `lobu eval` command and YAML eval schema are
removed. Migrate evals to promptfoo + @lobu/promptfoo-provider; see
examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
for the new pattern.
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
| } | ||
|
|
||
| this.agent = agent; | ||
| this.gateway = gateway.replace(/\/+$/, ""); |
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR removes the in-house ChangesEval system migration from CLI to promptfoo
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Makefile (1)
33-43:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd
packages/promptfoo-providertobuild-packagesbefore merge.
make build-packagescurrently skips the new provider package, so the default monorepo build path won’t compile it. This leaves the migration only partially wired for local validation and CI parity.Suggested diff
build-packages: `@echo` "📦 Building all TypeScript packages..." - `@for` pkg in core connector-sdk agent-worker openclaw-plugin embeddings connector-worker; do \ + `@for` pkg in core connector-sdk agent-worker openclaw-plugin embeddings connector-worker promptfoo-provider; do \ echo " 📦 Building packages/$$pkg..."; \ ( cd packages/$$pkg && bun run build ) || exit $$?; \ done🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Makefile` around lines 33 - 43, The build-packages Makefile target currently omits the new package; update the build-packages recipe (the for-loop and/or subsequent steps) to include packages/promptfoo-provider so it is built with the rest of the monorepo. Specifically, modify the list in the for-loop that iterates over core connector-sdk agent-worker openclaw-plugin embeddings connector-worker to also include promptfoo-provider (or add an explicit @( cd packages/promptfoo-provider && bun run build ) step similar to packages/server/packages/cli) so the promptfoo-provider package is compiled during make build-packages.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@AGENTS.md`:
- Line 203: Update the stale reference in AGENTS.md that points to the removed
example string "examples/qmsum-demo/": replace it with the live canonical eval
example (e.g., the personal-finance promptfoo config, such as
"examples/personal-finance/") so the docs remain self-consistent; edit the
sentence that currently mentions `examples/qmsum-demo/` to reference the new
example and confirm the referenced example actually exists and demonstrates
custom provider auto-wiring, parametric JSONL tests, and RAG + answer-quality
assertions.
In
`@examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml`:
- Around line 35-37: Change the case-sensitive regex rule to a case-insensitive
contains rule: replace the rule where type is "regex" and value is
'hello|hi\b|hey|yes|here|ready' with type "icontains-any" and supply the
greetings as a list (e.g., value: ["hello","hi","hey","yes","here","ready"]) so
the ping check matches capitalized variants like "Hello" or "Hi"; update the
entry in the same YAML block containing type/value to ensure test stability.
In `@packages/promptfoo-provider/HANDOFF.md`:
- Around line 5-18: Add the new package to the repo's release/publish/build
configuration: add a packages entry for "packages/promptfoo-provider" in
release-please-config.json (mirror the existing block for
packages/connector-sdk), append 'promptfoo-provider' to the PACKAGES array in
scripts/publish-packages.mjs so the package is included in the npm publish flow,
and add promptfoo-provider to the for pkg in ... list in the Makefile's
build-packages target so CI produces the dist/ output; update the exact
string/entry names to match the canonical package name used elsewhere (e.g.,
'`@lobu/promptfoo-provider`').
In `@packages/promptfoo-provider/package.json`:
- Around line 8-17: The package.json exports map both "import" and "require" to
the same ESM output while package.json sets "type": "module" (and tsconfig uses
"module": "ESNext"), which will break CommonJS require consumers; either remove
the "require" export entry from the "exports" field to make the package
ESM-only, or produce a separate CommonJS build (e.g., dist/index.cjs) and change
the "require" export to point to that CJS artifact (and ensure dist/index.d.ts
still points to the types); update the "exports" block accordingly so "import"
-> ./dist/index.js and "require" -> ./dist/index.cjs if you add a CJS build,
otherwise delete the "require" mapping.
In `@packages/promptfoo-provider/README.md`:
- Line 60: Update the README entry for traceId so it matches the implementation:
change the comment on traceId (symbol: traceId) to state that the W3C trace id
is read from the `traceparent` field in the /messages JSON response body
(symbol: `traceparent`), not from an incoming HTTP `traceparent` header; ensure
the wording explicitly references the /messages response body to avoid
confusion.
In `@packages/promptfoo-provider/src/provider.ts`:
- Around line 175-182: createSession, sendMessage, and deleteSession call fetch
without an AbortController so they can hang; update each to mirror
collectResponse by creating an AbortController, set a timeout using
this.defaultTimeoutMs that calls controller.abort(), pass controller.signal to
fetch, and clear the timer after fetch finishes (or in finally) so network calls
respect the provider's timeout behavior.
---
Outside diff comments:
In `@Makefile`:
- Around line 33-43: The build-packages Makefile target currently omits the new
package; update the build-packages recipe (the for-loop and/or subsequent steps)
to include packages/promptfoo-provider so it is built with the rest of the
monorepo. Specifically, modify the list in the for-loop that iterates over core
connector-sdk agent-worker openclaw-plugin embeddings connector-worker to also
include promptfoo-provider (or add an explicit @( cd packages/promptfoo-provider
&& bun run build ) step similar to packages/server/packages/cli) so the
promptfoo-provider package is compiled during make build-packages.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: ab1ec922-7aa3-4755-ac6b-5fcd034e59ae
⛔ Files ignored due to path filters (1)
bun.lockis excluded by!**/*.lock
📒 Files selected for processing (23)
AGENTS.mdMakefileexamples/personal-finance/agents/personal-finance/evals/README.mdexamples/personal-finance/agents/personal-finance/evals/promptfooconfig.yamlexamples/personal-finance/package.jsonpackages/cli/README.mdpackages/cli/src/__tests__/cli-ux.test.tspackages/cli/src/__tests__/eval-schema.test.tspackages/cli/src/commands/dev.tspackages/cli/src/commands/eval.tspackages/cli/src/eval/client.tspackages/cli/src/eval/grader.tspackages/cli/src/eval/reporter.tspackages/cli/src/eval/runner.tspackages/cli/src/eval/types.tspackages/cli/src/index.tspackages/promptfoo-provider/HANDOFF.mdpackages/promptfoo-provider/README.mdpackages/promptfoo-provider/package.jsonpackages/promptfoo-provider/src/index.tspackages/promptfoo-provider/src/provider.tspackages/promptfoo-provider/tsconfig.jsonskills/lobu/SKILL.md
💤 Files with no reviewable changes (9)
- packages/cli/src/tests/eval-schema.test.ts
- packages/cli/src/eval/runner.ts
- packages/cli/src/eval/reporter.ts
- packages/cli/src/tests/cli-ux.test.ts
- packages/cli/src/commands/eval.ts
- packages/cli/src/eval/grader.ts
- packages/cli/src/index.ts
- packages/cli/src/eval/types.ts
- packages/cli/src/eval/client.ts
| - type: regex | ||
| value: 'hello|hi\b|hey|yes|here|ready' | ||
| weight: 0.3 |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In promptfoo assertion configs, is type: regexcase-sensitive by default, and does it support aflags: i option for case-insensitive matching?
💡 Result:
In Promptfoo assertion configs, type: regex does not document any flags option, so case-insensitivity is not something you can enable via flags: i (and therefore it’s not documented as “case-insensitive by default”).[1] Promptfoo’s docs only describe type: regex / regex as “output matches the provided regular expression,” with no mention of automatic case-folding or a flags field.[1][2] Additionally, Promptfoo provides separate case-insensitive assertions like icontains, icontains-any, and icontains-all, reinforcing that case-insensitive behavior is handled by distinct assertion types rather than regex flags.[1] So, based on the current official docs: 1) type: regex is not stated to be case-insensitive by default (and no default i behavior is described). [1] 2) flags: i is not supported/documented for the regex assertion. [1][2]
Citations:
- 1: https://www.promptfoo.dev/docs/configuration/expected-outputs/
- 2: https://github.com/promptfoo/promptfoo/blob/ff310fef/site/docs/configuration/expected-outputs/deterministic.md
Use case-insensitive matching for the ping regex.
At Line 36, type: regex is case-sensitive and will miss capitalized greetings like "Hello" or "Hi", causing test flakiness. Replace with type: icontains-any to match any of the greeting values case-insensitively:
Proposed fix
- type: regex
- value: 'hello|hi\b|hey|yes|here|ready'
+ value: 'hello|hi|hey|yes|here|ready'
+ type: icontains-any
weight: 0.3🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml`
around lines 35 - 37, Change the case-sensitive regex rule to a case-insensitive
contains rule: replace the rule where type is "regex" and value is
'hello|hi\b|hey|yes|here|ready' with type "icontains-any" and supply the
greetings as a list (e.g., value: ["hello","hi","hey","yes","here","ready"]) so
the ping check matches capitalized variants like "Hello" or "Hi"; update the
entry in the same YAML block containing type/value to ensure test stability.
| "exports": { | ||
| ".": { | ||
| "import": { | ||
| "types": "./dist/index.d.ts", | ||
| "default": "./dist/index.js" | ||
| }, | ||
| "require": { | ||
| "types": "./dist/index.d.ts", | ||
| "default": "./dist/index.js" | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify this package currently emits only ESM and maps `require` to the same ESM artifact.
set -euo pipefail
echo "== package exports =="
cat packages/promptfoo-provider/package.json | sed -n '1,80p'
echo
echo "== tsconfig module settings =="
cat packages/promptfoo-provider/tsconfig.json | sed -n '1,60p'
echo
echo "== check for any cjs outputs configured in package scripts =="
rg -n '"build"|cjs|commonjs|module"' packages/promptfoo-provider/package.json packages/promptfoo-provider/tsconfig.jsonRepository: lobu-ai/lobu
Length of output: 1898
Remove the require export or provide a separate CommonJS build.
The package is configured as "type": "module" with TypeScript emitting ESM code ("module": "ESNext"), but the exports field maps both import and require to the same ESM artifact. CommonJS consumers calling require('@lobu/promptfoo-provider') will fail at runtime.
Either remove the require entry (simplest approach since this is ESM-only) or add a CJS build and map require.default to the CJS output:
Proposed fix (remove require export)
"exports": {
".": {
"import": {
"types": "./dist/index.d.ts",
"default": "./dist/index.js"
}
}
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "exports": { | |
| ".": { | |
| "import": { | |
| "types": "./dist/index.d.ts", | |
| "default": "./dist/index.js" | |
| }, | |
| "require": { | |
| "types": "./dist/index.d.ts", | |
| "default": "./dist/index.js" | |
| } | |
| "exports": { | |
| ".": { | |
| "import": { | |
| "types": "./dist/index.d.ts", | |
| "default": "./dist/index.js" | |
| } | |
| } | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/promptfoo-provider/package.json` around lines 8 - 17, The
package.json exports map both "import" and "require" to the same ESM output
while package.json sets "type": "module" (and tsconfig uses "module": "ESNext"),
which will break CommonJS require consumers; either remove the "require" export
entry from the "exports" field to make the package ESM-only, or produce a
separate CommonJS build (e.g., dist/index.cjs) and change the "require" export
to point to that CJS artifact (and ensure dist/index.d.ts still points to the
types); update the "exports" block accordingly so "import" -> ./dist/index.js
and "require" -> ./dist/index.cjs if you add a CJS build, otherwise delete the
"require" mapping.
| metadata: { | ||
| agent: string | ||
| thread: string // fresh per call by default | ||
| traceId?: string // W3C trace id from `traceparent` header |
There was a problem hiding this comment.
Trace ID source is documented incorrectly.
Line 60 says traceId comes from a traceparent header, but the provider currently reads it from the /messages JSON response body field (traceparent). Please align the wording with implementation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/promptfoo-provider/README.md` at line 60, Update the README entry
for traceId so it matches the implementation: change the comment on traceId
(symbol: traceId) to state that the W3C trace id is read from the `traceparent`
field in the /messages JSON response body (symbol: `traceparent`), not from an
incoming HTTP `traceparent` header; ensure the wording explicitly references the
/messages response body to avoid confusion.
…s, gateway path
Three bugs found exercising the provider end-to-end with promptfoo:
1. promptfoo's package-protocol loader unwraps `default` exports before
looking up the entity name on the module:
const mod = importedModule?.default || importedModule
const entity = mod[entityName]
With both `default` and named `LobuProvider`, `mod` became the class
itself and `mod['LobuProvider']` was undefined. Drop the default export
so the loader falls through to the namespace object.
2. examples/personal-finance is not picked up by the monorepo workspace
list (root package.json only globs packages/*). Bun couldn't resolve
`@lobu/promptfoo-provider@workspace:*`. Add `examples/personal-finance`
to the root workspaces array. Other examples don't have a package.json
yet so they're unaffected.
3. The public Agent API is mounted at `/lobu` (the gateway serves
org-scoped REST at `/`; see packages/server/src/server.ts). Old
eval/client.ts only worked because resolveGatewayUrl appended `/lobu`
upstream. The provider now uses `${gateway}/lobu/api/v1/agents` so
`LOBU_GATEWAY=http://localhost:8787` works without manual prefixing.
Smoke test now reaches a 401 (Unauthorized) — the correct failure shape
for a dummy LOBU_TOKEN against a live gateway. With a valid token the
eval will run through.
8fc444a to
81471e5
Compare
Codex review caught comments in provider.ts and HANDOFF.md still referring to `/api/v1/agents` instead of `/lobu/api/v1/agents`. Update both. Also add a note to HANDOFF that root package.json's `build:packages` script (invoked by scripts/publish-packages.mjs) needs the new package added, not just the Makefile target.
…h + rewrite landing docs Addresses the must-fix items from the codex review of PR #911: * release-please-config.json: add packages/promptfoo-provider/package.json to extra-files so the new package gets bumped with the monorepo. * scripts/publish-packages.mjs: append { dir: "packages/promptfoo-provider", transform: rewriteWorkspaceRefs } to PACKAGES so it actually publishes to npm. * root package.json: add `cd ../promptfoo-provider && bun run build` to the build:packages chain (the script publish-packages.mjs calls before npm publish). * Makefile: add promptfoo-provider to the build-packages target's pkg list. Landing docs rewritten to point at promptfoo + @lobu/promptfoo-provider: * packages/landing/src/content/docs/guides/evals.md: full rewrite — the old doc described the deleted YAML runner. New doc covers promptfoo install, promptfooconfig.yaml shape, the `package:` protocol provider id, assertion types, parametric tests, the personal-finance example, known limitations. * packages/landing/src/content/docs/getting-started/index.mdx: replace `npx @lobu/cli@latest eval` with `bunx promptfoo eval`. * packages/landing/src/content/docs/guides/testing.md: same. * packages/landing/src/content/docs/reference/cli.md: drop the `eval [name]` / `eval new` sections, point at the Evaluations guide. * codex-skills/lobu-builder/SKILL.md: replace lobu eval references with promptfoo invocation. * Two test file comments referencing `lobu eval` updated to be accurate.
…mpotency (#914) * chore(build): bun lockfile docs, login polling cleanup, migration idempotency Four hygiene gaps surfaced during the PR #911 end-to-end test: 1. AGENTS.md — document the "bun lockfile + owletto submodule" interaction. CI initialises the submodule before `bun install --frozen-lockfile`, so the lockfile on `main` always reflects an initialised submodule. Local pushes from an uninitialised checkout silently regenerate `bun.lock` and trip the next CI run. Document the pre-push check. 2. AGENTS.md — point IDE users at the biome plugin so save-time formatting matches the Husky pre-commit `biome check --write` hook. Without an integration the hook keeps rewriting files behind the editor's back. 3. `lobu login` device-flow polling cleanup. Backgrounded `lobu login &` shells were observed hammering `/oauth/token` at the polling interval long after the parent shell exited. Adds: - SIGHUP / SIGTERM / SIGINT handlers that exit the poll loop cleanly. - Hard 5-minute ceiling on the loop in addition to the server `expires_in` deadline. - Non-interactive bail: when `--quiet` or `!isTTY`, a `pending` poll is terminal — print "use --token <pat>" and exit instead of looping until expiry. 4. Migration runner idempotency. The squashed baseline uses plain `CREATE FUNCTION` / `CREATE TABLE`, so replaying against a DB that already has the schema raises 42723 / 42P07 / 42710. Catch those specific codes, log "Migration already applied (idempotent skip)", and record the version as applied. Non-idempotent errors still propagate. No new migrations, no schema changes. * address pi review: stdin TTY check, abortable sleep, baseline-only idempotency Pi findings on PR #914: 1. login: require stdin TTY in addition to stdout for `isInteractive`. A `lobu login </dev/null` with a stdout TTY no longer misidentifies as interactive and falls through to polling. 2. login: cap the polling sleep at `deadline - now`, and make it abortable so SIGHUP/SIGTERM/SIGINT (or the hard 5-min ceiling) wake the loop immediately instead of waiting out the current interval (which `slow_down` can balloon to >30s). 3. migrations: restrict the duplicate-object short-circuit to the squashed baseline version (`00000000000000`) via an `IDEMPOTENT_BASELINE_VERSIONS` allowlist. Future delta migrations must use `IF NOT EXISTS` discipline — the runner no longer masks mid-file failures by marking the version applied just because one statement collided.
…lock image builds (#927) PR #911 added `examples/personal-finance` to root `package.json`'s `workspaces` field but didn't update the Dockerfiles, which only COPY `packages/*/package.json` for the install layer. `bun install` inside the Docker build then errored: error: Workspace not found "examples/personal-finance" at /app/package.json:8:5 Every image build on `main` since #911 merged (13:25 UTC today) has been red: #911 → #913 (+revert) → #914 → #915 → #919 → #923 → #924 → #912 → #925 — all sitting on `main` un-deployable, including the `principal_kind` migration from #923 and my own loading-skeletons shipping artifacts. Two ways to fix it: 1. **Add stubs to all three Dockerfiles** for the example. Treats the symptom; couples prod build pipeline to whatever's under `examples/`, wrong direction. 2. **Take the example out of root workspaces.** Examples are documentation/demos for users to clone + run; they don't belong in the prod build graph. Cleaner separation. Going with (2). Side effects: - Example's dependency on `@lobu/promptfoo-provider` switched from `workspace:*` (workspace-protocol-only) to `file:../../packages/promptfoo-provider`. Resolves locally without requiring the example to be in a workspace; consumers run `cd examples/personal-finance && bun install` standalone (after building the provider once: `cd packages/promptfoo-provider && bun run build`). - `bun.lock` regenerated. Most of the diff is bun's "linked workspaces" table shrinking — no upstream version churn. Verified: simulated Docker build context (root files + stubbed packages/* manifests + provider stub, no examples/) runs `bun install` cleanly. No "Workspace not found" error.
Summary
packages/cli/src/eval/— client/grader/reporter/runner/types) and thelobu evalcommand. Net −2,700 LOC.@lobu/promptfoo-provider— a new workspace package that drives a Lobu agent via the gateway's public Agent API and plugs into promptfoo via itspackage:protocol.personal-financeevals (ping + tax-year-anchoring split into 2 independent cases) into a realpromptfooconfig.yamlto prove the wiring.lobu evalreferences in Makefile,packages/cli/README.md,skills/lobu/SKILL.md, a staledev.tscomment, andAGENTS.md.Out of scope (deliberately deferred)
personal-financebehavioural YAMLs (gap-surfacing,sa102,sa105,sa108) stay on disk but dormant. Multi-turn doesn't map cleanly to promptfoo's single-turn parametric tests; documented inevals/README.md. Follow-up PR can either extend the provider or flatten the conversations.tool_useSSE event that doesn't exist yet, without which the retrieval-recall scoring it was built around can't function.Handoff to a core-code agent
This branch intentionally does not touch core build/release/publish plumbing or gateway protocol. Three must-fix items + one should-fix are documented in
packages/promptfoo-provider/HANDOFF.md:release-please-config.json— addpackages/promptfoo-providerso it gets versioned with the monorepo.scripts/publish-packages.mjs— add it to thePACKAGESarray so it actually publishes to npm.Makefilebuild-packagestarget — add it to the build chain.tool_useevent type — unlockscontext-recall/context-faithfulness/ custom assertions that needmetadata.toolCalls/metadata.retrievedContext.BREAKING CHANGE
lobu evaland the YAML eval schema (agents/<id>/evals/*.yaml) are removed. Migrate to promptfoo +@lobu/promptfoo-provider. Seeexamples/personal-finance/agents/personal-finance/evals/promptfooconfig.yamlfor the new pattern.Test plan
bun run buildinsidepackages/promptfoo-providerproducesdist/cd examples/personal-finance && bun installresolvesworkspace:*depsLOBU_TOKEN=$(lobu token) bun run evalsinvokes promptfoo against the personal-finance agent and runs ping + tax-year-anchoring scenariosbun run evals:viewopens the comparison gridmake build-packages(some dist deps were pre-existing missing — unchanged by this PR)Summary by CodeRabbit
New Features
Deprecations
Documentation