Skip to content

feat(evals)!: drop in-house YAML runner, ship @lobu/promptfoo-provider#911

Merged
buremba merged 6 commits into
mainfrom
feat/promptfoo-evals
May 19, 2026
Merged

feat(evals)!: drop in-house YAML runner, ship @lobu/promptfoo-provider#911
buremba merged 6 commits into
mainfrom
feat/promptfoo-evals

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 19, 2026

Summary

  • Delete the in-house YAML eval runner (packages/cli/src/eval/ — client/grader/reporter/runner/types) and the lobu eval command. Net −2,700 LOC.
  • Add @lobu/promptfoo-provider — a new workspace package that drives a Lobu agent via the gateway's public Agent API and plugs into promptfoo via its package: protocol.
  • Migrate two single-turn personal-finance evals (ping + tax-year-anchoring split into 2 independent cases) into a real promptfooconfig.yaml to prove the wiring.
  • Clean lingering lobu eval references in Makefile, packages/cli/README.md, skills/lobu/SKILL.md, a stale dev.ts comment, and AGENTS.md.

Out of scope (deliberately deferred)

  • The 4 multi-turn personal-finance behavioural YAMLs (gap-surfacing, sa102, sa105, sa108) stay on disk but dormant. Multi-turn doesn't map cleanly to promptfoo's single-turn parametric tests; documented in evals/README.md. Follow-up PR can either extend the provider or flatten the conversations.
  • A QMSum demo example (the original motivation) is dropped from this PR; it depends on a gateway tool_use SSE event that doesn't exist yet, without which the retrieval-recall scoring it was built around can't function.

Handoff to a core-code agent

This branch intentionally does not touch core build/release/publish plumbing or gateway protocol. Three must-fix items + one should-fix are documented in packages/promptfoo-provider/HANDOFF.md:

  1. release-please-config.json — add packages/promptfoo-provider so it gets versioned with the monorepo.
  2. scripts/publish-packages.mjs — add it to the PACKAGES array so it actually publishes to npm.
  3. Makefile build-packages target — add it to the build chain.
  4. (should-fix) gateway SSE tool_use event type — unlocks context-recall / context-faithfulness / custom assertions that need metadata.toolCalls / metadata.retrievedContext.

BREAKING CHANGE

lobu eval and the YAML eval schema (agents/<id>/evals/*.yaml) are removed. Migrate to promptfoo + @lobu/promptfoo-provider. See examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml for the new pattern.

Test plan

  • HANDOFF must-fixes land (release-please + publish-packages + Makefile)
  • bun run build inside packages/promptfoo-provider produces dist/
  • cd examples/personal-finance && bun install resolves workspace:* deps
  • LOBU_TOKEN=$(lobu token) bun run evals invokes promptfoo against the personal-finance agent and runs ping + tax-year-anchoring scenarios
  • bun run evals:view opens the comparison grid
  • CLI typecheck passes after make build-packages (some dist deps were pre-existing missing — unchanged by this PR)

Summary by CodeRabbit

  • New Features

    • Added a Promptfoo provider package to enable running agent evaluations with promptfoo.
    • Added an example personal-finance eval config and scripts to run/view evals.
  • Deprecations

    • Removed the built-in CLI eval runner and related CLI eval commands; use promptfoo for evaluations.
  • Documentation

    • Updated docs and guides to instruct using promptfoo + the new provider and example config for evals.

Review Change Stack

buremba added 2 commits May 19, 2026 13:55
…ptfoo-provider

Drop packages/cli/src/eval/ (client/grader/reporter/runner/types) and the
`lobu eval` command. Eval authoring moves to promptfoo's native
promptfooconfig.yaml format; a new @lobu/promptfoo-provider workspace
package drives a Lobu agent end-to-end via the gateway's public Agent API
(POST /api/v1/agents -> /messages -> SSE /events -> DELETE).

Adds examples/qmsum-demo/: a new example project that ingests Yale-LILY's
QMSum meeting-summarization dataset as merged speaking-turn events with
per-domain speaker entity rules (Academic per-meeting, Product/Committee
per-domain), exposes the corpus to any MCP client, and ships a
promptfooconfig.yaml with four eval suites — answer-quality vs gold,
meeting-summary vs gold, speaker-attribution, cross-meeting synthesis.

Known limitation flagged in both READMEs: retrieval-recall +
context-recall + context-faithfulness need the gateway to emit tool_use
SSE events so the provider can populate metadata.toolCalls /
metadata.retrievedContext. The scenario is stubbed in
promptfooconfig.yaml; follow-up gateway PR will unlock it.

Personal-finance evals are deferred — their multi-turn semantics don't map
cleanly to promptfoo's single-turn parametric tests. README in their
evals/ dir explains the migration path.

Net diff: -2,719 LOC (deletions of the in-house runner + tests),
+1,506 LOC (provider package, example project, prompts).
…ce eval

Per user direction, narrowing this PR to:
  * Delete packages/cli/src/eval/ + the `lobu eval` command (already in
    previous commit on this branch).
  * Ship @lobu/promptfoo-provider as the published replacement plugin.
  * Use it in an existing example (personal-finance) to prove the wiring.

Drops examples/qmsum-demo/ entirely — that work belongs in a follow-up PR
once the gateway tool_use SSE events land (without them, the QMSum demo's
killer retrieval-recall beat doesn't function).

Migrates two single-turn personal-finance evals (ping + tax-year-anchoring,
split into 2 independent cases) into a real promptfooconfig.yaml. The four
multi-turn behavioural YAMLs (gap-surfacing, sa102, sa105, sa108) stay
dormant pending either a provider-side multi-turn extension or a flattening
port; documented in their evals/README.md.

Provider id uses promptfoo's package: protocol:
  package:@lobu/promptfoo-provider:LobuProvider

Cleans lingering `lobu eval` references in Makefile, packages/cli/README.md,
skills/lobu/SKILL.md, and a stale dev.ts comment.

Adds packages/promptfoo-provider/HANDOFF.md documenting the must-fix
plumbing changes that a follow-up agent (with core-code access) needs to
make: release-please-config.json, scripts/publish-packages.mjs, Makefile
build-packages target. Plus the should-fix gateway SSE tool_use change that
unlocks the RAG-specific assertions.

BREAKING CHANGE: The in-house `lobu eval` command and YAML eval schema are
removed. Migrate evals to promptfoo + @lobu/promptfoo-provider; see
examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
for the new pattern.
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
packages/cli/src/commands/dev.ts 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

}

this.agent = agent;
this.gateway = gateway.replace(/\/+$/, "");
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR removes the in-house lobu eval runner and related CLI eval modules/tests, adds a new @lobu/promptfoo-provider package that runs Lobu agents via the gateway (session + SSE), migrates the personal-finance example to Promptfoo, and updates docs/workspace/build/publish configs.

Changes

Eval system migration from CLI to promptfoo

Layer / File(s) Summary
Decommission in-house CLI eval command
Makefile, packages/cli/src/index.ts, packages/cli/src/__tests__/cli-ux.test.ts, packages/cli/README.md, packages/cli/src/commands/dev.ts, codex-skills/*
Remove lobu eval Makefile target and help text, unregister CLI eval commands/subcommands, delete related CLI eval tests/schemas, and update README/help entries and minor comments.
Implement promptfoo provider package for Lobu gateway integration
packages/promptfoo-provider/package.json, packages/promptfoo-provider/tsconfig.json, packages/promptfoo-provider/src/provider.ts, packages/promptfoo-provider/src/index.ts, packages/promptfoo-provider/README.md, packages/promptfoo-provider/HANDOFF.md
Add @lobu/promptfoo-provider workspace package with exports and TS config; implement LobuProvider that creates agent sessions, posts messages, reads SSE events with timeout/cleanup, maps token usage and metadata, and deletes sessions; include docs and release/publish handoff.
Migrate personal-finance example to promptfoo evals
examples/personal-finance/package.json, examples/personal-finance/agents/personal-finance/evals/README.md, examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
Add example package with eval scripts, a promptfooconfig.yaml containing ping and tax-year tests, and README describing migration and follow-ups for multi-turn YAML evals.
Update docs, workspace, and publish tooling
AGENTS.md, skills/*, packages/landing/*, package.json, release-please-config.json, scripts/publish-packages.mjs
Replace lobu eval guidance across docs with promptfoo instructions, add examples/personal-finance to workspaces, update build scripts to include promptfoo-provider, and include the package in publish/release automation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • lobu-ai/lobu#820: Modifies the CLI eval client (SSE/reader cleanup) and grader; strongly related because this PR removes the CLI eval modules replaced by the new Promptfoo provider.

Poem

🐰 The eval runner hops away,
Promptfoo now guides the play,
Sessions stream and tokens show,
Docs and examples learn to flow,
Hooray for tests that prompt and stay!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly describes the main change: removing the in-house YAML eval runner and shipping a new promptfoo provider package.
Description check ✅ Passed The PR description provides a comprehensive Summary section, identifies breaking changes, documents out-of-scope items, lists handoff requirements, and includes a detailed test plan.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/promptfoo-evals

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
Makefile (1)

33-43: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add packages/promptfoo-provider to build-packages before merge.

make build-packages currently skips the new provider package, so the default monorepo build path won’t compile it. This leaves the migration only partially wired for local validation and CI parity.

Suggested diff
 build-packages:
 	`@echo` "📦 Building all TypeScript packages..."
-	`@for` pkg in core connector-sdk agent-worker openclaw-plugin embeddings connector-worker; do \
+	`@for` pkg in core connector-sdk agent-worker openclaw-plugin embeddings connector-worker promptfoo-provider; do \
 		echo "   📦 Building packages/$$pkg..."; \
 		( cd packages/$$pkg && bun run build ) || exit $$?; \
 	done
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` around lines 33 - 43, The build-packages Makefile target currently
omits the new package; update the build-packages recipe (the for-loop and/or
subsequent steps) to include packages/promptfoo-provider so it is built with the
rest of the monorepo. Specifically, modify the list in the for-loop that
iterates over core connector-sdk agent-worker openclaw-plugin embeddings
connector-worker to also include promptfoo-provider (or add an explicit @( cd
packages/promptfoo-provider && bun run build ) step similar to
packages/server/packages/cli) so the promptfoo-provider package is compiled
during make build-packages.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@AGENTS.md`:
- Line 203: Update the stale reference in AGENTS.md that points to the removed
example string "examples/qmsum-demo/": replace it with the live canonical eval
example (e.g., the personal-finance promptfoo config, such as
"examples/personal-finance/") so the docs remain self-consistent; edit the
sentence that currently mentions `examples/qmsum-demo/` to reference the new
example and confirm the referenced example actually exists and demonstrates
custom provider auto-wiring, parametric JSONL tests, and RAG + answer-quality
assertions.

In
`@examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml`:
- Around line 35-37: Change the case-sensitive regex rule to a case-insensitive
contains rule: replace the rule where type is "regex" and value is
'hello|hi\b|hey|yes|here|ready' with type "icontains-any" and supply the
greetings as a list (e.g., value: ["hello","hi","hey","yes","here","ready"]) so
the ping check matches capitalized variants like "Hello" or "Hi"; update the
entry in the same YAML block containing type/value to ensure test stability.

In `@packages/promptfoo-provider/HANDOFF.md`:
- Around line 5-18: Add the new package to the repo's release/publish/build
configuration: add a packages entry for "packages/promptfoo-provider" in
release-please-config.json (mirror the existing block for
packages/connector-sdk), append 'promptfoo-provider' to the PACKAGES array in
scripts/publish-packages.mjs so the package is included in the npm publish flow,
and add promptfoo-provider to the for pkg in ... list in the Makefile's
build-packages target so CI produces the dist/ output; update the exact
string/entry names to match the canonical package name used elsewhere (e.g.,
'`@lobu/promptfoo-provider`').

In `@packages/promptfoo-provider/package.json`:
- Around line 8-17: The package.json exports map both "import" and "require" to
the same ESM output while package.json sets "type": "module" (and tsconfig uses
"module": "ESNext"), which will break CommonJS require consumers; either remove
the "require" export entry from the "exports" field to make the package
ESM-only, or produce a separate CommonJS build (e.g., dist/index.cjs) and change
the "require" export to point to that CJS artifact (and ensure dist/index.d.ts
still points to the types); update the "exports" block accordingly so "import"
-> ./dist/index.js and "require" -> ./dist/index.cjs if you add a CJS build,
otherwise delete the "require" mapping.

In `@packages/promptfoo-provider/README.md`:
- Line 60: Update the README entry for traceId so it matches the implementation:
change the comment on traceId (symbol: traceId) to state that the W3C trace id
is read from the `traceparent` field in the /messages JSON response body
(symbol: `traceparent`), not from an incoming HTTP `traceparent` header; ensure
the wording explicitly references the /messages response body to avoid
confusion.

In `@packages/promptfoo-provider/src/provider.ts`:
- Around line 175-182: createSession, sendMessage, and deleteSession call fetch
without an AbortController so they can hang; update each to mirror
collectResponse by creating an AbortController, set a timeout using
this.defaultTimeoutMs that calls controller.abort(), pass controller.signal to
fetch, and clear the timer after fetch finishes (or in finally) so network calls
respect the provider's timeout behavior.

---

Outside diff comments:
In `@Makefile`:
- Around line 33-43: The build-packages Makefile target currently omits the new
package; update the build-packages recipe (the for-loop and/or subsequent steps)
to include packages/promptfoo-provider so it is built with the rest of the
monorepo. Specifically, modify the list in the for-loop that iterates over core
connector-sdk agent-worker openclaw-plugin embeddings connector-worker to also
include promptfoo-provider (or add an explicit @( cd packages/promptfoo-provider
&& bun run build ) step similar to packages/server/packages/cli) so the
promptfoo-provider package is compiled during make build-packages.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ab1ec922-7aa3-4755-ac6b-5fcd034e59ae

📥 Commits

Reviewing files that changed from the base of the PR and between ac81fd7 and d03ce7e.

⛔ Files ignored due to path filters (1)
  • bun.lock is excluded by !**/*.lock
📒 Files selected for processing (23)
  • AGENTS.md
  • Makefile
  • examples/personal-finance/agents/personal-finance/evals/README.md
  • examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
  • examples/personal-finance/package.json
  • packages/cli/README.md
  • packages/cli/src/__tests__/cli-ux.test.ts
  • packages/cli/src/__tests__/eval-schema.test.ts
  • packages/cli/src/commands/dev.ts
  • packages/cli/src/commands/eval.ts
  • packages/cli/src/eval/client.ts
  • packages/cli/src/eval/grader.ts
  • packages/cli/src/eval/reporter.ts
  • packages/cli/src/eval/runner.ts
  • packages/cli/src/eval/types.ts
  • packages/cli/src/index.ts
  • packages/promptfoo-provider/HANDOFF.md
  • packages/promptfoo-provider/README.md
  • packages/promptfoo-provider/package.json
  • packages/promptfoo-provider/src/index.ts
  • packages/promptfoo-provider/src/provider.ts
  • packages/promptfoo-provider/tsconfig.json
  • skills/lobu/SKILL.md
💤 Files with no reviewable changes (9)
  • packages/cli/src/tests/eval-schema.test.ts
  • packages/cli/src/eval/runner.ts
  • packages/cli/src/eval/reporter.ts
  • packages/cli/src/tests/cli-ux.test.ts
  • packages/cli/src/commands/eval.ts
  • packages/cli/src/eval/grader.ts
  • packages/cli/src/index.ts
  • packages/cli/src/eval/types.ts
  • packages/cli/src/eval/client.ts

Comment thread AGENTS.md Outdated
Comment on lines +35 to +37
- type: regex
value: 'hello|hi\b|hey|yes|here|ready'
weight: 0.3
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

In promptfoo assertion configs, is type: regexcase-sensitive by default, and does it support aflags: i option for case-insensitive matching?

💡 Result:

In Promptfoo assertion configs, type: regex does not document any flags option, so case-insensitivity is not something you can enable via flags: i (and therefore it’s not documented as “case-insensitive by default”).[1] Promptfoo’s docs only describe type: regex / regex as “output matches the provided regular expression,” with no mention of automatic case-folding or a flags field.[1][2] Additionally, Promptfoo provides separate case-insensitive assertions like icontains, icontains-any, and icontains-all, reinforcing that case-insensitive behavior is handled by distinct assertion types rather than regex flags.[1] So, based on the current official docs: 1) type: regex is not stated to be case-insensitive by default (and no default i behavior is described). [1] 2) flags: i is not supported/documented for the regex assertion. [1][2]

Citations:


Use case-insensitive matching for the ping regex.

At Line 36, type: regex is case-sensitive and will miss capitalized greetings like "Hello" or "Hi", causing test flakiness. Replace with type: icontains-any to match any of the greeting values case-insensitively:

Proposed fix
      - type: regex
-       value: 'hello|hi\b|hey|yes|here|ready'
+       value: 'hello|hi|hey|yes|here|ready'
+       type: icontains-any
         weight: 0.3
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml`
around lines 35 - 37, Change the case-sensitive regex rule to a case-insensitive
contains rule: replace the rule where type is "regex" and value is
'hello|hi\b|hey|yes|here|ready' with type "icontains-any" and supply the
greetings as a list (e.g., value: ["hello","hi","hey","yes","here","ready"]) so
the ping check matches capitalized variants like "Hello" or "Hi"; update the
entry in the same YAML block containing type/value to ensure test stability.

Comment thread packages/promptfoo-provider/HANDOFF.md
Comment on lines +8 to +17
"exports": {
".": {
"import": {
"types": "./dist/index.d.ts",
"default": "./dist/index.js"
},
"require": {
"types": "./dist/index.d.ts",
"default": "./dist/index.js"
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify this package currently emits only ESM and maps `require` to the same ESM artifact.
set -euo pipefail

echo "== package exports =="
cat packages/promptfoo-provider/package.json | sed -n '1,80p'

echo
echo "== tsconfig module settings =="
cat packages/promptfoo-provider/tsconfig.json | sed -n '1,60p'

echo
echo "== check for any cjs outputs configured in package scripts =="
rg -n '"build"|cjs|commonjs|module"' packages/promptfoo-provider/package.json packages/promptfoo-provider/tsconfig.json

Repository: lobu-ai/lobu

Length of output: 1898


Remove the require export or provide a separate CommonJS build.

The package is configured as "type": "module" with TypeScript emitting ESM code ("module": "ESNext"), but the exports field maps both import and require to the same ESM artifact. CommonJS consumers calling require('@lobu/promptfoo-provider') will fail at runtime.

Either remove the require entry (simplest approach since this is ESM-only) or add a CJS build and map require.default to the CJS output:

Proposed fix (remove require export)
   "exports": {
     ".": {
       "import": {
         "types": "./dist/index.d.ts",
         "default": "./dist/index.js"
       }
     }
   },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"exports": {
".": {
"import": {
"types": "./dist/index.d.ts",
"default": "./dist/index.js"
},
"require": {
"types": "./dist/index.d.ts",
"default": "./dist/index.js"
}
"exports": {
".": {
"import": {
"types": "./dist/index.d.ts",
"default": "./dist/index.js"
}
}
},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/promptfoo-provider/package.json` around lines 8 - 17, The
package.json exports map both "import" and "require" to the same ESM output
while package.json sets "type": "module" (and tsconfig uses "module": "ESNext"),
which will break CommonJS require consumers; either remove the "require" export
entry from the "exports" field to make the package ESM-only, or produce a
separate CommonJS build (e.g., dist/index.cjs) and change the "require" export
to point to that CJS artifact (and ensure dist/index.d.ts still points to the
types); update the "exports" block accordingly so "import" -> ./dist/index.js
and "require" -> ./dist/index.cjs if you add a CJS build, otherwise delete the
"require" mapping.

metadata: {
agent: string
thread: string // fresh per call by default
traceId?: string // W3C trace id from `traceparent` header
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Trace ID source is documented incorrectly.

Line 60 says traceId comes from a traceparent header, but the provider currently reads it from the /messages JSON response body field (traceparent). Please align the wording with implementation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/promptfoo-provider/README.md` at line 60, Update the README entry
for traceId so it matches the implementation: change the comment on traceId
(symbol: traceId) to state that the W3C trace id is read from the `traceparent`
field in the /messages JSON response body (symbol: `traceparent`), not from an
incoming HTTP `traceparent` header; ensure the wording explicitly references the
/messages response body to avoid confusion.

Comment thread packages/promptfoo-provider/src/provider.ts Outdated
…s, gateway path

Three bugs found exercising the provider end-to-end with promptfoo:

1. promptfoo's package-protocol loader unwraps `default` exports before
   looking up the entity name on the module:
     const mod = importedModule?.default || importedModule
     const entity = mod[entityName]
   With both `default` and named `LobuProvider`, `mod` became the class
   itself and `mod['LobuProvider']` was undefined. Drop the default export
   so the loader falls through to the namespace object.

2. examples/personal-finance is not picked up by the monorepo workspace
   list (root package.json only globs packages/*). Bun couldn't resolve
   `@lobu/promptfoo-provider@workspace:*`. Add `examples/personal-finance`
   to the root workspaces array. Other examples don't have a package.json
   yet so they're unaffected.

3. The public Agent API is mounted at `/lobu` (the gateway serves
   org-scoped REST at `/`; see packages/server/src/server.ts). Old
   eval/client.ts only worked because resolveGatewayUrl appended `/lobu`
   upstream. The provider now uses `${gateway}/lobu/api/v1/agents` so
   `LOBU_GATEWAY=http://localhost:8787` works without manual prefixing.

Smoke test now reaches a 401 (Unauthorized) — the correct failure shape
for a dummy LOBU_TOKEN against a live gateway. With a valid token the
eval will run through.
@buremba buremba force-pushed the feat/promptfoo-evals branch from 8fc444a to 81471e5 Compare May 19, 2026 13:07
buremba added 2 commits May 19, 2026 14:09
Codex review caught comments in provider.ts and HANDOFF.md still referring
to `/api/v1/agents` instead of `/lobu/api/v1/agents`. Update both. Also
add a note to HANDOFF that root package.json's `build:packages` script
(invoked by scripts/publish-packages.mjs) needs the new package added,
not just the Makefile target.
…h + rewrite landing docs

Addresses the must-fix items from the codex review of PR #911:

* release-please-config.json: add packages/promptfoo-provider/package.json
  to extra-files so the new package gets bumped with the monorepo.
* scripts/publish-packages.mjs: append { dir: "packages/promptfoo-provider",
  transform: rewriteWorkspaceRefs } to PACKAGES so it actually publishes
  to npm.
* root package.json: add `cd ../promptfoo-provider && bun run build` to the
  build:packages chain (the script publish-packages.mjs calls before npm
  publish).
* Makefile: add promptfoo-provider to the build-packages target's pkg list.

Landing docs rewritten to point at promptfoo + @lobu/promptfoo-provider:

* packages/landing/src/content/docs/guides/evals.md: full rewrite — the old
  doc described the deleted YAML runner. New doc covers promptfoo install,
  promptfooconfig.yaml shape, the `package:` protocol provider id, assertion
  types, parametric tests, the personal-finance example, known limitations.
* packages/landing/src/content/docs/getting-started/index.mdx: replace
  `npx @lobu/cli@latest eval` with `bunx promptfoo eval`.
* packages/landing/src/content/docs/guides/testing.md: same.
* packages/landing/src/content/docs/reference/cli.md: drop the
  `eval [name]` / `eval new` sections, point at the Evaluations guide.
* codex-skills/lobu-builder/SKILL.md: replace lobu eval references with
  promptfoo invocation.
* Two test file comments referencing `lobu eval` updated to be accurate.
@buremba buremba added the skip-size-check Bypass PR size gate for intentionally large single-concern changes label May 19, 2026
@buremba buremba merged commit f8f087b into main May 19, 2026
23 of 25 checks passed
@buremba buremba deleted the feat/promptfoo-evals branch May 19, 2026 13:25
buremba added a commit that referenced this pull request May 19, 2026
…mpotency (#914)

* chore(build): bun lockfile docs, login polling cleanup, migration idempotency

Four hygiene gaps surfaced during the PR #911 end-to-end test:

1. AGENTS.md — document the "bun lockfile + owletto submodule" interaction.
   CI initialises the submodule before `bun install --frozen-lockfile`, so
   the lockfile on `main` always reflects an initialised submodule. Local
   pushes from an uninitialised checkout silently regenerate `bun.lock` and
   trip the next CI run. Document the pre-push check.

2. AGENTS.md — point IDE users at the biome plugin so save-time formatting
   matches the Husky pre-commit `biome check --write` hook. Without an
   integration the hook keeps rewriting files behind the editor's back.

3. `lobu login` device-flow polling cleanup. Backgrounded `lobu login &`
   shells were observed hammering `/oauth/token` at the polling interval
   long after the parent shell exited. Adds:
     - SIGHUP / SIGTERM / SIGINT handlers that exit the poll loop cleanly.
     - Hard 5-minute ceiling on the loop in addition to the server
       `expires_in` deadline.
     - Non-interactive bail: when `--quiet` or `!isTTY`, a `pending` poll
       is terminal — print "use --token <pat>" and exit instead of looping
       until expiry.

4. Migration runner idempotency. The squashed baseline uses plain
   `CREATE FUNCTION` / `CREATE TABLE`, so replaying against a DB that
   already has the schema raises 42723 / 42P07 / 42710. Catch those
   specific codes, log "Migration already applied (idempotent skip)", and
   record the version as applied. Non-idempotent errors still propagate.
   No new migrations, no schema changes.

* address pi review: stdin TTY check, abortable sleep, baseline-only idempotency

Pi findings on PR #914:

1. login: require stdin TTY in addition to stdout for `isInteractive`. A
   `lobu login </dev/null` with a stdout TTY no longer misidentifies as
   interactive and falls through to polling.
2. login: cap the polling sleep at `deadline - now`, and make it abortable
   so SIGHUP/SIGTERM/SIGINT (or the hard 5-min ceiling) wake the loop
   immediately instead of waiting out the current interval (which
   `slow_down` can balloon to >30s).
3. migrations: restrict the duplicate-object short-circuit to the squashed
   baseline version (`00000000000000`) via an `IDEMPOTENT_BASELINE_VERSIONS`
   allowlist. Future delta migrations must use `IF NOT EXISTS` discipline —
   the runner no longer masks mid-file failures by marking the version
   applied just because one statement collided.
buremba added a commit that referenced this pull request May 19, 2026
…lock image builds (#927)

PR #911 added `examples/personal-finance` to root `package.json`'s
`workspaces` field but didn't update the Dockerfiles, which only COPY
`packages/*/package.json` for the install layer. `bun install` inside
the Docker build then errored:

    error: Workspace not found "examples/personal-finance"
        at /app/package.json:8:5

Every image build on `main` since #911 merged (13:25 UTC today) has
been red: #911#913 (+revert) → #914#915#919#923#924#912#925 — all sitting on `main` un-deployable, including the
`principal_kind` migration from #923 and my own loading-skeletons
shipping artifacts.

Two ways to fix it:

1. **Add stubs to all three Dockerfiles** for the example. Treats the
   symptom; couples prod build pipeline to whatever's under `examples/`,
   wrong direction.
2. **Take the example out of root workspaces.** Examples are
   documentation/demos for users to clone + run; they don't belong in
   the prod build graph. Cleaner separation.

Going with (2). Side effects:

- Example's dependency on `@lobu/promptfoo-provider` switched from
  `workspace:*` (workspace-protocol-only) to
  `file:../../packages/promptfoo-provider`. Resolves locally without
  requiring the example to be in a workspace; consumers run
  `cd examples/personal-finance && bun install` standalone (after
  building the provider once: `cd packages/promptfoo-provider && bun
  run build`).
- `bun.lock` regenerated. Most of the diff is bun's "linked
  workspaces" table shrinking — no upstream version churn.

Verified: simulated Docker build context (root files + stubbed
packages/* manifests + provider stub, no examples/) runs `bun install`
cleanly. No "Workspace not found" error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-size-check Bypass PR size gate for intentionally large single-concern changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants