feat(promptfoo-provider): support vars.transcript for multi-turn evals#913
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (9)
📝 WalkthroughWalkthroughThis PR consolidates standalone YAML eval definitions into a unified ChangesMulti-turn eval consolidation and provider implementation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/promptfoo-provider/src/provider.ts (1)
113-177:⚠️ Potential issue | 🔴 CriticalRun
make build-packagesbefore merging.Changes to
packages/promptfoo-provider/src/require compilation todist/. The dist/ directory is missing, indicating the package has not been built. Per coding guidelines, workspace packages must be compiled from source;make devdoes not auto-rebuild them.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/promptfoo-provider/src/provider.ts` around lines 113 - 177, The package's compiled output (dist/) is missing for the updated provider implementation (see callApi in the provider.ts change), so run the repository build pipeline (run make build-packages) to compile packages/promptfoo-provider from source into dist/, then add/commit the generated dist/ files so consumers get the built artifacts; ensure the build succeeds and the distributed files reflect the changes in callApi/sendAndCollect/deleteSession and related exports.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/personal-finance/agents/personal-finance/evals/README.md`:
- Around line 15-18: Update the documentation header count from “Six checks” to
“Seven checks” to match the listed evals: single-turn checks (vars.query)
include ping and tax-year-anchoring (two items described as three? ensure you
count them correctly) and multi-turn checks (vars.transcript) include
gap-surfacing, sa102-employment, sa105-property, sa108-cgt; adjust the opening
sentence so it accurately reads “Seven checks, two shapes:” to match the seven
enumerated checks (ping, tax-year-anchoring, gap-surfacing, sa102-employment,
sa105-property, sa108-cgt) referenced in the README.
---
Outside diff comments:
In `@packages/promptfoo-provider/src/provider.ts`:
- Around line 113-177: The package's compiled output (dist/) is missing for the
updated provider implementation (see callApi in the provider.ts change), so run
the repository build pipeline (run make build-packages) to compile
packages/promptfoo-provider from source into dist/, then add/commit the
generated dist/ files so consumers get the built artifacts; ensure the build
succeeds and the distributed files reflect the changes in
callApi/sendAndCollect/deleteSession and related exports.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: b617f87e-ef17-4e10-b109-535d580c7cb4
📒 Files selected for processing (9)
examples/personal-finance/agents/personal-finance/evals/README.mdexamples/personal-finance/agents/personal-finance/evals/gap-surfacing.yamlexamples/personal-finance/agents/personal-finance/evals/promptfooconfig.yamlexamples/personal-finance/agents/personal-finance/evals/sa102-employment.yamlexamples/personal-finance/agents/personal-finance/evals/sa105-property.yamlexamples/personal-finance/agents/personal-finance/evals/sa108-cgt.yamlpackages/promptfoo-provider/README.mdpackages/promptfoo-provider/src/__tests__/provider.test.tspackages/promptfoo-provider/src/provider.ts
💤 Files with no reviewable changes (4)
- examples/personal-finance/agents/personal-finance/evals/sa108-cgt.yaml
- examples/personal-finance/agents/personal-finance/evals/sa105-property.yaml
- examples/personal-finance/agents/personal-finance/evals/gap-surfacing.yaml
- examples/personal-finance/agents/personal-finance/evals/sa102-employment.yaml
| Six checks, two shapes: | ||
|
|
||
| The remaining YAMLs — `gap-surfacing.yaml`, `sa102-employment.yaml`, `sa105-property.yaml`, `sa108-cgt.yaml` — are still on the old format and **not currently executable**. They are multi-turn conversational tests (e.g. `gap-surfacing.yaml` relies on context established in turn 1 to evaluate turn 2's behaviour) and promptfoo's parametric `tests:` model is single-turn by default. Porting needs either: | ||
| - **Single-turn** (`vars.query`): `ping`, `tax-year-anchoring` (2024-25 boundary, 2025-26 boundary). | ||
| - **Multi-turn** (`vars.transcript` — sequential user turns replayed in one Lobu thread; assertions evaluate the final response): `gap-surfacing`, `sa102-employment`, `sa105-property`, `sa108-cgt`. See `packages/promptfoo-provider/README.md` for the transcript protocol. |
There was a problem hiding this comment.
Fix coverage count mismatch.
Line 15 says “Six checks,” but the bullets on Lines 17–18 enumerate seven checks total (3 single-turn + 4 multi-turn). Update the count to avoid confusion in eval reporting docs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/personal-finance/agents/personal-finance/evals/README.md` around
lines 15 - 18, Update the documentation header count from “Six checks” to “Seven
checks” to match the listed evals: single-turn checks (vars.query) include ping
and tax-year-anchoring (two items described as three? ensure you count them
correctly) and multi-turn checks (vars.transcript) include gap-surfacing,
sa102-employment, sa105-property, sa108-cgt; adjust the opening sentence so it
accurately reads “Seven checks, two shapes:” to match the seven enumerated
checks (ping, tax-year-anchoring, gap-surfacing, sa102-employment,
sa105-property, sa108-cgt) referenced in the README.
ca66f6a to
c017b0b
Compare
…onal-finance evals @lobu/promptfoo-provider gains vars.transcript: string[] support — replays sequential turns in one Lobu thread, returns the final assistant response for assertion. Single-turn callers via plain prompt are unchanged. Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml using vars.transcript. Deletes the original YAML files. 5 provider tests pass (mock-fetch over the gateway endpoints) covering single-turn baseline, multi-turn ordering + session reuse + single cleanup, whitespace filter, non-array fallback, empty-array fallback. Rebased cleanly atop #918 (tool_use SSE) — the agent-worker / provider files in main already include #918's additions; this commit is the strict multi-turn delta.
c017b0b to
8880f29
Compare
…onal-finance evals (#921) @lobu/promptfoo-provider gains vars.transcript: string[] support — replays sequential turns in one Lobu thread, returns the final assistant response for assertion. Single-turn callers via plain prompt are unchanged. Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml using vars.transcript. Deletes the original YAML files. Strictly additive atop current main (which already includes #918's tool_use SSE events). Re-do of #913 after #920 reverted that PR — the original landing accidentally undid #914 and #916 because of a bad rebase-and-soft-reset.
…lock image builds (#927) PR #911 added `examples/personal-finance` to root `package.json`'s `workspaces` field but didn't update the Dockerfiles, which only COPY `packages/*/package.json` for the install layer. `bun install` inside the Docker build then errored: error: Workspace not found "examples/personal-finance" at /app/package.json:8:5 Every image build on `main` since #911 merged (13:25 UTC today) has been red: #911 → #913 (+revert) → #914 → #915 → #919 → #923 → #924 → #912 → #925 — all sitting on `main` un-deployable, including the `principal_kind` migration from #923 and my own loading-skeletons shipping artifacts. Two ways to fix it: 1. **Add stubs to all three Dockerfiles** for the example. Treats the symptom; couples prod build pipeline to whatever's under `examples/`, wrong direction. 2. **Take the example out of root workspaces.** Examples are documentation/demos for users to clone + run; they don't belong in the prod build graph. Cleaner separation. Going with (2). Side effects: - Example's dependency on `@lobu/promptfoo-provider` switched from `workspace:*` (workspace-protocol-only) to `file:../../packages/promptfoo-provider`. Resolves locally without requiring the example to be in a workspace; consumers run `cd examples/personal-finance && bun install` standalone (after building the provider once: `cd packages/promptfoo-provider && bun run build`). - `bun.lock` regenerated. Most of the diff is bun's "linked workspaces" table shrinking — no upstream version churn. Verified: simulated Docker build context (root files + stubbed packages/* manifests + provider stub, no examples/) runs `bun install` cleanly. No "Workspace not found" error.
Summary
LobuProvider.callApinow replayscontext.vars.transcript(astring[]) as sequential user turns in one Lobu thread when set, and returns the final assistant response for assertion. Falls back to single-turn behaviour whentranscriptis missing, non-array, or empty.gap-surfacing,sa102-employment,sa105-property,sa108-cgt) fromexamples/personal-finance/agents/personal-finance/evals/into the activepromptfooconfig.yaml. The old.yamlfiles are deleted. All six evals are now executable viabun run evals.packages/promptfoo-provider/src/__tests__/provider.test.ts) mocksglobalThis.fetchto cover the gateway protocol: single-turn baseline, multi-turn ordering + session re-use, empty-entry filtering, non-array fallback, empty-array fallback.Design notes
transcriptignores the renderedpromptwhen set — the transcript is the source of truth for what the user said. Documented inpackages/promptfoo-provider/README.md.tests:shape is single-assertion-set-per-test; replicating per-turn semantics would require rewriting the test runner, and the agent's final response is what the user actually sees. If an intermediate turn matters, encode it as a rubric on the final response (see the sa102/sa105/sa108 entries which ask the rubric to verify the final response references earlier context).Test plan
make typecheck— clean.bun test packages/promptfoo-provider/src— 5/5 pass (single-turn, multi-turn ordering, whitespace filter, non-array fallback, empty-array fallback).bun run check— pre-existing unrelated warning inpackages/cli/src/__tests__/cli-ux.test.ts; no new violations.vars.transcriptshape.bun run evalsfromexamples/personal-finance/) — left to the operator with an activeLOBU_TOKEN+ the personal-finance agent deployed; the unit test exercises the gateway protocol shape.Summary by CodeRabbit
New Features
Documentation
Tests
Chores