feat(promptfoo-provider): support vars.transcript for multi-turn evals by buremba · Pull Request #913 · lobu-ai/lobu

buremba · 2026-05-19T14:21:26Z

Summary

LobuProvider.callApi now replays context.vars.transcript (a string[]) as sequential user turns in one Lobu thread when set, and returns the final assistant response for assertion. Falls back to single-turn behaviour when transcript is missing, non-array, or empty.
Migrates the four dormant multi-turn YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) from examples/personal-finance/agents/personal-finance/evals/ into the active promptfooconfig.yaml. The old .yaml files are deleted. All six evals are now executable via bun run evals.
New bun test (packages/promptfoo-provider/src/__tests__/provider.test.ts) mocks globalThis.fetch to cover the gateway protocol: single-turn baseline, multi-turn ordering + session re-use, empty-entry filtering, non-array fallback, empty-array fallback.

Design notes

transcript ignores the rendered prompt when set — the transcript is the source of truth for what the user said. Documented in packages/promptfoo-provider/README.md.
Per-turn assertions are not exposed on purpose. Promptfoo's tests: shape is single-assertion-set-per-test; replicating per-turn semantics would require rewriting the test runner, and the agent's final response is what the user actually sees. If an intermediate turn matters, encode it as a rubric on the final response (see the sa102/sa105/sa108 entries which ask the rubric to verify the final response references earlier context).
Whitespace-only / empty transcript entries are filtered so an accidental trailing newline in YAML doesn't send a blank turn.
Bails on the first errored turn — assertions against a broken thread would be meaningless.

Test plan

make typecheck — clean.
bun test packages/promptfoo-provider/src — 5/5 pass (single-turn, multi-turn ordering, whitespace filter, non-array fallback, empty-array fallback).
bun run check — pre-existing unrelated warning in packages/cli/src/__tests__/cli-ux.test.ts; no new violations.
YAML parses to 7 tests (3 single-turn + 4 multi-turn) with the right vars.transcript shape.
End-to-end run against a live gateway (bun run evals from examples/personal-finance/) — left to the operator with an active LOBU_TOKEN + the personal-finance agent deployed; the unit test exercises the gateway protocol shape.

Summary by CodeRabbit

New Features
- Added multi-turn evaluation transcript support for sequential conversation testing.
Documentation
- Updated README documenting multi-turn evaluation execution and configuration.
- Updated evaluation framework documentation with coverage details.
Tests
- Added tests for multi-turn transcript evaluation scenarios.
Chores
- Consolidated and updated evaluation configurations with new multi-turn test cases.

coderabbitai · 2026-05-19T14:21:48Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d1d9150c-d614-4d9f-ba69-91f21d5589c9

📥 Commits

Reviewing files that changed from the base of the PR and between c017b0b and 8880f29.

📒 Files selected for processing (9)

examples/personal-finance/agents/personal-finance/evals/README.md
examples/personal-finance/agents/personal-finance/evals/gap-surfacing.yaml
examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
examples/personal-finance/agents/personal-finance/evals/sa102-employment.yaml
examples/personal-finance/agents/personal-finance/evals/sa105-property.yaml
examples/personal-finance/agents/personal-finance/evals/sa108-cgt.yaml
packages/promptfoo-provider/README.md
packages/promptfoo-provider/src/__tests__/provider.test.ts
packages/promptfoo-provider/src/provider.ts

📝 Walkthrough

Walkthrough

This PR consolidates standalone YAML eval definitions into a unified promptfooconfig.yaml and implements multi-turn transcript-driven evaluation execution in the promptfoo provider. Four new two-turn behavioral test cases are added (gap-surfacing, SA102 employment, SA105 rental property, SA108 capital gains), supported by provider changes and comprehensive tests.

Changes

Multi-turn eval consolidation and provider implementation

Layer / File(s)	Summary
Eval consolidation into promptfooconfig and README `examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml`, `examples/personal-finance/agents/personal-finance/evals/README.md`	Removes four standalone YAML eval files (`gap-surfacing.yaml`, `sa102-employment.yaml`, `sa105-property.yaml`, `sa108-cgt.yaml`), consolidates all definitions into `promptfooconfig.yaml` with four new two-turn multi-turn behavioral test cases (gap-surfacing, SA102, SA105, SA108), and updates README to document the unified config structure and evaluation coverage.
LobuProvider multi-turn transcript implementation `packages/promptfoo-provider/src/provider.ts`	`LobuProvider.callApi` now derives `turns` from `context.vars.transcript` (or falls back to single `prompt`), iterates `sendAndCollect` sequentially within the same Lobu session, and returns the final turn's response; `extractTranscript` helper validates and filters transcript entries.
Provider multi-turn transcript tests `packages/promptfoo-provider/src/__tests__/provider.test.ts`	Comprehensive test suite with mock gateway/fetch/SSE covering single-turn behavior, multi-turn transcript replay across three turns, transcript filtering of empty/whitespace entries, and fallback behavior when transcript is unavailable or empty.
Provider documentation for multi-turn evals `packages/promptfoo-provider/README.md`	README adds "Multi-turn evals" section documenting `vars.transcript` as `string[]`, single-thread replay behavior, final-response-only assertions, and fallback handling; includes YAML examples for single-turn and multi-turn rubric-based refusal tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

lobu-ai/lobu#911: Updates LobuProvider.callApi to support multi-turn transcript replay, directly preceding this PR's eval consolidation and testing.

Suggested labels

skip-size-check

Poem

🐰 From scattered YAMLs, a unified song,
Multi-turn chats flowing all along,
Transcripts replay in the same Lobu thread,
Tests ensure the final response is read! 📋✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding multi-turn eval support via vars.transcript to the promptfoo provider.
Description check	✅ Passed	The description covers all template sections: Summary explains the key changes, Test plan lists completed validations, and Notes include design rationale and implementation details.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/promptfoo-multiturn

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-05-19T14:23:54Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
packages/cli/src/commands/init.ts	90.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/promptfoo-provider/src/provider.ts (1)
113-177: ⚠️ Potential issue | 🔴 Critical

Run make build-packages before merging.

Changes to packages/promptfoo-provider/src/ require compilation to dist/. The dist/ directory is missing, indicating the package has not been built. Per coding guidelines, workspace packages must be compiled from source; make dev does not auto-rebuild them.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/promptfoo-provider/src/provider.ts` around lines 113 - 177, The
package's compiled output (dist/) is missing for the updated provider
implementation (see callApi in the provider.ts change), so run the repository
build pipeline (run make build-packages) to compile packages/promptfoo-provider
from source into dist/, then add/commit the generated dist/ files so consumers
get the built artifacts; ensure the build succeeds and the distributed files
reflect the changes in callApi/sendAndCollect/deleteSession and related exports.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/personal-finance/agents/personal-finance/evals/README.md`:
- Around line 15-18: Update the documentation header count from “Six checks” to
“Seven checks” to match the listed evals: single-turn checks (vars.query)
include ping and tax-year-anchoring (two items described as three? ensure you
count them correctly) and multi-turn checks (vars.transcript) include
gap-surfacing, sa102-employment, sa105-property, sa108-cgt; adjust the opening
sentence so it accurately reads “Seven checks, two shapes:” to match the seven
enumerated checks (ping, tax-year-anchoring, gap-surfacing, sa102-employment,
sa105-property, sa108-cgt) referenced in the README.

---

Outside diff comments:
In `@packages/promptfoo-provider/src/provider.ts`:
- Around line 113-177: The package's compiled output (dist/) is missing for the
updated provider implementation (see callApi in the provider.ts change), so run
the repository build pipeline (run make build-packages) to compile
packages/promptfoo-provider from source into dist/, then add/commit the
generated dist/ files so consumers get the built artifacts; ensure the build
succeeds and the distributed files reflect the changes in
callApi/sendAndCollect/deleteSession and related exports.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b617f87e-ef17-4e10-b109-535d580c7cb4

📥 Commits

Reviewing files that changed from the base of the PR and between f8f087b and ca66f6a.

📒 Files selected for processing (9)

examples/personal-finance/agents/personal-finance/evals/README.md
examples/personal-finance/agents/personal-finance/evals/gap-surfacing.yaml
examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
examples/personal-finance/agents/personal-finance/evals/sa102-employment.yaml
examples/personal-finance/agents/personal-finance/evals/sa105-property.yaml
examples/personal-finance/agents/personal-finance/evals/sa108-cgt.yaml
packages/promptfoo-provider/README.md
packages/promptfoo-provider/src/__tests__/provider.test.ts
packages/promptfoo-provider/src/provider.ts

💤 Files with no reviewable changes (4)

examples/personal-finance/agents/personal-finance/evals/sa108-cgt.yaml
examples/personal-finance/agents/personal-finance/evals/sa105-property.yaml
examples/personal-finance/agents/personal-finance/evals/gap-surfacing.yaml
examples/personal-finance/agents/personal-finance/evals/sa102-employment.yaml

coderabbitai · 2026-05-19T14:26:27Z

+Six checks, two shapes:

-The remaining YAMLs — `gap-surfacing.yaml`, `sa102-employment.yaml`, `sa105-property.yaml`, `sa108-cgt.yaml` — are still on the old format and **not currently executable**. They are multi-turn conversational tests (e.g. `gap-surfacing.yaml` relies on context established in turn 1 to evaluate turn 2's behaviour) and promptfoo's parametric `tests:` model is single-turn by default. Porting needs either:
+- **Single-turn** (`vars.query`): `ping`, `tax-year-anchoring` (2024-25 boundary, 2025-26 boundary).
+- **Multi-turn** (`vars.transcript` — sequential user turns replayed in one Lobu thread; assertions evaluate the final response): `gap-surfacing`, `sa102-employment`, `sa105-property`, `sa108-cgt`. See `packages/promptfoo-provider/README.md` for the transcript protocol.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix coverage count mismatch.

Line 15 says “Six checks,” but the bullets on Lines 17–18 enumerate seven checks total (3 single-turn + 4 multi-turn). Update the count to avoid confusion in eval reporting docs.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/personal-finance/agents/personal-finance/evals/README.md` around lines 15 - 18, Update the documentation header count from “Six checks” to “Seven checks” to match the listed evals: single-turn checks (vars.query) include ping and tax-year-anchoring (two items described as three? ensure you count them correctly) and multi-turn checks (vars.transcript) include gap-surfacing, sa102-employment, sa105-property, sa108-cgt; adjust the opening sentence so it accurately reads “Seven checks, two shapes:” to match the seven enumerated checks (ping, tax-year-anchoring, gap-surfacing, sa102-employment, sa105-property, sa108-cgt) referenced in the README.

buremba · 2026-05-19T14:55:42Z

Rebased onto feat/tool-use-sse to pre-resolve conflicts. Now mergeable AFTER #918 lands (the rebase incorporates #918's changes; merging #918 to main first then this PR will be clean).

If #918 lands a different version, redo this rebase against the merged commit.

…onal-finance evals @lobu/promptfoo-provider gains vars.transcript: string[] support — replays sequential turns in one Lobu thread, returns the final assistant response for assertion. Single-turn callers via plain prompt are unchanged. Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml using vars.transcript. Deletes the original YAML files. 5 provider tests pass (mock-fetch over the gateway endpoints) covering single-turn baseline, multi-turn ordering + session reuse + single cleanup, whitespace filter, non-array fallback, empty-array fallback. Rebased cleanly atop #918 (tool_use SSE) — the agent-worker / provider files in main already include #918's additions; this commit is the strict multi-turn delta.

…e 4 personal-finance evals (#913)" (#920) This reverts commit 69151a9.

…onal-finance evals (#921) @lobu/promptfoo-provider gains vars.transcript: string[] support — replays sequential turns in one Lobu thread, returns the final assistant response for assertion. Single-turn callers via plain prompt are unchanged. Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml using vars.transcript. Deletes the original YAML files. Strictly additive atop current main (which already includes #918's tool_use SSE events). Re-do of #913 after #920 reverted that PR — the original landing accidentally undid #914 and #916 because of a bad rebase-and-soft-reset.

…lock image builds (#927) PR #911 added `examples/personal-finance` to root `package.json`'s `workspaces` field but didn't update the Dockerfiles, which only COPY `packages/*/package.json` for the install layer. `bun install` inside the Docker build then errored: error: Workspace not found "examples/personal-finance" at /app/package.json:8:5 Every image build on `main` since #911 merged (13:25 UTC today) has been red: #911 → #913 (+revert) → #914 → #915 → #919 → #923 → #924 → #912 → #925 — all sitting on `main` un-deployable, including the `principal_kind` migration from #923 and my own loading-skeletons shipping artifacts. Two ways to fix it: 1. **Add stubs to all three Dockerfiles** for the example. Treats the symptom; couples prod build pipeline to whatever's under `examples/`, wrong direction. 2. **Take the example out of root workspaces.** Examples are documentation/demos for users to clone + run; they don't belong in the prod build graph. Cleaner separation. Going with (2). Side effects: - Example's dependency on `@lobu/promptfoo-provider` switched from `workspace:*` (workspace-protocol-only) to `file:../../packages/promptfoo-provider`. Resolves locally without requiring the example to be in a workspace; consumers run `cd examples/personal-finance && bun install` standalone (after building the provider once: `cd packages/promptfoo-provider && bun run build`). - `bun.lock` regenerated. Most of the diff is bun's "linked workspaces" table shrinking — no upstream version churn. Verified: simulated Docker build context (root files + stubbed packages/* manifests + provider stub, no examples/) runs `bun install` cleanly. No "Workspace not found" error.

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

buremba force-pushed the feat/promptfoo-multiturn branch from ca66f6a to c017b0b Compare May 19, 2026 14:55

buremba force-pushed the feat/promptfoo-multiturn branch from c017b0b to 8880f29 Compare May 19, 2026 14:59

buremba merged commit 69151a9 into main May 19, 2026
3 checks passed

buremba deleted the feat/promptfoo-multiturn branch May 19, 2026 14:59

buremba mentioned this pull request May 19, 2026

fix: revert #913 to restore #914 + #916 changes lost in bad squash #920

Merged

4 tasks

buremba added a commit that referenced this pull request May 19, 2026

Revert "feat(promptfoo-provider): vars.transcript multi-turn + migrat…

b648924

…e 4 personal-finance evals (#913)" (#920) This reverts commit 69151a9.

This was referenced May 19, 2026

chore(main): release lobu 8.0.0 #897

Merged

feat(promptfoo-provider): vars.transcript multi-turn + migrate 4 personal-finance evals (redo) #921

Merged

buremba mentioned this pull request May 19, 2026

fix(build): drop examples/personal-finance from root workspaces — unblock image builds #927

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(promptfoo-provider): support vars.transcript for multi-turn evals#913

feat(promptfoo-provider): support vars.transcript for multi-turn evals#913
buremba merged 1 commit into
mainfrom
feat/promptfoo-multiturn

buremba commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov-commenter commented May 19, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

buremba commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buremba commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design notes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov-commenter commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

buremba commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

codecov-commenter commented May 19, 2026 •

edited

Loading