Skip to content

feat(promptfoo-provider): support vars.transcript for multi-turn evals#913

Merged
buremba merged 1 commit into
mainfrom
feat/promptfoo-multiturn
May 19, 2026
Merged

feat(promptfoo-provider): support vars.transcript for multi-turn evals#913
buremba merged 1 commit into
mainfrom
feat/promptfoo-multiturn

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 19, 2026

Summary

  • LobuProvider.callApi now replays context.vars.transcript (a string[]) as sequential user turns in one Lobu thread when set, and returns the final assistant response for assertion. Falls back to single-turn behaviour when transcript is missing, non-array, or empty.
  • Migrates the four dormant multi-turn YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) from examples/personal-finance/agents/personal-finance/evals/ into the active promptfooconfig.yaml. The old .yaml files are deleted. All six evals are now executable via bun run evals.
  • New bun test (packages/promptfoo-provider/src/__tests__/provider.test.ts) mocks globalThis.fetch to cover the gateway protocol: single-turn baseline, multi-turn ordering + session re-use, empty-entry filtering, non-array fallback, empty-array fallback.

Design notes

  • transcript ignores the rendered prompt when set — the transcript is the source of truth for what the user said. Documented in packages/promptfoo-provider/README.md.
  • Per-turn assertions are not exposed on purpose. Promptfoo's tests: shape is single-assertion-set-per-test; replicating per-turn semantics would require rewriting the test runner, and the agent's final response is what the user actually sees. If an intermediate turn matters, encode it as a rubric on the final response (see the sa102/sa105/sa108 entries which ask the rubric to verify the final response references earlier context).
  • Whitespace-only / empty transcript entries are filtered so an accidental trailing newline in YAML doesn't send a blank turn.
  • Bails on the first errored turn — assertions against a broken thread would be meaningless.

Test plan

  • make typecheck — clean.
  • bun test packages/promptfoo-provider/src — 5/5 pass (single-turn, multi-turn ordering, whitespace filter, non-array fallback, empty-array fallback).
  • bun run check — pre-existing unrelated warning in packages/cli/src/__tests__/cli-ux.test.ts; no new violations.
  • YAML parses to 7 tests (3 single-turn + 4 multi-turn) with the right vars.transcript shape.
  • End-to-end run against a live gateway (bun run evals from examples/personal-finance/) — left to the operator with an active LOBU_TOKEN + the personal-finance agent deployed; the unit test exercises the gateway protocol shape.

Summary by CodeRabbit

  • New Features

    • Added multi-turn evaluation transcript support for sequential conversation testing.
  • Documentation

    • Updated README documenting multi-turn evaluation execution and configuration.
    • Updated evaluation framework documentation with coverage details.
  • Tests

    • Added tests for multi-turn transcript evaluation scenarios.
  • Chores

    • Consolidated and updated evaluation configurations with new multi-turn test cases.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d1d9150c-d614-4d9f-ba69-91f21d5589c9

📥 Commits

Reviewing files that changed from the base of the PR and between c017b0b and 8880f29.

📒 Files selected for processing (9)
  • examples/personal-finance/agents/personal-finance/evals/README.md
  • examples/personal-finance/agents/personal-finance/evals/gap-surfacing.yaml
  • examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
  • examples/personal-finance/agents/personal-finance/evals/sa102-employment.yaml
  • examples/personal-finance/agents/personal-finance/evals/sa105-property.yaml
  • examples/personal-finance/agents/personal-finance/evals/sa108-cgt.yaml
  • packages/promptfoo-provider/README.md
  • packages/promptfoo-provider/src/__tests__/provider.test.ts
  • packages/promptfoo-provider/src/provider.ts

📝 Walkthrough

Walkthrough

This PR consolidates standalone YAML eval definitions into a unified promptfooconfig.yaml and implements multi-turn transcript-driven evaluation execution in the promptfoo provider. Four new two-turn behavioral test cases are added (gap-surfacing, SA102 employment, SA105 rental property, SA108 capital gains), supported by provider changes and comprehensive tests.

Changes

Multi-turn eval consolidation and provider implementation

Layer / File(s) Summary
Eval consolidation into promptfooconfig and README
examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml, examples/personal-finance/agents/personal-finance/evals/README.md
Removes four standalone YAML eval files (gap-surfacing.yaml, sa102-employment.yaml, sa105-property.yaml, sa108-cgt.yaml), consolidates all definitions into promptfooconfig.yaml with four new two-turn multi-turn behavioral test cases (gap-surfacing, SA102, SA105, SA108), and updates README to document the unified config structure and evaluation coverage.
LobuProvider multi-turn transcript implementation
packages/promptfoo-provider/src/provider.ts
LobuProvider.callApi now derives turns from context.vars.transcript (or falls back to single prompt), iterates sendAndCollect sequentially within the same Lobu session, and returns the final turn's response; extractTranscript helper validates and filters transcript entries.
Provider multi-turn transcript tests
packages/promptfoo-provider/src/__tests__/provider.test.ts
Comprehensive test suite with mock gateway/fetch/SSE covering single-turn behavior, multi-turn transcript replay across three turns, transcript filtering of empty/whitespace entries, and fallback behavior when transcript is unavailable or empty.
Provider documentation for multi-turn evals
packages/promptfoo-provider/README.md
README adds "Multi-turn evals" section documenting vars.transcript as string[], single-thread replay behavior, final-response-only assertions, and fallback handling; includes YAML examples for single-turn and multi-turn rubric-based refusal tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • lobu-ai/lobu#911: Updates LobuProvider.callApi to support multi-turn transcript replay, directly preceding this PR's eval consolidation and testing.

Suggested labels

skip-size-check

Poem

🐰 From scattered YAMLs, a unified song,
Multi-turn chats flowing all along,
Transcripts replay in the same Lobu thread,
Tests ensure the final response is read! 📋✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding multi-turn eval support via vars.transcript to the promptfoo provider.
Description check ✅ Passed The description covers all template sections: Summary explains the key changes, Test plan lists completed validations, and Notes include design rationale and implementation details.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/promptfoo-multiturn

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 19, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
packages/cli/src/commands/init.ts 90.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/promptfoo-provider/src/provider.ts (1)

113-177: ⚠️ Potential issue | 🔴 Critical

Run make build-packages before merging.

Changes to packages/promptfoo-provider/src/ require compilation to dist/. The dist/ directory is missing, indicating the package has not been built. Per coding guidelines, workspace packages must be compiled from source; make dev does not auto-rebuild them.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/promptfoo-provider/src/provider.ts` around lines 113 - 177, The
package's compiled output (dist/) is missing for the updated provider
implementation (see callApi in the provider.ts change), so run the repository
build pipeline (run make build-packages) to compile packages/promptfoo-provider
from source into dist/, then add/commit the generated dist/ files so consumers
get the built artifacts; ensure the build succeeds and the distributed files
reflect the changes in callApi/sendAndCollect/deleteSession and related exports.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/personal-finance/agents/personal-finance/evals/README.md`:
- Around line 15-18: Update the documentation header count from “Six checks” to
“Seven checks” to match the listed evals: single-turn checks (vars.query)
include ping and tax-year-anchoring (two items described as three? ensure you
count them correctly) and multi-turn checks (vars.transcript) include
gap-surfacing, sa102-employment, sa105-property, sa108-cgt; adjust the opening
sentence so it accurately reads “Seven checks, two shapes:” to match the seven
enumerated checks (ping, tax-year-anchoring, gap-surfacing, sa102-employment,
sa105-property, sa108-cgt) referenced in the README.

---

Outside diff comments:
In `@packages/promptfoo-provider/src/provider.ts`:
- Around line 113-177: The package's compiled output (dist/) is missing for the
updated provider implementation (see callApi in the provider.ts change), so run
the repository build pipeline (run make build-packages) to compile
packages/promptfoo-provider from source into dist/, then add/commit the
generated dist/ files so consumers get the built artifacts; ensure the build
succeeds and the distributed files reflect the changes in
callApi/sendAndCollect/deleteSession and related exports.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: b617f87e-ef17-4e10-b109-535d580c7cb4

📥 Commits

Reviewing files that changed from the base of the PR and between f8f087b and ca66f6a.

📒 Files selected for processing (9)
  • examples/personal-finance/agents/personal-finance/evals/README.md
  • examples/personal-finance/agents/personal-finance/evals/gap-surfacing.yaml
  • examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
  • examples/personal-finance/agents/personal-finance/evals/sa102-employment.yaml
  • examples/personal-finance/agents/personal-finance/evals/sa105-property.yaml
  • examples/personal-finance/agents/personal-finance/evals/sa108-cgt.yaml
  • packages/promptfoo-provider/README.md
  • packages/promptfoo-provider/src/__tests__/provider.test.ts
  • packages/promptfoo-provider/src/provider.ts
💤 Files with no reviewable changes (4)
  • examples/personal-finance/agents/personal-finance/evals/sa108-cgt.yaml
  • examples/personal-finance/agents/personal-finance/evals/sa105-property.yaml
  • examples/personal-finance/agents/personal-finance/evals/gap-surfacing.yaml
  • examples/personal-finance/agents/personal-finance/evals/sa102-employment.yaml

Comment on lines +15 to +18
Six checks, two shapes:

The remaining YAMLs — `gap-surfacing.yaml`, `sa102-employment.yaml`, `sa105-property.yaml`, `sa108-cgt.yaml` — are still on the old format and **not currently executable**. They are multi-turn conversational tests (e.g. `gap-surfacing.yaml` relies on context established in turn 1 to evaluate turn 2's behaviour) and promptfoo's parametric `tests:` model is single-turn by default. Porting needs either:
- **Single-turn** (`vars.query`): `ping`, `tax-year-anchoring` (2024-25 boundary, 2025-26 boundary).
- **Multi-turn** (`vars.transcript` — sequential user turns replayed in one Lobu thread; assertions evaluate the final response): `gap-surfacing`, `sa102-employment`, `sa105-property`, `sa108-cgt`. See `packages/promptfoo-provider/README.md` for the transcript protocol.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix coverage count mismatch.

Line 15 says “Six checks,” but the bullets on Lines 17–18 enumerate seven checks total (3 single-turn + 4 multi-turn). Update the count to avoid confusion in eval reporting docs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/personal-finance/agents/personal-finance/evals/README.md` around
lines 15 - 18, Update the documentation header count from “Six checks” to “Seven
checks” to match the listed evals: single-turn checks (vars.query) include ping
and tax-year-anchoring (two items described as three? ensure you count them
correctly) and multi-turn checks (vars.transcript) include gap-surfacing,
sa102-employment, sa105-property, sa108-cgt; adjust the opening sentence so it
accurately reads “Seven checks, two shapes:” to match the seven enumerated
checks (ping, tax-year-anchoring, gap-surfacing, sa102-employment,
sa105-property, sa108-cgt) referenced in the README.

@buremba buremba force-pushed the feat/promptfoo-multiturn branch from ca66f6a to c017b0b Compare May 19, 2026 14:55
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 19, 2026

Rebased onto feat/tool-use-sse to pre-resolve conflicts. Now mergeable AFTER #918 lands (the rebase incorporates #918's changes; merging #918 to main first then this PR will be clean).

If #918 lands a different version, redo this rebase against the merged commit.

…onal-finance evals

@lobu/promptfoo-provider gains vars.transcript: string[] support — replays
sequential turns in one Lobu thread, returns the final assistant response
for assertion. Single-turn callers via plain prompt are unchanged.

Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing,
sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml
using vars.transcript. Deletes the original YAML files.

5 provider tests pass (mock-fetch over the gateway endpoints) covering
single-turn baseline, multi-turn ordering + session reuse + single
cleanup, whitespace filter, non-array fallback, empty-array fallback.

Rebased cleanly atop #918 (tool_use SSE) — the agent-worker / provider
files in main already include #918's additions; this commit is the
strict multi-turn delta.
@buremba buremba force-pushed the feat/promptfoo-multiturn branch from c017b0b to 8880f29 Compare May 19, 2026 14:59
@buremba buremba merged commit 69151a9 into main May 19, 2026
3 checks passed
@buremba buremba deleted the feat/promptfoo-multiturn branch May 19, 2026 14:59
buremba added a commit that referenced this pull request May 19, 2026
…e 4 personal-finance evals (#913)" (#920)

This reverts commit 69151a9.
buremba added a commit that referenced this pull request May 19, 2026
…onal-finance evals (#921)

@lobu/promptfoo-provider gains vars.transcript: string[] support — replays
sequential turns in one Lobu thread, returns the final assistant response
for assertion. Single-turn callers via plain prompt are unchanged.

Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing,
sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml
using vars.transcript. Deletes the original YAML files.

Strictly additive atop current main (which already includes #918's tool_use
SSE events). Re-do of #913 after #920 reverted that PR — the original
landing accidentally undid #914 and #916 because of a bad rebase-and-soft-reset.
buremba added a commit that referenced this pull request May 19, 2026
…lock image builds (#927)

PR #911 added `examples/personal-finance` to root `package.json`'s
`workspaces` field but didn't update the Dockerfiles, which only COPY
`packages/*/package.json` for the install layer. `bun install` inside
the Docker build then errored:

    error: Workspace not found "examples/personal-finance"
        at /app/package.json:8:5

Every image build on `main` since #911 merged (13:25 UTC today) has
been red: #911#913 (+revert) → #914#915#919#923#924#912#925 — all sitting on `main` un-deployable, including the
`principal_kind` migration from #923 and my own loading-skeletons
shipping artifacts.

Two ways to fix it:

1. **Add stubs to all three Dockerfiles** for the example. Treats the
   symptom; couples prod build pipeline to whatever's under `examples/`,
   wrong direction.
2. **Take the example out of root workspaces.** Examples are
   documentation/demos for users to clone + run; they don't belong in
   the prod build graph. Cleaner separation.

Going with (2). Side effects:

- Example's dependency on `@lobu/promptfoo-provider` switched from
  `workspace:*` (workspace-protocol-only) to
  `file:../../packages/promptfoo-provider`. Resolves locally without
  requiring the example to be in a workspace; consumers run
  `cd examples/personal-finance && bun install` standalone (after
  building the provider once: `cd packages/promptfoo-provider && bun
  run build`).
- `bun.lock` regenerated. Most of the diff is bun's "linked
  workspaces" table shrinking — no upstream version churn.

Verified: simulated Docker build context (root files + stubbed
packages/* manifests + provider stub, no examples/) runs `bun install`
cleanly. No "Workspace not found" error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants