feat(gateway): tool_use SSE events for client-side trace inspection#918
Conversation
Add a `tool_use` SSE event type so external clients (the @lobu/promptfoo-provider,
the Mac menubar, anything subscribed to `/lobu/api/v1/agents/<id>/events`) can
observe tool calls as they happen. Additive — no existing event renamed or
removed.
**Worker** (`packages/agent-worker/src/openclaw/`):
- Subscribe to pi-agent's `tool_execution_end` and emit a `tool_use` custom
event via the existing onProgress / sendCustomEvent path.
- `tool-use-events.ts` builds the payload: `{ toolCallId, name, input, isError,
result_summary? }` mirroring Anthropic tool-use blocks.
- For `search_memory` / `lobu_search_memory`, `result_summary` includes the
returned `event_ids` plus inline `snippets[]` (id + text) so clients can
compute `retrievedContext` without a server round-trip. Handles both raw
search_memory body and MCP CallToolResult wrapping.
**Gateway**:
- No router change needed — `unified-thread-consumer.ts` already broadcasts
customEvents under `data.customEvent.name`, so the SSE event name lands as
`tool_use` for both the conversation channel and the cli-session id.
- Test coverage added for the customEvent → broadcast path.
**Provider** (`@lobu/promptfoo-provider`):
- `collectResponse` handles the new event: accumulates calls into
`metadata.toolCalls`, joins retrieval-tool snippet text into
`metadata.retrievedContext`.
- README rewritten — drop the "Known limitations" disclaimer; document the
RAG-assertion pattern (`context-recall` + `contextTransform`) and the
`javascript`-assertion pattern over `toolCalls`.
**Example**:
- `examples/personal-finance/.../promptfooconfig.yaml` gains a retrieval test
that asserts the agent called `search_memory` before answering, exercising
the new metadata path end-to-end.
**Validation**:
- `make typecheck` (strict) clean.
- `bun test` for the three new test files: 13 pass.
- Pre-existing `apply-cmd-*` / `desired-state-*` failures in `make test-unit`
are unrelated (connector-sdk dist shape) and present on a clean main.
📝 WalkthroughWalkthroughWorker emits structured tool_use custom events (built with buildToolUseEventPayload) on tool execution end; the gateway broadcasts them over SSE; the provider normalizes tool_use payloads, accumulates toolCalls and retrieval snippets, and exposes them as metadata.toolCalls and metadata.retrievedContext for promptfoo RAG assertions. ChangesTool-use event tracing across worker, gateway, and provider
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
packages/agent-worker/src/openclaw/worker.ts (1)
1466-1490:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWait for pending
tool_usesends before resolving the turn.At Lines 1468-1480,
onProgress(...)is fire-and-forget. The turn can resolve onagent_endbefore these sends settle, sotool_useevents may arrive late or be dropped when completion closes the stream.Proposed fix
// Wire events through progress processor with delta batching let pendingDelta = ""; + const pendingCustomEventSends = new Set<Promise<void>>(); const DELTA_BATCH_INTERVAL_MS = 150; @@ if (event.type === "tool_execution_end") { const payload = buildToolUseEventPayload(event); - onProgress({ + const sendPromise = onProgress({ type: "custom_event", data: { name: "tool_use", payload: payload as unknown as Record<string, unknown>, }, timestamp: Date.now(), - }).catch((err) => { - logger.warn( - `Failed to emit tool_use custom event for ${event.toolName}:`, - err - ); - }); + }); + pendingCustomEventSends.add(sendPromise); + sendPromise + .catch((err) => { + logger.warn( + `Failed to emit tool_use custom event for ${event.toolName}:`, + err + ); + }) + .finally(() => { + pendingCustomEventSends.delete(sendPromise); + }); } if (event.type === "agent_end") { - flushDelta() + Promise.allSettled([...pendingCustomEventSends]) + .then(() => flushDelta()) .then(() => resolveTurnDone?.()) .catch((err) => { logger.error("Failed to flush final delta:", err); resolveTurnDone?.(); }); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/agent-worker/src/openclaw/worker.ts` around lines 1466 - 1490, The tool_use onProgress call is fire-and-forget and can still be pending when agent_end resolves; change the logic to track and await pending onProgress promises for tool_use before calling resolveTurnDone. Specifically, when event.type === "tool_execution_end" (and using buildToolUseEventPayload and onProgress), push the returned promise (with a catch that logs via logger.warn) into a local pending array; then when handling event.type === "agent_end" (and calling flushDelta and resolveTurnDone), await Promise.all on that pending array (handling any rejections with logging) before invoking resolveTurnDone so all tool_use sends settle first. Ensure the pending array is cleared after awaiting to avoid memory leaks.packages/promptfoo-provider/src/provider.ts (1)
153-162:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve tool-call metadata on error responses.
On Line 153, the error return omits
toolCalls/retrievedContext, even thoughcollectResponse()may already have capturedtool_useevents. This drops trace data exactly on failure paths.Suggested fix
if (response.error) { return { output: response.text, error: response.error, metadata: { agent: this.agent, thread, traceId: response.traceId, + ...(response.toolCalls && response.toolCalls.length > 0 + ? { toolCalls: response.toolCalls } + : {}), + ...(response.retrievedContext + ? { retrievedContext: response.retrievedContext } + : {}), }, }; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/promptfoo-provider/src/provider.ts` around lines 153 - 162, When handling an error from response (the branch checking response.error), preserve any tool-call metadata collected by collectResponse() by including toolCalls and retrievedContext in the returned metadata object; update the error return to include metadata: { agent: this.agent, thread, traceId: response.traceId, toolCalls: response.toolCalls || [], retrievedContext: response.retrievedContext || null } (or pull these values from whatever local variables collectResponse() populates) so tool_use events aren't dropped on failure paths.
🧹 Nitpick comments (1)
packages/promptfoo-provider/src/__tests__/provider.test.ts (1)
61-161: ⚡ Quick winAdd a regression test for
tool_use+errorfinalization.Current cases only assert success completion. Add one case where a
tool_useevent is emitted before anerrorevent, and assert metadata still includes capturedtoolCalls(and retrieval context when present).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/promptfoo-provider/src/__tests__/provider.test.ts` around lines 61 - 161, Add a regression test that emits a tool_use SSE event followed by an error/complete error finalization and asserts captured toolCalls (and retrievedContext for search_memory) are still present: create a new test in provider.test.ts that uses sseEvent to emit a tool_use (e.g. toolCallId "tc-err", name "search_memory" with result_summary/snippets) then an sseEvent("error", { messageId: "msg-1", error: ... }) (or an error finalization your SSE stub uses), stub fetch with createFetchStub, call LobuProvider.callApi, and assert result.metadata.toolCalls contains the tool call and result_summary.event_ids and that result.metadata.retrievedContext equals the concatenated snippets when applicable; mirror existing tests’ structure (use LobuProvider, callApi, expect on metadata.toolCalls and retrievedContext) to ensure behavior on error finalization is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/agent-worker/src/openclaw/tool-use-events.ts`:
- Around line 123-129: The current handler in openclaw/tool-use-events.ts
returns null inside the catch when JSON.parse(text) fails, which aborts scanning
and can miss later valid JSON content (dropping result_summary); change the
logic in the block that checks if (type === "text" && typeof text === "string")
so that on JSON.parse failure you do not return null but instead skip this text
entry and continue scanning subsequent parts (e.g., use continue or otherwise
ignore this entry), allowing later valid JSON content to be parsed and
result_summary to be set; ensure any eventual fallback still returns null only
after all parts are examined.
In `@packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts`:
- Around line 111-119: The test's assertion for conversationBroadcast (variable
conversationBroadcast) only checks a subset of the payload; update the assertion
that checks the broadcasted message for toolCallId "tc-1" and name
"search_memory" to also assert result_summary.snippets, input, and isError are
present and match the constructed payload (alongside existing event_ids,
messageId "m1" and timestamp 1000). Modify the
expect(conversationBroadcast?.[2]) check in unified-thread-consumer.test.ts to
include result_summary.snippets, input, and isError (either by expanding the
toMatchObject expected object or adding explicit expects) so the test validates
the full broadcast payload.
---
Outside diff comments:
In `@packages/agent-worker/src/openclaw/worker.ts`:
- Around line 1466-1490: The tool_use onProgress call is fire-and-forget and can
still be pending when agent_end resolves; change the logic to track and await
pending onProgress promises for tool_use before calling resolveTurnDone.
Specifically, when event.type === "tool_execution_end" (and using
buildToolUseEventPayload and onProgress), push the returned promise (with a
catch that logs via logger.warn) into a local pending array; then when handling
event.type === "agent_end" (and calling flushDelta and resolveTurnDone), await
Promise.all on that pending array (handling any rejections with logging) before
invoking resolveTurnDone so all tool_use sends settle first. Ensure the pending
array is cleared after awaiting to avoid memory leaks.
In `@packages/promptfoo-provider/src/provider.ts`:
- Around line 153-162: When handling an error from response (the branch checking
response.error), preserve any tool-call metadata collected by collectResponse()
by including toolCalls and retrievedContext in the returned metadata object;
update the error return to include metadata: { agent: this.agent, thread,
traceId: response.traceId, toolCalls: response.toolCalls || [],
retrievedContext: response.retrievedContext || null } (or pull these values from
whatever local variables collectResponse() populates) so tool_use events aren't
dropped on failure paths.
---
Nitpick comments:
In `@packages/promptfoo-provider/src/__tests__/provider.test.ts`:
- Around line 61-161: Add a regression test that emits a tool_use SSE event
followed by an error/complete error finalization and asserts captured toolCalls
(and retrievedContext for search_memory) are still present: create a new test in
provider.test.ts that uses sseEvent to emit a tool_use (e.g. toolCallId
"tc-err", name "search_memory" with result_summary/snippets) then an
sseEvent("error", { messageId: "msg-1", error: ... }) (or an error finalization
your SSE stub uses), stub fetch with createFetchStub, call LobuProvider.callApi,
and assert result.metadata.toolCalls contains the tool call and
result_summary.event_ids and that result.metadata.retrievedContext equals the
concatenated snippets when applicable; mirror existing tests’ structure (use
LobuProvider, callApi, expect on metadata.toolCalls and retrievedContext) to
ensure behavior on error finalization is covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 6681e6e2-83d2-4a4b-8e8a-b32a8198feeb
📒 Files selected for processing (9)
examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yamlpackages/agent-worker/src/__tests__/tool-use-events.test.tspackages/agent-worker/src/openclaw/tool-use-events.tspackages/agent-worker/src/openclaw/worker.tspackages/promptfoo-provider/HANDOFF.mdpackages/promptfoo-provider/README.mdpackages/promptfoo-provider/src/__tests__/provider.test.tspackages/promptfoo-provider/src/provider.tspackages/server/src/gateway/__tests__/unified-thread-consumer.test.ts
| if (type === "text" && typeof text === "string") { | ||
| try { | ||
| return JSON.parse(text); | ||
| } catch { | ||
| // Plain text result — nothing to summarise. | ||
| return null; | ||
| } |
There was a problem hiding this comment.
Don’t stop scanning MCP content after the first unparsable text block.
At Lines 126-128, returning null inside catch exits early and can miss a later valid JSON content part, which drops result_summary unexpectedly.
Proposed fix
function extractSearchMemoryBody(raw: unknown): unknown {
if (!raw || typeof raw !== "object") return null;
@@
const mcpContent = (raw as { content?: unknown }).content;
if (Array.isArray(mcpContent)) {
for (const part of mcpContent) {
if (!part || typeof part !== "object") continue;
const type = (part as { type?: unknown }).type;
const text = (part as { text?: unknown }).text;
if (type === "text" && typeof text === "string") {
try {
return JSON.parse(text);
} catch {
- // Plain text result — nothing to summarise.
- return null;
+ // Not JSON; keep scanning other parts.
+ continue;
}
}
}
+ return null;
}
// Already the search_memory body.
return raw;
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/agent-worker/src/openclaw/tool-use-events.ts` around lines 123 -
129, The current handler in openclaw/tool-use-events.ts returns null inside the
catch when JSON.parse(text) fails, which aborts scanning and can miss later
valid JSON content (dropping result_summary); change the logic in the block that
checks if (type === "text" && typeof text === "string") so that on JSON.parse
failure you do not return null but instead skip this text entry and continue
scanning subsequent parts (e.g., use continue or otherwise ignore this entry),
allowing later valid JSON content to be parsed and result_summary to be set;
ensure any eventual fallback still returns null only after all parts are
examined.
| expect(conversationBroadcast?.[2]).toMatchObject({ | ||
| toolCallId: "tc-1", | ||
| name: "search_memory", | ||
| result_summary: { | ||
| event_ids: [42], | ||
| }, | ||
| messageId: "m1", | ||
| timestamp: 1000, | ||
| }); |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Verify that result_summary.snippets, input, and isError are included in the broadcast.
The test constructs a payload containing input, isError, and result_summary.snippets but only verifies a subset of these fields in the broadcast assertion. Since the PR objectives emphasize that retrieval tools include "event_ids and inline snippets," the test should confirm that snippets are actually broadcast to prevent regressions where critical fields might be accidentally dropped.
🧪 Expand assertions to cover the complete payload
expect(conversationBroadcast?.[2]).toMatchObject({
toolCallId: "tc-1",
name: "search_memory",
+ input: { query: "rent" },
+ isError: false,
result_summary: {
event_ids: [42],
+ snippets: [{ id: 42, text: "Rent is due 1st" }],
},
messageId: "m1",
timestamp: 1000,
});📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| expect(conversationBroadcast?.[2]).toMatchObject({ | |
| toolCallId: "tc-1", | |
| name: "search_memory", | |
| result_summary: { | |
| event_ids: [42], | |
| }, | |
| messageId: "m1", | |
| timestamp: 1000, | |
| }); | |
| expect(conversationBroadcast?.[2]).toMatchObject({ | |
| toolCallId: "tc-1", | |
| name: "search_memory", | |
| input: { query: "rent" }, | |
| isError: false, | |
| result_summary: { | |
| event_ids: [42], | |
| snippets: [{ id: 42, text: "Rent is due 1st" }], | |
| }, | |
| messageId: "m1", | |
| timestamp: 1000, | |
| }); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts` around
lines 111 - 119, The test's assertion for conversationBroadcast (variable
conversationBroadcast) only checks a subset of the payload; update the assertion
that checks the broadcasted message for toolCallId "tc-1" and name
"search_memory" to also assert result_summary.snippets, input, and isError are
present and match the constructed payload (alongside existing event_ids,
messageId "m1" and timestamp 1000). Modify the
expect(conversationBroadcast?.[2]) check in unified-thread-consumer.test.ts to
include result_summary.snippets, input, and isError (either by expanding the
toMatchObject expected object or adding explicit expects) so the test validates
the full broadcast payload.
Address four real concerns raised by `pi -p` on the previous commit:
1. **tool_use payload was losing `input` args.** pi-agent's
`tool_execution_end` event omits `args` (those only fire on the matching
`tool_execution_start`). worker.ts now tracks a `Map<toolCallId, args>` from
the start event and merges them into the payload at end. Test comment
updated to call this out so future readers don't regress it.
2. **`result_summary` would never populate for live `search_memory` calls.**
The MCP proxy returns formatted markdown to workers by default; my extractor
expected JSON-stringified bodies. Two changes:
- `callMcpTool` now sends `x-mcp-format: json` for known retrieval tools
(`search_memory`, `lobu_search_memory`) only — keeps existing markdown
behaviour for every other tool the agent sees.
- MCP proxy's `handleCallTool` forwards the caller's `x-mcp-format` header
to upstream via the new `extraHeaders` param on `sendUpstreamRequest`.
Internal MCPs (embedded lobu-memory) respect it; external MCPs ignore
unknown headers.
3. **Race between `tool_use` SSE emit and `complete`.** The subscriber was
firing `onProgress(...).catch(...)` un-awaited, so a slow custom-event POST
could land after `complete` (and the promptfoo provider returns on
`complete`, losing the trace). worker.ts now tracks in-flight tool-use
promises in a Set and waits for them at `agent_end` before resolving the
turn, ensuring SSE ordering matches semantic ordering.
4. **Eval + README javascript assertions used the wrong shape.** promptfoo's
`context.vars` holds test vars, not provider metadata; metadata is on
`context.providerResponse.metadata`. And `contextTransform` is for context
assertions (`context-recall` etc.), not `javascript`. Both fixed:
- `examples/personal-finance/.../promptfooconfig.yaml` JS assertion now
reads `context.providerResponse.metadata.toolCalls`.
- README example shows the `context-recall` + `contextTransform` pattern
(correct shape) alongside a separate `javascript` example reading
`context.providerResponse?.metadata`.
Self-review via
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/server/src/gateway/auth/mcp/proxy.ts`:
- Around line 743-762: The retry path drops the forwarded x-mcp-format header
because extraHeaders (derived from callerFormat) is passed to the first
sendUpstreamRequest call but not to the stale-session reinitialize/retry call;
update the retry logic so that the same extraHeaders (or callerFormat) is passed
into the subsequent sendUpstreamRequest retry. Locate where callerFormat and
extraHeaders are created and ensure the retry branch that calls
this.sendUpstreamRequest(...) includes the extraHeaders argument (same shape as
the first call) so x-mcp-format is preserved on retry (also apply the same
change to the similar block around the 822–830 area).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 2783b9da-528d-4bf5-bc34-0aa2e883a705
📒 Files selected for processing (6)
examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yamlpackages/agent-worker/src/__tests__/tool-use-events.test.tspackages/agent-worker/src/openclaw/worker.tspackages/agent-worker/src/shared/tool-implementations.tspackages/promptfoo-provider/README.mdpackages/server/src/gateway/auth/mcp/proxy.ts
✅ Files skipped from review due to trivial changes (1)
- packages/promptfoo-provider/README.md
| // Forward the caller's `x-mcp-format` opt-in so internal MCPs (the | ||
| // embedded lobu-memory server) can return raw JSON instead of formatted | ||
| // markdown. The worker uses this for retrieval tools to surface | ||
| // structured `result_summary` (event ids + snippet text) through the | ||
| // `tool_use` SSE event. | ||
| const callerFormat = c.req.header("x-mcp-format"); | ||
| const extraHeaders = callerFormat | ||
| ? { "x-mcp-format": callerFormat } | ||
| : undefined; | ||
|
|
||
| let response = await this.sendUpstreamRequest( | ||
| httpServer, | ||
| agentId, | ||
| mcpId, | ||
| "POST", | ||
| jsonRpcBody, | ||
| scopeKey, | ||
| auth.token | ||
| auth.token, | ||
| extraHeaders | ||
| ); |
There was a problem hiding this comment.
Preserve x-mcp-format on stale-session retry.
extraHeaders are forwarded on the first call but dropped on the 404 reinitialize/retry call, so retrieval tools can silently fall back to markdown on exactly the retry path.
🔧 Suggested fix
response = await this.sendUpstreamRequest(
httpServer,
agentId,
mcpId,
"POST",
jsonRpcBody,
scopeKey,
- auth.token
+ auth.token,
+ extraHeaders
);Also applies to: 822-830
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/gateway/auth/mcp/proxy.ts` around lines 743 - 762, The
retry path drops the forwarded x-mcp-format header because extraHeaders (derived
from callerFormat) is passed to the first sendUpstreamRequest call but not to
the stale-session reinitialize/retry call; update the retry logic so that the
same extraHeaders (or callerFormat) is passed into the subsequent
sendUpstreamRequest retry. Locate where callerFormat and extraHeaders are
created and ensure the retry branch that calls this.sendUpstreamRequest(...)
includes the extraHeaders argument (same shape as the first call) so
x-mcp-format is preserved on retry (also apply the same change to the similar
block around the 822–830 area).
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…onal-finance evals @lobu/promptfoo-provider gains vars.transcript: string[] support — replays sequential turns in one Lobu thread, returns the final assistant response for assertion. Single-turn callers via plain prompt are unchanged. Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml using vars.transcript. Deletes the original YAML files. 5 provider tests pass (mock-fetch over the gateway endpoints) covering single-turn baseline, multi-turn ordering + session reuse + single cleanup, whitespace filter, non-array fallback, empty-array fallback. Rebased cleanly atop #918 (tool_use SSE) — the agent-worker / provider files in main already include #918's additions; this commit is the strict multi-turn delta.
…onal-finance evals (#913) @lobu/promptfoo-provider gains vars.transcript: string[] support — replays sequential turns in one Lobu thread, returns the final assistant response for assertion. Single-turn callers via plain prompt are unchanged. Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml using vars.transcript. Deletes the original YAML files. 5 provider tests pass (mock-fetch over the gateway endpoints) covering single-turn baseline, multi-turn ordering + session reuse + single cleanup, whitespace filter, non-array fallback, empty-array fallback. Rebased cleanly atop #918 (tool_use SSE) — the agent-worker / provider files in main already include #918's additions; this commit is the strict multi-turn delta.
…onal-finance evals (#921) @lobu/promptfoo-provider gains vars.transcript: string[] support — replays sequential turns in one Lobu thread, returns the final assistant response for assertion. Single-turn callers via plain prompt are unchanged. Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml using vars.transcript. Deletes the original YAML files. Strictly additive atop current main (which already includes #918's tool_use SSE events). Re-do of #913 after #920 reverted that PR — the original landing accidentally undid #914 and #916 because of a bad rebase-and-soft-reset.
Summary
tool_useSSE event type so external clients (the@lobu/promptfoo-provider, the Mac menubar, anything subscribed to/lobu/api/v1/agents/<id>/events) can observe tool calls as they happen. Fully additive — no rename, no removal ofoutput/complete/error.LobuProviderconsumes the new event, populatesmetadata.toolCalls, and (forsearch_memory/lobu_search_memory) joins returned snippet text intometadata.retrievedContextso promptfoo RAG assertions (context-recall,context-faithfulness, customjavascript) work end-to-end.packages/promptfoo-provider/HANDOFF.md.Change shape
Worker (
packages/agent-worker/src/openclaw/):worker.tssession.subscribehandles pi-agent'stool_execution_endand emits atool_usecustom event via the existingonProgress/sendCustomEventpath.tool-use-events.ts(new) builds the payload{ toolCallId, name, input, isError, result_summary? }mirroring Anthropic tool-use blocks. For retrieval tools,result_summaryincludesevent_idsplus inlinesnippets[](id + text) so clients can computeretrievedContextwithout a round-trip. Handles both rawsearch_memorybody and the MCPCallToolResultwrapping.Gateway:
unified-thread-consumer.tsalready broadcasts customEvents underdata.customEvent.name, so the SSE event name lands astool_usefor both the conversation channel and the cli-session id. New test coverage inunified-thread-consumer.test.ts.Provider (
@lobu/promptfoo-provider):collectResponseaccumulates calls intometadata.toolCallsand joins retrieval-tool snippet text intometadata.retrievedContext(de-duped by event id).LobuToolCallexported fromprovider.ts.context-recall/contextTransformpattern andjavascript-assertion shape overtoolCalls.Example:
examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yamlgains a retrieval test that asserts the agent calledsearch_memorybefore answering, exercising the new metadata path end-to-end.SSE payload
Test plan
make typecheck(strict) — clean.bun testfor new test files (13 pass total):packages/agent-worker/src/__tests__/tool-use-events.test.ts(6 tests covering payload shape, MCP wrapping, error path, both retrieval tool names)packages/promptfoo-provider/src/__tests__/provider.test.ts(3 tests covering happy path, messageId filtering, non-retrieval tools)packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts(new test fortool_usecustomEvent broadcast to both conversation + cli session)make test-unit— 910 pass, 4 pre-existing failures unrelated to this PR (connector-sdkAutoCreateWhenRuleexport issue inapply-cmd-*/desired-state-*tests; reproduces on a clean checkout).Compatibility
Additive. Existing SSE clients (CLI eval, Mac menubar, etc.) ignore unknown event types.
output/complete/errorare untouched.Summary by CodeRabbit
New Features
Documentation
Tests