feat(gateway): tool_use SSE events for client-side trace inspection by buremba · Pull Request #918 · lobu-ai/lobu

buremba · 2026-05-19T14:29:42Z

Summary

New tool_use SSE event type so external clients (the @lobu/promptfoo-provider, the Mac menubar, anything subscribed to /lobu/api/v1/agents/<id>/events) can observe tool calls as they happen. Fully additive — no rename, no removal of output / complete / error.
LobuProvider consumes the new event, populates metadata.toolCalls, and (for search_memory / lobu_search_memory) joins returned snippet text into metadata.retrievedContext so promptfoo RAG assertions (context-recall, context-faithfulness, custom javascript) work end-to-end.
Unblocks should-fix item 4 in packages/promptfoo-provider/HANDOFF.md.

Change shape

Worker (packages/agent-worker/src/openclaw/):

worker.ts session.subscribe handles pi-agent's tool_execution_end and emits a tool_use custom event via the existing onProgress / sendCustomEvent path.
tool-use-events.ts (new) builds the payload { toolCallId, name, input, isError, result_summary? } mirroring Anthropic tool-use blocks. For retrieval tools, result_summary includes event_ids plus inline snippets[] (id + text) so clients can compute retrievedContext without a round-trip. Handles both raw search_memory body and the MCP CallToolResult wrapping.

Gateway:

No router change. unified-thread-consumer.ts already broadcasts customEvents under data.customEvent.name, so the SSE event name lands as tool_use for both the conversation channel and the cli-session id. New test coverage in unified-thread-consumer.test.ts.

Provider (@lobu/promptfoo-provider):

collectResponse accumulates calls into metadata.toolCalls and joins retrieval-tool snippet text into metadata.retrievedContext (de-duped by event id).
LobuToolCall exported from provider.ts.
README rewritten — drops the "Known limitations" disclaimer and documents the context-recall / contextTransform pattern and javascript-assertion shape over toolCalls.

Example:

examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml gains a retrieval test that asserts the agent called search_memory before answering, exercising the new metadata path end-to-end.

SSE payload

event: tool_use
data: {
  "toolCallId": "tc-abc",
  "name": "search_memory",
  "input": { "query": "rent due dates" },
  "isError": false,
  "result_summary": {
    "event_ids": [42, 43],
    "snippets": [
      { "id": 42, "text": "Tenant Acme pays £1,200 on the 1st" },
      { "id": 43, "text": "Lease ends Dec 2027" }
    ]
  },
  "messageId": "msg-1",
  "timestamp": 1737302400000
}

Test plan

make typecheck (strict) — clean.
bun test for new test files (13 pass total):
- packages/agent-worker/src/__tests__/tool-use-events.test.ts (6 tests covering payload shape, MCP wrapping, error path, both retrieval tool names)
- packages/promptfoo-provider/src/__tests__/provider.test.ts (3 tests covering happy path, messageId filtering, non-retrieval tools)
- packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts (new test for tool_use customEvent broadcast to both conversation + cli session)
make test-unit — 910 pass, 4 pre-existing failures unrelated to this PR (connector-sdk AutoCreateWhenRule export issue in apply-cmd-* / desired-state-* tests; reproduces on a clean checkout).
End-to-end run of the personal-finance retrieval eval against a live gateway — deferred; new assertion shape is unit-tested but the live signal will land in the next eval run.

Compatibility

Additive. Existing SSE clients (CLI eval, Mac menubar, etc.) ignore unknown event types. output / complete / error are untouched.

Summary by CodeRabbit

New Features
- Tool execution events are streamed end-to-end and exposed in provider response metadata.
- Retrieval tool results are summarized (event IDs + snippet text) and surfaced as retrieved context for RAG assertions.
- Promptfoo tests can assert that search/retrieval tools were invoked and that agent responses are grounded or explicitly state no data.
Documentation
- Provider docs updated to describe toolCalls, retrievedContext, and SSE behavior.
Tests
- New test suites covering tool-use payloads, SSE handling, provider parsing, and promptfoo evals.

Add a `tool_use` SSE event type so external clients (the @lobu/promptfoo-provider, the Mac menubar, anything subscribed to `/lobu/api/v1/agents/<id>/events`) can observe tool calls as they happen. Additive — no existing event renamed or removed. **Worker** (`packages/agent-worker/src/openclaw/`): - Subscribe to pi-agent's `tool_execution_end` and emit a `tool_use` custom event via the existing onProgress / sendCustomEvent path. - `tool-use-events.ts` builds the payload: `{ toolCallId, name, input, isError, result_summary? }` mirroring Anthropic tool-use blocks. - For `search_memory` / `lobu_search_memory`, `result_summary` includes the returned `event_ids` plus inline `snippets[]` (id + text) so clients can compute `retrievedContext` without a server round-trip. Handles both raw search_memory body and MCP CallToolResult wrapping. **Gateway**: - No router change needed — `unified-thread-consumer.ts` already broadcasts customEvents under `data.customEvent.name`, so the SSE event name lands as `tool_use` for both the conversation channel and the cli-session id. - Test coverage added for the customEvent → broadcast path. **Provider** (`@lobu/promptfoo-provider`): - `collectResponse` handles the new event: accumulates calls into `metadata.toolCalls`, joins retrieval-tool snippet text into `metadata.retrievedContext`. - README rewritten — drop the "Known limitations" disclaimer; document the RAG-assertion pattern (`context-recall` + `contextTransform`) and the `javascript`-assertion pattern over `toolCalls`. **Example**: - `examples/personal-finance/.../promptfooconfig.yaml` gains a retrieval test that asserts the agent called `search_memory` before answering, exercising the new metadata path end-to-end. **Validation**: - `make typecheck` (strict) clean. - `bun test` for the three new test files: 13 pass. - Pre-existing `apply-cmd-*` / `desired-state-*` failures in `make test-unit` are unrelated (connector-sdk dist shape) and present on a clean main.

coderabbitai · 2026-05-19T14:30:01Z

📝 Walkthrough

Walkthrough

Worker emits structured tool_use custom events (built with buildToolUseEventPayload) on tool execution end; the gateway broadcasts them over SSE; the provider normalizes tool_use payloads, accumulates toolCalls and retrieval snippets, and exposes them as metadata.toolCalls and metadata.retrievedContext for promptfoo RAG assertions.

Changes

Tool-use event tracing across worker, gateway, and provider

Layer / File(s)	Summary
Worker: payload types and builder `packages/agent-worker/src/openclaw/tool-use-events.ts`, `packages/agent-worker/src/__tests__/tool-use-events.test.ts`	Defines `ToolUseEventPayload`/`ToolUseResultSummary`, implements `buildToolUseEventPayload` plus helpers (`summarizeSearchMemoryResult`, `extractSearchMemoryBody`, `extractErrorMessage`), and adds tests for payload shaping, retrieval snippet extraction, MCP unwrapping, empty-results handling, and error propagation.
Worker: capture args, emit tool_use events `packages/agent-worker/src/openclaw/worker.ts`	Captures tool-call `args` on `tool_execution_start`, builds payloads on `tool_execution_end`, emits `custom_event` `"tool_use"` via `onProgress` (with timestamp), logs emission failures, and waits for in-flight emits before resolving `agent_end`.
MCP JSON request opt-in & proxy header forwarding `packages/agent-worker/src/shared/tool-implementations.ts`, `packages/server/src/gateway/auth/mcp/proxy.ts`	Adds `TOOLS_REQUESTING_JSON_FORMAT` allowlist and sends `x-mcp-format: json` for matching MCP tool calls; proxy forwards incoming `x-mcp-format` upstream via `sendUpstreamRequest` extraHeaders merge.
Gateway: tool_use broadcast test `packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts`	Test verifies `UnifiedThreadResponseConsumer` broadcasts `tool_use` custom events to both conversation and CLI session targets with payload containing `toolCallId`, `name`, `result_summary.event_ids`, `messageId`, and `timestamp`.
Provider: types, SSE collection, normalize and finalize `packages/promptfoo-provider/src/provider.ts`	Adds exported `LobuToolCall` and enriches `LobuProviderResponse.metadata` with `toolCalls?: LobuToolCall[]` and `retrievedContext?: string`; extends `collectResponse` to normalize `tool_use` SSE events, accumulate/dedupe retrieval snippets, introduce `finalize()`, and ensure toolCalls/retrievedContext are returned.
Provider tests for SSE handling `packages/promptfoo-provider/src/__tests__/provider.test.ts`	Stubs `fetch`/SSE to validate `callApi` handles `tool_use` events: populates `metadata.toolCalls`, derives `metadata.retrievedContext` from snippets, extracts `traceId`, aggregates token usage, ignores foreign `messageId` events, and records non-retrieval tool calls.
Docs and eval config `packages/promptfoo-provider/HANDOFF.md`, `packages/promptfoo-provider/README.md`, `examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml`	HANDOFF/README updated to document `tool_use` SSE behavior and metadata fields; personal-finance promptfoo config adds a retrieval-focused eval asserting `search_memory` (or `lobu_search_memory`) was invoked and responses are grounded.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

lobu-ai/lobu#911: Main PR implements gateway/worker tool_use SSE payload handling and retrieval-context extraction, building on prior provider groundwork.

Suggested labels

skip-size-check

Poem

A rabbit hops through SSE streams so bright, 🐰
Tool calls traced from worker's height to provider's sight,
Snippets danced, metadata sang clear,
RAG checks now find what once was mere,
Hooray — the trace trails bring insight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(gateway): tool_use SSE events for client-side trace inspection' accurately summarizes the main change: adding a new SSE event type for tool tracing.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering Summary, Change shape, SSE payload, Test plan, and Compatibility sections.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/tool-use-sse

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

packages/agent-worker/src/openclaw/worker.ts (1)

1466-1490: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for pending tool_use sends before resolving the turn.

At Lines 1468-1480, onProgress(...) is fire-and-forget. The turn can resolve on agent_end before these sends settle, so tool_use events may arrive late or be dropped when completion closes the stream.

Proposed fix

       // Wire events through progress processor with delta batching
       let pendingDelta = "";
+      const pendingCustomEventSends = new Set<Promise<void>>();
       const DELTA_BATCH_INTERVAL_MS = 150;
@@
         if (event.type === "tool_execution_end") {
           const payload = buildToolUseEventPayload(event);
-          onProgress({
+          const sendPromise = onProgress({
             type: "custom_event",
             data: {
               name: "tool_use",
               payload: payload as unknown as Record<string, unknown>,
             },
             timestamp: Date.now(),
-          }).catch((err) => {
-            logger.warn(
-              `Failed to emit tool_use custom event for ${event.toolName}:`,
-              err
-            );
-          });
+          });
+          pendingCustomEventSends.add(sendPromise);
+          sendPromise
+            .catch((err) => {
+              logger.warn(
+                `Failed to emit tool_use custom event for ${event.toolName}:`,
+                err
+              );
+            })
+            .finally(() => {
+              pendingCustomEventSends.delete(sendPromise);
+            });
         }
 
         if (event.type === "agent_end") {
-          flushDelta()
+          Promise.allSettled([...pendingCustomEventSends])
+            .then(() => flushDelta())
             .then(() => resolveTurnDone?.())
             .catch((err) => {
               logger.error("Failed to flush final delta:", err);
               resolveTurnDone?.();
             });
         }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/agent-worker/src/openclaw/worker.ts` around lines 1466 - 1490, The
tool_use onProgress call is fire-and-forget and can still be pending when
agent_end resolves; change the logic to track and await pending onProgress
promises for tool_use before calling resolveTurnDone. Specifically, when
event.type === "tool_execution_end" (and using buildToolUseEventPayload and
onProgress), push the returned promise (with a catch that logs via logger.warn)
into a local pending array; then when handling event.type === "agent_end" (and
calling flushDelta and resolveTurnDone), await Promise.all on that pending array
(handling any rejections with logging) before invoking resolveTurnDone so all
tool_use sends settle first. Ensure the pending array is cleared after awaiting
to avoid memory leaks.

packages/promptfoo-provider/src/provider.ts (1)

153-162: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve tool-call metadata on error responses.

On Line 153, the error return omits toolCalls/retrievedContext, even though collectResponse() may already have captured tool_use events. This drops trace data exactly on failure paths.

Suggested fix

       if (response.error) {
         return {
           output: response.text,
           error: response.error,
           metadata: {
             agent: this.agent,
             thread,
             traceId: response.traceId,
+            ...(response.toolCalls && response.toolCalls.length > 0
+              ? { toolCalls: response.toolCalls }
+              : {}),
+            ...(response.retrievedContext
+              ? { retrievedContext: response.retrievedContext }
+              : {}),
           },
         };
       }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/promptfoo-provider/src/provider.ts` around lines 153 - 162, When
handling an error from response (the branch checking response.error), preserve
any tool-call metadata collected by collectResponse() by including toolCalls and
retrievedContext in the returned metadata object; update the error return to
include metadata: { agent: this.agent, thread, traceId: response.traceId,
toolCalls: response.toolCalls || [], retrievedContext: response.retrievedContext
|| null } (or pull these values from whatever local variables collectResponse()
populates) so tool_use events aren't dropped on failure paths.

🧹 Nitpick comments (1)

packages/promptfoo-provider/src/__tests__/provider.test.ts (1)

61-161: ⚡ Quick win

Add a regression test for tool_use + error finalization.

Current cases only assert success completion. Add one case where a tool_use event is emitted before an error event, and assert metadata still includes captured toolCalls (and retrieval context when present).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/promptfoo-provider/src/__tests__/provider.test.ts` around lines 61 -
161, Add a regression test that emits a tool_use SSE event followed by an
error/complete error finalization and asserts captured toolCalls (and
retrievedContext for search_memory) are still present: create a new test in
provider.test.ts that uses sseEvent to emit a tool_use (e.g. toolCallId
"tc-err", name "search_memory" with result_summary/snippets) then an
sseEvent("error", { messageId: "msg-1", error: ... }) (or an error finalization
your SSE stub uses), stub fetch with createFetchStub, call LobuProvider.callApi,
and assert result.metadata.toolCalls contains the tool call and
result_summary.event_ids and that result.metadata.retrievedContext equals the
concatenated snippets when applicable; mirror existing tests’ structure (use
LobuProvider, callApi, expect on metadata.toolCalls and retrievedContext) to
ensure behavior on error finalization is covered.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/agent-worker/src/openclaw/tool-use-events.ts`:
- Around line 123-129: The current handler in openclaw/tool-use-events.ts
returns null inside the catch when JSON.parse(text) fails, which aborts scanning
and can miss later valid JSON content (dropping result_summary); change the
logic in the block that checks if (type === "text" && typeof text === "string")
so that on JSON.parse failure you do not return null but instead skip this text
entry and continue scanning subsequent parts (e.g., use continue or otherwise
ignore this entry), allowing later valid JSON content to be parsed and
result_summary to be set; ensure any eventual fallback still returns null only
after all parts are examined.

In `@packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts`:
- Around line 111-119: The test's assertion for conversationBroadcast (variable
conversationBroadcast) only checks a subset of the payload; update the assertion
that checks the broadcasted message for toolCallId "tc-1" and name
"search_memory" to also assert result_summary.snippets, input, and isError are
present and match the constructed payload (alongside existing event_ids,
messageId "m1" and timestamp 1000). Modify the
expect(conversationBroadcast?.[2]) check in unified-thread-consumer.test.ts to
include result_summary.snippets, input, and isError (either by expanding the
toMatchObject expected object or adding explicit expects) so the test validates
the full broadcast payload.

---

Outside diff comments:
In `@packages/agent-worker/src/openclaw/worker.ts`:
- Around line 1466-1490: The tool_use onProgress call is fire-and-forget and can
still be pending when agent_end resolves; change the logic to track and await
pending onProgress promises for tool_use before calling resolveTurnDone.
Specifically, when event.type === "tool_execution_end" (and using
buildToolUseEventPayload and onProgress), push the returned promise (with a
catch that logs via logger.warn) into a local pending array; then when handling
event.type === "agent_end" (and calling flushDelta and resolveTurnDone), await
Promise.all on that pending array (handling any rejections with logging) before
invoking resolveTurnDone so all tool_use sends settle first. Ensure the pending
array is cleared after awaiting to avoid memory leaks.

In `@packages/promptfoo-provider/src/provider.ts`:
- Around line 153-162: When handling an error from response (the branch checking
response.error), preserve any tool-call metadata collected by collectResponse()
by including toolCalls and retrievedContext in the returned metadata object;
update the error return to include metadata: { agent: this.agent, thread,
traceId: response.traceId, toolCalls: response.toolCalls || [],
retrievedContext: response.retrievedContext || null } (or pull these values from
whatever local variables collectResponse() populates) so tool_use events aren't
dropped on failure paths.

---

Nitpick comments:
In `@packages/promptfoo-provider/src/__tests__/provider.test.ts`:
- Around line 61-161: Add a regression test that emits a tool_use SSE event
followed by an error/complete error finalization and asserts captured toolCalls
(and retrievedContext for search_memory) are still present: create a new test in
provider.test.ts that uses sseEvent to emit a tool_use (e.g. toolCallId
"tc-err", name "search_memory" with result_summary/snippets) then an
sseEvent("error", { messageId: "msg-1", error: ... }) (or an error finalization
your SSE stub uses), stub fetch with createFetchStub, call LobuProvider.callApi,
and assert result.metadata.toolCalls contains the tool call and
result_summary.event_ids and that result.metadata.retrievedContext equals the
concatenated snippets when applicable; mirror existing tests’ structure (use
LobuProvider, callApi, expect on metadata.toolCalls and retrievedContext) to
ensure behavior on error finalization is covered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 6681e6e2-83d2-4a4b-8e8a-b32a8198feeb

📥 Commits

Reviewing files that changed from the base of the PR and between f8f087b and 0dba284.

📒 Files selected for processing (9)

examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
packages/agent-worker/src/__tests__/tool-use-events.test.ts
packages/agent-worker/src/openclaw/tool-use-events.ts
packages/agent-worker/src/openclaw/worker.ts
packages/promptfoo-provider/HANDOFF.md
packages/promptfoo-provider/README.md
packages/promptfoo-provider/src/__tests__/provider.test.ts
packages/promptfoo-provider/src/provider.ts
packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts

coderabbitai · 2026-05-19T14:37:10Z

+      if (type === "text" && typeof text === "string") {
+        try {
+          return JSON.parse(text);
+        } catch {
+          // Plain text result — nothing to summarise.
+          return null;
+        }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t stop scanning MCP content after the first unparsable text block.

At Lines 126-128, returning null inside catch exits early and can miss a later valid JSON content part, which drops result_summary unexpectedly.

Proposed fix

function extractSearchMemoryBody(raw: unknown): unknown { if (!raw || typeof raw !== "object") return null; @@ const mcpContent = (raw as { content?: unknown }).content; if (Array.isArray(mcpContent)) { for (const part of mcpContent) { if (!part || typeof part !== "object") continue; const type = (part as { type?: unknown }).type; const text = (part as { text?: unknown }).text; if (type === "text" && typeof text === "string") { try { return JSON.parse(text); } catch { - // Plain text result — nothing to summarise. - return null; + // Not JSON; keep scanning other parts. + continue; } } } + return null; } // Already the search_memory body. return raw; }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/agent-worker/src/openclaw/tool-use-events.ts` around lines 123 - 129, The current handler in openclaw/tool-use-events.ts returns null inside the catch when JSON.parse(text) fails, which aborts scanning and can miss later valid JSON content (dropping result_summary); change the logic in the block that checks if (type === "text" && typeof text === "string") so that on JSON.parse failure you do not return null but instead skip this text entry and continue scanning subsequent parts (e.g., use continue or otherwise ignore this entry), allowing later valid JSON content to be parsed and result_summary to be set; ensure any eventual fallback still returns null only after all parts are examined.

coderabbitai · 2026-05-19T14:37:10Z

+    expect(conversationBroadcast?.[2]).toMatchObject({
+      toolCallId: "tc-1",
+      name: "search_memory",
+      result_summary: {
+        event_ids: [42],
+      },
+      messageId: "m1",
+      timestamp: 1000,
+    });


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Verify that result_summary.snippets, input, and isError are included in the broadcast.

The test constructs a payload containing input, isError, and result_summary.snippets but only verifies a subset of these fields in the broadcast assertion. Since the PR objectives emphasize that retrieval tools include "event_ids and inline snippets," the test should confirm that snippets are actually broadcast to prevent regressions where critical fields might be accidentally dropped.

🧪 Expand assertions to cover the complete payload

expect(conversationBroadcast?.[2]).toMatchObject({ toolCallId: "tc-1", name: "search_memory", + input: { query: "rent" }, + isError: false, result_summary: { event_ids: [42], + snippets: [{ id: 42, text: "Rent is due 1st" }], }, messageId: "m1", timestamp: 1000, });

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

expect(conversationBroadcast?.[2]).toMatchObject({

toolCallId: "tc-1",

name: "search_memory",

result_summary: {

event_ids: [42],

},

messageId: "m1",

timestamp: 1000,

});

expect(conversationBroadcast?.[2]).toMatchObject({

toolCallId: "tc-1",

name: "search_memory",

input: { query: "rent" },

isError: false,

result_summary: {

event_ids: [42],

snippets: [{ id: 42, text: "Rent is due 1st" }],

},

messageId: "m1",

timestamp: 1000,

});

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts` around lines 111 - 119, The test's assertion for conversationBroadcast (variable conversationBroadcast) only checks a subset of the payload; update the assertion that checks the broadcasted message for toolCallId "tc-1" and name "search_memory" to also assert result_summary.snippets, input, and isError are present and match the constructed payload (alongside existing event_ids, messageId "m1" and timestamp 1000). Modify the expect(conversationBroadcast?.[2]) check in unified-thread-consumer.test.ts to include result_summary.snippets, input, and isError (either by expanding the toMatchObject expected object or adding explicit expects) so the test validates the full broadcast payload.

Address four real concerns raised by `pi -p` on the previous commit: 1. **tool_use payload was losing `input` args.** pi-agent's `tool_execution_end` event omits `args` (those only fire on the matching `tool_execution_start`). worker.ts now tracks a `Map<toolCallId, args>` from the start event and merges them into the payload at end. Test comment updated to call this out so future readers don't regress it. 2. **`result_summary` would never populate for live `search_memory` calls.** The MCP proxy returns formatted markdown to workers by default; my extractor expected JSON-stringified bodies. Two changes: - `callMcpTool` now sends `x-mcp-format: json` for known retrieval tools (`search_memory`, `lobu_search_memory`) only — keeps existing markdown behaviour for every other tool the agent sees. - MCP proxy's `handleCallTool` forwards the caller's `x-mcp-format` header to upstream via the new `extraHeaders` param on `sendUpstreamRequest`. Internal MCPs (embedded lobu-memory) respect it; external MCPs ignore unknown headers. 3. **Race between `tool_use` SSE emit and `complete`.** The subscriber was firing `onProgress(...).catch(...)` un-awaited, so a slow custom-event POST could land after `complete` (and the promptfoo provider returns on `complete`, losing the trace). worker.ts now tracks in-flight tool-use promises in a Set and waits for them at `agent_end` before resolving the turn, ensuring SSE ordering matches semantic ordering. 4. **Eval + README javascript assertions used the wrong shape.** promptfoo's `context.vars` holds test vars, not provider metadata; metadata is on `context.providerResponse.metadata`. And `contextTransform` is for context assertions (`context-recall` etc.), not `javascript`. Both fixed: - `examples/personal-finance/.../promptfooconfig.yaml` JS assertion now reads `context.providerResponse.metadata.toolCalls`. - README example shows the `context-recall` + `contextTransform` pattern (correct shape) alongside a separate `javascript` example reading `context.providerResponse?.metadata`.

buremba · 2026-05-19T14:42:34Z

Self-review via `pi -p` — findings applied in `80d9abb`

pi -p flagged four real concerns on the initial cut:

tool_use.input lost args — pi-agent's tool_execution_end event omits args (start-only). Fixed by tracking Map<toolCallId, args> from tool_execution_start and merging at end.
result_summary wouldn't populate live — MCP proxy returns markdown by default; my extractor expected JSON. Fixed by having callMcpTool send x-mcp-format: json only for retrieval tools (search_memory / lobu_search_memory) and forwarding the header through handleCallTool → sendUpstreamRequest to internal MCPs. External MCPs / non-retrieval tools keep existing markdown behaviour.
Race between tool_use SSE emit and complete — emits were .catch(...) un-awaited, so a slow POST could land after complete (provider returns on complete, losing the trace). Fixed: in-flight tool_use promises are tracked in a Set and awaited at agent_end before resolving the turn.
Eval/README JS assertion used wrong shape — context.vars.metadata doesn't exist (vars = test vars); metadata is on context.providerResponse.metadata. And contextTransform is for context assertions, not javascript. Both fixed.

All four already had local repros via the new unit tests (the worker test still validates the merged shape; the provider test stays green because the SSE event payload is the same). make typecheck + bun test for the new files still clean (13/13).

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/auth/mcp/proxy.ts`:
- Around line 743-762: The retry path drops the forwarded x-mcp-format header
because extraHeaders (derived from callerFormat) is passed to the first
sendUpstreamRequest call but not to the stale-session reinitialize/retry call;
update the retry logic so that the same extraHeaders (or callerFormat) is passed
into the subsequent sendUpstreamRequest retry. Locate where callerFormat and
extraHeaders are created and ensure the retry branch that calls
this.sendUpstreamRequest(...) includes the extraHeaders argument (same shape as
the first call) so x-mcp-format is preserved on retry (also apply the same
change to the similar block around the 822–830 area).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 2783b9da-528d-4bf5-bc34-0aa2e883a705

📥 Commits

Reviewing files that changed from the base of the PR and between 0dba284 and 80d9abb.

📒 Files selected for processing (6)

examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
packages/agent-worker/src/__tests__/tool-use-events.test.ts
packages/agent-worker/src/openclaw/worker.ts
packages/agent-worker/src/shared/tool-implementations.ts
packages/promptfoo-provider/README.md
packages/server/src/gateway/auth/mcp/proxy.ts

✅ Files skipped from review due to trivial changes (1)

packages/promptfoo-provider/README.md

coderabbitai · 2026-05-19T14:46:15Z

+      // Forward the caller's `x-mcp-format` opt-in so internal MCPs (the
+      // embedded lobu-memory server) can return raw JSON instead of formatted
+      // markdown. The worker uses this for retrieval tools to surface
+      // structured `result_summary` (event ids + snippet text) through the
+      // `tool_use` SSE event.
+      const callerFormat = c.req.header("x-mcp-format");
+      const extraHeaders = callerFormat
+        ? { "x-mcp-format": callerFormat }
+        : undefined;
+
      let response = await this.sendUpstreamRequest(
        httpServer,
        agentId,
        mcpId,
        "POST",
        jsonRpcBody,
        scopeKey,
-        auth.token
+        auth.token,
+        extraHeaders
      );


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve x-mcp-format on stale-session retry.

extraHeaders are forwarded on the first call but dropped on the 404 reinitialize/retry call, so retrieval tools can silently fall back to markdown on exactly the retry path.

🔧 Suggested fix

response = await this.sendUpstreamRequest( httpServer, agentId, mcpId, "POST", jsonRpcBody, scopeKey, - auth.token + auth.token, + extraHeaders );

Also applies to: 822-830

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/auth/mcp/proxy.ts` around lines 743 - 762, The retry path drops the forwarded x-mcp-format header because extraHeaders (derived from callerFormat) is passed to the first sendUpstreamRequest call but not to the stale-session reinitialize/retry call; update the retry logic so that the same extraHeaders (or callerFormat) is passed into the subsequent sendUpstreamRequest retry. Locate where callerFormat and extraHeaders are created and ensure the retry branch that calls this.sendUpstreamRequest(...) includes the extraHeaders argument (same shape as the first call) so x-mcp-format is preserved on retry (also apply the same change to the similar block around the 822–830 area).

codecov-commenter · 2026-05-19T14:46:20Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…onal-finance evals @lobu/promptfoo-provider gains vars.transcript: string[] support — replays sequential turns in one Lobu thread, returns the final assistant response for assertion. Single-turn callers via plain prompt are unchanged. Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml using vars.transcript. Deletes the original YAML files. 5 provider tests pass (mock-fetch over the gateway endpoints) covering single-turn baseline, multi-turn ordering + session reuse + single cleanup, whitespace filter, non-array fallback, empty-array fallback. Rebased cleanly atop #918 (tool_use SSE) — the agent-worker / provider files in main already include #918's additions; this commit is the strict multi-turn delta.

…onal-finance evals (#913) @lobu/promptfoo-provider gains vars.transcript: string[] support — replays sequential turns in one Lobu thread, returns the final assistant response for assertion. Single-turn callers via plain prompt are unchanged. Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml using vars.transcript. Deletes the original YAML files. 5 provider tests pass (mock-fetch over the gateway endpoints) covering single-turn baseline, multi-turn ordering + session reuse + single cleanup, whitespace filter, non-array fallback, empty-array fallback. Rebased cleanly atop #918 (tool_use SSE) — the agent-worker / provider files in main already include #918's additions; this commit is the strict multi-turn delta.

…onal-finance evals (#921) @lobu/promptfoo-provider gains vars.transcript: string[] support — replays sequential turns in one Lobu thread, returns the final assistant response for assertion. Single-turn callers via plain prompt are unchanged. Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing, sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml using vars.transcript. Deletes the original YAML files. Strictly additive atop current main (which already includes #918's tool_use SSE events). Re-do of #913 after #920 reverted that PR — the original landing accidentally undid #914 and #916 because of a bad rebase-and-soft-reset.

buremba added 2 commits May 19, 2026 15:29

chore: biome formatting

0dba284

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

buremba mentioned this pull request May 19, 2026

feat(promptfoo-provider): support vars.transcript for multi-turn evals #913

Merged

5 tasks

buremba merged commit dcb5b1d into main May 19, 2026
26 checks passed

buremba deleted the feat/tool-use-sse branch May 19, 2026 14:57

buremba mentioned this pull request May 19, 2026

chore(main): release lobu 8.0.0 #897

Merged

This was referenced May 19, 2026

fix: revert #913 to restore #914 + #916 changes lost in bad squash #920

Merged

feat(promptfoo-provider): vars.transcript multi-turn + migrate 4 personal-finance evals (redo) #921

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gateway): tool_use SSE events for client-side trace inspection#918

feat(gateway): tool_use SSE events for client-side trace inspection#918
buremba merged 3 commits into
mainfrom
feat/tool-use-sse

buremba commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

buremba commented May 19, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 19, 2026

Uh oh!

codecov-commenter commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buremba commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change shape

SSE payload

Test plan

Compatibility

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

buremba commented May 19, 2026

Self-review via pi -p — findings applied in 80d9abb

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 19, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Self-review via `pi -p` — findings applied in `80d9abb`