Skip to content

feat(gateway): tool_use SSE events for client-side trace inspection#918

Merged
buremba merged 3 commits into
mainfrom
feat/tool-use-sse
May 19, 2026
Merged

feat(gateway): tool_use SSE events for client-side trace inspection#918
buremba merged 3 commits into
mainfrom
feat/tool-use-sse

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 19, 2026

Summary

  • New tool_use SSE event type so external clients (the @lobu/promptfoo-provider, the Mac menubar, anything subscribed to /lobu/api/v1/agents/<id>/events) can observe tool calls as they happen. Fully additive — no rename, no removal of output / complete / error.
  • LobuProvider consumes the new event, populates metadata.toolCalls, and (for search_memory / lobu_search_memory) joins returned snippet text into metadata.retrievedContext so promptfoo RAG assertions (context-recall, context-faithfulness, custom javascript) work end-to-end.
  • Unblocks should-fix item 4 in packages/promptfoo-provider/HANDOFF.md.

Change shape

Worker (packages/agent-worker/src/openclaw/):

  • worker.ts session.subscribe handles pi-agent's tool_execution_end and emits a tool_use custom event via the existing onProgress / sendCustomEvent path.
  • tool-use-events.ts (new) builds the payload { toolCallId, name, input, isError, result_summary? } mirroring Anthropic tool-use blocks. For retrieval tools, result_summary includes event_ids plus inline snippets[] (id + text) so clients can compute retrievedContext without a round-trip. Handles both raw search_memory body and the MCP CallToolResult wrapping.

Gateway:

  • No router change. unified-thread-consumer.ts already broadcasts customEvents under data.customEvent.name, so the SSE event name lands as tool_use for both the conversation channel and the cli-session id. New test coverage in unified-thread-consumer.test.ts.

Provider (@lobu/promptfoo-provider):

  • collectResponse accumulates calls into metadata.toolCalls and joins retrieval-tool snippet text into metadata.retrievedContext (de-duped by event id).
  • LobuToolCall exported from provider.ts.
  • README rewritten — drops the "Known limitations" disclaimer and documents the context-recall / contextTransform pattern and javascript-assertion shape over toolCalls.

Example:

  • examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml gains a retrieval test that asserts the agent called search_memory before answering, exercising the new metadata path end-to-end.

SSE payload

event: tool_use
data: {
  "toolCallId": "tc-abc",
  "name": "search_memory",
  "input": { "query": "rent due dates" },
  "isError": false,
  "result_summary": {
    "event_ids": [42, 43],
    "snippets": [
      { "id": 42, "text": "Tenant Acme pays £1,200 on the 1st" },
      { "id": 43, "text": "Lease ends Dec 2027" }
    ]
  },
  "messageId": "msg-1",
  "timestamp": 1737302400000
}

Test plan

  • make typecheck (strict) — clean.
  • bun test for new test files (13 pass total):
    • packages/agent-worker/src/__tests__/tool-use-events.test.ts (6 tests covering payload shape, MCP wrapping, error path, both retrieval tool names)
    • packages/promptfoo-provider/src/__tests__/provider.test.ts (3 tests covering happy path, messageId filtering, non-retrieval tools)
    • packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts (new test for tool_use customEvent broadcast to both conversation + cli session)
  • make test-unit — 910 pass, 4 pre-existing failures unrelated to this PR (connector-sdk AutoCreateWhenRule export issue in apply-cmd-* / desired-state-* tests; reproduces on a clean checkout).
  • End-to-end run of the personal-finance retrieval eval against a live gateway — deferred; new assertion shape is unit-tested but the live signal will land in the next eval run.

Compatibility

Additive. Existing SSE clients (CLI eval, Mac menubar, etc.) ignore unknown event types. output / complete / error are untouched.

Summary by CodeRabbit

  • New Features

    • Tool execution events are streamed end-to-end and exposed in provider response metadata.
    • Retrieval tool results are summarized (event IDs + snippet text) and surfaced as retrieved context for RAG assertions.
    • Promptfoo tests can assert that search/retrieval tools were invoked and that agent responses are grounded or explicitly state no data.
  • Documentation

    • Provider docs updated to describe toolCalls, retrievedContext, and SSE behavior.
  • Tests

    • New test suites covering tool-use payloads, SSE handling, provider parsing, and promptfoo evals.

Review Change Stack

buremba added 2 commits May 19, 2026 15:29
Add a `tool_use` SSE event type so external clients (the @lobu/promptfoo-provider,
the Mac menubar, anything subscribed to `/lobu/api/v1/agents/<id>/events`) can
observe tool calls as they happen. Additive — no existing event renamed or
removed.

**Worker** (`packages/agent-worker/src/openclaw/`):
- Subscribe to pi-agent's `tool_execution_end` and emit a `tool_use` custom
  event via the existing onProgress / sendCustomEvent path.
- `tool-use-events.ts` builds the payload: `{ toolCallId, name, input, isError,
  result_summary? }` mirroring Anthropic tool-use blocks.
- For `search_memory` / `lobu_search_memory`, `result_summary` includes the
  returned `event_ids` plus inline `snippets[]` (id + text) so clients can
  compute `retrievedContext` without a server round-trip. Handles both raw
  search_memory body and MCP CallToolResult wrapping.

**Gateway**:
- No router change needed — `unified-thread-consumer.ts` already broadcasts
  customEvents under `data.customEvent.name`, so the SSE event name lands as
  `tool_use` for both the conversation channel and the cli-session id.
- Test coverage added for the customEvent → broadcast path.

**Provider** (`@lobu/promptfoo-provider`):
- `collectResponse` handles the new event: accumulates calls into
  `metadata.toolCalls`, joins retrieval-tool snippet text into
  `metadata.retrievedContext`.
- README rewritten — drop the "Known limitations" disclaimer; document the
  RAG-assertion pattern (`context-recall` + `contextTransform`) and the
  `javascript`-assertion pattern over `toolCalls`.

**Example**:
- `examples/personal-finance/.../promptfooconfig.yaml` gains a retrieval test
  that asserts the agent called `search_memory` before answering, exercising
  the new metadata path end-to-end.

**Validation**:
- `make typecheck` (strict) clean.
- `bun test` for the three new test files: 13 pass.
- Pre-existing `apply-cmd-*` / `desired-state-*` failures in `make test-unit`
  are unrelated (connector-sdk dist shape) and present on a clean main.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

📝 Walkthrough

Walkthrough

Worker emits structured tool_use custom events (built with buildToolUseEventPayload) on tool execution end; the gateway broadcasts them over SSE; the provider normalizes tool_use payloads, accumulates toolCalls and retrieval snippets, and exposes them as metadata.toolCalls and metadata.retrievedContext for promptfoo RAG assertions.

Changes

Tool-use event tracing across worker, gateway, and provider

Layer / File(s) Summary
Worker: payload types and builder
packages/agent-worker/src/openclaw/tool-use-events.ts, packages/agent-worker/src/__tests__/tool-use-events.test.ts
Defines ToolUseEventPayload/ToolUseResultSummary, implements buildToolUseEventPayload plus helpers (summarizeSearchMemoryResult, extractSearchMemoryBody, extractErrorMessage), and adds tests for payload shaping, retrieval snippet extraction, MCP unwrapping, empty-results handling, and error propagation.
Worker: capture args, emit tool_use events
packages/agent-worker/src/openclaw/worker.ts
Captures tool-call args on tool_execution_start, builds payloads on tool_execution_end, emits custom_event "tool_use" via onProgress (with timestamp), logs emission failures, and waits for in-flight emits before resolving agent_end.
MCP JSON request opt-in & proxy header forwarding
packages/agent-worker/src/shared/tool-implementations.ts, packages/server/src/gateway/auth/mcp/proxy.ts
Adds TOOLS_REQUESTING_JSON_FORMAT allowlist and sends x-mcp-format: json for matching MCP tool calls; proxy forwards incoming x-mcp-format upstream via sendUpstreamRequest extraHeaders merge.
Gateway: tool_use broadcast test
packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts
Test verifies UnifiedThreadResponseConsumer broadcasts tool_use custom events to both conversation and CLI session targets with payload containing toolCallId, name, result_summary.event_ids, messageId, and timestamp.
Provider: types, SSE collection, normalize and finalize
packages/promptfoo-provider/src/provider.ts
Adds exported LobuToolCall and enriches LobuProviderResponse.metadata with toolCalls?: LobuToolCall[] and retrievedContext?: string; extends collectResponse to normalize tool_use SSE events, accumulate/dedupe retrieval snippets, introduce finalize(), and ensure toolCalls/retrievedContext are returned.
Provider tests for SSE handling
packages/promptfoo-provider/src/__tests__/provider.test.ts
Stubs fetch/SSE to validate callApi handles tool_use events: populates metadata.toolCalls, derives metadata.retrievedContext from snippets, extracts traceId, aggregates token usage, ignores foreign messageId events, and records non-retrieval tool calls.
Docs and eval config
packages/promptfoo-provider/HANDOFF.md, packages/promptfoo-provider/README.md, examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
HANDOFF/README updated to document tool_use SSE behavior and metadata fields; personal-finance promptfoo config adds a retrieval-focused eval asserting search_memory (or lobu_search_memory) was invoked and responses are grounded.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • lobu-ai/lobu#911: Main PR implements gateway/worker tool_use SSE payload handling and retrieval-context extraction, building on prior provider groundwork.

Suggested labels

skip-size-check

Poem

A rabbit hops through SSE streams so bright, 🐰
Tool calls traced from worker's height to provider's sight,
Snippets danced, metadata sang clear,
RAG checks now find what once was mere,
Hooray — the trace trails bring insight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(gateway): tool_use SSE events for client-side trace inspection' accurately summarizes the main change: adding a new SSE event type for tool tracing.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering Summary, Change shape, SSE payload, Test plan, and Compatibility sections.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/tool-use-sse

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/agent-worker/src/openclaw/worker.ts (1)

1466-1490: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for pending tool_use sends before resolving the turn.

At Lines 1468-1480, onProgress(...) is fire-and-forget. The turn can resolve on agent_end before these sends settle, so tool_use events may arrive late or be dropped when completion closes the stream.

Proposed fix
       // Wire events through progress processor with delta batching
       let pendingDelta = "";
+      const pendingCustomEventSends = new Set<Promise<void>>();
       const DELTA_BATCH_INTERVAL_MS = 150;
@@
         if (event.type === "tool_execution_end") {
           const payload = buildToolUseEventPayload(event);
-          onProgress({
+          const sendPromise = onProgress({
             type: "custom_event",
             data: {
               name: "tool_use",
               payload: payload as unknown as Record<string, unknown>,
             },
             timestamp: Date.now(),
-          }).catch((err) => {
-            logger.warn(
-              `Failed to emit tool_use custom event for ${event.toolName}:`,
-              err
-            );
-          });
+          });
+          pendingCustomEventSends.add(sendPromise);
+          sendPromise
+            .catch((err) => {
+              logger.warn(
+                `Failed to emit tool_use custom event for ${event.toolName}:`,
+                err
+              );
+            })
+            .finally(() => {
+              pendingCustomEventSends.delete(sendPromise);
+            });
         }
 
         if (event.type === "agent_end") {
-          flushDelta()
+          Promise.allSettled([...pendingCustomEventSends])
+            .then(() => flushDelta())
             .then(() => resolveTurnDone?.())
             .catch((err) => {
               logger.error("Failed to flush final delta:", err);
               resolveTurnDone?.();
             });
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/agent-worker/src/openclaw/worker.ts` around lines 1466 - 1490, The
tool_use onProgress call is fire-and-forget and can still be pending when
agent_end resolves; change the logic to track and await pending onProgress
promises for tool_use before calling resolveTurnDone. Specifically, when
event.type === "tool_execution_end" (and using buildToolUseEventPayload and
onProgress), push the returned promise (with a catch that logs via logger.warn)
into a local pending array; then when handling event.type === "agent_end" (and
calling flushDelta and resolveTurnDone), await Promise.all on that pending array
(handling any rejections with logging) before invoking resolveTurnDone so all
tool_use sends settle first. Ensure the pending array is cleared after awaiting
to avoid memory leaks.
packages/promptfoo-provider/src/provider.ts (1)

153-162: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve tool-call metadata on error responses.

On Line 153, the error return omits toolCalls/retrievedContext, even though collectResponse() may already have captured tool_use events. This drops trace data exactly on failure paths.

Suggested fix
       if (response.error) {
         return {
           output: response.text,
           error: response.error,
           metadata: {
             agent: this.agent,
             thread,
             traceId: response.traceId,
+            ...(response.toolCalls && response.toolCalls.length > 0
+              ? { toolCalls: response.toolCalls }
+              : {}),
+            ...(response.retrievedContext
+              ? { retrievedContext: response.retrievedContext }
+              : {}),
           },
         };
       }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/promptfoo-provider/src/provider.ts` around lines 153 - 162, When
handling an error from response (the branch checking response.error), preserve
any tool-call metadata collected by collectResponse() by including toolCalls and
retrievedContext in the returned metadata object; update the error return to
include metadata: { agent: this.agent, thread, traceId: response.traceId,
toolCalls: response.toolCalls || [], retrievedContext: response.retrievedContext
|| null } (or pull these values from whatever local variables collectResponse()
populates) so tool_use events aren't dropped on failure paths.
🧹 Nitpick comments (1)
packages/promptfoo-provider/src/__tests__/provider.test.ts (1)

61-161: ⚡ Quick win

Add a regression test for tool_use + error finalization.

Current cases only assert success completion. Add one case where a tool_use event is emitted before an error event, and assert metadata still includes captured toolCalls (and retrieval context when present).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/promptfoo-provider/src/__tests__/provider.test.ts` around lines 61 -
161, Add a regression test that emits a tool_use SSE event followed by an
error/complete error finalization and asserts captured toolCalls (and
retrievedContext for search_memory) are still present: create a new test in
provider.test.ts that uses sseEvent to emit a tool_use (e.g. toolCallId
"tc-err", name "search_memory" with result_summary/snippets) then an
sseEvent("error", { messageId: "msg-1", error: ... }) (or an error finalization
your SSE stub uses), stub fetch with createFetchStub, call LobuProvider.callApi,
and assert result.metadata.toolCalls contains the tool call and
result_summary.event_ids and that result.metadata.retrievedContext equals the
concatenated snippets when applicable; mirror existing tests’ structure (use
LobuProvider, callApi, expect on metadata.toolCalls and retrievedContext) to
ensure behavior on error finalization is covered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/agent-worker/src/openclaw/tool-use-events.ts`:
- Around line 123-129: The current handler in openclaw/tool-use-events.ts
returns null inside the catch when JSON.parse(text) fails, which aborts scanning
and can miss later valid JSON content (dropping result_summary); change the
logic in the block that checks if (type === "text" && typeof text === "string")
so that on JSON.parse failure you do not return null but instead skip this text
entry and continue scanning subsequent parts (e.g., use continue or otherwise
ignore this entry), allowing later valid JSON content to be parsed and
result_summary to be set; ensure any eventual fallback still returns null only
after all parts are examined.

In `@packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts`:
- Around line 111-119: The test's assertion for conversationBroadcast (variable
conversationBroadcast) only checks a subset of the payload; update the assertion
that checks the broadcasted message for toolCallId "tc-1" and name
"search_memory" to also assert result_summary.snippets, input, and isError are
present and match the constructed payload (alongside existing event_ids,
messageId "m1" and timestamp 1000). Modify the
expect(conversationBroadcast?.[2]) check in unified-thread-consumer.test.ts to
include result_summary.snippets, input, and isError (either by expanding the
toMatchObject expected object or adding explicit expects) so the test validates
the full broadcast payload.

---

Outside diff comments:
In `@packages/agent-worker/src/openclaw/worker.ts`:
- Around line 1466-1490: The tool_use onProgress call is fire-and-forget and can
still be pending when agent_end resolves; change the logic to track and await
pending onProgress promises for tool_use before calling resolveTurnDone.
Specifically, when event.type === "tool_execution_end" (and using
buildToolUseEventPayload and onProgress), push the returned promise (with a
catch that logs via logger.warn) into a local pending array; then when handling
event.type === "agent_end" (and calling flushDelta and resolveTurnDone), await
Promise.all on that pending array (handling any rejections with logging) before
invoking resolveTurnDone so all tool_use sends settle first. Ensure the pending
array is cleared after awaiting to avoid memory leaks.

In `@packages/promptfoo-provider/src/provider.ts`:
- Around line 153-162: When handling an error from response (the branch checking
response.error), preserve any tool-call metadata collected by collectResponse()
by including toolCalls and retrievedContext in the returned metadata object;
update the error return to include metadata: { agent: this.agent, thread,
traceId: response.traceId, toolCalls: response.toolCalls || [],
retrievedContext: response.retrievedContext || null } (or pull these values from
whatever local variables collectResponse() populates) so tool_use events aren't
dropped on failure paths.

---

Nitpick comments:
In `@packages/promptfoo-provider/src/__tests__/provider.test.ts`:
- Around line 61-161: Add a regression test that emits a tool_use SSE event
followed by an error/complete error finalization and asserts captured toolCalls
(and retrievedContext for search_memory) are still present: create a new test in
provider.test.ts that uses sseEvent to emit a tool_use (e.g. toolCallId
"tc-err", name "search_memory" with result_summary/snippets) then an
sseEvent("error", { messageId: "msg-1", error: ... }) (or an error finalization
your SSE stub uses), stub fetch with createFetchStub, call LobuProvider.callApi,
and assert result.metadata.toolCalls contains the tool call and
result_summary.event_ids and that result.metadata.retrievedContext equals the
concatenated snippets when applicable; mirror existing tests’ structure (use
LobuProvider, callApi, expect on metadata.toolCalls and retrievedContext) to
ensure behavior on error finalization is covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 6681e6e2-83d2-4a4b-8e8a-b32a8198feeb

📥 Commits

Reviewing files that changed from the base of the PR and between f8f087b and 0dba284.

📒 Files selected for processing (9)
  • examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
  • packages/agent-worker/src/__tests__/tool-use-events.test.ts
  • packages/agent-worker/src/openclaw/tool-use-events.ts
  • packages/agent-worker/src/openclaw/worker.ts
  • packages/promptfoo-provider/HANDOFF.md
  • packages/promptfoo-provider/README.md
  • packages/promptfoo-provider/src/__tests__/provider.test.ts
  • packages/promptfoo-provider/src/provider.ts
  • packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts

Comment on lines +123 to +129
if (type === "text" && typeof text === "string") {
try {
return JSON.parse(text);
} catch {
// Plain text result — nothing to summarise.
return null;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t stop scanning MCP content after the first unparsable text block.

At Lines 126-128, returning null inside catch exits early and can miss a later valid JSON content part, which drops result_summary unexpectedly.

Proposed fix
 function extractSearchMemoryBody(raw: unknown): unknown {
   if (!raw || typeof raw !== "object") return null;
@@
   const mcpContent = (raw as { content?: unknown }).content;
   if (Array.isArray(mcpContent)) {
     for (const part of mcpContent) {
       if (!part || typeof part !== "object") continue;
       const type = (part as { type?: unknown }).type;
       const text = (part as { text?: unknown }).text;
       if (type === "text" && typeof text === "string") {
         try {
           return JSON.parse(text);
         } catch {
-          // Plain text result — nothing to summarise.
-          return null;
+          // Not JSON; keep scanning other parts.
+          continue;
         }
       }
     }
+    return null;
   }
 
   // Already the search_memory body.
   return raw;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/agent-worker/src/openclaw/tool-use-events.ts` around lines 123 -
129, The current handler in openclaw/tool-use-events.ts returns null inside the
catch when JSON.parse(text) fails, which aborts scanning and can miss later
valid JSON content (dropping result_summary); change the logic in the block that
checks if (type === "text" && typeof text === "string") so that on JSON.parse
failure you do not return null but instead skip this text entry and continue
scanning subsequent parts (e.g., use continue or otherwise ignore this entry),
allowing later valid JSON content to be parsed and result_summary to be set;
ensure any eventual fallback still returns null only after all parts are
examined.

Comment on lines +111 to +119
expect(conversationBroadcast?.[2]).toMatchObject({
toolCallId: "tc-1",
name: "search_memory",
result_summary: {
event_ids: [42],
},
messageId: "m1",
timestamp: 1000,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Verify that result_summary.snippets, input, and isError are included in the broadcast.

The test constructs a payload containing input, isError, and result_summary.snippets but only verifies a subset of these fields in the broadcast assertion. Since the PR objectives emphasize that retrieval tools include "event_ids and inline snippets," the test should confirm that snippets are actually broadcast to prevent regressions where critical fields might be accidentally dropped.

🧪 Expand assertions to cover the complete payload
 expect(conversationBroadcast?.[2]).toMatchObject({
   toolCallId: "tc-1",
   name: "search_memory",
+  input: { query: "rent" },
+  isError: false,
   result_summary: {
     event_ids: [42],
+    snippets: [{ id: 42, text: "Rent is due 1st" }],
   },
   messageId: "m1",
   timestamp: 1000,
 });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
expect(conversationBroadcast?.[2]).toMatchObject({
toolCallId: "tc-1",
name: "search_memory",
result_summary: {
event_ids: [42],
},
messageId: "m1",
timestamp: 1000,
});
expect(conversationBroadcast?.[2]).toMatchObject({
toolCallId: "tc-1",
name: "search_memory",
input: { query: "rent" },
isError: false,
result_summary: {
event_ids: [42],
snippets: [{ id: 42, text: "Rent is due 1st" }],
},
messageId: "m1",
timestamp: 1000,
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/__tests__/unified-thread-consumer.test.ts` around
lines 111 - 119, The test's assertion for conversationBroadcast (variable
conversationBroadcast) only checks a subset of the payload; update the assertion
that checks the broadcasted message for toolCallId "tc-1" and name
"search_memory" to also assert result_summary.snippets, input, and isError are
present and match the constructed payload (alongside existing event_ids,
messageId "m1" and timestamp 1000). Modify the
expect(conversationBroadcast?.[2]) check in unified-thread-consumer.test.ts to
include result_summary.snippets, input, and isError (either by expanding the
toMatchObject expected object or adding explicit expects) so the test validates
the full broadcast payload.

Address four real concerns raised by `pi -p` on the previous commit:

1. **tool_use payload was losing `input` args.** pi-agent's
   `tool_execution_end` event omits `args` (those only fire on the matching
   `tool_execution_start`). worker.ts now tracks a `Map<toolCallId, args>` from
   the start event and merges them into the payload at end. Test comment
   updated to call this out so future readers don't regress it.

2. **`result_summary` would never populate for live `search_memory` calls.**
   The MCP proxy returns formatted markdown to workers by default; my extractor
   expected JSON-stringified bodies. Two changes:
   - `callMcpTool` now sends `x-mcp-format: json` for known retrieval tools
     (`search_memory`, `lobu_search_memory`) only — keeps existing markdown
     behaviour for every other tool the agent sees.
   - MCP proxy's `handleCallTool` forwards the caller's `x-mcp-format` header
     to upstream via the new `extraHeaders` param on `sendUpstreamRequest`.
     Internal MCPs (embedded lobu-memory) respect it; external MCPs ignore
     unknown headers.

3. **Race between `tool_use` SSE emit and `complete`.** The subscriber was
   firing `onProgress(...).catch(...)` un-awaited, so a slow custom-event POST
   could land after `complete` (and the promptfoo provider returns on
   `complete`, losing the trace). worker.ts now tracks in-flight tool-use
   promises in a Set and waits for them at `agent_end` before resolving the
   turn, ensuring SSE ordering matches semantic ordering.

4. **Eval + README javascript assertions used the wrong shape.** promptfoo's
   `context.vars` holds test vars, not provider metadata; metadata is on
   `context.providerResponse.metadata`. And `contextTransform` is for context
   assertions (`context-recall` etc.), not `javascript`. Both fixed:
   - `examples/personal-finance/.../promptfooconfig.yaml` JS assertion now
     reads `context.providerResponse.metadata.toolCalls`.
   - README example shows the `context-recall` + `contextTransform` pattern
     (correct shape) alongside a separate `javascript` example reading
     `context.providerResponse?.metadata`.
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 19, 2026

Self-review via pi -p — findings applied in 80d9abb

pi -p flagged four real concerns on the initial cut:

  1. tool_use.input lost args — pi-agent's tool_execution_end event omits args (start-only). Fixed by tracking Map<toolCallId, args> from tool_execution_start and merging at end.
  2. result_summary wouldn't populate live — MCP proxy returns markdown by default; my extractor expected JSON. Fixed by having callMcpTool send x-mcp-format: json only for retrieval tools (search_memory / lobu_search_memory) and forwarding the header through handleCallToolsendUpstreamRequest to internal MCPs. External MCPs / non-retrieval tools keep existing markdown behaviour.
  3. Race between tool_use SSE emit and complete — emits were .catch(...) un-awaited, so a slow POST could land after complete (provider returns on complete, losing the trace). Fixed: in-flight tool_use promises are tracked in a Set and awaited at agent_end before resolving the turn.
  4. Eval/README JS assertion used wrong shapecontext.vars.metadata doesn't exist (vars = test vars); metadata is on context.providerResponse.metadata. And contextTransform is for context assertions, not javascript. Both fixed.

All four already had local repros via the new unit tests (the worker test still validates the merged shape; the provider test stays green because the SSE event payload is the same). make typecheck + bun test for the new files still clean (13/13).

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/server/src/gateway/auth/mcp/proxy.ts`:
- Around line 743-762: The retry path drops the forwarded x-mcp-format header
because extraHeaders (derived from callerFormat) is passed to the first
sendUpstreamRequest call but not to the stale-session reinitialize/retry call;
update the retry logic so that the same extraHeaders (or callerFormat) is passed
into the subsequent sendUpstreamRequest retry. Locate where callerFormat and
extraHeaders are created and ensure the retry branch that calls
this.sendUpstreamRequest(...) includes the extraHeaders argument (same shape as
the first call) so x-mcp-format is preserved on retry (also apply the same
change to the similar block around the 822–830 area).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 2783b9da-528d-4bf5-bc34-0aa2e883a705

📥 Commits

Reviewing files that changed from the base of the PR and between 0dba284 and 80d9abb.

📒 Files selected for processing (6)
  • examples/personal-finance/agents/personal-finance/evals/promptfooconfig.yaml
  • packages/agent-worker/src/__tests__/tool-use-events.test.ts
  • packages/agent-worker/src/openclaw/worker.ts
  • packages/agent-worker/src/shared/tool-implementations.ts
  • packages/promptfoo-provider/README.md
  • packages/server/src/gateway/auth/mcp/proxy.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/promptfoo-provider/README.md

Comment on lines +743 to 762
// Forward the caller's `x-mcp-format` opt-in so internal MCPs (the
// embedded lobu-memory server) can return raw JSON instead of formatted
// markdown. The worker uses this for retrieval tools to surface
// structured `result_summary` (event ids + snippet text) through the
// `tool_use` SSE event.
const callerFormat = c.req.header("x-mcp-format");
const extraHeaders = callerFormat
? { "x-mcp-format": callerFormat }
: undefined;

let response = await this.sendUpstreamRequest(
httpServer,
agentId,
mcpId,
"POST",
jsonRpcBody,
scopeKey,
auth.token
auth.token,
extraHeaders
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve x-mcp-format on stale-session retry.

extraHeaders are forwarded on the first call but dropped on the 404 reinitialize/retry call, so retrieval tools can silently fall back to markdown on exactly the retry path.

🔧 Suggested fix
       response = await this.sendUpstreamRequest(
         httpServer,
         agentId,
         mcpId,
         "POST",
         jsonRpcBody,
         scopeKey,
-        auth.token
+        auth.token,
+        extraHeaders
       );

Also applies to: 822-830

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/auth/mcp/proxy.ts` around lines 743 - 762, The
retry path drops the forwarded x-mcp-format header because extraHeaders (derived
from callerFormat) is passed to the first sendUpstreamRequest call but not to
the stale-session reinitialize/retry call; update the retry logic so that the
same extraHeaders (or callerFormat) is passed into the subsequent
sendUpstreamRequest retry. Locate where callerFormat and extraHeaders are
created and ensure the retry branch that calls this.sendUpstreamRequest(...)
includes the extraHeaders argument (same shape as the first call) so
x-mcp-format is preserved on retry (also apply the same change to the similar
block around the 822–830 area).

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@buremba buremba merged commit dcb5b1d into main May 19, 2026
26 checks passed
@buremba buremba deleted the feat/tool-use-sse branch May 19, 2026 14:57
buremba added a commit that referenced this pull request May 19, 2026
…onal-finance evals

@lobu/promptfoo-provider gains vars.transcript: string[] support — replays
sequential turns in one Lobu thread, returns the final assistant response
for assertion. Single-turn callers via plain prompt are unchanged.

Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing,
sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml
using vars.transcript. Deletes the original YAML files.

5 provider tests pass (mock-fetch over the gateway endpoints) covering
single-turn baseline, multi-turn ordering + session reuse + single
cleanup, whitespace filter, non-array fallback, empty-array fallback.

Rebased cleanly atop #918 (tool_use SSE) — the agent-worker / provider
files in main already include #918's additions; this commit is the
strict multi-turn delta.
buremba added a commit that referenced this pull request May 19, 2026
…onal-finance evals (#913)

@lobu/promptfoo-provider gains vars.transcript: string[] support — replays
sequential turns in one Lobu thread, returns the final assistant response
for assertion. Single-turn callers via plain prompt are unchanged.

Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing,
sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml
using vars.transcript. Deletes the original YAML files.

5 provider tests pass (mock-fetch over the gateway endpoints) covering
single-turn baseline, multi-turn ordering + session reuse + single
cleanup, whitespace filter, non-array fallback, empty-array fallback.

Rebased cleanly atop #918 (tool_use SSE) — the agent-worker / provider
files in main already include #918's additions; this commit is the
strict multi-turn delta.
buremba added a commit that referenced this pull request May 19, 2026
…onal-finance evals (#921)

@lobu/promptfoo-provider gains vars.transcript: string[] support — replays
sequential turns in one Lobu thread, returns the final assistant response
for assertion. Single-turn callers via plain prompt are unchanged.

Migrates the 4 dormant personal-finance behavioural YAMLs (gap-surfacing,
sa102-employment, sa105-property, sa108-cgt) into promptfooconfig.yaml
using vars.transcript. Deletes the original YAML files.

Strictly additive atop current main (which already includes #918's tool_use
SSE events). Re-do of #913 after #920 reverted that PR — the original
landing accidentally undid #914 and #916 because of a bad rebase-and-soft-reset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants