Skip to content

[Agent Builder] [AI Infra] Adds product documentation tool and task evals#248370

Merged
spong merged 10 commits intoelastic:mainfrom
spong:product-doc-evals
Jan 9, 2026
Merged

[Agent Builder] [AI Infra] Adds product documentation tool and task evals#248370
spong merged 10 commits intoelastic:mainfrom
spong:product-doc-evals

Conversation

@spong
Copy link
Member

@spong spong commented Jan 9, 2026

Note

Need to iterate on actual baseline evals (they're pretty much the same now), but wanted to check in and get working on CI since we're adding a new package here. Will tune baseline evals for each so that they're somewhat useful, but the intent here is to get something in place to make further feedback cycles quicker/easier.

Summary

This PR adds two complementary evaluation specs for the Product Documentation experience:

Agent Builder tool-behavior evals
  • Verifies the agent calls only the platform.core.product_documentation tool and follows grounding/insufficiency rules.
  • File: x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/evals/product_documentation/product_documentation.spec.ts
AI Infra retriever-task evals (llm_tasks)
  • Evaluates the llmTasks.retrieveDocumentation task itself (retriever + token reduction)
  • File: x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/evals/retrieve_documentation/retrieve_documentation.spec.ts

Key implementation details

  • New eval suite package for ai-infra tasks: @kbn/evals-suite-llm-tasks
    • Path: x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/
  • New product_documentation eval spec in existing agent-builder/kbn-evals-suite-agent-builder suite

Test Instructions

Start Scout server in another terminal and keep it running:

scripts/scout.js start-server --stateful

Start phoenix in another terminal and keep it running:

node scripts/phoenix

Then run desired suite

  1. Agent Builder: product documentation tool eval
EVALUATION_REPETITIONS=1 \
EVALUATION_CONNECTOR_ID="gemini-3-pro" \
node scripts/playwright test \
--config x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/playwright.config.ts \
x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/evals/product_documentation/product_documentation.spec.ts \
--project gemini-3-pro
image
  1. ai-infra: llm_tasks retrieveDocumentation retriever-task eval
EVALUATION_REPETITIONS=1 \
EVALUATION_CONNECTOR_ID="gemini-3-pro" \
node scripts/playwright test \
--config x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/playwright.config.ts \
x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/evals/retrieve_documentation/retrieve_documentation.spec.ts \
--project gemini-3-pro
image

Note

Replace --project gemini-3-pro with the connector id you want to run against, and EVALUATION_CONNECTOR_ID with the judge connector id.

PR developed with Cursor + GPT 5.2

@spong spong requested a review from a team January 9, 2026 00:32
@spong spong self-assigned this Jan 9, 2026
@spong spong requested a review from a team as a code owner January 9, 2026 00:32
@spong spong added release_note:skip Skip the PR/issue when compiling release notes backport:version Backport to applied version labels v9.3.0 v9.4.0 labels Jan 9, 2026
async ({ uiSettings, log }, use) => {
// Ensure AgentBuilder API is enabled before running the evaluation.
// Using Scout's uiSettings fixture is more robust than calling /internal/kibana/settings directly.
await uiSettings.set({ ['agentBuilder:enabled']: true });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this feature flag is being removed soon #248050

Copy link
Member

@joemcelroy joemcelroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - will be interesting when skills come in and i wonder if the product_documentation.spec will change to be more evaluating the skill vs evaluating an agent with a single tool.

Copy link
Member

@qn895 qn895 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/evals 134 160 +26
Unknown metric groups

API count

id before after diff
@kbn/evals 153 179 +26

History

cc @spong

@spong spong merged commit 16e3505 into elastic:main Jan 9, 2026
15 checks passed
@spong spong deleted the product-doc-evals branch January 9, 2026 21:54
@kibanamachine
Copy link
Contributor

Starting backport for target branches: 9.3

https://github.com/elastic/kibana/actions/runs/20866660846

@kibanamachine
Copy link
Contributor

💔 All backports failed

Status Branch Result
9.3 Backport failed because of merge conflicts

Manual backport

To create the backport manually run:

node scripts/backport --pr 248370

Questions ?

Please refer to the Backport tool documentation

@spong
Copy link
Member Author

spong commented Jan 9, 2026

Will just keep to main/9.4 as the 9.3 backport has a buncha conflicts with the OneChat->AB rename. We won't be running these directly in 9.3, so fine to just have them from here forward.

@spong spong removed the backport:version Backport to applied version labels label Jan 9, 2026
@spong spong removed the v9.3.0 label Jan 9, 2026
@kibanamachine kibanamachine added the backport:skip This PR does not require backporting label Jan 9, 2026
devamanv pushed a commit to devamanv/kibana that referenced this pull request Jan 12, 2026
…vals (elastic#248370)

> [!NOTE]
> Need to iterate on actual baseline evals (they're pretty much the same
now), but wanted to check in and get working on CI since we're adding a
new package here. Will tune baseline evals for each so that they're
somewhat useful, but the intent here is to get something in place to
make further feedback cycles quicker/easier.


## Summary

This PR adds two complementary evaluation specs for the Product
Documentation experience:

##### Agent Builder tool-behavior evals
* Verifies the agent calls only the
`platform.core.product_documentation` tool and follows
grounding/insufficiency rules.
* File:
`x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/evals/product_documentation/product_documentation.spec.ts`
##### AI Infra retriever-task evals (llm_tasks)
* Evaluates the `llmTasks.retrieveDocumentation` task itself (retriever
+ token reduction)
* File:
`x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/evals/retrieve_documentation/retrieve_documentation.spec.ts`

#### Key implementation details
* New eval suite package for ai-infra tasks:
`@kbn/evals-suite-llm-tasks`
* Path:
`x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/`
* New `product_documentation` eval spec in existing
`agent-builder/kbn-evals-suite-agent-builder` suite

#### Test Instructions

Start Scout server in another terminal and keep it running:
```
scripts/scout.js start-server --stateful
```

Start phoenix in another terminal and keep it running:

```
node scripts/phoenix
```

Then run desired suite

1) Agent Builder: product documentation tool eval
```
EVALUATION_REPETITIONS=1 \
EVALUATION_CONNECTOR_ID="gemini-3-pro" \
node scripts/playwright test \
--config x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/playwright.config.ts \
x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/evals/product_documentation/product_documentation.spec.ts \
--project gemini-3-pro
```

<img width="2293" height="958" alt="image"
src="https://github.com/user-attachments/assets/d372526e-bdca-4847-b4e2-b0343b7ab390"
/>


2) ai-infra: llm_tasks retrieveDocumentation retriever-task eval
```
EVALUATION_REPETITIONS=1 \
EVALUATION_CONNECTOR_ID="gemini-3-pro" \
node scripts/playwright test \
--config x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/playwright.config.ts \
x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/evals/retrieve_documentation/retrieve_documentation.spec.ts \
--project gemini-3-pro
```

<img width="1146" height="396" alt="image"
src="https://github.com/user-attachments/assets/345aa83b-fbf7-407d-87bc-18d2c64879e6"
/>


> [!NOTE]
> Replace `--project gemini-3-pro` with the connector id you want to run
against, and EVALUATION_CONNECTOR_ID with the judge connector id.

_PR developed with Cursor + GPT 5.2_

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
mbondyra added a commit to mbondyra/kibana that referenced this pull request Jan 12, 2026
* commit 'c4304e27736c62f17af20d145770b2ae9d3fae30': (418 commits)
  skip failing suite (elastic#89079)
  [ES|QL] Update grammars (elastic#248600)
  skip failing test suite (elastic#248579)
  [ES|QL] Update function metadata (elastic#248601)
  skip failing test suite (elastic#248554)
  Fix flaky test runner serverless flag for Search solution (elastic#248559)
  [Security Solution][Attacks/Alerts][Attacks page][Table section] Remember last selected attack details tab (Summary or Alerts) (elastic#247519) (elastic#247988)
  Fix ES health check poller (elastic#248496)
  Fix collector schema ownership (elastic#241292)
  [api-docs] 2026-01-10 Daily api_docs build (elastic#248574)
  Update dependency cssstyle to v5.3.5 (main) (elastic#237637)
  Update dependency @octokit/rest to v22.0.1 (main) (elastic#243102)
  skip failing test suite (elastic#248504)
  skip failing test suite (elastic#247685)
  Remove broken ecommerce_dashboard journeys (elastic#248162)
  [Obs AI] Hide AI Insight component when there are no connectors (elastic#248542)
  skip failing suite (elastic#248433)
  [Security Solution][Attacks/Alerts][Attacks page][Table section] Hide tabs for generic attack groups (elastic#248444)
  [Agent Builder] [AI Infra] Adds product documentation tool and task evals (elastic#248370)
  [Controls Anywhere] Keep controls focused when creating + editing other panels (elastic#248021)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants