[Agent Builder] [AI Infra] Adds product documentation tool and task evals#248370
[Agent Builder] [AI Infra] Adds product documentation tool and task evals#248370spong merged 10 commits intoelastic:mainfrom
Conversation
| async ({ uiSettings, log }, use) => { | ||
| // Ensure AgentBuilder API is enabled before running the evaluation. | ||
| // Using Scout's uiSettings fixture is more robust than calling /internal/kibana/settings directly. | ||
| await uiSettings.set({ ['agentBuilder:enabled']: true }); |
joemcelroy
left a comment
There was a problem hiding this comment.
LGTM - will be interesting when skills come in and i wonder if the product_documentation.spec will change to be more evaluating the skill vs evaluating an agent with a single tool.
...-infra/kbn-evals-suite-llm-tasks/evals/retrieve_documentation/retrieve_documentation.spec.ts
Outdated
Show resolved
Hide resolved
...platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/src/evaluate_dataset.ts
Outdated
Show resolved
Hide resolved
💛 Build succeeded, but was flaky
Failed CI StepsMetrics [docs]Public APIs missing comments
History
cc @spong |
|
Starting backport for target branches: 9.3 https://github.com/elastic/kibana/actions/runs/20866660846 |
💔 All backports failed
Manual backportTo create the backport manually run: Questions ?Please refer to the Backport tool documentation |
|
Will just keep to |
…vals (elastic#248370) > [!NOTE] > Need to iterate on actual baseline evals (they're pretty much the same now), but wanted to check in and get working on CI since we're adding a new package here. Will tune baseline evals for each so that they're somewhat useful, but the intent here is to get something in place to make further feedback cycles quicker/easier. ## Summary This PR adds two complementary evaluation specs for the Product Documentation experience: ##### Agent Builder tool-behavior evals * Verifies the agent calls only the `platform.core.product_documentation` tool and follows grounding/insufficiency rules. * File: `x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/evals/product_documentation/product_documentation.spec.ts` ##### AI Infra retriever-task evals (llm_tasks) * Evaluates the `llmTasks.retrieveDocumentation` task itself (retriever + token reduction) * File: `x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/evals/retrieve_documentation/retrieve_documentation.spec.ts` #### Key implementation details * New eval suite package for ai-infra tasks: `@kbn/evals-suite-llm-tasks` * Path: `x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/` * New `product_documentation` eval spec in existing `agent-builder/kbn-evals-suite-agent-builder` suite #### Test Instructions Start Scout server in another terminal and keep it running: ``` scripts/scout.js start-server --stateful ``` Start phoenix in another terminal and keep it running: ``` node scripts/phoenix ``` Then run desired suite 1) Agent Builder: product documentation tool eval ``` EVALUATION_REPETITIONS=1 \ EVALUATION_CONNECTOR_ID="gemini-3-pro" \ node scripts/playwright test \ --config x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/playwright.config.ts \ x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/evals/product_documentation/product_documentation.spec.ts \ --project gemini-3-pro ``` <img width="2293" height="958" alt="image" src="https://github.com/user-attachments/assets/d372526e-bdca-4847-b4e2-b0343b7ab390" /> 2) ai-infra: llm_tasks retrieveDocumentation retriever-task eval ``` EVALUATION_REPETITIONS=1 \ EVALUATION_CONNECTOR_ID="gemini-3-pro" \ node scripts/playwright test \ --config x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/playwright.config.ts \ x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/evals/retrieve_documentation/retrieve_documentation.spec.ts \ --project gemini-3-pro ``` <img width="1146" height="396" alt="image" src="https://github.com/user-attachments/assets/345aa83b-fbf7-407d-87bc-18d2c64879e6" /> > [!NOTE] > Replace `--project gemini-3-pro` with the connector id you want to run against, and EVALUATION_CONNECTOR_ID with the judge connector id. _PR developed with Cursor + GPT 5.2_ --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
* commit 'c4304e27736c62f17af20d145770b2ae9d3fae30': (418 commits) skip failing suite (elastic#89079) [ES|QL] Update grammars (elastic#248600) skip failing test suite (elastic#248579) [ES|QL] Update function metadata (elastic#248601) skip failing test suite (elastic#248554) Fix flaky test runner serverless flag for Search solution (elastic#248559) [Security Solution][Attacks/Alerts][Attacks page][Table section] Remember last selected attack details tab (Summary or Alerts) (elastic#247519) (elastic#247988) Fix ES health check poller (elastic#248496) Fix collector schema ownership (elastic#241292) [api-docs] 2026-01-10 Daily api_docs build (elastic#248574) Update dependency cssstyle to v5.3.5 (main) (elastic#237637) Update dependency @octokit/rest to v22.0.1 (main) (elastic#243102) skip failing test suite (elastic#248504) skip failing test suite (elastic#247685) Remove broken ecommerce_dashboard journeys (elastic#248162) [Obs AI] Hide AI Insight component when there are no connectors (elastic#248542) skip failing suite (elastic#248433) [Security Solution][Attacks/Alerts][Attacks page][Table section] Hide tabs for generic attack groups (elastic#248444) [Agent Builder] [AI Infra] Adds product documentation tool and task evals (elastic#248370) [Controls Anywhere] Keep controls focused when creating + editing other panels (elastic#248021) ...
Note
Need to iterate on actual baseline evals (they're pretty much the same now), but wanted to check in and get working on CI since we're adding a new package here. Will tune baseline evals for each so that they're somewhat useful, but the intent here is to get something in place to make further feedback cycles quicker/easier.
Summary
This PR adds two complementary evaluation specs for the Product Documentation experience:
Agent Builder tool-behavior evals
platform.core.product_documentationtool and follows grounding/insufficiency rules.x-pack/platform/packages/shared/agent-builder/kbn-evals-suite-agent-builder/evals/product_documentation/product_documentation.spec.tsAI Infra retriever-task evals (llm_tasks)
llmTasks.retrieveDocumentationtask itself (retriever + token reduction)x-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/evals/retrieve_documentation/retrieve_documentation.spec.tsKey implementation details
@kbn/evals-suite-llm-tasksx-pack/platform/packages/shared/ai-infra/kbn-evals-suite-llm-tasks/product_documentationeval spec in existingagent-builder/kbn-evals-suite-agent-buildersuiteTest Instructions
Start Scout server in another terminal and keep it running:
Start phoenix in another terminal and keep it running:
Then run desired suite
Note
Replace
--project gemini-3-prowith the connector id you want to run against, and EVALUATION_CONNECTOR_ID with the judge connector id.PR developed with Cursor + GPT 5.2