[1chat] Evaluation Suite #229929

SrdjanLL · 2025-07-30T09:20:47Z

Summary

Relates to: #227997

Adding a new @kbn/evals-suite-onechat package that provides automated evaluation testing for OneChat functionality.

The suite is built on top of the existing @kbn/evals framework to use Phoenix/Playwright for evaluations and follows the same patterns as other evaluation packages (@kbn/evals-suite-obs-ai-assistant).

Besides the boilerplate evaluation framework setup, this change adds the following:

OneChat API Client: API client for handling chat interactions and conversation management (using sync API)
Evaluation: Playwright test extension with OneChat-specific fixture and a first, very basic response evaluator.
Sample Test Suite: Example evaluation tests for KB tests (from this doc)

What's not added (yet)

Dataset loading. Knowledge base tests currently don't preload datasets, this should be addressed in a follow-up as it requires configuration of the OneChat HF datasets within out data loaders package.

How it looks:

Running the test suite (screenshot below). Experiment results available in Phoenix (link). Note: Scores are low as the dataset loading is still not happening. Therefore the model can't respond to any data related questions.

Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios
The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines

…t --include-path /api/status --include-path /api/alerting/rule/ --include-path /api/alerting/rules --include-path /api/actions --include-path /api/security/role --include-path /api/spaces --include-path /api/streams --include-path /api/fleet --include-path /api/dashboards --include-path /api/saved_objects/_import --include-path /api/saved_objects/_export --include-path /api/maintenance_window --update'

…no-cache --fix'

elasticmachine · 2025-07-31T16:55:08Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 85ead23

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #120 / Cloud Security Posture GET /internal/cloud_security_posture/stats KSPM Compliance Dashboard Stats API should return KSPM benchmarks V2

Metrics [docs]

✅ unchanged

History

## Summary Relates to: elastic#227997 Adding a new `@kbn/evals-suite-onechat` package that provides automated evaluation testing for OneChat functionality. The suite is built on top of the existing `@kbn/evals` framework to use Phoenix/Playwright for evaluations and follows the same patterns as other evaluation packages (`@kbn/evals-suite-obs-ai-assistant`). Besides the boilerplate evaluation framework setup, this change adds the following: - OneChat API Client: API client for handling chat interactions and conversation management (using sync API) - Evaluation: Playwright test extension with OneChat-specific fixture and a first, very basic response evaluator. - Sample Test Suite: Example evaluation tests for KB tests (from this [doc](https://docs.google.com/document/d/1dG4dsImCtTEHeiqXxuC6wXDj497X_fmhJtJnF-jKmQ8/edit?tab=t.0)) What's not added (yet) - Dataset loading. Knowledge base tests currently don't preload datasets, this should be addressed in a follow-up as it requires configuration of the OneChat HF datasets within out data loaders package. ### Checklist Check the PR satisfies following conditions. Reviewers should verify this PR satisfies this list as well. - [x] [Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html) was added for features that require explanation or tutorials - [x] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios - [x] The PR description includes the appropriate Release Notes section, and the correct `release_note:*` label is applied per the [guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process) --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>

SrdjanLL requested a review from joemcelroy July 30, 2025 09:20

SrdjanLL added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting backport:version Backport to applied version labels labels Jul 30, 2025

joemcelroy approved these changes Jul 30, 2025

View reviewed changes

SrdjanLL and others added 7 commits July 31, 2025 09:00

Create kbn-evals-suite-onechat package

9a7aad9

Set up Playwright and Phoenix. Add OneChat API client and a sample test

73e61fa

[CI] Auto-commit changed files from 'node scripts/eslint_all_files --…

f81074e

…no-cache --fix'

[CI] Auto-commit changed files from 'node scripts/generate codeowners'

62f6e76

[CI] Auto-commit changed files from 'node scripts/yarn_deduplicate'

632f987

Fix jest config root dir

b3f01f0

SrdjanLL force-pushed the one-chat-eval-suite branch from 44038bc to b3f01f0 Compare July 31, 2025 08:01

SrdjanLL enabled auto-merge (squash) July 31, 2025 08:09

SrdjanLL added 2 commits July 31, 2025 11:32

Merge branch 'main' into one-chat-eval-suite

7182e89

Merge branch 'main' into one-chat-eval-suite

85ead23

SrdjanLL merged commit 24e3181 into elastic:main Jul 31, 2025
13 checks passed

kibanamachine added the v9.2.0 label Jul 31, 2025

wildemat mentioned this pull request Aug 7, 2025

pr 230826 #231022

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1chat] Evaluation Suite #229929

[1chat] Evaluation Suite #229929

Uh oh!

SrdjanLL commented Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

elasticmachine commented Jul 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[1chat] Evaluation Suite #229929

[1chat] Evaluation Suite #229929

Uh oh!

Conversation

SrdjanLL commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it looks:

Checklist

Uh oh!

Uh oh!

elasticmachine commented Jul 31, 2025

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

History

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SrdjanLL commented Jul 30, 2025 •

edited

Loading