Skip to content

Conversation

@SrdjanLL
Copy link
Contributor

@SrdjanLL SrdjanLL commented Jul 30, 2025

Summary

Relates to: #227997

Adding a new @kbn/evals-suite-onechat package that provides automated evaluation testing for OneChat functionality.

The suite is built on top of the existing @kbn/evals framework to use Phoenix/Playwright for evaluations and follows the same patterns as other evaluation packages (@kbn/evals-suite-obs-ai-assistant).

Besides the boilerplate evaluation framework setup, this change adds the following:

  • OneChat API Client: API client for handling chat interactions and conversation management (using sync API)
  • Evaluation: Playwright test extension with OneChat-specific fixture and a first, very basic response evaluator.
  • Sample Test Suite: Example evaluation tests for KB tests (from this doc)

What's not added (yet)

  • Dataset loading. Knowledge base tests currently don't preload datasets, this should be addressed in a follow-up as it requires configuration of the OneChat HF datasets within out data loaders package.

How it looks:

Running the test suite (screenshot below). Experiment results available in Phoenix (link). Note: Scores are low as the dataset loading is still not happening. Therefore the model can't respond to any data related questions.
image

Checklist

Check the PR satisfies following conditions.

Reviewers should verify this PR satisfies this list as well.

  • Documentation was added for features that require explanation or tutorials
  • Unit or functional tests were updated or added to match the most common scenarios
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines

@SrdjanLL SrdjanLL requested a review from joemcelroy July 30, 2025 09:20
@SrdjanLL SrdjanLL added release_note:skip Skip the PR/issue when compiling release notes backport:skip This PR does not require backporting backport:version Backport to applied version labels labels Jul 30, 2025
SrdjanLL and others added 7 commits July 31, 2025 09:00
…t --include-path /api/status --include-path /api/alerting/rule/ --include-path /api/alerting/rules --include-path /api/actions --include-path /api/security/role --include-path /api/spaces --include-path /api/streams --include-path /api/fleet --include-path /api/dashboards --include-path /api/saved_objects/_import --include-path /api/saved_objects/_export --include-path /api/maintenance_window --update'
@SrdjanLL SrdjanLL force-pushed the one-chat-eval-suite branch from 44038bc to b3f01f0 Compare July 31, 2025 08:01
@SrdjanLL SrdjanLL enabled auto-merge (squash) July 31, 2025 08:09
@SrdjanLL SrdjanLL merged commit 24e3181 into elastic:main Jul 31, 2025
13 checks passed
@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #120 / Cloud Security Posture GET /internal/cloud_security_posture/stats KSPM Compliance Dashboard Stats API should return KSPM benchmarks V2

Metrics [docs]

✅ unchanged

History

delanni pushed a commit to delanni/kibana that referenced this pull request Aug 5, 2025
## Summary

Relates to: elastic#227997

Adding a new `@kbn/evals-suite-onechat` package that provides automated
evaluation testing for OneChat functionality.

The suite is built on top of the existing `@kbn/evals` framework to use
Phoenix/Playwright for evaluations and follows the same patterns as
other evaluation packages (`@kbn/evals-suite-obs-ai-assistant`).

Besides the boilerplate evaluation framework setup, this change adds the
following:
- OneChat API Client: API client for handling chat interactions and
conversation management (using sync API)
- Evaluation: Playwright test extension with OneChat-specific fixture
and a first, very basic response evaluator.
- Sample Test Suite: Example evaluation tests for KB tests (from this
[doc](https://docs.google.com/document/d/1dG4dsImCtTEHeiqXxuC6wXDj497X_fmhJtJnF-jKmQ8/edit?tab=t.0))

What's not added (yet)
- Dataset loading. Knowledge base tests currently don't preload
datasets, this should be addressed in a follow-up as it requires
configuration of the OneChat HF datasets within out data loaders
package.


### Checklist

Check the PR satisfies following conditions. 

Reviewers should verify this PR satisfies this list as well.

- [x]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
@wildemat wildemat mentioned this pull request Aug 7, 2025
10 tasks
NicholasPeretti pushed a commit to NicholasPeretti/kibana that referenced this pull request Aug 18, 2025
## Summary

Relates to: elastic#227997

Adding a new `@kbn/evals-suite-onechat` package that provides automated
evaluation testing for OneChat functionality.

The suite is built on top of the existing `@kbn/evals` framework to use
Phoenix/Playwright for evaluations and follows the same patterns as
other evaluation packages (`@kbn/evals-suite-obs-ai-assistant`).

Besides the boilerplate evaluation framework setup, this change adds the
following:
- OneChat API Client: API client for handling chat interactions and
conversation management (using sync API)
- Evaluation: Playwright test extension with OneChat-specific fixture
and a first, very basic response evaluator.
- Sample Test Suite: Example evaluation tests for KB tests (from this
[doc](https://docs.google.com/document/d/1dG4dsImCtTEHeiqXxuC6wXDj497X_fmhJtJnF-jKmQ8/edit?tab=t.0))

What's not added (yet)
- Dataset loading. Knowledge base tests currently don't preload
datasets, this should be addressed in a follow-up as it requires
configuration of the OneChat HF datasets within out data loaders
package.


### Checklist

Check the PR satisfies following conditions. 

Reviewers should verify this PR satisfies this list as well.

- [x]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] The PR description includes the appropriate Release Notes section,
and the correct `release_note:*` label is applied per the
[guidelines](https://www.elastic.co/guide/en/kibana/master/contributing.html#kibana-release-notes-process)

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting backport:version Backport to applied version labels release_note:skip Skip the PR/issue when compiling release notes v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants