feat(evals): create @kbn/evals-extensions foundation package by patrykkopycinski · Pull Request #258775 · elastic/kibana

patrykkopycinski · 2026-03-20T08:45:56Z

Summary

Creates the foundation package @kbn/evals-extensions for advanced evaluation capabilities. This package will house features ported from cursor-plugin-evals and serve as the home for Phases 3-5 of the evals roadmap.

Architecture

One-way dependency:

✅ kbn-evals-extensions depends on kbn-evals
❌ kbn-evals has NO dependency on kbn-evals-extensions

Evaluation suites opt-in by importing from extensions directly.

What's Included

✅ Package structure and build configuration
✅ Comprehensive documentation
✅ 5 passing unit tests
✅ CODEOWNERS entry
✅ No functional changes

Validation

✅ Bootstrap, type check, tests, eslint, check_changes.ts all passed
✅ No circular dependencies

Roadmap

This enables PRs #2-10 for cost tracking, dataset management, safety evaluators, UI components, DX enhancements, analytics, A/B testing, human-in-the-loop, and IDE integration.

- Add "Vision Alignment" section to README with strategic principles (trace-first, Elastic-native, shared layer boundaries, ownership) - Add module-level JSDoc to index.ts explaining architecture boundaries - Document trace-first evaluator contract in Evaluator and EvaluationResult types - Export createTraceBasedEvaluator and TraceBasedEvaluatorConfig from barrel to promote trace-first pattern as the primary building block - Add JSDoc to all new evaluator factories (security, trajectory, similarity, multi-judge, conversation-coherence) explaining purpose and parameters - Add trace-first migration path annotation to security evaluators module Addresses vision alignment concerns: - Section 5.2.1 (trace-first evaluator contract) - Section 5.2.3 (shared evaluation layer boundaries) - Section 4.5 (ownership model) - CI metrics: reduces public API documentation gap

…cution Two framework bugs prevented Playwright workers from executing @kbn/evals test suites: 1. `.text` file imports crash workers — packages like @kbn/evals import `.text` files (LLM prompt templates) that need a require hook to convert them to CommonJS modules. The hook was registered in the main process via @kbn/babel-register but Playwright workers use their own module resolution. Added a `dot_text_setup.ts` require hook in @kbn/scout (mirroring the existing peggy_setup pattern). 2. `NO_COLOR` env warning kills workers — Playwright sets `FORCE_COLOR` while `NO_COLOR` may also be in the environment. Node emits a warning for this conflict, and `exit_on_warning.js` terminates the process on any unrecognized warning. Added this specific warning to the ignore list. Also adds an initial agentic alert triage eval suite with 5 test cases for the skill migration validation.

…vals execution" This reverts commit 5add16c.

This establishes the structure for advanced evaluation capabilities ported from cursor-plugin-evals and serves as the home for Phases 3-5 of the evals roadmap. ## Architecture The package is designed to be completely independent from @kbn/evals: ``` Evaluation Suites ├──> @kbn/evals (core) └──> @kbn/evals-extensions (advanced features) └──> depends on @kbn/evals ``` **Dependency Rule:** - ✅ kbn-evals-extensions CAN import from kbn-evals - ❌ kbn-evals MUST NOT import from kbn-evals-extensions ## This PR **What's included:** - Package structure (package.json, kibana.jsonc, tsconfig.json) - Placeholder exports (no functional changes) - Test infrastructure (5 passing tests) - Comprehensive documentation **What's NOT included:** - No functional features (placeholder exports only) - No changes to @kbn/evals package - No changes to evaluation suite behavior ## Validation ✅ Bootstrap completed successfully ✅ Type check passed ✅ All tests passing (5/5) ✅ ESLint passed ✅ No circular dependencies ✅ check_changes.ts passed ## Roadmap This foundation enables parallel development of: - PR #2: Cost tracking & metadata enrichment - PR #3: Dataset management utilities - PR #4: Safety evaluators (toxicity, PII, bias, etc.) - PR #5: UI components (run comparison, example explorer) - PR #6: DX enhancements (watch mode, caching, parallel) - PR #7: Advanced analytics - PR #8: A/B testing & active learning - PR #9: Human-in-the-loop workflows - PR elastic#10: IDE integration ## Related Issues - Closes part of elastic#257821 (Epic: Extend @kbn/evals) - Enables elastic#257823 (Phase 2: CI Quality Gates) - Enables elastic#257824 (Phase 3: Red-Teaming) - Enables elastic#257825 (Phase 4: Lens Dashboards) - Enables elastic#257826 (Phase 5: Auto-Generation) - Addresses elastic#255820 (kbn/evals <-> Agent Builder completeness) Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>

elasticmachine · 2026-03-20T08:49:58Z

Pinging @elastic/appex-ai-infra (Team:AI Infra)

coderabbitai · 2026-03-20T09:23:35Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This pull request introduces a new @kbn/evals-extensions package within the Kibana monorepo's shared packages directory. The package serves as a standalone extension layer for the core @kbn/evals evaluation framework, with strict unidirectional dependency boundaries. The PR includes complete package scaffolding (manifest files, configuration, build setup), comprehensive README documentation with architectural guidelines and a multi-phase roadmap, placeholder type and utility exports, and test coverage. Additionally, documentation and exports for trace-based evaluators are added to the core @kbn/evals package, along with JSDoc comments for existing evaluator implementations.

Suggested labels

backport:all-open

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: introduction of a new foundation package `@kbn/evals-extensions`, which is the primary objective of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

🛠️ Update Documentation: Commit on current branch
🛠️ Update Documentation: Create PR

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

x-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.ts (1)

60-72: Consider filtering non-finite scores.

The code checks for score != null but doesn't filter NaN or Infinity values, which could produce unexpected aggregation results if a judge returns such values.

♻️ Proposed fix to filter invalid scores

         if (result.status === 'fulfilled') {
           judgeResults.push({ name: judges[i].name, result: result.value });
-          if (result.value.score != null) {
+          if (result.value.score != null && Number.isFinite(result.value.score)) {
             scores.push(result.value.score);
           }
         } else {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.ts`
around lines 60 - 72, The current loop in results.forEach pushes any non-null
score which can include NaN/Infinity; update the fulfilled branch (inside
results.forEach where judgeResults and scores are updated for judges[i]) to only
push numeric scores that are finite by checking
Number.isFinite(result.value.score) (or otherwise coercing and validating
finiteness) before adding to scores, and keep failedJudges/logger logic
unchanged.

x-pack/platform/packages/shared/kbn-evals-extensions/index.ts (1)

62-68: Placeholder interface is acceptable for foundation PR.

Consider adding a TODO or @internal annotation to signal this interface will be expanded, preventing consumers from relying on its current minimal shape.

📝 Suggested annotation

+/**
+ * `@internal` Placeholder - will be expanded in future PRs
+ */
 export interface ExtensionConfig {
   /**
    * Configuration for extension features
    * Will be expanded as features are added
    */
   placeholder?: string;
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@x-pack/platform/packages/shared/kbn-evals-extensions/index.ts` around lines
62 - 68, Add a JSDoc annotation to the ExtensionConfig interface to indicate it
is a placeholder and will expand (so consumers don't rely on its current shape);
update the doc comment on the exported interface ExtensionConfig to include
either an `@internal` tag or a TODO/@todo note stating it is temporary and subject
to change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@x-pack/platform/packages/shared/kbn-evals-extensions/__tests__/package.test.ts`:
- Around line 20-24: The test incorrectly wraps an async function in expect() —
change the assertion to pass the Promise returned by import('..') directly to
resolves; e.g. replace await expect(async () => { await import('..');
}).resolves.not.toThrow(); with await
expect(import('..')).resolves.toBeDefined(); so the import Promise is asserted
correctly (look for the import('..') usage in the test body).

---

Nitpick comments:
In `@x-pack/platform/packages/shared/kbn-evals-extensions/index.ts`:
- Around line 62-68: Add a JSDoc annotation to the ExtensionConfig interface to
indicate it is a placeholder and will expand (so consumers don't rely on its
current shape); update the doc comment on the exported interface ExtensionConfig
to include either an `@internal` tag or a TODO/@todo note stating it is temporary
and subject to change.

In
`@x-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.ts`:
- Around line 60-72: The current loop in results.forEach pushes any non-null
score which can include NaN/Infinity; update the fulfilled branch (inside
results.forEach where judgeResults and scores are updated for judges[i]) to only
push numeric scores that are finite by checking
Number.isFinite(result.value.score) (or otherwise coercing and validating
finiteness) before adding to scores, and keep failedJudges/logger logic
unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: f6bdda89-339a-4bc3-92cd-680ab7d8bc02

📥 Commits

Reviewing files that changed from the base of the PR and between 5ab22d7 and 4e51bf5.

⛔ Files ignored due to path filters (1)

yarn.lock is excluded by !**/yarn.lock, !**/*.lock

📒 Files selected for processing (23)

.github/CODEOWNERS
package.json
tsconfig.base.json
x-pack/platform/packages/shared/kbn-evals-extensions/.gitignore
x-pack/platform/packages/shared/kbn-evals-extensions/README.md
x-pack/platform/packages/shared/kbn-evals-extensions/__tests__/package.test.ts
x-pack/platform/packages/shared/kbn-evals-extensions/index.ts
x-pack/platform/packages/shared/kbn-evals-extensions/jest.config.js
x-pack/platform/packages/shared/kbn-evals-extensions/kibana.jsonc
x-pack/platform/packages/shared/kbn-evals-extensions/moon.yml
x-pack/platform/packages/shared/kbn-evals-extensions/package.json
x-pack/platform/packages/shared/kbn-evals-extensions/src/index.ts
x-pack/platform/packages/shared/kbn-evals-extensions/src/types/index.ts
x-pack/platform/packages/shared/kbn-evals-extensions/src/utils/index.ts
x-pack/platform/packages/shared/kbn-evals-extensions/tsconfig.json
x-pack/platform/packages/shared/kbn-evals/README.md
x-pack/platform/packages/shared/kbn-evals/index.ts
x-pack/platform/packages/shared/kbn-evals/src/evaluators/conversation_coherence/index.ts
x-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.ts
x-pack/platform/packages/shared/kbn-evals/src/evaluators/security/index.ts
x-pack/platform/packages/shared/kbn-evals/src/evaluators/similarity/index.ts
x-pack/platform/packages/shared/kbn-evals/src/evaluators/trajectory/index.ts
x-pack/platform/packages/shared/kbn-evals/src/types.ts

x-pack/platform/packages/shared/kbn-evals-extensions/__tests__/package.test.ts

- Use `export type *` for type-only re-exports (consistent-type-exports) - Remove redundant scripts/dependencies from package.json to fix jest CI reporter expecting --config arg

patrykkopycinski · 2026-03-25T21:24:05Z

/ci

patrykkopycinski · 2026-03-25T21:49:22Z

/ci

patrykkopycinski · 2026-03-26T00:22:26Z

/ci

patrykkopycinski · 2026-03-26T01:51:08Z

/ci

patrykkopycinski · 2026-03-26T04:03:24Z

/ci

patrykkopycinski · 2026-03-26T08:52:50Z

/ci

…y test .resolves.not.toThrow() expects a promise but received a function. Replaced with a direct dynamic import assertion.

patrykkopycinski · 2026-03-26T21:26:56Z

/ci

…undation

patrykkopycinski · 2026-03-26T21:28:07Z

/ci

…s-foundation

spong

Overall structure and foundation LGTM! 👍

Pushed a couple small fixes from initial review, but this is good to merge as-is if you'd like! 😀

Glad to have an extensions area to expand evals like this, good stuff @patrykkopycinski! 🎉

elasticmachine · 2026-03-27T06:53:32Z

💚 Build Succeeded

Buildkite Build
Commit: 4643b7e

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`@kbn/evals`	284	275	-9
`@kbn/evals-extensions`	-	22	+22
total			+13

Unknown metric groups

API count

id	before	after	diff
`@kbn/evals`	326	339	+13
`@kbn/evals-extensions`	-	30	+30
total			+43

History

💔 Build #417606 failed 0e9e572

cc @patrykkopycinski

…#258775) ## Summary Creates the foundation package `@kbn/evals-extensions` for advanced evaluation capabilities. This package will house features ported from cursor-plugin-evals and serve as the home for Phases 3-5 of the evals roadmap. ## Architecture **One-way dependency:** - ✅ kbn-evals-extensions depends on kbn-evals - ❌ kbn-evals has NO dependency on kbn-evals-extensions Evaluation suites opt-in by importing from extensions directly. ## What's Included ✅ Package structure and build configuration ✅ Comprehensive documentation ✅ 5 passing unit tests ✅ CODEOWNERS entry ✅ No functional changes ## Validation ✅ Bootstrap, type check, tests, eslint, check_changes.ts all passed ✅ No circular dependencies ## Roadmap This enables PRs #2-10 for cost tracking, dataset management, safety evaluators, UI components, DX enhancements, analytics, A/B testing, human-in-the-loop, and IDE integration. ## Related - Part of elastic#257821 - Enables elastic#257823, elastic#257824, elastic#257825, elastic#257826 - Addresses elastic#255820 Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com> Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Garrett Spong <garrett.spong@elastic.co>

patrykkopycinski and others added 4 commits March 17, 2026 20:42

Revert "fix(evals): resolve Playwright worker crashes blocking @kbn/e…

9abcea5

…vals execution" This reverts commit 5add16c.

patrykkopycinski added release_note:skip Skip the PR/issue when compiling release notes Team:AI Infra Platform AppEx AI Infrastructure Team t// labels Mar 20, 2026

patrykkopycinski marked this pull request as ready for review March 20, 2026 08:49

patrykkopycinski requested review from a team as code owners March 20, 2026 08:49

patrykkopycinski self-assigned this Mar 20, 2026

patrykkopycinski added the backport:skip This PR does not require backporting label Mar 20, 2026

kibanamachine added 4 commits March 20, 2026 09:05

Changes from node scripts/lint_ts_projects --fix

bf1e95c

Changes from node scripts/lint_packages --fix

8467f10

Changes from node scripts/generate codeowners

29401c3

Changes from node scripts/regenerate_moon_projects.js --update

4e51bf5

coderabbitai bot reviewed Mar 20, 2026

View reviewed changes

x-pack/platform/packages/shared/kbn-evals-extensions/__tests__/package.test.ts Show resolved Hide resolved

patrykkopycinski and others added 3 commits March 20, 2026 19:49

Merge branch 'main' into evals-extensions-foundation

3334694

fix(evals): resolve CI failures in @kbn/evals-extensions

b459fdc

- Use `export type *` for type-only re-exports (consistent-type-exports) - Remove redundant scripts/dependencies from package.json to fix jest CI reporter expecting --config arg

Changes from node scripts/lint.js --fix

4b03027

Changes from node scripts/regenerate_moon_projects.js --update

0160158

fix(evals-extensions): fix jest matcher error in package importabilit…

9e09605

…y test .resolves.not.toThrow() expects a promise but received a function. Replaced with a direct dynamic import assertion.

Merge remote-tracking branch 'upstream/main' into evals-extensions-fo…

fffd948

…undation

elastic deleted a comment from elasticmachine Mar 26, 2026

spong and others added 3 commits March 26, 2026 23:23

First pass review fixes

4840a8e

Merge branch 'main' of github.com:elastic/kibana into evals-extension…

0e9e572

…s-foundation

Changes from node scripts/lint.js --fix

c8720ad

spong approved these changes Mar 27, 2026

View reviewed changes

Changes from node scripts/lint.js --fix

4643b7e

patrykkopycinski merged commit ab24b48 into elastic:main Mar 27, 2026
16 checks passed

patrykkopycinski deleted the evals-extensions-foundation branch March 27, 2026 10:09

kibanamachine added the v9.4.0 label Mar 27, 2026

Conversation

patrykkopycinski commented Mar 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

What's Included

Validation

Roadmap

Related

Uh oh!

elasticmachine commented Mar 20, 2026

Uh oh!

coderabbitai bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

patrykkopycinski commented Mar 25, 2026

Uh oh!

patrykkopycinski commented Mar 25, 2026

Uh oh!

patrykkopycinski commented Mar 26, 2026

Uh oh!

patrykkopycinski commented Mar 26, 2026

Uh oh!

patrykkopycinski commented Mar 26, 2026

Uh oh!

patrykkopycinski commented Mar 26, 2026

Uh oh!

patrykkopycinski commented Mar 26, 2026

Uh oh!

patrykkopycinski commented Mar 26, 2026

Uh oh!

spong left a comment

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Mar 27, 2026

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

API count

History

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

patrykkopycinski commented Mar 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 20, 2026 •

edited

Loading