Skip to content

feat(evals): create @kbn/evals-extensions foundation package#258775

Merged
patrykkopycinski merged 18 commits intoelastic:mainfrom
patrykkopycinski:evals-extensions-foundation
Mar 27, 2026
Merged

feat(evals): create @kbn/evals-extensions foundation package#258775
patrykkopycinski merged 18 commits intoelastic:mainfrom
patrykkopycinski:evals-extensions-foundation

Conversation

@patrykkopycinski
Copy link
Copy Markdown
Contributor

@patrykkopycinski patrykkopycinski commented Mar 20, 2026

Summary

Creates the foundation package @kbn/evals-extensions for advanced evaluation capabilities. This package will house features ported from cursor-plugin-evals and serve as the home for Phases 3-5 of the evals roadmap.

Architecture

One-way dependency:

  • ✅ kbn-evals-extensions depends on kbn-evals
  • ❌ kbn-evals has NO dependency on kbn-evals-extensions

Evaluation suites opt-in by importing from extensions directly.

What's Included

✅ Package structure and build configuration
✅ Comprehensive documentation
✅ 5 passing unit tests
✅ CODEOWNERS entry
✅ No functional changes

Validation

✅ Bootstrap, type check, tests, eslint, check_changes.ts all passed
✅ No circular dependencies

Roadmap

This enables PRs #2-10 for cost tracking, dataset management, safety evaluators, UI components, DX enhancements, analytics, A/B testing, human-in-the-loop, and IDE integration.

Related

Co-Authored-By: Claude Sonnet 4.5 (1M context) noreply@anthropic.com

patrykkopycinski and others added 4 commits March 17, 2026 20:42
- Add "Vision Alignment" section to README with strategic principles
  (trace-first, Elastic-native, shared layer boundaries, ownership)
- Add module-level JSDoc to index.ts explaining architecture boundaries
- Document trace-first evaluator contract in Evaluator and EvaluationResult types
- Export createTraceBasedEvaluator and TraceBasedEvaluatorConfig from barrel
  to promote trace-first pattern as the primary building block
- Add JSDoc to all new evaluator factories (security, trajectory, similarity,
  multi-judge, conversation-coherence) explaining purpose and parameters
- Add trace-first migration path annotation to security evaluators module

Addresses vision alignment concerns:
- Section 5.2.1 (trace-first evaluator contract)
- Section 5.2.3 (shared evaluation layer boundaries)
- Section 4.5 (ownership model)
- CI metrics: reduces public API documentation gap
…cution

Two framework bugs prevented Playwright workers from executing @kbn/evals
test suites:

1. `.text` file imports crash workers — packages like @kbn/evals import
   `.text` files (LLM prompt templates) that need a require hook to
   convert them to CommonJS modules. The hook was registered in the main
   process via @kbn/babel-register but Playwright workers use their own
   module resolution. Added a `dot_text_setup.ts` require hook in
   @kbn/scout (mirroring the existing peggy_setup pattern).

2. `NO_COLOR` env warning kills workers — Playwright sets `FORCE_COLOR`
   while `NO_COLOR` may also be in the environment. Node emits a warning
   for this conflict, and `exit_on_warning.js` terminates the process on
   any unrecognized warning. Added this specific warning to the ignore
   list.

Also adds an initial agentic alert triage eval suite with 5 test cases
for the skill migration validation.
This establishes the structure for advanced evaluation capabilities
ported from cursor-plugin-evals and serves as the home for Phases 3-5
of the evals roadmap.

## Architecture

The package is designed to be completely independent from @kbn/evals:

```
Evaluation Suites
     ├──> @kbn/evals (core)
     └──> @kbn/evals-extensions (advanced features)
              └──> depends on @kbn/evals
```

**Dependency Rule:**
- ✅ kbn-evals-extensions CAN import from kbn-evals
- ❌ kbn-evals MUST NOT import from kbn-evals-extensions

## This PR

**What's included:**
- Package structure (package.json, kibana.jsonc, tsconfig.json)
- Placeholder exports (no functional changes)
- Test infrastructure (5 passing tests)
- Comprehensive documentation

**What's NOT included:**
- No functional features (placeholder exports only)
- No changes to @kbn/evals package
- No changes to evaluation suite behavior

## Validation

✅ Bootstrap completed successfully
✅ Type check passed
✅ All tests passing (5/5)
✅ ESLint passed
✅ No circular dependencies
✅ check_changes.ts passed

## Roadmap

This foundation enables parallel development of:
- PR #2: Cost tracking & metadata enrichment
- PR #3: Dataset management utilities
- PR #4: Safety evaluators (toxicity, PII, bias, etc.)
- PR #5: UI components (run comparison, example explorer)
- PR #6: DX enhancements (watch mode, caching, parallel)
- PR #7: Advanced analytics
- PR #8: A/B testing & active learning
- PR #9: Human-in-the-loop workflows
- PR elastic#10: IDE integration

## Related Issues

- Closes part of elastic#257821 (Epic: Extend @kbn/evals)
- Enables elastic#257823 (Phase 2: CI Quality Gates)
- Enables elastic#257824 (Phase 3: Red-Teaming)
- Enables elastic#257825 (Phase 4: Lens Dashboards)
- Enables elastic#257826 (Phase 5: Auto-Generation)
- Addresses elastic#255820 (kbn/evals <-> Agent Builder completeness)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
@patrykkopycinski patrykkopycinski added release_note:skip Skip the PR/issue when compiling release notes Team:AI Infra Platform AppEx AI Infrastructure Team t// labels Mar 20, 2026
@patrykkopycinski patrykkopycinski marked this pull request as ready for review March 20, 2026 08:49
@patrykkopycinski patrykkopycinski requested review from a team as code owners March 20, 2026 08:49
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/appex-ai-infra (Team:AI Infra)

@patrykkopycinski patrykkopycinski self-assigned this Mar 20, 2026
@patrykkopycinski patrykkopycinski added the backport:skip This PR does not require backporting label Mar 20, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request introduces a new @kbn/evals-extensions package within the Kibana monorepo's shared packages directory. The package serves as a standalone extension layer for the core @kbn/evals evaluation framework, with strict unidirectional dependency boundaries. The PR includes complete package scaffolding (manifest files, configuration, build setup), comprehensive README documentation with architectural guidelines and a multi-phase roadmap, placeholder type and utility exports, and test coverage. Additionally, documentation and exports for trace-based evaluators are added to the core @kbn/evals package, along with JSDoc comments for existing evaluator implementations.

Suggested labels

backport:all-open

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: introduction of a new foundation package @kbn/evals-extensions, which is the primary objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • 🛠️ Update Documentation: Commit on current branch
  • 🛠️ Update Documentation: Create PR

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
x-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.ts (1)

60-72: Consider filtering non-finite scores.

The code checks for score != null but doesn't filter NaN or Infinity values, which could produce unexpected aggregation results if a judge returns such values.

♻️ Proposed fix to filter invalid scores
         if (result.status === 'fulfilled') {
           judgeResults.push({ name: judges[i].name, result: result.value });
-          if (result.value.score != null) {
+          if (result.value.score != null && Number.isFinite(result.value.score)) {
             scores.push(result.value.score);
           }
         } else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@x-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.ts`
around lines 60 - 72, The current loop in results.forEach pushes any non-null
score which can include NaN/Infinity; update the fulfilled branch (inside
results.forEach where judgeResults and scores are updated for judges[i]) to only
push numeric scores that are finite by checking
Number.isFinite(result.value.score) (or otherwise coercing and validating
finiteness) before adding to scores, and keep failedJudges/logger logic
unchanged.
x-pack/platform/packages/shared/kbn-evals-extensions/index.ts (1)

62-68: Placeholder interface is acceptable for foundation PR.

Consider adding a TODO or @internal annotation to signal this interface will be expanded, preventing consumers from relying on its current minimal shape.

📝 Suggested annotation
+/**
+ * `@internal` Placeholder - will be expanded in future PRs
+ */
 export interface ExtensionConfig {
   /**
    * Configuration for extension features
    * Will be expanded as features are added
    */
   placeholder?: string;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@x-pack/platform/packages/shared/kbn-evals-extensions/index.ts` around lines
62 - 68, Add a JSDoc annotation to the ExtensionConfig interface to indicate it
is a placeholder and will expand (so consumers don't rely on its current shape);
update the doc comment on the exported interface ExtensionConfig to include
either an `@internal` tag or a TODO/@todo note stating it is temporary and subject
to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@x-pack/platform/packages/shared/kbn-evals-extensions/__tests__/package.test.ts`:
- Around line 20-24: The test incorrectly wraps an async function in expect() —
change the assertion to pass the Promise returned by import('..') directly to
resolves; e.g. replace await expect(async () => { await import('..');
}).resolves.not.toThrow(); with await
expect(import('..')).resolves.toBeDefined(); so the import Promise is asserted
correctly (look for the import('..') usage in the test body).

---

Nitpick comments:
In `@x-pack/platform/packages/shared/kbn-evals-extensions/index.ts`:
- Around line 62-68: Add a JSDoc annotation to the ExtensionConfig interface to
indicate it is a placeholder and will expand (so consumers don't rely on its
current shape); update the doc comment on the exported interface ExtensionConfig
to include either an `@internal` tag or a TODO/@todo note stating it is temporary
and subject to change.

In
`@x-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.ts`:
- Around line 60-72: The current loop in results.forEach pushes any non-null
score which can include NaN/Infinity; update the fulfilled branch (inside
results.forEach where judgeResults and scores are updated for judges[i]) to only
push numeric scores that are finite by checking
Number.isFinite(result.value.score) (or otherwise coercing and validating
finiteness) before adding to scores, and keep failedJudges/logger logic
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: f6bdda89-339a-4bc3-92cd-680ab7d8bc02

📥 Commits

Reviewing files that changed from the base of the PR and between 5ab22d7 and 4e51bf5.

⛔ Files ignored due to path filters (1)
  • yarn.lock is excluded by !**/yarn.lock, !**/*.lock
📒 Files selected for processing (23)
  • .github/CODEOWNERS
  • package.json
  • tsconfig.base.json
  • x-pack/platform/packages/shared/kbn-evals-extensions/.gitignore
  • x-pack/platform/packages/shared/kbn-evals-extensions/README.md
  • x-pack/platform/packages/shared/kbn-evals-extensions/__tests__/package.test.ts
  • x-pack/platform/packages/shared/kbn-evals-extensions/index.ts
  • x-pack/platform/packages/shared/kbn-evals-extensions/jest.config.js
  • x-pack/platform/packages/shared/kbn-evals-extensions/kibana.jsonc
  • x-pack/platform/packages/shared/kbn-evals-extensions/moon.yml
  • x-pack/platform/packages/shared/kbn-evals-extensions/package.json
  • x-pack/platform/packages/shared/kbn-evals-extensions/src/index.ts
  • x-pack/platform/packages/shared/kbn-evals-extensions/src/types/index.ts
  • x-pack/platform/packages/shared/kbn-evals-extensions/src/utils/index.ts
  • x-pack/platform/packages/shared/kbn-evals-extensions/tsconfig.json
  • x-pack/platform/packages/shared/kbn-evals/README.md
  • x-pack/platform/packages/shared/kbn-evals/index.ts
  • x-pack/platform/packages/shared/kbn-evals/src/evaluators/conversation_coherence/index.ts
  • x-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.ts
  • x-pack/platform/packages/shared/kbn-evals/src/evaluators/security/index.ts
  • x-pack/platform/packages/shared/kbn-evals/src/evaluators/similarity/index.ts
  • x-pack/platform/packages/shared/kbn-evals/src/evaluators/trajectory/index.ts
  • x-pack/platform/packages/shared/kbn-evals/src/types.ts

patrykkopycinski and others added 3 commits March 20, 2026 19:49
- Use `export type *` for type-only re-exports (consistent-type-exports)
- Remove redundant scripts/dependencies from package.json to fix jest
  CI reporter expecting --config arg
@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

4 similar comments
@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

…y test

.resolves.not.toThrow() expects a promise but received a function.
Replaced with a direct dynamic import assertion.
@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

@elastic elastic deleted a comment from elasticmachine Mar 26, 2026
Copy link
Copy Markdown
Member

@spong spong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall structure and foundation LGTM! 👍

Pushed a couple small fixes from initial review, but this is good to merge as-is if you'd like! 😀

Glad to have an extensions area to expand evals like this, good stuff @patrykkopycinski! 🎉

@elasticmachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/evals 284 275 -9
@kbn/evals-extensions - 22 +22
total +13
Unknown metric groups

API count

id before after diff
@kbn/evals 326 339 +13
@kbn/evals-extensions - 30 +30
total +43

History

cc @patrykkopycinski

@patrykkopycinski patrykkopycinski merged commit ab24b48 into elastic:main Mar 27, 2026
16 checks passed
@patrykkopycinski patrykkopycinski deleted the evals-extensions-foundation branch March 27, 2026 10:09
kelvtanv pushed a commit to kelvtanv/kibana that referenced this pull request Mar 27, 2026
…#258775)

## Summary

Creates the foundation package `@kbn/evals-extensions` for advanced
evaluation capabilities. This package will house features ported from
cursor-plugin-evals and serve as the home for Phases 3-5 of the evals
roadmap.

## Architecture

**One-way dependency:**
- ✅ kbn-evals-extensions depends on kbn-evals
- ❌ kbn-evals has NO dependency on kbn-evals-extensions

Evaluation suites opt-in by importing from extensions directly.

## What's Included

✅ Package structure and build configuration
✅ Comprehensive documentation
✅ 5 passing unit tests
✅ CODEOWNERS entry
✅ No functional changes

## Validation

✅ Bootstrap, type check, tests, eslint, check_changes.ts all passed
✅ No circular dependencies

## Roadmap

This enables PRs #2-10 for cost tracking, dataset management, safety
evaluators, UI components, DX enhancements, analytics, A/B testing,
human-in-the-loop, and IDE integration.

## Related

- Part of elastic#257821 - Enables elastic#257823, elastic#257824, elastic#257825, elastic#257826
- Addresses elastic#255820

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Garrett Spong <garrett.spong@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip This PR does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:AI Infra Platform AppEx AI Infrastructure Team t// v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants