feat(evals): create @kbn/evals-extensions foundation package#258775
feat(evals): create @kbn/evals-extensions foundation package#258775patrykkopycinski merged 18 commits intoelastic:mainfrom
Conversation
- Add "Vision Alignment" section to README with strategic principles (trace-first, Elastic-native, shared layer boundaries, ownership) - Add module-level JSDoc to index.ts explaining architecture boundaries - Document trace-first evaluator contract in Evaluator and EvaluationResult types - Export createTraceBasedEvaluator and TraceBasedEvaluatorConfig from barrel to promote trace-first pattern as the primary building block - Add JSDoc to all new evaluator factories (security, trajectory, similarity, multi-judge, conversation-coherence) explaining purpose and parameters - Add trace-first migration path annotation to security evaluators module Addresses vision alignment concerns: - Section 5.2.1 (trace-first evaluator contract) - Section 5.2.3 (shared evaluation layer boundaries) - Section 4.5 (ownership model) - CI metrics: reduces public API documentation gap
…cution Two framework bugs prevented Playwright workers from executing @kbn/evals test suites: 1. `.text` file imports crash workers — packages like @kbn/evals import `.text` files (LLM prompt templates) that need a require hook to convert them to CommonJS modules. The hook was registered in the main process via @kbn/babel-register but Playwright workers use their own module resolution. Added a `dot_text_setup.ts` require hook in @kbn/scout (mirroring the existing peggy_setup pattern). 2. `NO_COLOR` env warning kills workers — Playwright sets `FORCE_COLOR` while `NO_COLOR` may also be in the environment. Node emits a warning for this conflict, and `exit_on_warning.js` terminates the process on any unrecognized warning. Added this specific warning to the ignore list. Also adds an initial agentic alert triage eval suite with 5 test cases for the skill migration validation.
…vals execution" This reverts commit 5add16c.
This establishes the structure for advanced evaluation capabilities
ported from cursor-plugin-evals and serves as the home for Phases 3-5
of the evals roadmap.
## Architecture
The package is designed to be completely independent from @kbn/evals:
```
Evaluation Suites
├──> @kbn/evals (core)
└──> @kbn/evals-extensions (advanced features)
└──> depends on @kbn/evals
```
**Dependency Rule:**
- ✅ kbn-evals-extensions CAN import from kbn-evals
- ❌ kbn-evals MUST NOT import from kbn-evals-extensions
## This PR
**What's included:**
- Package structure (package.json, kibana.jsonc, tsconfig.json)
- Placeholder exports (no functional changes)
- Test infrastructure (5 passing tests)
- Comprehensive documentation
**What's NOT included:**
- No functional features (placeholder exports only)
- No changes to @kbn/evals package
- No changes to evaluation suite behavior
## Validation
✅ Bootstrap completed successfully
✅ Type check passed
✅ All tests passing (5/5)
✅ ESLint passed
✅ No circular dependencies
✅ check_changes.ts passed
## Roadmap
This foundation enables parallel development of:
- PR #2: Cost tracking & metadata enrichment
- PR #3: Dataset management utilities
- PR #4: Safety evaluators (toxicity, PII, bias, etc.)
- PR #5: UI components (run comparison, example explorer)
- PR #6: DX enhancements (watch mode, caching, parallel)
- PR #7: Advanced analytics
- PR #8: A/B testing & active learning
- PR #9: Human-in-the-loop workflows
- PR elastic#10: IDE integration
## Related Issues
- Closes part of elastic#257821 (Epic: Extend @kbn/evals)
- Enables elastic#257823 (Phase 2: CI Quality Gates)
- Enables elastic#257824 (Phase 3: Red-Teaming)
- Enables elastic#257825 (Phase 4: Lens Dashboards)
- Enables elastic#257826 (Phase 5: Auto-Generation)
- Addresses elastic#255820 (kbn/evals <-> Agent Builder completeness)
Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
|
Pinging @elastic/appex-ai-infra (Team:AI Infra) |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis pull request introduces a new Suggested labels
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
x-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.ts (1)
60-72: Consider filtering non-finite scores.The code checks for
score != nullbut doesn't filterNaNorInfinityvalues, which could produce unexpected aggregation results if a judge returns such values.♻️ Proposed fix to filter invalid scores
if (result.status === 'fulfilled') { judgeResults.push({ name: judges[i].name, result: result.value }); - if (result.value.score != null) { + if (result.value.score != null && Number.isFinite(result.value.score)) { scores.push(result.value.score); } } else {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.ts` around lines 60 - 72, The current loop in results.forEach pushes any non-null score which can include NaN/Infinity; update the fulfilled branch (inside results.forEach where judgeResults and scores are updated for judges[i]) to only push numeric scores that are finite by checking Number.isFinite(result.value.score) (or otherwise coercing and validating finiteness) before adding to scores, and keep failedJudges/logger logic unchanged.x-pack/platform/packages/shared/kbn-evals-extensions/index.ts (1)
62-68: Placeholder interface is acceptable for foundation PR.Consider adding a
TODOor@internalannotation to signal this interface will be expanded, preventing consumers from relying on its current minimal shape.📝 Suggested annotation
+/** + * `@internal` Placeholder - will be expanded in future PRs + */ export interface ExtensionConfig { /** * Configuration for extension features * Will be expanded as features are added */ placeholder?: string; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@x-pack/platform/packages/shared/kbn-evals-extensions/index.ts` around lines 62 - 68, Add a JSDoc annotation to the ExtensionConfig interface to indicate it is a placeholder and will expand (so consumers don't rely on its current shape); update the doc comment on the exported interface ExtensionConfig to include either an `@internal` tag or a TODO/@todo note stating it is temporary and subject to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@x-pack/platform/packages/shared/kbn-evals-extensions/__tests__/package.test.ts`:
- Around line 20-24: The test incorrectly wraps an async function in expect() —
change the assertion to pass the Promise returned by import('..') directly to
resolves; e.g. replace await expect(async () => { await import('..');
}).resolves.not.toThrow(); with await
expect(import('..')).resolves.toBeDefined(); so the import Promise is asserted
correctly (look for the import('..') usage in the test body).
---
Nitpick comments:
In `@x-pack/platform/packages/shared/kbn-evals-extensions/index.ts`:
- Around line 62-68: Add a JSDoc annotation to the ExtensionConfig interface to
indicate it is a placeholder and will expand (so consumers don't rely on its
current shape); update the doc comment on the exported interface ExtensionConfig
to include either an `@internal` tag or a TODO/@todo note stating it is temporary
and subject to change.
In
`@x-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.ts`:
- Around line 60-72: The current loop in results.forEach pushes any non-null
score which can include NaN/Infinity; update the fulfilled branch (inside
results.forEach where judgeResults and scores are updated for judges[i]) to only
push numeric scores that are finite by checking
Number.isFinite(result.value.score) (or otherwise coercing and validating
finiteness) before adding to scores, and keep failedJudges/logger logic
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: f6bdda89-339a-4bc3-92cd-680ab7d8bc02
⛔ Files ignored due to path filters (1)
yarn.lockis excluded by!**/yarn.lock,!**/*.lock
📒 Files selected for processing (23)
.github/CODEOWNERSpackage.jsontsconfig.base.jsonx-pack/platform/packages/shared/kbn-evals-extensions/.gitignorex-pack/platform/packages/shared/kbn-evals-extensions/README.mdx-pack/platform/packages/shared/kbn-evals-extensions/__tests__/package.test.tsx-pack/platform/packages/shared/kbn-evals-extensions/index.tsx-pack/platform/packages/shared/kbn-evals-extensions/jest.config.jsx-pack/platform/packages/shared/kbn-evals-extensions/kibana.jsoncx-pack/platform/packages/shared/kbn-evals-extensions/moon.ymlx-pack/platform/packages/shared/kbn-evals-extensions/package.jsonx-pack/platform/packages/shared/kbn-evals-extensions/src/index.tsx-pack/platform/packages/shared/kbn-evals-extensions/src/types/index.tsx-pack/platform/packages/shared/kbn-evals-extensions/src/utils/index.tsx-pack/platform/packages/shared/kbn-evals-extensions/tsconfig.jsonx-pack/platform/packages/shared/kbn-evals/README.mdx-pack/platform/packages/shared/kbn-evals/index.tsx-pack/platform/packages/shared/kbn-evals/src/evaluators/conversation_coherence/index.tsx-pack/platform/packages/shared/kbn-evals/src/evaluators/multi_judge/index.tsx-pack/platform/packages/shared/kbn-evals/src/evaluators/security/index.tsx-pack/platform/packages/shared/kbn-evals/src/evaluators/similarity/index.tsx-pack/platform/packages/shared/kbn-evals/src/evaluators/trajectory/index.tsx-pack/platform/packages/shared/kbn-evals/src/types.ts
x-pack/platform/packages/shared/kbn-evals-extensions/__tests__/package.test.ts
Show resolved
Hide resolved
- Use `export type *` for type-only re-exports (consistent-type-exports) - Remove redundant scripts/dependencies from package.json to fix jest CI reporter expecting --config arg
|
/ci |
|
/ci |
4 similar comments
|
/ci |
|
/ci |
|
/ci |
|
/ci |
…y test .resolves.not.toThrow() expects a promise but received a function. Replaced with a direct dynamic import assertion.
|
/ci |
|
/ci |
spong
left a comment
There was a problem hiding this comment.
Overall structure and foundation LGTM! 👍
Pushed a couple small fixes from initial review, but this is good to merge as-is if you'd like! 😀
Glad to have an extensions area to expand evals like this, good stuff @patrykkopycinski! 🎉
💚 Build Succeeded
Metrics [docs]Public APIs missing comments
Unknown metric groupsAPI count
History
|
…#258775) ## Summary Creates the foundation package `@kbn/evals-extensions` for advanced evaluation capabilities. This package will house features ported from cursor-plugin-evals and serve as the home for Phases 3-5 of the evals roadmap. ## Architecture **One-way dependency:** - ✅ kbn-evals-extensions depends on kbn-evals - ❌ kbn-evals has NO dependency on kbn-evals-extensions Evaluation suites opt-in by importing from extensions directly. ## What's Included ✅ Package structure and build configuration ✅ Comprehensive documentation ✅ 5 passing unit tests ✅ CODEOWNERS entry ✅ No functional changes ## Validation ✅ Bootstrap, type check, tests, eslint, check_changes.ts all passed ✅ No circular dependencies ## Roadmap This enables PRs #2-10 for cost tracking, dataset management, safety evaluators, UI components, DX enhancements, analytics, A/B testing, human-in-the-loop, and IDE integration. ## Related - Part of elastic#257821 - Enables elastic#257823, elastic#257824, elastic#257825, elastic#257826 - Addresses elastic#255820 Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com> Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Garrett Spong <garrett.spong@elastic.co>
Summary
Creates the foundation package
@kbn/evals-extensionsfor advanced evaluation capabilities. This package will house features ported from cursor-plugin-evals and serve as the home for Phases 3-5 of the evals roadmap.Architecture
One-way dependency:
Evaluation suites opt-in by importing from extensions directly.
What's Included
✅ Package structure and build configuration
✅ Comprehensive documentation
✅ 5 passing unit tests
✅ CODEOWNERS entry
✅ No functional changes
Validation
✅ Bootstrap, type check, tests, eslint, check_changes.ts all passed
✅ No circular dependencies
Roadmap
This enables PRs #2-10 for cost tracking, dataset management, safety evaluators, UI components, DX enhancements, analytics, A/B testing, human-in-the-loop, and IDE integration.
Related
Co-Authored-By: Claude Sonnet 4.5 (1M context) noreply@anthropic.com