[SecuritySolution] [Dashboard Migrations] Add security automatic migrations evaluation suite#261568
Conversation
| /** Panel-level ground truth */ | ||
| panels: ExpectedPanel[]; | ||
| /** Category for conditional evaluator logic */ | ||
| category: 'standard' | 'complex' | 'edge_case'; |
There was a problem hiding this comment.
Could you please elaborate, how does this help?
There was a problem hiding this comment.
The category field on DashboardExpected drives conditional evaluator logic. For example, edge_case dashboards might use relaxed scoring thresholds or skip certain evaluators (like index pattern matching). This avoids hardcoding per-dashboard exceptions.
| } | ||
|
|
||
| export interface DashboardMetadata { | ||
| category: 'standard' | 'complex' | 'edge_case'; |
There was a problem hiding this comment.
is there a difference between DashboardExpected['category'] and DashboardMetadata['category']?
There was a problem hiding this comment.
No, they share the same type ('standard' | 'complex' | 'edge_case'). The duplication is intentional as they serve different roles, but I'm happy to DRY it up by having DashboardMetadata reference DashboardExpected['category'] if preferred.
There was a problem hiding this comment.
it is okay.. i just wanted to know the purpose of keeping them separate.
| import type { MigrationResult } from '../migration_client'; | ||
| import { extractEsqlQueries } from '../helpers'; | ||
|
|
||
| export const createEsqlSyntaxValidityEvaluator = (): Evaluator< |
There was a problem hiding this comment.
this mostly looks like a ESQL query completeness rather than syntax rather than syntax check.
It might be worth dividing them into 2
There was a problem hiding this comment.
I will rename this to better reflect its role as a "completeness" check.
Full ES|QL syntax validation (via endpoint or parser) is planned as a follow-up. Would you prefer I split the logic now or track it in an issue?
| [key: string]: unknown; | ||
| } | ||
|
|
||
| export class DashboardMigrationClient { |
There was a problem hiding this comment.
I would prefer this to be graph instance .. since we will need to asset the graph state as well, specially in case of tool calls.
There was a problem hiding this comment.
Agreed. Running the graph directly allows access to intermediate states (inline_query, nl_query) needed for deep evaluation. I'll create a follow-up issue to expose the graph invocation endpoint.
There was a problem hiding this comment.
logeekal
left a comment
There was a problem hiding this comment.
Okay so overall PR looks good. Thank you @enriquesanchez-elastic . Apart from my minor comments, i would like to highlight one important things that need to be changed.
- We need to directly run graph instead of whole migration.
- I think easiest way is to create an endpoint to run the Migrations graph which can take below inputs. basically it will simply call this ( )
- graph name
- graph input
- invokation config
- I think easiest way is to create an endpoint to run the Migrations graph which can take below inputs. basically it will simply call this (
This will impact how we do some evaluations where we need to access the internal state of the graph.
Lemme know what you think.
e2c4eee to
0dd61b9
Compare
logeekal
left a comment
There was a problem hiding this comment.
Thanks @enriquesanchez-elastic for starting this up.
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]
History
|
This commit introduces the `@kbn/evals-suite-security-automatic-migrations` package, which includes a new evaluation suite for the Splunk-to-Kibana dashboard migration AI pipeline. The suite features various evaluators to assess the migration quality, including checks for lookup joins, ES|QL syntax validity, and translation fidelity. Key changes: - Added new package with necessary configuration files. - Implemented evaluators and dataset handling for dashboard migration. - Created test specifications to validate the migration process. This enhancement aims to improve the accuracy and reliability of dashboard migrations from Splunk to Kibana.
Defines RuleExample, RuleInput, RuleExpected, and RuleMetadata types that model the dataset shape for Splunk SPL and QRadar rule migration evaluations, following the existing dashboards dataset pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements the HTTP client for the SIEM rules migration API, following the same patterns as DashboardMigrationClient: create migration, upload rule and resources, start migration, poll until complete (max 30 min), fetch translated result, and always cleanup in a finally block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements three code-based evaluators for rule migration quality: - esql_validity: checks FROM clause and placeholder resolution - lookup_join_preservation: verifies LOOKUP JOIN presence matches expectations - unsupported_pattern_detection: validates untranslatable rules are not hallucinated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements 4 CODE-kind evaluators: custom query accuracy (Levenshtein similarity), integration match, prebuilt rule match, and translation result for rule migration evaluation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wires all rule evaluators into a factory function that runs shared evaluators for both Splunk and QRadar, plus QRadar-only evaluators, tracking per-dataset success/failure stats. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extends the base evaluate fixture with rule migration client, rule dataset evaluator, rule display options, display groups, and rule skip summary reporting alongside the existing dashboard ones. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Creates the splunk rules dataset (3 placeholder examples covering simple, lookup-based, and unsupported patterns) and the corresponding splunk_rule_migration.spec.ts evaluation spec that exercises evaluateRuleDataset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds placeholder QRadar rules dataset (simple event rule, reference set rule, unsupported sequence rule) and corresponding evaluation spec following the same structure as the Splunk SPL dataset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ormatting - Add dataset re-exports (splunkRules, qradarRules) to datasets/rules/index.ts - Fix @typescript-eslint/no-shadow lint error in helpers.ts (rename shadowed _ params) - Apply eslint --fix formatting to evaluate.ts, evaluate_dataset.ts, migration_client.ts, and evaluators Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d evaluators - Add empty-array guard before accessing rules[0] in migration client - Move TranslationResult evaluator from QRadar-only to shared (applies to both vendors) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pecs
Add { tag: tags.stateful.classic } to both rule migration specs to match
the dashboard spec pattern. Also add empty-dataset guards and progress
logging consistent with the dashboard spec.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename and refactor rule dataset summary functions for clarity and consistency. - Update dashboard metadata to enable markdown panels. - Adjust evaluation logic to handle new dataset structure.
… add queries to metadata The evaluator checks for unresolved placeholders, not actual syntax parsing. Rename to reflect its true purpose and include generated ES|QL queries in the evaluator result metadata for debugging visibility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Changed CODEOWNERS entry for `kbn-evals-suite-security-automatic-migrations` to assign ownership to `@elastic/security-threat-hunting`. - Refactored `index_pattern_validity.ts` to improve handling of actual index patterns, allowing for multiple index patterns per panel title. - Adjusted regex in `helpers.ts` for better query matching.
- Modified the regex in `helpers.ts` to allow for optional backticks around index patterns in the FROM clause of queries, enhancing the accuracy of index pattern extraction from panels.
…andalone lookup splHasLookups returned false for SPL queries containing both a standalone `lookup` and an `inputlookup`/`outputlookup` (e.g. `"lookup users | inputlookup extra.csv"`). The global exclusion regex short-circuited the first branch, and the fallback only matched piped lookups. Now iterates matches of `(?<![a-zA-Z])lookup\s+\w+` and filters out `input`/`output`-prefixed ones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gainst missing eai:data The previous guard only checked `result` truthiness. If `result` existed but lacked `eai:data`, `sourceSpl.slice(0, 3000)` threw a TypeError outside the LLM try/catch. Now narrows the guard to the actual field used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o threat-hunting Per review feedback, moves the kbn-evals-suite-security-automatic-migrations package owner from security-generative-ai to security-threat-hunting. CODEOWNERS regenerated from kibana.jsonc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9e770e8 to
14ac71b
Compare
💛 Build succeeded, but was flaky
Failed CI StepsMetrics [docs]
History
|
SrdjanLL
left a comment
There was a problem hiding this comment.
@kbn/evals changes and suite setup LGTM!
Please note that the eval suite won't run in the weekly automated run against the golden cluster, unless it's added here - that's okay for early stage suites so we don't bump the token usage, but if/when you think it's ready, you're welcome to add it.
Summary
This PR introduces the
@kbn/evals-suite-security-automatic-migrationspackage, which includes a new evaluation suite for the Splunk-to-Kibana dashboard migration AI pipeline. The suite features various evaluators to assess the migration quality, including checks for lookup joins, ES|QL syntax validity, and translation fidelity.Key changes:
This enhancement aims to improve the accuracy and reliability of dashboard migrations from Splunk to Kibana.