Spike: Data Source Catalog for Detection Engineering#260159
Closed
patrykkopycinski wants to merge 31 commits intoelastic:mainfrom
Closed
Spike: Data Source Catalog for Detection Engineering#260159patrykkopycinski wants to merge 31 commits intoelastic:mainfrom
patrykkopycinski wants to merge 31 commits intoelastic:mainfrom
Conversation
Adds the initial package scaffold for @kbn/data-source-catalog, a shared-server package that will provide a unified, queryable catalog of data source metadata for Security Solution AI features, rule creation, and SIEM migration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces core TypeScript interfaces (DataSourceEntry, FieldMetadata, IntegrationMetadata, DataSourceStats, DataSourceSemantic, CatalogQueryParams, CatalogQueryResult) and shared constants (CATALOG_INDEX_NAME, FRESHNESS_THRESHOLDS, DEFAULT_SECURITY_PATTERNS, CATALOG_VERSION) for the kbn-data-source-catalog package. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the Elasticsearch index mapping for the .kibana-data-source-catalog system index using MappingTypeMapping from @elastic/elasticsearch. Updates the package entry point to export types, constants, and the mapping. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements CatalogClient class wrapping .kibana-data-source-catalog index with ensureIndex, bulkUpsert, and deleteAll operations. Adds 6 unit tests covering all public methods. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d match Implements CatalogQuery class that builds typed Elasticsearch queries from CatalogQueryParams (wildcard name, term type/package, nested field existence, activeOnly, multi_match full-text) and returns CatalogQueryResult. Exports from package index. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements discoverIndexMetadata function that resolves indices/data streams/aliases via ES resolveIndex, flattens nested mapping properties into FieldMetadata[], computes ECS field coverage via ecsFieldMap, limits stored fields to DEFAULT_FIELD_LIMIT (500) prioritizing ECS fields, and matches index templates by pattern to attach _meta. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…adata) Implements fetchIntegrationMetadata function that maps installed Fleet package data streams to IntegrationMetadata. Uses a minimal PackageClientLike interface to avoid a hard dependency on the Fleet plugin. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds refreshCatalog function that orchestrates index discovery, Fleet integration metadata merging, and bulk persistence via CatalogClient. Exports refreshCatalog and PackageClientLike from the package public API. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and Fleet hooks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ess, size) Implements fetchIndexStats which combines indices.stats (doc count + store size) and msearch max-@timestamp aggregations to compute is_active and freshness_category (live/recent/stale/empty) per index. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add `includeStats` param to `RefreshCatalogParams`; when true, calls `fetchIndexStats` after index/integration discovery and attaches stats to entries before persisting via `bulkUpsert`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… interval) Registers a recurring task that refreshes the data source catalog with stats every 6 hours, and passes includeStats: true on both initial and on-demand refreshes so doc counts and freshness metadata are always populated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a pure helper that converts DataSourceEntry[] into a human-readable markdown block suitable for injection into LLM prompts, surfacing integration metadata, doc count / freshness stats, and top ECS fields for each data source. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Discovery prompt Thread an optional dataSourceContext string through the Attack Discovery graph so the LLM can receive context about available security data sources in the environment. When empty (default), behavior is identical to current. This prepares the plumbing for injecting catalog summaries from SecurityCatalogService. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pper, use operations)
…pe ({items} not array)
Contributor
|
🤖 Jobs for this PR can be triggered through checkboxes. 🚧
ℹ️ To trigger the CI, please tick the checkbox below 👇
|
…ummaries and MITRE mapping Adds the semantic enrichment pipeline that runs at catalog refresh time: - Extends ES index mapping with dense_vector (384-dim, cosine) for future embedding support - Implements heuristic summary generator that produces summaries, topic tags, and MITRE ATT&CK technique IDs from field names and integration metadata — no LLM required - Wires summary generation as Step 5 in refreshCatalog, running for every entry - Exports generateHeuristicSummary from the package public API Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gration Supplements the ELSER-based integration retriever with data source catalog metadata (field counts, freshness, ECS coverage) to enrich integration matching during SIEM rule migration. The catalog enrichment is appended to the LLM context alongside the ELSER results, with graceful degradation if the catalog is unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ook and badge component Add useCatalogDataSources hook that fetches catalog entries via the data plugin search service, and CatalogDataSourceBadge component that renders integration name and freshness metadata alongside data source names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ainst catalog Add validateRequiredFields function that checks whether a rule's required_fields exist in catalog entries for the target index patterns. Returns advisory results without blocking rule activation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…semantic search Adds optional `embedding` parameter to `CatalogQueryParams` and implements hybrid search (kNN + keyword filters) when an embedding vector is provided, using the `semantic.embedding` dense_vector field (384 dims, cosine). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… UI data source picker Add CatalogSuggestions component that displays available data sources from the catalog below the DataViewSelectorField combo box, showing freshness badges and integration metadata for non-empty sources. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nent Client-side component that validates required fields against the data source catalog and displays warnings when fields are not found in catalog entries matching the selected index patterns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ntegrations from catalog in AI Rule Creation The DISCOVER_DATA_SOURCES node now extracts ECS fields and integration metadata from catalog entries, storing them as suggestedRequiredFields and suggestedRelatedIntegrations on the state. catalogDataSources is also persisted for downstream nodes. Integrations are deduplicated by package name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tterns configurable SecurityCatalogService.start() now accepts a refreshInterval param (e.g. '1h', '6h') that overrides the DEFAULT_REFRESH_INTERVAL when scheduling the background task. configPatterns was already supported; this completes the constructor-param-based config override approach for spike-level configurability without a full Kibana config schema. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…talog # Conflicts: # x-pack/solutions/security/plugins/security_solution/tsconfig.json
… schema entry Replace inline formatting logic in DataSourceCatalogTool with a call to the existing formatCatalogContextForPrompt helper. Add DataSourceCatalogTool to the toolsInvoked type and schema in event_based_telemetry.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ondition, N+1 queries, non-null assertions) - Extract globToRegex() helper that escapes all regex metacharacters before converting * wildcards, fixing potential ReDoS in catalog_refresh and index_metadata_provider - Wrap indices.create in try/catch to silently absorb resource_already_exists_exception, eliminating the TOCTOU race condition in ensureIndex - Remove conflicting number_of_replicas: 0 setting now that auto_expand_replicas: '0-1' fully controls replica count - Refactor validateRequiredFields to issue a single batch catalog query instead of N per-field queries; change indexPatterns: string[] to indexPattern: string to make the single-pattern contract explicit - Replace filter+map with non-null assertions for entry.integration with flatMap guard in discover_data_sources node Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
13 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Spike/PoC for a unified Data Source Catalog (
@kbn/data-source-catalog) that provides queryable metadata about available Elasticsearch indices, data streams, and Fleet integrations. This directly addresses how the Automatic Index Summarization (SML) concept could be applied to Detection Engineering in Elastic Security.Problem
Security Solution has 15+ independent mechanisms for discovering index metadata. Each AI feature (AI Assistant, Attack Discovery, AI Rule Creation, SIEM Migration) builds its own ad-hoc view of available data sources using different ES APIs. This creates:
logs-endpoint.events.process-defaultare opaque to LLMsrequired_fieldsandrelated_integrationson rules are informational-onlySolution
A shared Kibana package that maintains a unified, queryable catalog of data source metadata in a
.kibana-data-source-catalogElasticsearch index, refreshed on Fleet integration events and periodically via TaskManager. Three tiers of metadata — all built in this spike.What this spike builds
Tier 1 —
@kbn/data-source-catalogpackage (static metadata)DataSourceEntry,FieldMetadata,IntegrationMetadatawith strict ES mapping.kibana-data-source-catalogindex (race-condition safe viaresource_already_exists_exceptionhandling)resolveIndex+getMapping+getIndexTemplateTier 2 — Stats and consumer integrations
indices.stats+msearchsecurity:data-source-catalog:refresh-stats— configurable periodic refresh (default 6h)DataSourceCatalogTool— LangChain tool for querying catalog from conversations (with telemetry)dataSourceContextparameter threaded through graph → injected into LLM promptDISCOVER_DATA_SOURCESgraph node — queries catalog, pre-populatessuggestedRequiredFields+suggestedRelatedIntegrationsTier 3 — Semantic layer + MITRE mapping
dense_vector(384 dims, cosine),CatalogQuerysupports kNN hybrid searchConsumer integrations
DataSourceCatalogToolwith shared formatting and telemetry schemaDISCOVER_DATA_SOURCESnode withsuggestedRequiredFields+suggestedRelatedIntegrationsCatalogIntegrationEnrichersupplements ELSER matches with catalog metadata (freshness, fields, topics)useCatalogDataSourceshook +CatalogSuggestionspanel +CatalogDataSourceBadge(freshness badges)validateRequiredFieldsserver-side validation +RequiredFieldsCatalogWarningsUI componentE2E verification results
Tested on a live Kibana + Elasticsearch instance:
.kibana-data-source-catalogindex created on startuplogs-*sources.alerts-security.*,.siem-signals-*Code review findings addressed
globToRegexutility with proper metacharacter escapingensureIndexrace condition (multi-node TOCTOU)resource_already_exists_exceptionnumber_of_replicas, keptauto_expand_replicasDataSourceCatalogTooltoevent_based_telemetry.tsentry.integration!.flatMapnarrowing patternformatCatalogContextForPromptindexPatterns[0]silent truncationindexPattern: stringFiles changed (86 files, +4,937 lines)
New package:
x-pack/platform/packages/shared/kbn-data-source-catalog/(24 files)Security Solution consumer:
server/lib/data_source_catalog/(10 files)Attack Discovery integration:
elastic_assistant/server/lib/attack_discovery/(7 files)AI Rule Creation integration:
ai_rule_creation/agent/(5 files)SIEM Migration integration:
siem_migrations/rules/task/(8 files)Rule Creation UI:
public/detection_engine/rule_creation_ui/(5 files)Plugin wiring:
server/plugin.ts+ telemetryTest coverage
@kbn/data-source-catalogpackageRelated
Test plan
🤖 Generated with Claude Code
Production-Readiness Checklist — Agent Skills Ecosystem
Generated against [Epic] Creation of the Agent Skills Ecosystem for Elastic Security.
Narrative role: Shared context layer for Detection Engineering (and by extension every other skill that needs "what security data exists?"). Applies the epic's "Entity Analytics as a horizontal enrichment layer" pattern to data sources.
Must-do before this can ship
.kibana-data-source-catalog) that 15+ features will depend on — review scope is larger than a code reviewresource_already_exists_exceptionhandling for catalog-index bootstrap under concurrent Kibana nodesFollow-ups (post-merge)
required_fieldsandrelated_integrationson detection rules validatable against the catalog (today informational only)