Skip to content

Spike: Data Source Catalog for Detection Engineering#260159

Closed
patrykkopycinski wants to merge 31 commits intoelastic:mainfrom
patrykkopycinski:feat/data-source-catalog
Closed

Spike: Data Source Catalog for Detection Engineering#260159
patrykkopycinski wants to merge 31 commits intoelastic:mainfrom
patrykkopycinski:feat/data-source-catalog

Conversation

@patrykkopycinski
Copy link
Copy Markdown
Contributor

@patrykkopycinski patrykkopycinski commented Mar 29, 2026

Summary

Spike/PoC for a unified Data Source Catalog (@kbn/data-source-catalog) that provides queryable metadata about available Elasticsearch indices, data streams, and Fleet integrations. This directly addresses how the Automatic Index Summarization (SML) concept could be applied to Detection Engineering in Elastic Security.

Problem

Security Solution has 15+ independent mechanisms for discovering index metadata. Each AI feature (AI Assistant, Attack Discovery, AI Rule Creation, SIEM Migration) builds its own ad-hoc view of available data sources using different ES APIs. This creates:

  • No shared understanding — AI features can't answer "what security data exists?" without expensive runtime discovery
  • No semantic context — index names like logs-endpoint.events.process-default are opaque to LLMs
  • No validationrequired_fields and related_integrations on rules are informational-only
  • Duplicated effort — SIEM Migration, ES|QL tools, and Timeline UI each build their own index discoverer

Solution

A shared Kibana package that maintains a unified, queryable catalog of data source metadata in a .kibana-data-source-catalog Elasticsearch index, refreshed on Fleet integration events and periodically via TaskManager. Three tiers of metadata — all built in this spike.


What this spike builds

Tier 1 — @kbn/data-source-catalog package (static metadata)

Component Description
Types & Schema DataSourceEntry, FieldMetadata, IntegrationMetadata with strict ES mapping
CatalogClient CRUD for .kibana-data-source-catalog index (race-condition safe via resource_already_exists_exception handling)
CatalogQuery Typed search: filter by name pattern, integration, field existence, full-text, kNN vector search
IndexMetadataProvider Discovers indices via resolveIndex + getMapping + getIndexTemplate
IntegrationProvider Enriches entries with Fleet package metadata (name, description, data streams)
CatalogRefresh Orchestrates: discover → merge integrations → stats → heuristic summaries → persist
globToRegex utility Safe glob-to-regex conversion with proper metacharacter escaping

Tier 2 — Stats and consumer integrations

Component Description
IndexStatsProvider Doc counts, store size, freshness (live/recent/stale/empty) via indices.stats + msearch
TaskManager refresh security:data-source-catalog:refresh-stats — configurable periodic refresh (default 6h)
AI Assistant tool DataSourceCatalogTool — LangChain tool for querying catalog from conversations (with telemetry)
Attack Discovery dataSourceContext parameter threaded through graph → injected into LLM prompt
AI Rule Creation DISCOVER_DATA_SOURCES graph node — queries catalog, pre-populates suggestedRequiredFields + suggestedRelatedIntegrations
formatCatalogContextForPrompt Shared formatter for LLM consumption (used by tool + consumers)

Tier 3 — Semantic layer + MITRE mapping

Component Description
Heuristic Summary Provider Generates summaries without LLM — using field names, integration metadata, ECS coverage
Topic inference Automatically detects: process execution, network, file activity, authentication, cloud, etc.
MITRE ATT&CK mapping Infers techniques from field names (T1059, T1071, T1078, T1083, T1112, etc.)
Dense vector + kNN search Index mapping includes dense_vector (384 dims, cosine), CatalogQuery supports kNN hybrid search

Consumer integrations

Consumer Integration
AI Assistant DataSourceCatalogTool with shared formatting and telemetry schema
Attack Discovery Environmental data source context injected into LLM prompt
AI Rule Creation DISCOVER_DATA_SOURCES node with suggestedRequiredFields + suggestedRelatedIntegrations
SIEM Migration CatalogIntegrationEnricher supplements ELSER matches with catalog metadata (freshness, fields, topics)
Rule Creation UI useCatalogDataSources hook + CatalogSuggestions panel + CatalogDataSourceBadge (freshness badges)
required_fields validateRequiredFields server-side validation + RequiredFieldsCatalogWarnings UI component

E2E verification results

Tested on a live Kibana + Elasticsearch instance:

Verification Result
.kibana-data-source-catalog index created on startup ✅ Strict mapping with all fields
Catalog populated from ES + Fleet 16 entries in 1.8 seconds
Fleet integration matching (Elastic Defend, System) ✅ 3 Endpoint + 2 System data streams matched
Freshness computed (live/stale/empty) ✅ Correct for all entries
CatalogQuery: filter by integration ✅ Found 3 Elastic Defend streams
CatalogQuery: wildcard name pattern ✅ Found 8 logs-* sources
Security aliases discovered .alerts-security.*, .siem-signals-*

Code review findings addressed

Finding Fix
ReDoS vulnerability in glob-to-regex Extracted globToRegex utility with proper metacharacter escaping
ensureIndex race condition (multi-node TOCTOU) Catch resource_already_exists_exception
Conflicting index settings Removed number_of_replicas, kept auto_expand_replicas
Missing telemetry schema Added DataSourceCatalogTool to event_based_telemetry.ts
Non-null assertions on entry.integration! Replaced with .flatMap narrowing pattern
Duplicated formatting logic Tool now calls shared formatCatalogContextForPrompt
N+1 sequential queries in validation Batched into single catalog query + Set lookup
indexPatterns[0] silent truncation Changed parameter to explicit indexPattern: string

Files changed (86 files, +4,937 lines)

New package: x-pack/platform/packages/shared/kbn-data-source-catalog/ (24 files)
Security Solution consumer: server/lib/data_source_catalog/ (10 files)
Attack Discovery integration: elastic_assistant/server/lib/attack_discovery/ (7 files)
AI Rule Creation integration: ai_rule_creation/agent/ (5 files)
SIEM Migration integration: siem_migrations/rules/task/ (8 files)
Rule Creation UI: public/detection_engine/rule_creation_ui/ (5 files)
Plugin wiring: server/plugin.ts + telemetry

Test coverage

Suite Tests
@kbn/data-source-catalog package 34
Security Solution catalog consumers 20
AI Rule Creation discover node 8
SIEM Migration enricher 4
Attack Discovery (existing, all pass) 369
Defend Insights (existing, all pass) 125
Total new tests 66

Related

Test plan

  • All 66 new unit tests pass across 13 test files
  • All 494 existing tests in affected areas pass (Attack Discovery, Defend Insights, AI Rule Creation)
  • E2E: catalog index created and populated on live Kibana startup (16 entries)
  • E2E: Fleet integration metadata correctly matched to data streams
  • E2E: CatalogQuery returns correct results for filter, wildcard, and integration searches
  • E2E: Heuristic summaries generate topics and MITRE techniques
  • Type check passes with zero errors in the new package
  • ESLint passes with zero errors
  • Code review findings from Garrett + Viduni reviews addressed

🤖 Generated with Claude Code

Production-Readiness Checklist — Agent Skills Ecosystem

Generated against [Epic] Creation of the Agent Skills Ecosystem for Elastic Security.

Narrative role: Shared context layer for Detection Engineering (and by extension every other skill that needs "what security data exists?"). Applies the epic's "Entity Analytics as a horizontal enrichment layer" pattern to data sources.

Must-do before this can ship

  • File a formal RFC. This spike proposes a new system-of-record (.kibana-data-source-catalog) that 15+ features will depend on — review scope is larger than a code review
  • Decide and document the embedding model for kNN search (ELSER vs E5 vs user-supplied) and publish the cost model
  • Widen the consumer contract before shipping: AESOP (#258980), SIEM Migration, AI Assistant, ES|QL tools — not just Detection Engineering. Define read API and versioning now
  • Stress-test the resource_already_exists_exception handling for catalog-index bootstrap under concurrent Kibana nodes
  • Task Manager refresh cadence + kill switch + backoff on Fleet-event storms
  • Privacy / access control: catalog contains index names and field metadata — restrict to users with the relevant index privileges (do not leak schema of indices the caller cannot read)

Follow-ups (post-merge)

  • Expose the catalog as an Agent Builder read-tool so skills (Triage, Hunt, DE) can call it instead of re-implementing index discovery
  • Make required_fields and related_integrations on detection rules validatable against the catalog (today informational only)

patrykkopycinski and others added 19 commits March 26, 2026 16:57
Adds the initial package scaffold for @kbn/data-source-catalog, a shared-server
package that will provide a unified, queryable catalog of data source metadata
for Security Solution AI features, rule creation, and SIEM migration.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Introduces core TypeScript interfaces (DataSourceEntry, FieldMetadata,
IntegrationMetadata, DataSourceStats, DataSourceSemantic, CatalogQueryParams,
CatalogQueryResult) and shared constants (CATALOG_INDEX_NAME, FRESHNESS_THRESHOLDS,
DEFAULT_SECURITY_PATTERNS, CATALOG_VERSION) for the kbn-data-source-catalog package.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the Elasticsearch index mapping for the .kibana-data-source-catalog
system index using MappingTypeMapping from @elastic/elasticsearch. Updates
the package entry point to export types, constants, and the mapping.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements CatalogClient class wrapping .kibana-data-source-catalog index
with ensureIndex, bulkUpsert, and deleteAll operations. Adds 6 unit tests
covering all public methods.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d match

Implements CatalogQuery class that builds typed Elasticsearch queries from
CatalogQueryParams (wildcard name, term type/package, nested field existence,
activeOnly, multi_match full-text) and returns CatalogQueryResult. Exports
from package index.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements discoverIndexMetadata function that resolves indices/data streams/aliases
via ES resolveIndex, flattens nested mapping properties into FieldMetadata[], computes
ECS field coverage via ecsFieldMap, limits stored fields to DEFAULT_FIELD_LIMIT (500)
prioritizing ECS fields, and matches index templates by pattern to attach _meta.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…adata)

Implements fetchIntegrationMetadata function that maps installed Fleet package
data streams to IntegrationMetadata. Uses a minimal PackageClientLike interface
to avoid a hard dependency on the Fleet plugin.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds refreshCatalog function that orchestrates index discovery, Fleet integration metadata merging, and bulk persistence via CatalogClient. Exports refreshCatalog and PackageClientLike from the package public API.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and Fleet hooks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ess, size)

Implements fetchIndexStats which combines indices.stats (doc count + store
size) and msearch max-@timestamp aggregations to compute is_active and
freshness_category (live/recent/stale/empty) per index.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add `includeStats` param to `RefreshCatalogParams`; when true, calls
`fetchIndexStats` after index/integration discovery and attaches stats
to entries before persisting via `bulkUpsert`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… interval)

Registers a recurring task that refreshes the data source catalog with stats every 6 hours, and passes includeStats: true on both initial and on-demand refreshes so doc counts and freshness metadata are always populated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a pure helper that converts DataSourceEntry[] into a human-readable markdown block suitable for injection into LLM prompts, surfacing integration metadata, doc count / freshness stats, and top ECS fields for each data source.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Discovery prompt

Thread an optional dataSourceContext string through the Attack Discovery graph
so the LLM can receive context about available security data sources in the
environment. When empty (default), behavior is identical to current. This
prepares the plumbing for injecting catalog summaries from SecurityCatalogService.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@elasticmachine
Copy link
Copy Markdown
Contributor

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

  • Click to trigger kibana-pull-request for this PR!
  • Click to trigger kibana-deploy-project-from-pr for this PR!
  • Click to trigger kibana-deploy-cloud-from-pr for this PR!
  • Click to trigger kibana-entity-store-performance-from-pr for this PR!
  • Click to trigger kibana-storybooks-from-pr for this PR!

patrykkopycinski and others added 10 commits March 29, 2026 08:03
…ummaries and MITRE mapping

Adds the semantic enrichment pipeline that runs at catalog refresh time:
- Extends ES index mapping with dense_vector (384-dim, cosine) for future embedding support
- Implements heuristic summary generator that produces summaries, topic tags, and MITRE
  ATT&CK technique IDs from field names and integration metadata — no LLM required
- Wires summary generation as Step 5 in refreshCatalog, running for every entry
- Exports generateHeuristicSummary from the package public API

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gration

Supplements the ELSER-based integration retriever with data source catalog
metadata (field counts, freshness, ECS coverage) to enrich integration
matching during SIEM rule migration. The catalog enrichment is appended
to the LLM context alongside the ELSER results, with graceful degradation
if the catalog is unavailable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ook and badge component

Add useCatalogDataSources hook that fetches catalog entries via the data
plugin search service, and CatalogDataSourceBadge component that renders
integration name and freshness metadata alongside data source names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ainst catalog

Add validateRequiredFields function that checks whether a rule's
required_fields exist in catalog entries for the target index patterns.
Returns advisory results without blocking rule activation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…semantic search

Adds optional `embedding` parameter to `CatalogQueryParams` and implements
hybrid search (kNN + keyword filters) when an embedding vector is provided,
using the `semantic.embedding` dense_vector field (384 dims, cosine).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… UI data source picker

Add CatalogSuggestions component that displays available data sources from
the catalog below the DataViewSelectorField combo box, showing freshness
badges and integration metadata for non-empty sources.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nent

Client-side component that validates required fields against the data source
catalog and displays warnings when fields are not found in catalog entries
matching the selected index patterns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ntegrations from catalog in AI Rule Creation

The DISCOVER_DATA_SOURCES node now extracts ECS fields and integration
metadata from catalog entries, storing them as suggestedRequiredFields
and suggestedRelatedIntegrations on the state. catalogDataSources is
also persisted for downstream nodes. Integrations are deduplicated by
package name.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tterns configurable

SecurityCatalogService.start() now accepts a refreshInterval param
(e.g. '1h', '6h') that overrides the DEFAULT_REFRESH_INTERVAL when
scheduling the background task. configPatterns was already supported;
this completes the constructor-param-based config override approach
for spike-level configurability without a full Kibana config schema.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…talog

# Conflicts:
#	x-pack/solutions/security/plugins/security_solution/tsconfig.json
patrykkopycinski and others added 2 commits March 29, 2026 12:23
… schema entry

Replace inline formatting logic in DataSourceCatalogTool with a call to
the existing formatCatalogContextForPrompt helper. Add DataSourceCatalogTool
to the toolsInvoked type and schema in event_based_telemetry.ts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ondition, N+1 queries, non-null assertions)

- Extract globToRegex() helper that escapes all regex metacharacters before converting * wildcards, fixing potential ReDoS in catalog_refresh and index_metadata_provider
- Wrap indices.create in try/catch to silently absorb resource_already_exists_exception, eliminating the TOCTOU race condition in ensureIndex
- Remove conflicting number_of_replicas: 0 setting now that auto_expand_replicas: '0-1' fully controls replica count
- Refactor validateRequiredFields to issue a single batch catalog query instead of N per-field queries; change indexPatterns: string[] to indexPattern: string to make the single-pattern contract explicit
- Replace filter+map with non-null assertions for entry.integration with flatMap guard in discover_data_sources node

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants