[Agent Builder] [SML] Define schema + mappings by Apmats · Pull Request #268485 · elastic/kibana

Apmats · 2026-05-08T15:28:02Z

Summary

Resolves search-team#14362. Adds the schema fields the SML team converged on, refactors title / description / content to support BM25 + prefix typeahead + a single unified vector retrieval surface, and rewires buildSmlSearchQuery to match.

Schema changes inevitably touch the query path (the field references in multi_match and the match clauses against content / description move). This PR includes the minimum query rewiring needed for the schema to function as intended. Search-API follow-up (search-team#14363) refines from there — splitting autocomplete from full retrieval, field weighting, etc.

This is the first of three planned PRs:

This PR — schema + base mappings + the query rewiring required.
Autocomplete — adds discovery_labels and the @ menu query path.
Search — refines sml_search retrieval (codepath split, field weighting, hybrid scoring tuning).

Fields added

Field	Mapping	Purpose
`tags`	`keyword[]`	Free-form labels for filtering and discovery.
`payload`	`flattened`	Type-specific opaque data. Sub-path keyword filtering, no mapping explosion, no per-type schema registry.
`title_autocomplete`	`search_as_you_type`	`copy_to` target of `title`. Powers @ menu / typeahead via the SAYT subfields (`._2gram`, `._3gram`, `._index_prefix`).
`unified_semantic`	`semantic_text`	`copy_to` target of `title`, `description`, and `content`. Single vector index per record — one inference pass, single recall surface.

Field changes

Field	Was	Now	Why
`title`	`search_as_you_type`	`text` with `copy_to: ['title_autocomplete', 'unified_semantic']`	Three-way fan-out: BM25 on `title`, typeahead on `title_autocomplete`, semantic on `unified_semantic`. Each independently tunable.
`description`	`semantic_text`	`text` with `copy_to: 'unified_semantic'`	Contributes to the unified vector field instead of having its own.
`content`	`semantic_text`	`text` with `copy_to: 'unified_semantic'`	Same.

_source.description and _source.content still hold the producer text. _source.unified_semantic does not exist (copy_to populates the inverted/vector index, not _source); the field is queryable via match but not retrievable.

Why one unified semantic field

One inference pass per record instead of three. Three independent semantic_text fields would each run inference on overlapping content; one unified field runs inference once on the union.
Single vector recall surface. A unified semantic field doesn't fragment retrieval across per-field vector spaces.
BM25 stays per-field with boosts available. multi_match: { fields: ['title^3', 'description^2', 'content'] } remains possible — each text field is individually scored.

Why three fields for title

semantic_text cannot be a multi-field, so the BM25 / typeahead / semantic split has to live at the top level. title is the canonical text; title_autocomplete and unified_semantic are derived via copy_to. Each field is independently tunable — analyzer for BM25, shingles for autocomplete, inference model for semantic.

Decisions

`payload` mapping: `flattened`

Sub-path keyword filtering, no mapping explosion. disabled is the fallback if even keyword filtering turns out unneeded. object rejected — dynamic: true invites mapping explosion, dynamic: 'strict' would require a per-type schema registry that the team explicitly doesn't want.

`tags` mapping: plain `keyword`

Multi-value. Add a search_as_you_type subfield later only if tag-name autocomplete becomes a real need.

Query path changes

buildSmlSearchQuery's should block now contains:

multi_match: bool_prefix against SML_SEARCH_AS_YOU_TYPE_FIELDS — typeahead/prefix on title_autocomplete.* and type.autocomplete.*. (Previously this list also included bare title — dropped because title is no longer a SAYT field.)
match: { title } — BM25 on the canonical text field (added; previously title contributed via the SAYT bool_prefix).
match: { description } and match: { content } — BM25 on the per-field text. (Previously these were match clauses against semantic_text fields, i.e. semantic retrieval. Now they're BM25 since the fields became text.)
match: { unified_semantic } — semantic vector retrieval. Replaces the prior semantic matches on content and description; coverage is the same (the unified field aggregates both via copy_to) with one inference output rather than two.

Existing type writers (dashboard, visualization, connector, rule, etc.) need no changes — they keep returning the same SmlChunk shape; ES copy_to handles the fan-out.

Out of scope

discovery_labels (categorical/nickname terms backing the @ menu) — autocomplete follow-up PR.
Splitting the query into autocomplete-only vs full-retrieval codepaths, field-weighted BM25, hybrid scoring tuning — search-team#14363.
Per-type payload schema registry / validation — explicit no.
last_accessed_at / access_count — not added: "accessed" semantics are undefined and bumping a counter on every search hit is write-amplification on a hot path. ES mapping append is non-breaking, so these can be added later when there is a defined writer.
sml_read / lookup endpoint — search-team#14365.
Telemetry — search-team#14366.
Migration / mapping versioning / recrawl — search-team#14367 (note: data-loss concern present but not addressed here).

Test plan

node scripts/eslint --fix on touched files.
node scripts/type_check --project x-pack/platform/plugins/shared/agent_context_layer/tsconfig.json.
node scripts/jest x-pack/platform/plugins/shared/agent_context_layer/server/services/sml/.
FTR agent_builder_api_integration SML smoke test exercises the smlElasticsearchIndexMappings export end-to-end.

infra-vault-gh-plugin-prod · 2026-05-08T15:28:20Z

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

Click to trigger kibana-pull-request for this PR!
Click to trigger kibana-deploy-project-from-pr for this PR!
Click to trigger kibana-deploy-cloud-from-pr for this PR!
Click to trigger kibana-entity-store-performance-from-pr for this PR!
Click to trigger kibana-storybooks-from-pr for this PR!

Builds on Peter's merged elastic#266573 by adding the schema fields the team converged on and refactoring title/description/content for BM25 + a single unified vector retrieval surface. Fields added: - tags (keyword[], free-form labels) - payload (flattened, type-specific opaque data) - title_autocomplete (search_as_you_type, copy target of title) - unified_semantic (semantic_text, copy target of title/description/content) Behavior changes: - title becomes `text` with copy_to fanning into title_autocomplete and unified_semantic. Three retrieval modes (BM25/lexical, prefix/typeahead, semantic) from one producer-set field. - description and content become `text` with copy_to: 'unified_semantic'. One inference pass per record instead of three; recall doesn't fragment across overlapping content. - buildSmlSearchQuery: SAYT field paths move from title.* to title_autocomplete.*; the should-block uses match: { unified_semantic } in place of separate matches on content and description. origin_id is unchanged. An earlier draft of this PR added a parallel `origin` URI field per the gist's vision but it was dropped: the URI form is computable on the fly from type + origin_id, Sean's merged buildTypeFilters targets origin_id directly, and adding a parallel representation would just be redundant. Type writers need no changes: they keep returning the same SmlChunk shape; ES copy_to handles the fan-out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kderusso

Nice starting point @Apmats ! I added some initial comments - as this is a draft PR I didn't look at tests, just the schema.

kderusso · 2026-05-11T17:57:21Z

@@ -23,12 +23,16 @@ const smlStorageSchemaProperties = {
      autocomplete: types.search_as_you_type({}),
    },
  }),
-  title: types.search_as_you_type({}),
+  title: types.text({ copy_to: ['title_autocomplete', 'unified_semantic'] }),
+  title_autocomplete: types.search_as_you_type({}),


Could we reconsider making title a multi-field? semantic_text does now support multi-fields, see: https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text-ingestions#use-multi-fields

Noted, use multi-fields when able.

kderusso · 2026-05-11T18:01:23Z

-  user_id: types.keyword({}),
+  content: types.text({ copy_to: 'unified_semantic' }),
+  description: types.text({ copy_to: 'unified_semantic' }),
+  unified_semantic: types.semantic_text({}),


I realize that you're trying to optimize for inference calls here, but I have some questions about whether we really want to break these up in practice.

Inference costs should scale with total length, and as we send inference calls in bulk, the overhead shouldn't be something that we worry about

There is a potential for relevance pitfalls here, for example if we want to boost title content more highly or use description as a boost field only in a hybrid query, etc. This means we can't experiment with any of those optimizations.

I'd at least like to test these out before we commit.

Actually I'm with you and I wanted to bring this up in review. I was trying to read the room though and from discussions it seemed like most everyone assumed we would accumulate all text for a semantic field. Eg your message here

Way I see it - keeping them separate allows us to do all the experiments you just described, eg drop title if they are not friendly to semantic search because of title being essentially something system-provided for certain objects.

The inference argument IMO is ass, It's just Claude filling in the gaps.
My real concern is the HNSW overhead. Every separate semantic-searchable field getting it's own field mean more memory usage, and depends on how far SML goes we might be needlessly introducing scaling problems - as for simple SML objects that wouldn't chunk we're going from 1 vector entry into 1 graph] to [N (now 3, title content description, maybe more down the line) into N graphs].

I've struggled with scaling HNSW search before in multiple cases, needs to be conscious decision probably.

Fair point 😅 I have some recency bias, recently talked to some people who backed themselves into a relevance corner doing that.

Whatever we do we should default to disk BBQ and not HNSW though

Noted on diskBBQ, since primary user of this interface is agents they're also less sensitive to latency. I do need to catch up on details around resource util, latency etc. for diskBBQ myself though.

kderusso · 2026-05-11T18:02:17Z

+  content: types.text({ copy_to: 'unified_semantic' }),
+  description: types.text({ copy_to: 'unified_semantic' }),
+  unified_semantic: types.semantic_text({}),
+  tags: types.keyword({}),


Should we add some common sense normalizes here, like lowercase?

Good question. On tags you mean specifically? Lowercase probably makes sense, maybe some folding. Will figure out.

++ was talking about tags, it could be worth doing a pass over the schema and seeing if other fields would benefit, but tags would definitely benefit the most I think.

kderusso · 2026-05-11T18:04:06Z

+   * keyword-searchable for sub-path filtering. SML treats this opaquely;
+   * type writers own its shape.
+   */
+  payload?: Record<string, unknown>;


I'm good with flattened here for the reasons you laid out, the other option would be nested but that comes with its own dragons.

kderusso · 2026-05-11T18:06:28Z

  /** Owner or last-modifier user id when known */
  user_id?: string;
-  /** Other SML chunk ids this item references */
+  /** Other SML chunk ids (URI form preferred, e.g. `dashboard://abc`) this item references */


This feels like a "here be dragons" - as a future note, we should probably make sure that we add validators for an appropriate URI format here, to ensure that we don't pollute the contents of this field.

I recall that we had some discussions here on additional data that we might want - we may need more than just the URI but have the ability to explain exactly what the relationship and correlation here is. Where did we end up with that investigation, and should we consider making references more explicit here?

Replace the single `unified_semantic` copy_to aggregator with per-field semantic_text mirrors (`title_semantic`, `description_semantic`, `content_semantic`). The search query now uses the `rrf` retriever with `query`/`fields` for the semantic side and a `standard` sub-retriever for BM25 + SAYT prefix matching. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Apmats force-pushed the apmats/sml-schema-mappings branch 7 times, most recently from 810f1df to c97a1a5 Compare May 9, 2026 10:57

Apmats force-pushed the apmats/sml-schema-mappings branch from c97a1a5 to 4902b1b Compare May 11, 2026 00:08

Apmats mentioned this pull request May 11, 2026

[Agent Builder] [SML] Add discovery_labels + autocomplete query path Apmats/kibana#1

Draft

8 tasks

kderusso reviewed May 11, 2026

View reviewed changes

Apmats mentioned this pull request May 14, 2026

[Agent Builder] [SML] Retrieval layer: schema, autocomplete, hybrid search #269277

Draft

Conversation

Apmats commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fields added

Field changes

Why one unified semantic field

Why three fields for title

Decisions

payload mapping: flattened

tags mapping: plain keyword

Query path changes

Out of scope

Test plan

Uh oh!

infra-vault-gh-plugin-prod Bot commented May 8, 2026

Uh oh!

kderusso left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Apmats commented May 8, 2026 •

edited

Loading

`payload` mapping: `flattened`

`tags` mapping: plain `keyword`