Skip to content

[Agent Builder] [SML] Define schema + mappings#268485

Draft
Apmats wants to merge 2 commits into
elastic:mainfrom
Apmats:apmats/sml-schema-mappings
Draft

[Agent Builder] [SML] Define schema + mappings#268485
Apmats wants to merge 2 commits into
elastic:mainfrom
Apmats:apmats/sml-schema-mappings

Conversation

@Apmats
Copy link
Copy Markdown
Contributor

@Apmats Apmats commented May 8, 2026

Summary

Resolves search-team#14362. Adds the schema fields the SML team converged on, refactors title / description / content to support BM25 + prefix typeahead + a single unified vector retrieval surface, and rewires buildSmlSearchQuery to match.

Schema changes inevitably touch the query path (the field references in multi_match and the match clauses against content / description move). This PR includes the minimum query rewiring needed for the schema to function as intended. Search-API follow-up (search-team#14363) refines from there — splitting autocomplete from full retrieval, field weighting, etc.

This is the first of three planned PRs:

  1. This PR — schema + base mappings + the query rewiring required.
  2. Autocomplete — adds discovery_labels and the @ menu query path.
  3. Search — refines sml_search retrieval (codepath split, field weighting, hybrid scoring tuning).

Fields added

Field Mapping Purpose
tags keyword[] Free-form labels for filtering and discovery.
payload flattened Type-specific opaque data. Sub-path keyword filtering, no mapping explosion, no per-type schema registry.
title_autocomplete search_as_you_type copy_to target of title. Powers @ menu / typeahead via the SAYT subfields (._2gram, ._3gram, ._index_prefix).
unified_semantic semantic_text copy_to target of title, description, and content. Single vector index per record — one inference pass, single recall surface.

Field changes

Field Was Now Why
title search_as_you_type text with copy_to: ['title_autocomplete', 'unified_semantic'] Three-way fan-out: BM25 on title, typeahead on title_autocomplete, semantic on unified_semantic. Each independently tunable.
description semantic_text text with copy_to: 'unified_semantic' Contributes to the unified vector field instead of having its own.
content semantic_text text with copy_to: 'unified_semantic' Same.

_source.description and _source.content still hold the producer text. _source.unified_semantic does not exist (copy_to populates the inverted/vector index, not _source); the field is queryable via match but not retrievable.

Why one unified semantic field

  • One inference pass per record instead of three. Three independent semantic_text fields would each run inference on overlapping content; one unified field runs inference once on the union.
  • Single vector recall surface. A unified semantic field doesn't fragment retrieval across per-field vector spaces.
  • BM25 stays per-field with boosts available. multi_match: { fields: ['title^3', 'description^2', 'content'] } remains possible — each text field is individually scored.

Why three fields for title

semantic_text cannot be a multi-field, so the BM25 / typeahead / semantic split has to live at the top level. title is the canonical text; title_autocomplete and unified_semantic are derived via copy_to. Each field is independently tunable — analyzer for BM25, shingles for autocomplete, inference model for semantic.

Decisions

payload mapping: flattened

Sub-path keyword filtering, no mapping explosion. disabled is the fallback if even keyword filtering turns out unneeded. object rejected — dynamic: true invites mapping explosion, dynamic: 'strict' would require a per-type schema registry that the team explicitly doesn't want.

tags mapping: plain keyword

Multi-value. Add a search_as_you_type subfield later only if tag-name autocomplete becomes a real need.

Query path changes

buildSmlSearchQuery's should block now contains:

  • multi_match: bool_prefix against SML_SEARCH_AS_YOU_TYPE_FIELDS — typeahead/prefix on title_autocomplete.* and type.autocomplete.*. (Previously this list also included bare title — dropped because title is no longer a SAYT field.)
  • match: { title } — BM25 on the canonical text field (added; previously title contributed via the SAYT bool_prefix).
  • match: { description } and match: { content } — BM25 on the per-field text. (Previously these were match clauses against semantic_text fields, i.e. semantic retrieval. Now they're BM25 since the fields became text.)
  • match: { unified_semantic } — semantic vector retrieval. Replaces the prior semantic matches on content and description; coverage is the same (the unified field aggregates both via copy_to) with one inference output rather than two.

Existing type writers (dashboard, visualization, connector, rule, etc.) need no changes — they keep returning the same SmlChunk shape; ES copy_to handles the fan-out.

Out of scope

  • discovery_labels (categorical/nickname terms backing the @ menu) — autocomplete follow-up PR.
  • Splitting the query into autocomplete-only vs full-retrieval codepaths, field-weighted BM25, hybrid scoring tuning — search-team#14363.
  • Per-type payload schema registry / validation — explicit no.
  • last_accessed_at / access_count — not added: "accessed" semantics are undefined and bumping a counter on every search hit is write-amplification on a hot path. ES mapping append is non-breaking, so these can be added later when there is a defined writer.
  • sml_read / lookup endpoint — search-team#14365.
  • Telemetry — search-team#14366.
  • Migration / mapping versioning / recrawl — search-team#14367 (note: data-loss concern present but not addressed here).

Test plan

  • node scripts/eslint --fix on touched files.
  • node scripts/type_check --project x-pack/platform/plugins/shared/agent_context_layer/tsconfig.json.
  • node scripts/jest x-pack/platform/plugins/shared/agent_context_layer/server/services/sml/.
  • FTR agent_builder_api_integration SML smoke test exercises the smlElasticsearchIndexMappings export end-to-end.

@infra-vault-gh-plugin-prod
Copy link
Copy Markdown

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

  • Click to trigger kibana-pull-request for this PR!
  • Click to trigger kibana-deploy-project-from-pr for this PR!
  • Click to trigger kibana-deploy-cloud-from-pr for this PR!
  • Click to trigger kibana-entity-store-performance-from-pr for this PR!
  • Click to trigger kibana-storybooks-from-pr for this PR!

@Apmats Apmats force-pushed the apmats/sml-schema-mappings branch 7 times, most recently from 810f1df to c97a1a5 Compare May 9, 2026 10:57
Builds on Peter's merged elastic#266573 by adding the schema fields the team
converged on and refactoring title/description/content for BM25 + a
single unified vector retrieval surface.

Fields added:
- tags (keyword[], free-form labels)
- payload (flattened, type-specific opaque data)
- title_autocomplete (search_as_you_type, copy target of title)
- unified_semantic (semantic_text, copy target of title/description/content)

Behavior changes:
- title becomes `text` with copy_to fanning into title_autocomplete and
  unified_semantic. Three retrieval modes (BM25/lexical, prefix/typeahead,
  semantic) from one producer-set field.
- description and content become `text` with copy_to: 'unified_semantic'.
  One inference pass per record instead of three; recall doesn't
  fragment across overlapping content.
- buildSmlSearchQuery: SAYT field paths move from title.* to
  title_autocomplete.*; the should-block uses match: { unified_semantic }
  in place of separate matches on content and description.

origin_id is unchanged. An earlier draft of this PR added a parallel
`origin` URI field per the gist's vision but it was dropped: the URI
form is computable on the fly from type + origin_id, Sean's merged
buildTypeFilters targets origin_id directly, and adding a parallel
representation would just be redundant.

Type writers need no changes: they keep returning the same SmlChunk
shape; ES copy_to handles the fan-out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice starting point @Apmats ! I added some initial comments - as this is a draft PR I didn't look at tests, just the schema.

@@ -23,12 +23,16 @@ const smlStorageSchemaProperties = {
autocomplete: types.search_as_you_type({}),
},
}),
title: types.search_as_you_type({}),
title: types.text({ copy_to: ['title_autocomplete', 'unified_semantic'] }),
title_autocomplete: types.search_as_you_type({}),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we reconsider making title a multi-field? semantic_text does now support multi-fields, see: https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text-ingestions#use-multi-fields

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, use multi-fields when able.

user_id: types.keyword({}),
content: types.text({ copy_to: 'unified_semantic' }),
description: types.text({ copy_to: 'unified_semantic' }),
unified_semantic: types.semantic_text({}),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that you're trying to optimize for inference calls here, but I have some questions about whether we really want to break these up in practice.

  • Inference costs should scale with total length, and as we send inference calls in bulk, the overhead shouldn't be something that we worry about
  • There is a potential for relevance pitfalls here, for example if we want to boost title content more highly or use description as a boost field only in a hybrid query, etc. This means we can't experiment with any of those optimizations.

I'd at least like to test these out before we commit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm with you and I wanted to bring this up in review. I was trying to read the room though and from discussions it seemed like most everyone assumed we would accumulate all text for a semantic field. Eg your message here

Way I see it - keeping them separate allows us to do all the experiments you just described, eg drop title if they are not friendly to semantic search because of title being essentially something system-provided for certain objects.

The inference argument IMO is ass, It's just Claude filling in the gaps.
My real concern is the HNSW overhead. Every separate semantic-searchable field getting it's own field mean more memory usage, and depends on how far SML goes we might be needlessly introducing scaling problems - as for simple SML objects that wouldn't chunk we're going from 1 vector entry into 1 graph] to [N (now 3, title content description, maybe more down the line) into N graphs].

I've struggled with scaling HNSW search before in multiple cases, needs to be conscious decision probably.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point 😅 I have some recency bias, recently talked to some people who backed themselves into a relevance corner doing that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whatever we do we should default to disk BBQ and not HNSW though

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted on diskBBQ, since primary user of this interface is agents they're also less sensitive to latency. I do need to catch up on details around resource util, latency etc. for diskBBQ myself though.

content: types.text({ copy_to: 'unified_semantic' }),
description: types.text({ copy_to: 'unified_semantic' }),
unified_semantic: types.semantic_text({}),
tags: types.keyword({}),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some common sense normalizes here, like lowercase?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. On tags you mean specifically? Lowercase probably makes sense, maybe some folding. Will figure out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ was talking about tags, it could be worth doing a pass over the schema and seeing if other fields would benefit, but tags would definitely benefit the most I think.

* keyword-searchable for sub-path filtering. SML treats this opaquely;
* type writers own its shape.
*/
payload?: Record<string, unknown>;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with flattened here for the reasons you laid out, the other option would be nested but that comes with its own dragons.

/** Owner or last-modifier user id when known */
user_id?: string;
/** Other SML chunk ids this item references */
/** Other SML chunk ids (URI form preferred, e.g. `dashboard://abc`) this item references */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a "here be dragons" - as a future note, we should probably make sure that we add validators for an appropriate URI format here, to ensure that we don't pollute the contents of this field.

I recall that we had some discussions here on additional data that we might want - we may need more than just the URI but have the ability to explain exactly what the relationship and correlation here is. Where did we end up with that investigation, and should we consider making references more explicit here?

Replace the single `unified_semantic` copy_to aggregator with per-field
semantic_text mirrors (`title_semantic`, `description_semantic`,
`content_semantic`). The search query now uses the `rrf` retriever with
`query`/`fields` for the semantic side and a `standard` sub-retriever for
BM25 + SAYT prefix matching.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants