[Agent Builder] [SML] Define schema + mappings#268485
Conversation
|
🤖 Jobs for this PR can be triggered through checkboxes. 🚧
ℹ️ To trigger the CI, please tick the checkbox below 👇
|
810f1df to
c97a1a5
Compare
Builds on Peter's merged elastic#266573 by adding the schema fields the team converged on and refactoring title/description/content for BM25 + a single unified vector retrieval surface. Fields added: - tags (keyword[], free-form labels) - payload (flattened, type-specific opaque data) - title_autocomplete (search_as_you_type, copy target of title) - unified_semantic (semantic_text, copy target of title/description/content) Behavior changes: - title becomes `text` with copy_to fanning into title_autocomplete and unified_semantic. Three retrieval modes (BM25/lexical, prefix/typeahead, semantic) from one producer-set field. - description and content become `text` with copy_to: 'unified_semantic'. One inference pass per record instead of three; recall doesn't fragment across overlapping content. - buildSmlSearchQuery: SAYT field paths move from title.* to title_autocomplete.*; the should-block uses match: { unified_semantic } in place of separate matches on content and description. origin_id is unchanged. An earlier draft of this PR added a parallel `origin` URI field per the gist's vision but it was dropped: the URI form is computable on the fly from type + origin_id, Sean's merged buildTypeFilters targets origin_id directly, and adding a parallel representation would just be redundant. Type writers need no changes: they keep returning the same SmlChunk shape; ES copy_to handles the fan-out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c97a1a5 to
4902b1b
Compare
| @@ -23,12 +23,16 @@ const smlStorageSchemaProperties = { | |||
| autocomplete: types.search_as_you_type({}), | |||
| }, | |||
| }), | |||
| title: types.search_as_you_type({}), | |||
| title: types.text({ copy_to: ['title_autocomplete', 'unified_semantic'] }), | |||
| title_autocomplete: types.search_as_you_type({}), | |||
There was a problem hiding this comment.
Could we reconsider making title a multi-field? semantic_text does now support multi-fields, see: https://www.elastic.co/docs/reference/elasticsearch/mapping-reference/semantic-text-ingestions#use-multi-fields
There was a problem hiding this comment.
Noted, use multi-fields when able.
| user_id: types.keyword({}), | ||
| content: types.text({ copy_to: 'unified_semantic' }), | ||
| description: types.text({ copy_to: 'unified_semantic' }), | ||
| unified_semantic: types.semantic_text({}), |
There was a problem hiding this comment.
I realize that you're trying to optimize for inference calls here, but I have some questions about whether we really want to break these up in practice.
- Inference costs should scale with total length, and as we send inference calls in bulk, the overhead shouldn't be something that we worry about
- There is a potential for relevance pitfalls here, for example if we want to boost title content more highly or use description as a boost field only in a hybrid query, etc. This means we can't experiment with any of those optimizations.
I'd at least like to test these out before we commit.
There was a problem hiding this comment.
Actually I'm with you and I wanted to bring this up in review. I was trying to read the room though and from discussions it seemed like most everyone assumed we would accumulate all text for a semantic field. Eg your message here
Way I see it - keeping them separate allows us to do all the experiments you just described, eg drop title if they are not friendly to semantic search because of title being essentially something system-provided for certain objects.
The inference argument IMO is ass, It's just Claude filling in the gaps.
My real concern is the HNSW overhead. Every separate semantic-searchable field getting it's own field mean more memory usage, and depends on how far SML goes we might be needlessly introducing scaling problems - as for simple SML objects that wouldn't chunk we're going from 1 vector entry into 1 graph] to [N (now 3, title content description, maybe more down the line) into N graphs].
I've struggled with scaling HNSW search before in multiple cases, needs to be conscious decision probably.
There was a problem hiding this comment.
Fair point 😅 I have some recency bias, recently talked to some people who backed themselves into a relevance corner doing that.
There was a problem hiding this comment.
Whatever we do we should default to disk BBQ and not HNSW though
There was a problem hiding this comment.
Noted on diskBBQ, since primary user of this interface is agents they're also less sensitive to latency. I do need to catch up on details around resource util, latency etc. for diskBBQ myself though.
| content: types.text({ copy_to: 'unified_semantic' }), | ||
| description: types.text({ copy_to: 'unified_semantic' }), | ||
| unified_semantic: types.semantic_text({}), | ||
| tags: types.keyword({}), |
There was a problem hiding this comment.
Should we add some common sense normalizes here, like lowercase?
There was a problem hiding this comment.
Good question. On tags you mean specifically? Lowercase probably makes sense, maybe some folding. Will figure out.
There was a problem hiding this comment.
++ was talking about tags, it could be worth doing a pass over the schema and seeing if other fields would benefit, but tags would definitely benefit the most I think.
| * keyword-searchable for sub-path filtering. SML treats this opaquely; | ||
| * type writers own its shape. | ||
| */ | ||
| payload?: Record<string, unknown>; |
There was a problem hiding this comment.
I'm good with flattened here for the reasons you laid out, the other option would be nested but that comes with its own dragons.
| /** Owner or last-modifier user id when known */ | ||
| user_id?: string; | ||
| /** Other SML chunk ids this item references */ | ||
| /** Other SML chunk ids (URI form preferred, e.g. `dashboard://abc`) this item references */ |
There was a problem hiding this comment.
This feels like a "here be dragons" - as a future note, we should probably make sure that we add validators for an appropriate URI format here, to ensure that we don't pollute the contents of this field.
I recall that we had some discussions here on additional data that we might want - we may need more than just the URI but have the ability to explain exactly what the relationship and correlation here is. Where did we end up with that investigation, and should we consider making references more explicit here?
Replace the single `unified_semantic` copy_to aggregator with per-field semantic_text mirrors (`title_semantic`, `description_semantic`, `content_semantic`). The search query now uses the `rrf` retriever with `query`/`fields` for the semantic side and a `standard` sub-retriever for BM25 + SAYT prefix matching. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Resolves search-team#14362. Adds the schema fields the SML team converged on, refactors
title/description/contentto support BM25 + prefix typeahead + a single unified vector retrieval surface, and rewiresbuildSmlSearchQueryto match.Schema changes inevitably touch the query path (the field references in
multi_matchand thematchclauses againstcontent/descriptionmove). This PR includes the minimum query rewiring needed for the schema to function as intended. Search-API follow-up (search-team#14363) refines from there — splitting autocomplete from full retrieval, field weighting, etc.This is the first of three planned PRs:
discovery_labelsand the @ menu query path.sml_searchretrieval (codepath split, field weighting, hybrid scoring tuning).Fields added
tagskeyword[]payloadflattenedtitle_autocompletesearch_as_you_typecopy_totarget oftitle. Powers @ menu / typeahead via the SAYT subfields (._2gram,._3gram,._index_prefix).unified_semanticsemantic_textcopy_totarget oftitle,description, andcontent. Single vector index per record — one inference pass, single recall surface.Field changes
titlesearch_as_you_typetextwithcopy_to: ['title_autocomplete', 'unified_semantic']title, typeahead ontitle_autocomplete, semantic onunified_semantic. Each independently tunable.descriptionsemantic_texttextwithcopy_to: 'unified_semantic'contentsemantic_texttextwithcopy_to: 'unified_semantic'_source.descriptionand_source.contentstill hold the producer text._source.unified_semanticdoes not exist (copy_topopulates the inverted/vector index, not_source); the field is queryable viamatchbut not retrievable.Why one unified semantic field
multi_match: { fields: ['title^3', 'description^2', 'content'] }remains possible — each text field is individually scored.Why three fields for title
semantic_textcannot be a multi-field, so the BM25 / typeahead / semantic split has to live at the top level.titleis the canonical text;title_autocompleteandunified_semanticare derived viacopy_to. Each field is independently tunable — analyzer for BM25, shingles for autocomplete, inference model for semantic.Decisions
payloadmapping:flattenedSub-path keyword filtering, no mapping explosion.
disabledis the fallback if even keyword filtering turns out unneeded.objectrejected —dynamic: trueinvites mapping explosion,dynamic: 'strict'would require a per-type schema registry that the team explicitly doesn't want.tagsmapping: plainkeywordMulti-value. Add a
search_as_you_typesubfield later only if tag-name autocomplete becomes a real need.Query path changes
buildSmlSearchQuery'sshouldblock now contains:multi_match: bool_prefixagainstSML_SEARCH_AS_YOU_TYPE_FIELDS— typeahead/prefix ontitle_autocomplete.*andtype.autocomplete.*. (Previously this list also included baretitle— dropped becausetitleis no longer a SAYT field.)match: { title }— BM25 on the canonical text field (added; previouslytitlecontributed via the SAYT bool_prefix).match: { description }andmatch: { content }— BM25 on the per-field text. (Previously these werematchclauses against semantic_text fields, i.e. semantic retrieval. Now they're BM25 since the fields becametext.)match: { unified_semantic }— semantic vector retrieval. Replaces the prior semantic matches oncontentanddescription; coverage is the same (the unified field aggregates both viacopy_to) with one inference output rather than two.Existing type writers (dashboard, visualization, connector, rule, etc.) need no changes — they keep returning the same
SmlChunkshape; EScopy_tohandles the fan-out.Out of scope
discovery_labels(categorical/nickname terms backing the @ menu) — autocomplete follow-up PR.last_accessed_at/access_count— not added: "accessed" semantics are undefined and bumping a counter on every search hit is write-amplification on a hot path. ES mapping append is non-breaking, so these can be added later when there is a defined writer.sml_read/ lookup endpoint — search-team#14365.Test plan
node scripts/eslint --fixon touched files.node scripts/type_check --project x-pack/platform/plugins/shared/agent_context_layer/tsconfig.json.node scripts/jest x-pack/platform/plugins/shared/agent_context_layer/server/services/sml/.agent_builder_api_integrationSML smoke test exercises thesmlElasticsearchIndexMappingsexport end-to-end.