Skip to content

[Agent Builder] [SML] Retrieval layer: schema, autocomplete, hybrid search #269277

Draft
Apmats wants to merge 11 commits into
elastic:mainfrom
Apmats:apmats/sml-search-api
Draft

[Agent Builder] [SML] Retrieval layer: schema, autocomplete, hybrid search #269277
Apmats wants to merge 11 commits into
elastic:mainfrom
Apmats:apmats/sml-search-api

Conversation

@Apmats
Copy link
Copy Markdown
Contributor

@Apmats Apmats commented May 14, 2026

Summary

This PR consolidates three stacked PRs into a single reviewable diff:

Apmats and others added 11 commits May 11, 2026 03:08
Builds on Peter's merged elastic#266573 by adding the schema fields the team
converged on and refactoring title/description/content for BM25 + a
single unified vector retrieval surface.

Fields added:
- tags (keyword[], free-form labels)
- payload (flattened, type-specific opaque data)
- title_autocomplete (search_as_you_type, copy target of title)
- unified_semantic (semantic_text, copy target of title/description/content)

Behavior changes:
- title becomes `text` with copy_to fanning into title_autocomplete and
  unified_semantic. Three retrieval modes (BM25/lexical, prefix/typeahead,
  semantic) from one producer-set field.
- description and content become `text` with copy_to: 'unified_semantic'.
  One inference pass per record instead of three; recall doesn't
  fragment across overlapping content.
- buildSmlSearchQuery: SAYT field paths move from title.* to
  title_autocomplete.*; the should-block uses match: { unified_semantic }
  in place of separate matches on content and description.

origin_id is unchanged. An earlier draft of this PR added a parallel
`origin` URI field per the gist's vision but it was dropped: the URI
form is computable on the fly from type + origin_id, Sean's merged
buildTypeFilters targets origin_id directly, and adding a parallel
representation would just be redundant.

Type writers need no changes: they keep returning the same SmlChunk
shape; ES copy_to handles the fan-out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the single `unified_semantic` copy_to aggregator with per-field
semantic_text mirrors (`title_semantic`, `description_semantic`,
`content_semantic`). The search query now uses the `rrf` retriever with
`query`/`fields` for the semantic side and a `standard` sub-retriever for
BM25 + SAYT prefix matching.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the @ menu / typeahead query path on top of the schema PR. The
autocomplete surface is unified through `discovery_labels` — title and
type are no longer special-cased.

discovery_labels: nested { value, kind }
  - `value.autocomplete` is a `text` field with a custom edge-ngram analyzer
    (NOT `search_as_you_type` — see WHY below).
  - The indexer auto-prepends `{value: chunk.title, kind: 'title'}` and
    `{value: chunk.type, kind: 'type'}` to every record, plus any entries
    the producer provides (taglines, nicknames, categories, etc.). Type
    writers don't change their SmlChunk shape.

Drops `title_autocomplete` (top-level SAYT field) and the `.autocomplete`
subfield from `type`. Title's and type's autocomplete reach now goes
through their auto-prepended discovery_labels entries; `type` stays as
plain keyword for exact filtering.

POST /internal/agent_context_layer/sml/_autocomplete — dedicated route
for the @ menu, separate from /sml/_search.

buildSmlAutocompleteQuery: a single nested query against
discovery_labels.value.autocomplete with `match` + `operator: and` (every
typed token must match). `inner_hits` returns the matched entries with
their `kind` and an ES `unified`-highlighter snippet wrapping the
matched word(s) in <em>...</em>.

Response: { id, type, title, origin_id, matched_discovery_labels? }.
No content, no description, no score — order is sufficient; UI gets
exactly what it needs to render a typeahead row with badges.

filterResultsByPermissions is generic over the result type and shared
with the search path.

SAYT (`type: 'search_as_you_type'`) is the obvious first choice for an
autocomplete field. It does not produce useful highlights for prefix
queries in a nested context. Confirmed in two ways:

  1. Empirical: tested unified, plain, fvh highlighters; tested
     `require_field_match: false`, `highlight_query: prefix`, term_vector
     with offsets, top-level vs inner_hits highlight — none returned a
     snippet for nested + SAYT + bool_prefix on prefix-only matches.
  2. Documented: this is the bug behind elastic/elasticsearch#53744
     ("search_as_you_type, prefix queries and broken highlighting"),
     open since 2020-03, explicitly low-priority per the ES Search team
     lead at the time. Root cause documented in elastic#49069 ("Highlight with
     index_prefixes"): SAYT's `_index_prefix` subfield indexes
     position-spanning ngrams with offsets that span from the prefix
     start to the end of the entire field, which the highlighter can't
     wrap per-word. The ES-community workaround (see
     elastic/elasticsearch#53744 comment by peterbe) is exactly what
     this PR does: use a regular text field with a custom analyzer.

The custom analyzer: standard tokenizer + lowercase + edge_ngram filter
(1–20) at index time; standard + lowercase at search time. Per-token
edge_ngrams preserve each word's offsets, so the highlighter wraps the
matched word correctly — verified live for queries like "git" →
"<em>github</em>", "github c" → "<em>GitHub</em> <em>Connector</em>",
"prod logs ana" → "<em>production</em> <em>logs</em> <em>analytics</em>".

Trade-offs vs SAYT:
  - Smaller inverted index (one field, not four).
  - Comparable query latency (posting-list lookup per token).
  - Loses SAYT's _2gram/_3gram shingle subfields used for phrase-prefix
    scoring. BM25 on edge_ngram tokens is good enough for typeahead
    ranking.

Adds an optional `analysis` field to `IndexStorageSettings`, threaded
through to the index template's `settings.analysis` block. Required so
that consumers (like SML here) can declare custom analyzers without
bypassing the storage adapter's bootstrap flow. Backwards-compatible —
field is optional. Ignored on serverless ES (no index-level settings).

Owners: @elastic/observability-ui, @elastic/kibana-core.

- Service test: query shape (match + operator: and + nested + inner_hits
  + highlight), inner_hits mapping into matched_discovery_labels,
  highlight passthrough.
- Route test: response shape, no permissions/spaces/content leak.
- FTR (real-ES): /sml/_autocomplete against an indexed record with
  title / type / tagline discovery_labels; asserts matched_discovery_labels,
  kind, and that the highlight wraps the matched word.
- 121/121 jest in agent_context_layer; 14/14 jest in kbn-storage-adapter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stacks on 03da4f4 (which introduced custom edge_ngram analyzer for
discovery_labels.value.autocomplete to get nested-context highlights
working). The edge_ngram approach highlights correctly but is loose:
every typed token behaves as a prefix at index time, so "github c"
matches "Githubster Cup" (because "github" is an indexed edge_ngram of
"githubster"). Verified live.

This commit pivots discovery_labels.value to `search_as_you_type` with
`multi_match bool_prefix operator:and`. SAYT's native query type gives
tight per-token semantics: all-but-last analyzed tokens require exact
indexed-term match (against ES's auto-generated _2gram/_3gram shingle
subfields); only the last (trailing partial) is treated as a prefix
(against _index_prefix). Net: "github c" no longer matches "Githubster
Cup".

Trade-off: highlights. SAYT + bool_prefix + nested + inner_hits is the
exact case behind ES bug elastic/elasticsearch#53744 (open since 2020),
so `matched_discovery_labels` entries return without `highlighted`. The
route still configures the highlighter (forward-compat when the bug
lands); UI falls back to rendering plain `value` when `highlighted` is
absent.

Alternative considered (not taken in this PR): hybrid AND —
  bool { must: [match(edge_ngram), match_bool_prefix(plain text)] }
Live-tested to give tight semantics AND working highlights (from
the edge_ngram clause). Costs an extra plain-text subfield on
`discovery_labels.value` (~5% index footprint with index_options:freqs
norms:false) and still has the "git c" mid-typing gap where
match_bool_prefix demands a complete indexed token. Slotted for PR3 if
user feedback warrants.

Removes the custom edge_ngram analyzers and the storageSettings
`analysis` block — SAYT needs no custom analysis.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nt connector filter support

Two changes that ship together because they touch the same FE surface:

1. Wire the @ menu typeahead to POST /sml/_autocomplete (instead of
   the previously used POST /sml/_search). Adds:
   - `smlService.autocomplete({ query, size, filters? })`
   - `queryKeys.sml.autocomplete(query, filters?)`
   - `useSmlAutocomplete(query, { filters? })` hook
   - `render_sml_highlight.tsx` helper that parses ES highlight snippets
     (`<em>...</em>`) into React fragments — currently inert on the SAYT
     path because of ES bug elastic#53744, forward-compatible when it lands.
   - `Sml` component switches to the new hook and renders per-row
     `matched_discovery_labels` (`kind`-aware) instead of client-side
     `EuiHighlight` on `${type}/${title}`.
   - Drops `sml_command_menu_highlight.{ts,test.ts}` — the old
     client-side highlight parser is no longer needed.

2. Restore agent-centric connector scoping (Sean's
   `agent-centric connectors` feature, elastic#267333) on the autocomplete
   route. The earlier autocomplete commit (03da4f4) added the route
   without a `filters` parameter, which silently broke the @ menu's
   respect for `agent.configuration.connector_ids`. This commit:
   - BE: adds `filters?: SmlSearchFilters` to the autocomplete route
     schema and threads it through `SmlService.autocomplete` and
     `autocompleteSml`, which now calls the existing `buildTypeFilters`
     helper (shared with the search path) so the filter clause is
     identical between routes.
   - FE: `useSmlAutocomplete` accepts `{ filters? }`,
     `usePrefetchSml(filters?)` regains the filter param, the
     command-menu orchestrator re-prefetches SML when the agent's
     filters change after the async agent fetch resolves.
   - Tests: new BE unit test asserting filters flow into the ES filter
     clauses via `buildTypeFilters`; new FE hook test covering the
     filters arg; existing `sml.test.tsx` + `use_prefetch_sml.test.tsx`
     updated to the new mocks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…@ menu

The previous commit on this branch (b46c879) rewrote the @ menu
rendering to use the server's `matched_discovery_labels` and a custom
`<em>...</em>` parser. That assumed working server-side highlights —
true for the edge_ngram iteration, but the SAYT pivot (202ce94) makes
ES return no highlight snippets (elastic#53744), so the new path renders
plain text with no visual indication of what matched.

The new `SmlAutocompleteHttpResultItem` still has `type` and `title`
top-level, which is everything the original client-side EuiHighlight
rendering needed. Restoring it gets us back the "bold the typed query
in the displayed row" UX with no schema or route changes:

- `sml.tsx` back to two `<EuiHighlight>` blocks over `type` and `title`,
  using the same `getSmlMenuHighlightSearchStrings` parser.
- Restored `sml_command_menu_highlight.{ts,test.ts}`.
- Dropped `render_sml_highlight.tsx` — no longer used. (Forward-compat
  for elastic#53744 isn't load-bearing; whoever revives server-side highlights
  can write it then.)
- `matched_discovery_labels` is still returned by the BE and still on
  the FE result type — FE engineers can switch the rendering to use it
  whenever they want kind-aware badges.

Hook + filter wiring (useSmlAutocomplete with agent connector filters)
is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier autocomplete commit on this branch added an optional
`analysis` field to `IndexStorageSettings` so the (then-planned) custom
edge_ngram analyzer could be declared through the storage adapter. The
SAYT pivot on the following commit dropped the custom analyzer
entirely — SAYT brings its own auto-generated subfields and doesn't
need index-level `analysis`.

`grep` confirms no SML file references `analysis` today, and no other
consumer in the tree was planning to use the hook. Reverting the two
storage-adapter files to their main-branch state removes ~18 lines of
unused code and drops `@elastic/observability-ui` / `@elastic/kibana-core`
from the PR's required reviewers, since this PR no longer touches their
package.

If a future consumer needs to declare custom analyzers through the
storage adapter, the hook can be reintroduced then with a real motivating
use case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… abandoned iterations

Two doc-comment blocks (one in sml_storage's discovery_labels mapping,
one in sml_service's buildSmlAutocompleteQuery) carried a tail referring
to the custom edge_ngram approach and hybrid-AND alternative — iterations
we tried during this PR and did not ship. These comments rot the moment
the PR closes: the "PR description" reference becomes opaque, and the
"earlier commit" comparison no longer maps to anything visible in the
file.

The technical content stays: SAYT mechanics, the elastic#53744 limitation, what
`matched_discovery_labels.highlighted` means. That's still useful for
future readers and isn't visible from the code alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lastic#14363)

Land the natural-language search API over SML, replacing the transitional
bool.should mash with a proper RRF retriever and introducing the
trust-boundary split between runtime-imposed scoping and
agent-discoverable filters.

Decisions per the implementation prompt §6:

- §6.1 Retrieval: RRF over BM25 (title^2, description, content via
  best_fields multi_match) and semantic (unified_semantic). Filters
  mirrored to each child retriever so RRF can't pull unauthorized docs
  into the fused top-k. Defaults rank_constant=60, rank_window_size=50.
- §6.2 Filter merging: option (2). Service signature splits
  `scoping?: SmlSearchScoping` (runtime-imposed, unconditional) from
  `filters?: SmlSearchFilters` (caller refinement). Trust boundary
  visible at the API layer; merge happens in one place. Sean's per-type
  id-allowlist shape (elastic#267333) survives unchanged as `SmlSearchScoping`.
- §6.3 Agent-supplied dimensions: `types?: string[]`, `tags?: string[]`,
  each lowering into a `terms` clause. Time-range and free-form payload
  filters deferred.
- §6.4 Compact LLM response: SmlSearchResult drops the `content` blob,
  full `payload`, dates, and discovery_labels; gains `more_content`
  (true when the indexed record has non-empty content worth fetching
  via sml_read). `permissions` retained internally for post-hoc
  filtering, stripped from the HTTP response.
- §6.5 Reference resolution: URI strings only — SML does not
  dereference. Inline-expansion deferred until eval data justifies it
  (Reading 1; Sean's "SML shouldn't turn pointers into blobs").
- §6.6 Tool description rewrite with when-to-use guidance and four
  worked examples (plain query, types+tags filter, connector
  restriction, wildcard inventory). Runtime scoping is not exposed in
  the LLM input schema.

Fixes the `total` bug: post-permission-filter `results.length` no
longer overwrites ES `hits.total.value`.

Frontend / HTTP route updates:
- POST /sml/_search body accepts `scoping?` and `filters?` (replaces
  the old `filters` per-type id-allowlist field and `skip_content`).
- POST /sml/_autocomplete renames `filters` → `scoping`; no
  agent-discoverable filters on this route (no LLM in the path).
- Frontend SmlService, useSmlSearch, useSmlAutocomplete, usePrefetchSml,
  queryKeys, and buildSmlScopingFromAgent (formerly
  buildSmlFiltersFromAgent) all follow the new naming.

Tests:
- New unit tests for RRF body, filter threading through child
  retrievers, vocabulary-mismatch over the semantic leg, more_content
  on/off, compact result mapping, agent-filter builder, and tool
  wrapper scoping/filters mapping.
- FTR + Scout integration tests updated to drop content/skip_content
  assertions and lock in the no-content invariant.

Out of scope (deferred per epic): reranker, kneedle, per-type
retriever profiles, HyDE/query expansion, reference inline-expansion,
telemetry instrumentation (elastic#14366), recrawl/versioning (elastic#14367).

Open coordination points (not blocking this PR):
- Inference model on unified_semantic still relies on defaults
  (Kathleen Mar 19: pin Jina V5). Belongs to A.
- Kathleen review on A's PR proposes splitting unified_semantic into
  per-field semantic_text fields. If A reverts unified, the semantic
  leg in this PR fans out from 1 to 2-3 child retrievers — small
  retriever-only change.
- Pierre's configuration_override-aware tool handler context
  (elastic#267333 review) — wrapper currently reads agentConfiguration from
  the tool handler context, which `runner.ts` already resolves with
  overrides applied via `resolveConfiguration`. Note in code.

Stacked on top of apmats/sml-autocomplete-followup (elastic#14364). Once
autocomplete merges, this PR rebases to main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ers on autocomplete

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… scoping semantics

Switch all three semantic_text mirrors (title_semantic, description_semantic,
content_semantic) from the default HNSW to disk-based BBQ quantization. Agents
are latency-tolerant and the SML index can grow large — bbq_disk avoids the
per-field HNSW graph memory overhead that comes with N separate semantic fields.

Also adds a comment to SmlSearchScoping documenting the cross-type OR semantics
and the id-allowlist complexity ceiling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@infra-vault-gh-plugin-prod
Copy link
Copy Markdown

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

  • Click to trigger kibana-pull-request for this PR!
  • Click to trigger kibana-deploy-project-from-pr for this PR!
  • Click to trigger kibana-deploy-cloud-from-pr for this PR!
  • Click to trigger kibana-entity-store-performance-from-pr for this PR!
  • Click to trigger kibana-storybooks-from-pr for this PR!

@Apmats Apmats changed the title [Agent Builder] [SML] Retrieval layer: schema, autocomplete, hybrid search + PIT loop [Agent Builder] [SML] Retrieval layer: schema, autocomplete, hybrid search May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant