-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Is your feature request related to a problem? Please describe
The current execution path is very straightforward: we take the DSL, run it through SearchModule’s registered QueryParsers, and end up with a QueryBuilder tree. We then call toQuery() on that tree to build Lucene Query objects, hand them to Lucene, and Lucene decides execution order at the scorer level.
Lucene’s scorer ordering comes from Weight.scorer() and scorerCost(), but that’s local to each scorer and doesn’t give us cross-clause control. There’s no point in the pipeline where we normalize the query tree, estimate cost across clauses, or choose an execution strategy before Lucene starts pulling scorers.
That means:
- We can’t ensure that the most selective filters run first.
- We can’t prefer vector-first or term-first for hybrid queries.
- We have no central place to collapse redundant clauses or apply systematic rewrites. Current rewrites happen in
QueryPhase.preProcess()and we should ideally centralise it. - We can’t explain to ourselves or to users why a query was executed in a certain order.
Why this is a real problem
The simplest example:
{
"bool": {
"filter": [
{"knn": {"embedding": {"vector": [...], "k": 10}}},
{"term": {"status": "error"}},
{"range": {"timestamp": {"gte": "2023"}}}
]
}
}If Lucene’s scorer chooses to lead with the range, then term, then KNN, the KNN work is done on a huge candidate set. If we flipped it to KNN → term → range, we’d do far less work for the same result.
In practice, this problem shows up in:
- Monitoring/log queries: big time ranges + filters + vector matches for anomaly search.
- Security queries: IP/network ranges with expensive scripted scoring + selective terms.
- Multi-field disjunctions where some fields are sparse and others are dense.
Describe the solution you'd like
What we add: a planning layer
The idea is to insert one clear stage between “DSL parsed” and “Lucene query built”:
-
Rewriting – One place to apply all query simplifications and normalizations. Instead of relying on each
QueryBuilder’s ad hocrewrite()method, we have aQueryRewriteServicethat walks the tree and applies a sequence of pluggable rewriters.Examples:
- Flatten nested bools:
→ single
{"bool": {"filter": [{"bool": {"filter": [ ... ]}}]}}boolwith all filters at one level. - Merge
termclauses on the same field into atermsquery. - Drop no-op
match_allinmustorfiltercontexts. - Add fallback terms for KNN queries (e.g., centroid tokens) if configured.
Each rewriter is a small class:
interface QueryRewriter { QueryBuilder rewrite(QueryBuilder qb, RewriteContext ctx); }
Rewriters are chained and order matters. They operate only on the builder tree, no Lucene objects yet.
- Flatten nested bools:
-
Logical plan – After rewrite, we walk the builder tree into our own lightweight plan structure. This is where we identify clause type, field, nesting, and assign hints.
Example node:
class LogicalQueryNode { QueryBuilder clause; ClauseType type; // TERM, RANGE, KNN, SCRIPT, EXISTS, etc. String field; boolean isNested; ExecutionHint hint; // FILTER, VECTOR, SCRIPT double estimatedCost; List<LogicalQueryNode> children; }
This is not meant for execution — it’s for reasoning about the query before we commit to Lucene’s representation.
A
bool.filterwith 5 clauses becomes 5 siblingLogicalQueryNodes. Anestedquery becomes a parent with its own children. Forshouldclauses we record their grouping but can still estimate each branch’s cost.
-
Cost estimation – We now attach estimated costs to each logical node. The goal isn’t perfect precision; it’s “cheap vs expensive” in a way that lets us order intelligently.
Inputs for cost estimation:
FieldStatsfrom index mappings/statistics (doc count, cardinality, min/max for numerics).- Lucene
IndexReader.docFreq()for term queries if stats aren’t available. Alternative can beIndexSearcher.collectionStatistics()sinceIndexReader.docFreq()can be expensive. - Fixed costs for types like KNN or script where we can’t derive from stats.
Example heuristic table:
Type Cost term (low-card keyword) 10_000 term (high-card text) 500_000 range on timestamp 1_000_000 exists 100_000 knn 500_000 script 50_000_000 These numbers are just units for comparison — the scale doesn’t matter as long as it’s consistent.
We can improve this later with:
- Per-field dynamic baselines (from real query logs).
- Cost models that factor in
size,from, and aggregation load. - ML-predicted costs from query features.
-
Physical plan – Now we decide:
- Clause execution order (e.g., sort filters by ascending cost).
- Strategy:
- TERM_FIRST — run selective term/range filters, then vector.
- VECTOR_FIRST — run KNN first, post-filter results.
- PARALLEL — evaluate branches independently (disjunction-heavy queries).
- STAGED — KNN → lexical filter → script rerank.
- Scorer choice:
BooleanScorer2(good for disjunctions),ConjunctionDISI(for conjunction-heavy queries),ImpactScorer(if impact-based skipping helps). - Collector choice:
TopDocsCollector,TotalHitCountCollector, or multi-collector for staged execution. - Early termination: enable if topK and plan predicts low result volume.
Plan example:
{ "clauses": [ {"field": "status", "type": "term", "cost": 10000}, {"field": "embedding", "type": "knn", "cost": 500000}, {"field": "timestamp", "type": "range", "cost": 1000000} ], "execution_order": ["status", "embedding", "timestamp"], "strategy": "vector_first", "scorer": "BooleanScorer2", "early_termination": true }
Where it fits
This entire flow sits inside SearchService before the call to SearchExecutionContext.toQuery(). We rewrite builders, build a logical plan, attach costs, sort clauses, decide strategy, and only then convert to Lucene Query objects. That way, Lucene sees exactly the structure we want it to execute.
No changes are needed in Lucene itself — we’re just controlling our side of the API. All changes are internal to query execution. Existing queries continue to work identically with potential performance improvements.
Rollout plan
- Phase 1: Implement
QueryRewriteServiceand a few basic rewriters (flatten bools, merge terms, drop match_all). - Phase 2: Build the
LogicalPlanBuilderand plug in cost estimation. - Phase 3: Implement the physical planner with clause ordering and basic strategy selection.
- Later: shard-aware planning (take shard-level stats into account), join planning, ML-driven strategies, adaptive replanning mid-query.
Why this is worth it
Even just Phase 3 with a simple heuristic cost model will improve hybrid search and filter-heavy queries noticeably. We’ll also finally be able to debug clause order and scorer selection, which is currently a black box. This sets the stage for more advanced features — streaming aggregations, join optimizations, vector reranking — without bolting them onto an uncontrolled execution path.
This is a standard pattern - statistics based cost optimisers exist in databases like Postgres and Spark.
The QueryRewriter and cost estimation interfaces allow plugins to contribute domain-specific optimizations.
If folks agree this is a good direction, I can get a branch up with the rewrite service and logical plan scaffolding so we can try it against real workloads before worrying about cost model tuning.
Related component
No response
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status