Skip to content

Add lucene query for wildcards on high cardinality keyword fields.#139746

Merged
martijnvg merged 3 commits intoelastic:mainfrom
martijnvg:SlowCustomBinaryDocValuesWildcardQuery
Dec 18, 2025
Merged

Add lucene query for wildcards on high cardinality keyword fields.#139746
martijnvg merged 3 commits intoelastic:mainfrom
martijnvg:SlowCustomBinaryDocValuesWildcardQuery

Conversation

@martijnvg
Copy link
Member

@martijnvg martijnvg commented Dec 18, 2025

Speeds to wildcard queries on high cardinality keyword fields up by roughly 50% compared to using StringScriptFieldWildcardQuery.

Also added a base class (AbstractBinaryDocValuesQuery ) for both SlowCustomBinaryDocValuesTermQuery and SlowCustomBinaryDocValuesWildcardQuery .

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

assertEquals("Cannot search on field [field] since it is not indexed nor has doc values.", e.getMessage());
}

public void testTermQueryHighCardinality() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the change, but this was missing from the previous change.

);
}

public void testWildcardQueryHighCardinality() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that 395_binary_doc_values_search.yml handles integration testing for this change, but this unit test just checks that we use the right lucene query if cardinality is set to high.

return new SlowCustomBinaryDocValuesWildcardQuery(name(), value, caseInsensitive);
}

if (caseInsensitive == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why not using SlowCustomBinaryDocValuesWildcardQuery when case sensitive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is only used if storedInBinaryDocValues() return true and if code ends up here then we don't use binary doc values.

Currently the WildcardQuery(term, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT, MultiTermQuery.DOC_VALUES_REWRITE) can't be used if caseInsensitive is true. I do think this is possible, but the CaseInsensitiveWildcardQuery needs to be extended to work with doc values rewrite?

Copy link
Contributor

@salvatore-campagna salvatore-campagna Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking this since I see in buildByteRunAutomaton the automaton is built based on the caseInsensitive flag which suggests this query can handle both based on the boolean value of caseInsensitive. This pattern could potentially be applied to the SortedSetDocValues path as well resulting in just 2 query classes (one per doc values format), each taking a caseInsensitive boolean, rather than 4 separate code paths. But maybe this is what you are planning to do as a followup.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

import static org.hamcrest.Matchers.equalTo;
import static org.hamcrest.Matchers.greaterThanOrEqualTo;

public class SlowCustomBinaryDocValuesWildcardQueryTests extends ESTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test also with a multi-value field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In testBasics() a multi-valued randomly gets added.
I will do something similar in testAgainstWildcardQuery()

@martijnvg martijnvg enabled auto-merge (squash) December 18, 2025 14:59
@martijnvg martijnvg merged commit 56fe77e into elastic:main Dec 18, 2025
35 checks passed
szybia added a commit to szybia/elasticsearch that referenced this pull request Dec 19, 2025
* upstream/main: (253 commits)
  Adds ST_SIMPLIFY geo spatial function (elastic#136309)
  Take control of max clause count verification in Lucene searcher (elastic#139752)
  [ML] Unmute Inference Test (elastic#139765)
  Parameterize the vector operation benchmark tests (elastic#139735)
  Fix node reduction pushdown tests for release tests (elastic#139548)
  Fix flakiness in TSDataGenerationHelper (elastic#139759)
  CPS: Copy existing resolved index expressions when constructing a new `SearchRequest` from an existing one (elastic#139596)
  Add release notes for v9.1.9 release (elastic#139674)
  Add lucene query for wildcards on high cardinality keyword fields. (elastic#139746)
  Suppress Tika entitlement warnings from AWT (elastic#139711)
  Check field storage when synthetic source is enabled, in tests (elastic#139715)
  Refactor VectorSimilarityType to know about its corresponding Function (elastic#139678)
  Merge fixes from patch branch back into main (elastic#139721)
  Define native bulk operations for vector square distance (elastic#139198)
  Use LongUpDownCounter for Linked Project Error Metrics (elastic#139657)
  ESQL: Add javadoc that explains version-aware planning (elastic#139706)
  Add helper to pick node for reindex relocation (elastic#139081)
  Fix auth serialization randomized version test (elastic#139182)
  ES|QL - Add parsing, preanalysis and analysis timing information to profile (elastic#139540)
  Mute org.elasticsearch.persistent.ClusterPersistentTasksCustomMetadataTests testMinVersionSerialization elastic#139741
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants