Add lucene query for wildcards on high cardinality keyword fields. by martijnvg · Pull Request #139746 · elastic/elasticsearch

martijnvg · 2025-12-18T13:13:03Z

Speeds to wildcard queries on high cardinality keyword fields up by roughly 50% compared to using StringScriptFieldWildcardQuery.

Also added a base class (AbstractBinaryDocValuesQuery ) for both SlowCustomBinaryDocValuesTermQuery and SlowCustomBinaryDocValuesWildcardQuery .

elasticsearchmachine · 2025-12-18T13:13:28Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

martijnvg · 2025-12-18T13:13:36Z

server/src/test/java/org/elasticsearch/index/mapper/KeywordFieldTypeTests.java

        assertEquals("Cannot search on field [field] since it is not indexed nor has doc values.", e.getMessage());
    }

+    public void testTermQueryHighCardinality() {


Not related to the change, but this was missing from the previous change.

martijnvg · 2025-12-18T13:14:53Z

server/src/test/java/org/elasticsearch/index/mapper/KeywordFieldTypeTests.java

        );
    }

+    public void testWildcardQueryHighCardinality() {


Note that 395_binary_doc_values_search.yml handles integration testing for this change, but this unit test just checks that we use the right lucene query if cardinality is set to high.

salvatore-campagna · 2025-12-18T13:30:00Z

server/src/main/java/org/elasticsearch/index/mapper/KeywordFieldMapper.java

+                    return new SlowCustomBinaryDocValuesWildcardQuery(name(), value, caseInsensitive);
+                }
+
+                if (caseInsensitive == false) {


question: why not using SlowCustomBinaryDocValuesWildcardQuery when case sensitive?

That is only used if storedInBinaryDocValues() return true and if code ends up here then we don't use binary doc values.

Currently the WildcardQuery(term, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT, MultiTermQuery.DOC_VALUES_REWRITE) can't be used if caseInsensitive is true. I do think this is possible, but the CaseInsensitiveWildcardQuery needs to be extended to work with doc values rewrite?

I am asking this since I see in buildByteRunAutomaton the automaton is built based on the caseInsensitive flag which suggests this query can handle both based on the boolean value of caseInsensitive. This pattern could potentially be applied to the SortedSetDocValues path as well resulting in just 2 query classes (one per doc values format), each taking a caseInsensitive boolean, rather than 4 separate code paths. But maybe this is what you are planning to do as a followup.

romseygeek

Nice!

salvatore-campagna · 2025-12-18T13:51:34Z

.../test/java/org/elasticsearch/lucene/queries/SlowCustomBinaryDocValuesWildcardQueryTests.java

+import static org.hamcrest.Matchers.equalTo;
+import static org.hamcrest.Matchers.greaterThanOrEqualTo;
+
+public class SlowCustomBinaryDocValuesWildcardQueryTests extends ESTestCase {


Do we need to test also with a multi-value field?

In testBasics() a multi-valued randomly gets added.
I will do something similar in testAgainstWildcardQuery()

* upstream/main: (253 commits) Adds ST_SIMPLIFY geo spatial function (elastic#136309) Take control of max clause count verification in Lucene searcher (elastic#139752) [ML] Unmute Inference Test (elastic#139765) Parameterize the vector operation benchmark tests (elastic#139735) Fix node reduction pushdown tests for release tests (elastic#139548) Fix flakiness in TSDataGenerationHelper (elastic#139759) CPS: Copy existing resolved index expressions when constructing a new `SearchRequest` from an existing one (elastic#139596) Add release notes for v9.1.9 release (elastic#139674) Add lucene query for wildcards on high cardinality keyword fields. (elastic#139746) Suppress Tika entitlement warnings from AWT (elastic#139711) Check field storage when synthetic source is enabled, in tests (elastic#139715) Refactor VectorSimilarityType to know about its corresponding Function (elastic#139678) Merge fixes from patch branch back into main (elastic#139721) Define native bulk operations for vector square distance (elastic#139198) Use LongUpDownCounter for Linked Project Error Metrics (elastic#139657) ESQL: Add javadoc that explains version-aware planning (elastic#139706) Add helper to pick node for reindex relocation (elastic#139081) Fix auth serialization randomized version test (elastic#139182) ES|QL - Add parsing, preanalysis and analysis timing information to profile (elastic#139540) Mute org.elasticsearch.persistent.ClusterPersistentTasksCustomMetadataTests testMinVersionSerialization elastic#139741 ...

Add lucene query for wildcards on high cardinality keyword fields.

57e26f2

martijnvg requested a review from romseygeek December 18, 2025 13:13

martijnvg added >non-issue :StorageEngine/Mapping The storage related side of mappings labels Dec 18, 2025

elasticsearchmachine added v9.4.0 Team:StorageEngine labels Dec 18, 2025

martijnvg commented Dec 18, 2025

View reviewed changes

[CI] Auto commit changes from spotless

97c4f61

salvatore-campagna reviewed Dec 18, 2025

View reviewed changes

romseygeek approved these changes Dec 18, 2025

View reviewed changes

salvatore-campagna reviewed Dec 18, 2025

View reviewed changes

iter

2a29940

martijnvg enabled auto-merge (squash) December 18, 2025 14:59

martijnvg merged commit 56fe77e into elastic:main Dec 18, 2025
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lucene query for wildcards on high cardinality keyword fields.#139746

Add lucene query for wildcards on high cardinality keyword fields.#139746
martijnvg merged 3 commits intoelastic:mainfrom
martijnvg:SlowCustomBinaryDocValuesWildcardQuery

martijnvg commented Dec 18, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Dec 18, 2025

Uh oh!

martijnvg Dec 18, 2025

Uh oh!

martijnvg Dec 18, 2025

Uh oh!

salvatore-campagna Dec 18, 2025

Uh oh!

martijnvg Dec 18, 2025

Uh oh!

salvatore-campagna Dec 19, 2025 •

edited

Loading

Uh oh!

romseygeek left a comment

Uh oh!

salvatore-campagna Dec 18, 2025

Uh oh!

martijnvg Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

martijnvg commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Dec 18, 2025

Uh oh!

martijnvg Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

martijnvg Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

salvatore-campagna Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

martijnvg Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

salvatore-campagna Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romseygeek left a comment

Choose a reason for hiding this comment

Uh oh!

salvatore-campagna Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

martijnvg Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

martijnvg commented Dec 18, 2025 •

edited

Loading

salvatore-campagna Dec 19, 2025 •

edited

Loading