Add lucene query for wildcards on high cardinality keyword fields.#139746
Add lucene query for wildcards on high cardinality keyword fields.#139746martijnvg merged 3 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
| assertEquals("Cannot search on field [field] since it is not indexed nor has doc values.", e.getMessage()); | ||
| } | ||
|
|
||
| public void testTermQueryHighCardinality() { |
There was a problem hiding this comment.
Not related to the change, but this was missing from the previous change.
| ); | ||
| } | ||
|
|
||
| public void testWildcardQueryHighCardinality() { |
There was a problem hiding this comment.
Note that 395_binary_doc_values_search.yml handles integration testing for this change, but this unit test just checks that we use the right lucene query if cardinality is set to high.
| return new SlowCustomBinaryDocValuesWildcardQuery(name(), value, caseInsensitive); | ||
| } | ||
|
|
||
| if (caseInsensitive == false) { |
There was a problem hiding this comment.
question: why not using SlowCustomBinaryDocValuesWildcardQuery when case sensitive?
There was a problem hiding this comment.
That is only used if storedInBinaryDocValues() return true and if code ends up here then we don't use binary doc values.
Currently the WildcardQuery(term, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT, MultiTermQuery.DOC_VALUES_REWRITE) can't be used if caseInsensitive is true. I do think this is possible, but the CaseInsensitiveWildcardQuery needs to be extended to work with doc values rewrite?
There was a problem hiding this comment.
I am asking this since I see in buildByteRunAutomaton the automaton is built based on the caseInsensitive flag which suggests this query can handle both based on the boolean value of caseInsensitive. This pattern could potentially be applied to the SortedSetDocValues path as well resulting in just 2 query classes (one per doc values format), each taking a caseInsensitive boolean, rather than 4 separate code paths. But maybe this is what you are planning to do as a followup.
| import static org.hamcrest.Matchers.equalTo; | ||
| import static org.hamcrest.Matchers.greaterThanOrEqualTo; | ||
|
|
||
| public class SlowCustomBinaryDocValuesWildcardQueryTests extends ESTestCase { |
There was a problem hiding this comment.
Do we need to test also with a multi-value field?
There was a problem hiding this comment.
In testBasics() a multi-valued randomly gets added.
I will do something similar in testAgainstWildcardQuery()
* upstream/main: (253 commits) Adds ST_SIMPLIFY geo spatial function (elastic#136309) Take control of max clause count verification in Lucene searcher (elastic#139752) [ML] Unmute Inference Test (elastic#139765) Parameterize the vector operation benchmark tests (elastic#139735) Fix node reduction pushdown tests for release tests (elastic#139548) Fix flakiness in TSDataGenerationHelper (elastic#139759) CPS: Copy existing resolved index expressions when constructing a new `SearchRequest` from an existing one (elastic#139596) Add release notes for v9.1.9 release (elastic#139674) Add lucene query for wildcards on high cardinality keyword fields. (elastic#139746) Suppress Tika entitlement warnings from AWT (elastic#139711) Check field storage when synthetic source is enabled, in tests (elastic#139715) Refactor VectorSimilarityType to know about its corresponding Function (elastic#139678) Merge fixes from patch branch back into main (elastic#139721) Define native bulk operations for vector square distance (elastic#139198) Use LongUpDownCounter for Linked Project Error Metrics (elastic#139657) ESQL: Add javadoc that explains version-aware planning (elastic#139706) Add helper to pick node for reindex relocation (elastic#139081) Fix auth serialization randomized version test (elastic#139182) ES|QL - Add parsing, preanalysis and analysis timing information to profile (elastic#139540) Mute org.elasticsearch.persistent.ClusterPersistentTasksCustomMetadataTests testMinVersionSerialization elastic#139741 ...
Speeds to wildcard queries on high cardinality keyword fields up by roughly 50% compared to using
StringScriptFieldWildcardQuery.Also added a base class (
AbstractBinaryDocValuesQuery) for bothSlowCustomBinaryDocValuesTermQueryandSlowCustomBinaryDocValuesWildcardQuery.