ES|QL: Optimize MMR by reducing cache size and lookup#145014
ES|QL: Optimize MMR by reducing cache size and lookup#145014ioanatia merged 5 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
| // compute MMR scores for remaining searchHits | ||
| float highestSimilarityScoreToSelected = getHighestSimilarityScoreToSelectedVectors( | ||
| selectedDocRanks, | ||
| float similarityToLastSelected = getVectorComparisonScore( |
There was a problem hiding this comment.
I think I get at what the optimization is here -- you're only comparing the current document to the last selected document, correct? (the original implementation, and the implementation in the paper, computes MMR in respect to all the previously selected documents)...
I think this will work, but it's still unsure in my head if it will produce the most correct results...
There was a problem hiding this comment.
We still compute MMR wrt all previously selected documents.
We keep an array of the computed max similarity between each doc and the selected set.
Then as we select new diversified docs and we iterate through the remaining docs to find a new candidate:
- We compute the similarity between the current doc for which we calculate MMR and the doc that was added to the selected docs in the prev iteration.
- We update the
maxSimilarityToSelected[docRank - 1]for the current doc. - We compute the MMR score using the
maxSimilarityToSelectedvalue for the current doc.
| highestScore = similarityScore; | ||
| } | ||
| } | ||
| return highestScore == Float.NEGATIVE_INFINITY ? 0.0f : highestScore; |
There was a problem hiding this comment.
💭 minor edge case, just from comparing implementations (not sure if it's a valid one). In the previous code if there was no valid similarity score, we would have 0.0f, but now we will get Float.NEGATIVE_INFINITY, that is if context.getFieldVector() returns null for every selected document.
There was a problem hiding this comment.
We don't actually get to this path, because as we iterate through candidates, we skip those that don't have a vector value:
…rics * upstream/main: (428 commits) ESQL: DS: Add inference/RERANK tests (elastic#145229) Unmute MMR logical plan test (elastic#145311) Do not attempt marking store as corrupted if the check is rejected due to shutdown (elastic#145209) feat(tsdb): add pipeline runtime and rename stage interfaces (elastic#145175) Fix UnresolvedException on PromQL by(step) grouping (elastic#145307) ES|QL: Optimize MMR by reducing cache size and lookup (elastic#145014) Prometheus labels/series APIs: support multiple match[] selectors (elastic#145298) Move ClientScrollablePaginatedHitSource into Reindex Module (elastic#144100) mute test class for elastic#145277 CPS mode for ViewResolver (elastic#145219) [ESQL] Disables GroupedTopNBenchmark temporarily (elastic#145124) Make exponential_histogram the default histogram type for HTTP OTLP endpoint (elastic#145065) More tests requiring an explicit confidence interval (elastic#145232) ES|QL: Adding `USER_AGENT` command (elastic#144384) ESQL: enable Generative IT after more fixes (elastic#145112) Rework FieldMapper parameter tests to not use merge builders (elastic#145213) [ESQL] Fix ORC type support gaps (elastic#145074) [Test] Unmute FollowingEngineTests.testProcessOnceOnPrimary (elastic#145192) Add PrometheusSeriesRestAction for /_prometheus/api/v1/series endpoint (elastic#144494) Prometheus labels API: add rest action (elastic#144952) ...
closes #140710
for the full explanation - #140710