-
Notifications
You must be signed in to change notification settings - Fork 104
Add hybrid query and score/rank based normalization processor stats #1326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
eb2d2ec
88a69e7
886f1b7
7859422
788edd7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,6 +6,7 @@ | |
|
|
||
| import static org.opensearch.neuralsearch.search.util.HybridSearchResultFormatUtil.isHybridQueryStartStopElement; | ||
|
|
||
| import java.util.Map; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| import java.util.List; | ||
|
|
@@ -14,8 +15,11 @@ | |
|
|
||
| import com.google.common.annotations.VisibleForTesting; | ||
| import lombok.Getter; | ||
| import org.opensearch.neuralsearch.processor.combination.RRFScoreCombinationTechnique; | ||
| import org.opensearch.neuralsearch.processor.combination.ScoreCombinationTechnique; | ||
| import org.opensearch.neuralsearch.processor.normalization.ScoreNormalizationTechnique; | ||
| import org.opensearch.neuralsearch.stats.events.EventStatName; | ||
| import org.opensearch.neuralsearch.stats.events.EventStatsManager; | ||
| import org.opensearch.search.fetch.FetchSearchResult; | ||
| import org.opensearch.search.pipeline.PipelineProcessingContext; | ||
| import org.opensearch.search.query.QuerySearchResult; | ||
|
|
@@ -50,6 +54,11 @@ public class RRFProcessor extends AbstractScoreHybridizationProcessor { | |
| private final ScoreCombinationTechnique combinationTechnique; | ||
| private final NormalizationProcessorWorkflow normalizationWorkflow; | ||
|
|
||
| private final Map<String, Runnable> combTechniqueIncrementers = Map.of( | ||
| RRFScoreCombinationTechnique.TECHNIQUE_NAME, | ||
| () -> EventStatsManager.increment(EventStatName.COMB_TECHNIQUE_RRF_EXECUTIONS) | ||
| ); | ||
|
|
||
| /** | ||
| * Method abstracts functional aspect of score normalization and score combination. Exact methods for each processing stage | ||
| * are set as part of class constructor | ||
|
|
@@ -70,6 +79,7 @@ <Result extends SearchPhaseResult> void hybridizeScores( | |
| Optional<FetchSearchResult> fetchSearchResult = getFetchSearchResults(searchPhaseResult); | ||
| boolean explain = Objects.nonNull(searchPhaseContext.getRequest().source().explain()) | ||
| && searchPhaseContext.getRequest().source().explain(); | ||
| recordStats(combinationTechnique); | ||
| // make data transfer object to pass in, execute will get object with 4 or 5 fields, depending | ||
| // on coming from NormalizationProcessor or RRFProcessor | ||
| NormalizationProcessorWorkflowExecuteRequest normalizationExecuteDTO = NormalizationProcessorWorkflowExecuteRequest.builder() | ||
|
|
@@ -143,4 +153,9 @@ <Result extends SearchPhaseResult> Optional<FetchSearchResult> getFetchSearchRes | |
| Optional<Result> optionalFirstSearchPhaseResult = searchPhaseResults.getAtomicArray().asList().stream().findFirst(); | ||
| return optionalFirstSearchPhaseResult.map(SearchPhaseResult::fetchResult); | ||
| } | ||
|
|
||
| private void recordStats(ScoreCombinationTechnique combinationTechnique) { | ||
| EventStatsManager.increment(EventStatName.RRF_PROCESSOR_EXECUTIONS); | ||
| Optional.of(combTechniqueIncrementers.get(combinationTechnique.techniqueName())).ifPresent(Runnable::run); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. RRF processor internally creates RRFNormalizationProcessor also. Check https://github.com/opensearch-project/neural-search/blob/main/src/main/java/org/opensearch/neuralsearch/processor/factory/RRFProcessorFactory.java#L51
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added this since @martin-gaievski requested, but my opinion is if currently RRF processor only works works with RRF combination technique and RRF normalization technique, then functionally the stats will be identical, and I don't think there's a point to including a breakdown, it will just be duplicated. Combination is configurable with a single option, but normalization technique isn't a configurable in processor config: https://docs.opensearch.org/docs/latest/search-plugins/search-pipelines/score-ranker-processor/ If in the future there are more normalizaiton/score techniques added we can add this granularity then.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having RRF as processor and as technique gives more dimensions for reporting. For instance, with original version you can generate report with breakdown "by processor type". But report with breakdown on something like "by combination technique" will be harder because for RRF you raw data will have combination technique metric with empty value.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see your point, my concern is having duplicated information in the stats API makes it less readable and less performant. Especially as we add more features, we are concerned with possible response bloat, similar the issues core is facing with the size of From a report perspective, I'm thinking in terms of what kind of insight a breakdown would give you: If you are trying to determine which score combination techniques are seeing more usage, perhaps proportion of RRF to zscore or minmax isn't comparable since they're categorically different, e.g. used in different processors in different contexts. And if needed, the information is available implicitly from looking at RRF processor stats directly. |
||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,6 +7,7 @@ | |
| import java.io.IOException; | ||
| import java.util.ArrayList; | ||
| import java.util.Collection; | ||
| import java.util.HashMap; | ||
| import java.util.List; | ||
| import java.util.ListIterator; | ||
| import java.util.Locale; | ||
|
|
@@ -38,6 +39,8 @@ | |
| import lombok.Setter; | ||
| import lombok.experimental.Accessors; | ||
| import lombok.extern.log4j.Log4j2; | ||
| import org.opensearch.neuralsearch.stats.events.EventStatName; | ||
| import org.opensearch.neuralsearch.stats.events.EventStatsManager; | ||
|
|
||
| import static org.opensearch.neuralsearch.common.MinClusterVersionUtil.isClusterOnOrAfterMinReqVersionForPaginationInHybridQuery; | ||
|
|
||
|
|
@@ -270,13 +273,26 @@ public static HybridQueryBuilder fromXContent(XContentParser parser) throws IOEx | |
| if (isClusterOnOrAfterMinReqVersionForPaginationInHybridQuery()) { | ||
| compoundQueryBuilder.paginationDepth(paginationDepth); | ||
| } | ||
|
|
||
| boolean hasInnerHits = false; | ||
| for (QueryBuilder query : queries) { | ||
| if (filter == null) { | ||
| compoundQueryBuilder.add(query); | ||
| } else { | ||
| compoundQueryBuilder.add(query.filter(filter)); | ||
| } | ||
|
|
||
| // Check if children have inner hits for stats | ||
| if (hasInnerHits == false) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this the right comparison? if children have inner hits, shouldn't this be true?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where are we changing this parameter value?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's set inside the loop, see my other comment for explanation. Basically my thought is we want to increment the stat once if at least one child query has inner hits, which means the hybrid query as a whole is an inner hits hybrid query. |
||
| Map<String, InnerHitContextBuilder> innerHits = new HashMap<>(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't we emit stats for the inner hits? seems like we're checking for inner hits without doing anything to them.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The thought here is that we care about the stats at the level of the hybrid query. If at least one child query has inner hits then that means the hybrid query is an inner hits query. So we check each child, if at least one has inner hits, then we set hasInnerHits = true and we don't have to keep checking the rest. |
||
| InnerHitContextBuilder.extractInnerHits(query, innerHits); | ||
| hasInnerHits = innerHits.isEmpty() == false; | ||
| } | ||
| } | ||
|
|
||
| boolean hasFilter = filter != null; | ||
| boolean hasPagination = paginationDepth != null; | ||
| updateQueryStats(hasFilter, hasPagination, hasInnerHits); | ||
| return compoundQueryBuilder; | ||
| } | ||
|
|
||
|
|
@@ -409,4 +425,17 @@ protected void extractInnerHitBuilders(Map<String, InnerHitContextBuilder> inner | |
| InnerHitContextBuilder.extractInnerHits(queryBuilder, innerHits); | ||
| } | ||
| } | ||
|
|
||
| private static void updateQueryStats(boolean hasFilter, boolean hasPagination, boolean hasInnerHits) { | ||
| EventStatsManager.increment(EventStatName.HYBRID_QUERY_REQUESTS); | ||
| if (hasFilter) { | ||
| EventStatsManager.increment(EventStatName.HYBRID_QUERY_FILTER_REQUESTS); | ||
| } | ||
| if (hasPagination) { | ||
| EventStatsManager.increment(EventStatName.HYBRID_QUERY_PAGINATION_REQUESTS); | ||
| } | ||
| if (hasInnerHits) { | ||
| EventStatsManager.increment(EventStatName.HYBRID_QUERY_INNER_HITS_REQUESTS); | ||
| } | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we have
switch/caseto increment for some processors, and runnable as map values for others. We should have one standard approachThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I am more inclined towards using map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can refactor the others as maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this map can be static, same for the second map
combTechniqueIncrementers