-
Notifications
You must be signed in to change notification settings - Fork 5.7k
[Azure Cognitive Search] Add new relevance features (feature API, session ID and scoring statistics) to the Index Search APIs #8793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -248,6 +248,51 @@ | |
| "name": "SearchParameters" | ||
| } | ||
| }, | ||
| { | ||
| "name": "featuresMode", | ||
| "in": "query", | ||
| "type": "string", | ||
| "enum": [ | ||
| "disabled", | ||
| "enabled" | ||
| ], | ||
| "x-ms-enum": { | ||
| "name": "FeaturesMode", | ||
| "modelAsString": false | ||
| }, | ||
| "x-nullable": false, | ||
| "description": "A value that specifies whether the results should include scoring features such as per field similarity.", | ||
| "x-ms-parameter-grouping": { | ||
| "name": "SearchParameters" | ||
| } | ||
| }, | ||
| { | ||
| "name": "scoringStatistics", | ||
| "in": "query", | ||
| "type": "string", | ||
| "enum": [ | ||
| "local", | ||
| "global" | ||
| ], | ||
| "x-ms-enum": { | ||
| "name": "ScoringStatistics", | ||
| "modelAsString": false | ||
| }, | ||
| "x-nullable": false, | ||
| "description": "A value that specifies whether we want to calculate scoring statistics (such as document frequency) globally for more consistent scoring, or locally, for lower latency.", | ||
| "x-ms-parameter-grouping": { | ||
| "name": "SearchParameters" | ||
| } | ||
| }, | ||
| { | ||
| "name": "sessionId", | ||
| "in": "query", | ||
| "type": "string", | ||
| "description": "A value to be used to create a sticky session, which can help to get more consistent results. As long as the same sessionId is used, a best-effort attempt will be made to target the same replica set. Be wary that reusing the same sessionID values repeatedly can interfere with the load balancing of the requests across replicas and adversely affect the performance of the search service. The value used as sessionId cannot start with a '_' character.", | ||
| "x-ms-parameter-grouping": { | ||
| "name": "SearchParameters" | ||
| } | ||
| }, | ||
| { | ||
| "name": "$select", | ||
| "in": "query", | ||
|
|
@@ -857,6 +902,27 @@ | |
| "additionalProperties": true, | ||
| "description": "A single bucket of a facet query result. Reports the number of documents with a field value falling within a particular range or having a particular value or interval." | ||
| }, | ||
| "SearchFeatures": { | ||
| "properties": { | ||
| "uniqueTokenMatches": { | ||
| "type": "number", | ||
| "readOnly": true, | ||
| "format": "double", | ||
| "description": "The number of unique tokens from the search query that matched this field." | ||
| }, | ||
| "similarityScore": { | ||
| "type": "number", | ||
| "readOnly": true, | ||
| "format": "double", | ||
| "description": "The similarity score computed between the search query and this field." | ||
| } | ||
| }, | ||
| "required": [ | ||
| "uniqueTokenMatches", | ||
| "similarityScore" | ||
| ], | ||
| "description": "A list of features describing the scoring of a specific field against the search query." | ||
| }, | ||
| "DocumentSearchResult": { | ||
| "properties": { | ||
| "@odata.count": { | ||
|
|
@@ -930,6 +996,15 @@ | |
| "readOnly": true, | ||
| "x-ms-client-name": "Highlights", | ||
| "description": "Text fragments from the document that indicate the matching search terms, organized by each applicable field; null if hit highlighting was not enabled for the query." | ||
| }, | ||
| "@search.features": { | ||
| "type": "object", | ||
| "additionalProperties": { | ||
| "$ref": "#/definitions/SearchFeatures" | ||
| }, | ||
| "readOnly": true, | ||
tg-msft marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| "x-ms-client-name": "Features", | ||
| "description": "description for the feature" | ||
| } | ||
| }, | ||
| "additionalProperties": true, | ||
|
|
@@ -1039,6 +1114,30 @@ | |
| }, | ||
| "description": "Specifies the syntax of the search query. The default is 'simple'. Use 'full' if your query uses the Lucene query syntax." | ||
| }, | ||
| "FeaturesMode": { | ||
| "type": "string", | ||
| "enum": [ | ||
| "disabled", | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a reason this is an enum of
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, we expect to extend this enum as we start adding new "features" through versioning. We will likely add new values such as maybe "enableV2" (not sure on the names yet)
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if it's already shipped on the server side, but maybe having it be "None", "V1", "V2" would make more semantic sense? But yeah, I get the difficulty here.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good suggestion. Unfortunately the preview API is already shipped and being used. However I'll bring that suggestion as we revisit the API for GA.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. AutoRest has extensions that allow you to change the name of a swagger enum but not the value you send out over the wire: https://github.com/Azure/autorest/tree/master/docs/extensions#x-ms-enum If we think we can come up with better names, it would be great to change that sooner than later so we can expose them properly in our new client libraries. I don't think we need to block this PR on getting that done now though.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The thing is we aren't yet sure how the feature API will evolve. We published this as preview to get early feedback on it. We wanted to make the API flexible enough to allow us to use versioning, but aren't really ready to commit to it yet, that's why we started with enabled/disabled with the option to add more values to the enum in the future if needed. |
||
| "enabled" | ||
| ], | ||
| "x-ms-enum": { | ||
| "name": "FeaturesMode", | ||
| "modelAsString": false | ||
| }, | ||
| "description": "A value that specifies whether the results should include scoring features, such as per field similarity. The default is 'disabled'. Use 'enabled' to expose additional scoring features." | ||
| }, | ||
| "ScoringStatistics": { | ||
| "type": "string", | ||
| "enum": [ | ||
| "local", | ||
| "global" | ||
| ], | ||
| "x-ms-enum": { | ||
| "name": "ScoringStatistics", | ||
| "modelAsString": false | ||
| }, | ||
| "description": "A value that specifies whether we want to calculate scoring statistics (such as document frequency) globally for more consistent scoring, or locally, for lower latency. The default is 'local'. Use 'global' to aggregate scoring statistics globally before scoring. Using global scoring statistics can increase latency of search queries." | ||
| }, | ||
| "AutocompleteMode": { | ||
| "type": "string", | ||
| "enum": [ | ||
|
|
@@ -1103,6 +1202,18 @@ | |
| "$ref": "#/definitions/QueryType", | ||
| "description": "A value that specifies the syntax of the search query. The default is 'simple'. Use 'full' if your query uses the Lucene query syntax." | ||
| }, | ||
| "featuresMode": { | ||
| "$ref": "#/definitions/FeaturesMode", | ||
| "description": "A value that specifies whether the results should include scoring features, such as per field similarity. The default is 'disabled'. Use 'enabled' to expose additional scoring features." | ||
| }, | ||
| "scoringStatistics": { | ||
| "$ref": "#/definitions/ScoringStatistics", | ||
| "description": "A value that specifies whether we want to calculate scoring statistics (such as document frequency) globally for more consistent scoring, or locally, for lower latency. The default is 'local'. Use 'global' to aggregate scoring statistics globally before scoring. Using global scoring statistics can increase latency of search queries." | ||
| }, | ||
| "sessionId": { | ||
| "type": "string", | ||
| "description": "A value to be used to create a sticky session, which can help getting more consistent results. As long as the same sessionId is used, a best-effort attempt will be made to target the same replica set. Be wary that reusing the same sessionID values repeatedly can interfere with the load balancing of the requests across replicas and adversely affect the performance of the search service. The value used as sessionId cannot start with a '_' character." | ||
| }, | ||
| "scoringParameters": { | ||
| "type": "array", | ||
| "items": { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this always a whole number? what does it being a decimal type mean?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, we only have those 2 features (unique tokens and similarity score). While we expect "unique tokens" to be whole numbers, almost all features we intend on adding won't be whole numbers. We also expect most customers to simply flatten out those structures into vectors of floats to be used for machine learning use cases. While features are defined as strongly typed object, we really see them as a dictionary of floats (where the label/key describe what the float is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. It's all
numberto JS fortunately for me. 😄There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If customers want to use this value in a vector of floats for ML training or prediction, that's super easy for them to do regardless of how we return the property. My bigger concern is that every person using this API is going to wonder "Why is that a double? Am I missing something here?" That was my first thought too, anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. That's good feedback, I can see people asking themselves the question. Our vision of the feature response was really just a bag of float describing the query-to-document relationship. The names are labels to describe how the values were calculated, but we never intended for those values to be used as discrete values. As we evolve it, there's even a chance that we start to apply normalization factors on those values to make them more useful. I don't necessarily mind converting some of those values to integer in the swagger if you think it would otherwise make their usage too confusing, but also see value in keeping them the way they are on the server-side.