opensearch-project · natebower · Aug 18, 2025 · Jul 28, 2025 · Aug 4, 2025 · Aug 5, 2025
@@ -2,20 +2,15 @@
 layout: default
 title: Comparing query sets
 nav_order: 12
-parent: Using Search Relevance Workbench
+parent: Search Relevance Workbench
 grand_parent: Search relevance
 has_children: false
-has_toc: false
 ---
 
 # Comparing query sets
 
-This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).    
-{: .warning}
-
 To compare the results of two different search configurations, you can run a pairwise experiment. To achieve this, you need two search configurations and a query set to use for the search configuration.
 
-
 For more information about creating a query set, see [Query Sets]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/query-sets/).
 
 For more information about creating search configurations, see [Search Configurations]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/search-configurations/).
@@ -48,7 +43,7 @@ Field | Data type |  Description
 `querySetId` | String |	The query set ID.
 `searchConfigurationList` | List | A list of search configuration IDs to use for comparison.
 `size` | Integer | The number of documents to return in the results.
-`type` | String | Defines the type of experiment to run. Valid values are `PAIRWISE_COMPARISON`, `HYBRID_OPTIMIZER`, or `POINTWISE_EVALUATION`. Depending on the experiment type, you must provide different body fields in the request. `PAIRWISE_COMPARISON` is for comparing two search configurations against a query set and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/compare-query-sets/). `HYBRID_OPTIMIZER` is for combining results and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/optimize-hybrid-search/). `POINTWISE_EVALUATION` is for evaluating a search configuration against judgments and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/evaluate-search-quality/). 
+`type` | String | Defines the type of experiment to run. Valid values are `PAIRWISE_COMPARISON`, `HYBRID_OPTIMIZER`, or `POINTWISE_EVALUATION`. Depending on the experiment type, you must provide different body fields in the request. `PAIRWISE_COMPARISON` is for comparing two search configurations against a query set and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/compare-query-sets/). `HYBRID_OPTIMIZER` is for combining results and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/optimize-hybrid-search/). `POINTWISE_EVALUATION` is for evaluating a search configuration against judgments and is used [here]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/evaluate-search-quality/).
 
 The response contains the experiment ID of the created experiment:
 

@@ -1,11 +1,10 @@
 ---
 layout: default
 title: Comparing single queries
-nav_order: 11
-parent: Using Search Relevance Workbench
+nav_order: 10
+parent: Search Relevance Workbench
 grand_parent: Search relevance
 has_children: false
-has_toc: false
 ---
 
 # Comparing single queries

@@ -1,18 +1,14 @@
 ---
 layout: default
 title: Comparing search results
-nav_order: 10
-parent: Using Search Relevance Workbench
+nav_order: 11
+parent: Search Relevance Workbench
 grand_parent: Search relevance
 has_children: true
-has_toc: false
 ---
 
 # Comparing search results
 
-This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).    
-{: .warning}
-
 Comparing search results, also called a _pairwise experiment_, in OpenSearch Dashboards allows you to compare results of multiple search configurations. Using this tool helps assess how results change when applying different search configurations to queries.
 
 For example, you can see how results change when you apply one of the following query changes:

@@ -2,17 +2,13 @@
 layout: default
 title: Evaluating search quality
 nav_order: 50
-parent: Using Search Relevance Workbench
+parent: Search Relevance Workbench
 grand_parent: Search relevance
 has_children: false
-has_toc: false
 ---
 
 # Evaluating search quality
 
-This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).    
-{: .warning}
-
 Search Relevance Workbench can run pointwise experiments to evaluate search configuration quality using provided queries and relevance judgments.
 
 For more information about creating a query set, see [Query sets]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/query-sets/).
@@ -210,3 +206,5 @@ The results include the original request parameters along with the following met
 - `MAP@k`: The Mean Average Precision, which calculates the average precision across all documents. For more information, see [Average precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Average_precision).
 
 - `NDCG@k`: The Normalized Discounted Cumulative Gain, which compares the actual ranking of results against a perfect ranking, with higher weights given to top results. This measures the quality of result ordering.
+
+To review these results visually, see [Exploring search evaluation results]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/explore-experiment-results/).
@@ -0,0 +1,75 @@
+---
+layout: default
+title: Exploring search evaluation results
+nav_order: 65
+parent: Search Relevance Workbench
+grand_parent: Search relevance
+has_children: false
+---
+
+# Exploring search evaluation results
+Introduced 3.2
+{: .label .label-purple }
+
+In addition to retrieving the experiment results using the API, you can explore the results visually. The Search Relevance Workbench comes with dashboards that you can install to review search evaluation and hybrid search optimization experiment results.
+
+## Installing the dashboards
+
+You can install the dashboards in one of the following ways:
+
+* In the **Actions** column, select a visualization icon in the experiment overview.
+
+* Select the **Install Dashboards** button in the upper-right corner of the experiment overview.
+
+<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/experiment_overview_dashboard_installation_options.png" alt="Experiment overview of the Search Relevance Workbench including dashboard installation options"/>{: .img-fluid }
+
+The modal offers to install the dashboards for the user.
+
+<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/install_dashboards_modal.png" alt="Modal to install dashboards"/>{: .img-fluid }
+
+## Using the dashboards
+
+Once you install the dashboards, in the **Actions** column, select the visualization icon in the experiment overview. This opens the experiment result dashboard. The view presented depends on the type of experiment you chose:
+
+* The search evaluation dashboard focuses on the individual query level and provides insights about well-performing queries and queries with open relevance potential.
+
+* The hybrid search dashboard provides an overview of how the different hybrid search parameter configurations performed and lets you identify candidate queries for further exploration and experimentation.
+
+### Search evaluation dashboard
+
+The search evaluation dashboard, shown in the following image, aggregates performance metrics across all queries in your selected experiment. Use the search evaluation dashboard to get a high-level view of overall experiment performance and identify the queries that need attention.
+
+<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/search_evaulation_dashboard.png" alt="Search evaluation dashboard with visualizations"/>{: .img-fluid }
+
+The **Deep Dive Summary** panel shows the aggregate metrics for NDCG, MAP, precision, and coverage (see [Evaluating search quality]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/search-configurations/)).
+
+The **Deep Dive Query Scores** pane shows individual query performance ranked by NDCG score (highest to lowest). Use this pane to identify your best- and worst-performing queries.
+
+The **Deep Dive Score Densities** pane shows how metric values are distributed across your query set. Use this pane to understand whether poor performance is widespread or concentrated in specific queries. The x-axis shows metric values, while the y-axis shows how frequently those values occur.
+
+The **Deep Dive Score Scatter Plot** pane shows an interactive view of the preceding distribution data, with each query shown as a separate point. Use this pane to investigate specific queries at performance extremes. Points are scattered vertically to prevent overlap while maintaining the same x-axis metric values as the preceding distribution view.
+
+### Hybrid search evaluation dashboard
+
+Use the hybrid search evaluation dashboard, shown in the following image, to compare experiment variants and identify the optimal parameter configurations for your hybrid experiment.
+
+<img src="{{site.url}}{{site.baseurl}}/images/search-relevance-workbench/hybrid_search_optimizer_dashboard.png" alt="Hybrid search optimization evaluation dashboard with visualizations"/>{: .img-fluid }
+
+The **Variant Performance Chart** shows your experiment variants arranged visually from best to worst performing (left to right, by decreasing NDCG). Use this chart to quickly identify your top-performing queries and view performance patterns across different parameter combinations at a glance.
+
+The **Variant Performance** pane shows the same variant data in a sortable table format with all metrics visible. Use this pane to compare specific metric values across variants and customize your analysis by sorting on different performance measures. To sort by a column, select the column header.
+
+
+### Customizing the dashboards
+
+The dashboards are installed as saved objects. After installing them, you can edit the dashboards or clone and customize them to your specific requirements.
+
+To learn how to customize the source files, see [Updating the default dashboards](https://github.com/opensearch-project/dashboards-search-relevance/blob/main/DEVELOPER_GUIDE.md#updating-default-dashboards).
+
+### Resetting dashboards
+
+To reset the dashboards, select the **Install Dashboards** button in the upper-right corner of the experiment overview. This will reinstall the dashboards.
+
+
+
+
@@ -2,17 +2,13 @@
 layout: default
 title: Judgments
 nav_order: 8
-parent: Using Search Relevance Workbench
+parent: Search Relevance Workbench
 grand_parent: Search relevance
 has_children: false
-has_toc: false
 ---
 
 # Judgments
 
-This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).    
-{: .warning}
-
 A judgment is a relevance rating assigned to a specific document in the context of a particular query. Multiple judgments are grouped together into judgment lists.
 Typically, judgments are categorized into two types---implicit and explicit:
 
@@ -120,10 +116,10 @@ To use AI-assisted judgment generation, ensure that you have configured the foll
 * A query set: Together with the `size` parameter, the query set defines the scope for generating judgments. For each query, the top k documents are retrieved from the specified index, where k is defined in the `size` parameter.
 * A search configuration: A search configuration defines how documents are retrieved for use in query/document pairs.
 
-The AI-assisted judgment process works as follows: 
-- For each query, the top k documents are retrieved using the defined search configuration, which includes the index information. The query and each document from the result list create a query/document pair. 
-- Each query and document pair forms a query/document pair. 
-- The LLM is then called with a predefined prompt (stored as a static variable in the backend) to generate a judgment for each query/document pair. 
+The AI-assisted judgment process works as follows:
+- For each query, the top k documents are retrieved using the defined search configuration, which includes the index information. The query and each document from the result list create a query/document pair.
+- Each query and document pair forms a query/document pair.
+- The LLM is then called with a predefined prompt (stored as a static variable in the backend) to generate a judgment for each query/document pair.
 - All generated judgments are stored in the judgments index for reuse in future experiments.
 
 To create a judgment list, provide the model ID of the LLM, an available query set, and a created search configuration:
@@ -132,7 +128,7 @@ To create a judgment list, provide the model ID of the LLM, an available query s
 ```json
 PUT _plugins/_search_relevance/judgments
 {
-    "name":"COEC",
+    "name":"AI-assisted judgment list",
     "type":"LLM_JUDGMENT",
     "querySetId":"5f0115ad-94b9-403a-912f-3e762870ccf6",
     "searchConfigurationList":["2f90d4fd-bd5e-450f-95bb-eabe4a740bd1"],
@@ -177,6 +173,8 @@ Parameter | Data type | Description
 `clickModel` | String | The model used to calculate implicit judgments. Only `coec` (Clicks Over Expected Clicks) is supported.
 `type` | String | Set to `UBI_JUDGMENT`.
 `maxRank` | Integer | The maximum rank to consider when including events in the judgment calculation.
+`startDate` | Date | The optional starting date from which behavioral data events are considered for implicit judgment generation. The format is`yyyy-MM-dd`.
+`endDate` | Date | The optional end date until which behavioral data events are considered for implicit judgment generation. The format is`yyyy-MM-dd`.
 
 ## Managing judgment lists
 

@@ -2,17 +2,13 @@
 layout: default
 title: Optimizing hybrid search
 nav_order: 60
-parent: Using Search Relevance Workbench
+parent: Search Relevance Workbench
 grand_parent: Search relevance
 has_children: false
-has_toc: false
 ---
 
 # Optimizing hybrid search
 
-This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).    
-{: .warning}
-
 A key challenge of using hybrid search in OpenSearch is combining results from lexical and vector-based search effectively. OpenSearch provides different techniques and various parameters you can experiment with to find the best setup for your application. What works best, however, depends heavily on your data, user behavior, and application domain—there is no one-size-fits-all solution.
 
 Search Relevance Workbench helps you systematically find the ideal set of parameters for your needs.
@@ -114,3 +110,5 @@ POST _plugins/_sql
 }
 ```
 {% include copy-curl.html %}
+
+To review these results visually, see [Exploring search evaluation results]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/explore-experiment-results/).
@@ -2,17 +2,13 @@
 layout: default
 title: Query sets
 nav_order: 3
-parent: Using Search Relevance Workbench
+parent: Search Relevance Workbench
 grand_parent: Search relevance
 has_children: false
-has_toc: false
 ---
 
 # Query sets
 
-This is an experimental feature and is not recommended for use in a production environment. For updates on the progress of the feature or if you want to leave feedback, see the associated [GitHub issue](https://github.com/opensearch-project/OpenSearch/issues/17735).    
-{: .warning}
-
 A query set is a collection of queries. These queries are used in experiments for search relevance evaluation. Search Relevance Workbench offers different sampling techniques for creating query sets from real user data that adheres to the [User Behavior Insights (UBI)]({{site.url}}{{site.baseurl}}/search-plugins/ubi/schemas/) specification.
 Additionally, Search Relevance Workbench allows you to import a query set.
 
@@ -37,10 +33,10 @@ The following table lists the available input parameters.
 
 Field | Data type |  Description
 :---  | :--- | :---
-`name` | String |	The name of the query set.
-`description` | String | A short description of the query set.
+`name` | String | The name of the query set. The maximum length is 50 characters.
+`description` | String | A short description of the query set. The maximum length is 250 characters.
 `sampling` | String | Defines which sampler to use. Valid values are `pptss` (Probability-Proportional-to-Size-Sampling), `random`, `topn` (most frequent queries), and `manual`.
-`querySetSize` | Integer | The target number of queries in the query set. Depending on the number of unique queries in `ubi_queries`, the resulting query set may contain fewer queries.
+`querySetSize` | Integer | The target number of queries in the query set. Depending on the number of unique queries in `ubi_queries`, the resulting query set may contain fewer queries. Must be a positive integer.
 
 ### Example request: Sampling 20 queries with the Top N sampler
 
@@ -73,6 +69,54 @@ PUT _plugins/_search_relevance/query_sets
 }
 ```
 
+## Query set formats
+
+Search Relevance Workbench supports two formats for query sets, each designed for different use cases. Both formats are a collection of user queries, but they differ in whether they include an expected answer.
+
+* **Basic query set**: A list of user queries without any additional information. This is useful for general relevance testing where no specific answer is expected.
+
+* **Query set with reference answers**: A list of user queries, in which each query is paired with its expected answer. This format is particularly useful for evaluating applications designed to provide a specific answer, such as question-answering systems.
+
+### Fields
+
+All query sets comprise one or more entries. Each entry is a JSON object containing the following fields.
+
+| Field | Data type | Description |
+| :--- | :--- | :--- |
+| `queryText` | String | The user query string. Required. |
+| `referenceAnswer` | String | The expected or correct answer to the user query. This field is used for generating judgments, especially with large language models (LLMs). Optional. |
+
+### Basic query set example
+
+A basic query set contains only the `queryText` field for each entry. It is suitable for general relevance tests where no single "correct" answer exists.
+
+#### Example query set without reference answers
+
+```json
+{"queryText": "t towels kitchen"}
+{"queryText": "table top bandsaw for metal"}
+{"queryText": "tan strappy heels for women"}
+{"queryText": "tank top plus size women"}
+{"queryText": "tape and mudding tools"}
+```
+
+### Query set with reference answers example
+
+This format includes the `referenceAnswer` field alongside the `queryText`. It is ideal for evaluating applications designed to provide specific answers, such as chatbots or question-answering systems.
+
+#### Example query set with reference answers
+
+```json
+{"queryText": "What is the capital of France?", "referenceAnswer": "Paris"}
+{"queryText": "Who wrote 'Romeo and Juliet'?", "referenceAnswer": "William Shakespeare"}
+{"queryText": "What is the chemical symbol for water?", "referenceAnswer": "H2O"}
+{"queryText": "What is the highest mountain in the world?", "referenceAnswer": "Mount Everest"}
+{"queryText": "When was the first iPhone released?", "referenceAnswer": "June 29, 2007"}
+```
+
+
+The `referenceAnswer` field is particularly useful when using [LLMs to generate judgments]({{site.url}}{{site.baseurl}}/search-plugins/search-relevance/judgments/). The LLM can use the reference answer as a ground truth to compare against the retrieved search results, allowing it to accurately score the relevance of the response.
+
 ## Managing query sets
 
 You can retrieve or delete query sets using the following APIs.