NVIDIA · rapids-bot · May 1, 2025 · Mar 18, 2025 · Mar 18, 2025 · Mar 18, 2025
@@ -208,6 +208,62 @@ eval:
 ```
 The swe-bench evaluator uses unstructured dataset entries. The entire row is provided as input to the workflow.
 
+### Tunable RAG Evaluator
+The tunable RAG evaluator is a customizable LLM evaluator that allows for flexible evaluation of RAG workflows.
+It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.
+
+The judge LLM prompt is tunable and can be provided in the `config.yml` file.
+
+A default scoring method is provided as follows:
+- Coverage: Evaluates if the answer covers all mandatory elements of the expected answer.
+- Correctness: Evaluates if the answer is correct compared to the expected answer.
+- Relevance: Evaluates if the answer is relevant to the question.
+
+These weights can be optionally tuned by setting the `default_score_weights` parameter in the `config.yml` file. If not set, each score will be equally weighted.
+
+The default scoring can be overridden by setting the config boolean `default_scoring` to false and providing your own scoring mechanism which you describe in your custom judge LLM prompt.
+Note: if you do choose to use the default scoring method, you are still able to tune the judge LLM prompt.
+
+**Example:**
+`example/simple_calculator/configs/config-tunable-rag-eval.yml`:
+```yaml
+eval:
+  evaluators:
+    tuneable_eval:
+      _type: tunable_rag_evaluator
+      llm_name: nim_rag_eval_llm
+      default_scoring: false
+      default_score_weights:
+        coverage: 0.5
+        correctness: 0.3
+        relevance: 0.2
+      judge_llm_prompt: >
+        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
+        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
+        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.
+
+        Rules:
+        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
+        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
+```
+
+Note: In your evaluation dataset, make sure that the `answer` field is a description of the expected answer with details on what is expected from the generated answer.
+
+**Example:**
+`example/simple_calculator/configs/config-tunable-rag-eval.yml`:
+```json
+{
+  "id": 1,
+  "question": "What is the product of 3 and 7, and is it greater than the current hour?",
+  "answer": "Answer must have the answer of product of 3 and 7 and whether it is greater than the current hour"
+}
+```
+
+**Sample Usage:**
+```bash
+aiq eval --config_file=examples/simple_calculator/configs/config-tunable-rag-eval.yml
+```
+
 ## Adding Custom Evaluators
 You can add custom evaluators to evaluate the workflow output. To add a custom evaluator, you need to implement the evaluator and register it with the AIQ Toolkit evaluator system. See the [Custom Evaluator](../guides/custom-evaluator.md) documentation for more information.
 

@@ -178,6 +178,17 @@ Workflow Result:
 ### Examine the Traces in Phoenix
 Open your browser and navigate to `http://localhost:6006` to view the traces.
 
+## Accuracy Evaluation
+The answers generated by the workflow can be evaluated using the `Tunable RAG Evaluator`[../../../docs/source/concepts/evaluate.md#tunable-rag-evaluator]. A sample dataset is provided in `examples/simple_calculator/data/simple_calculator.json`.
+
+To run the evaluation, use the `aiq eval` command:
+
+```bash
+aiq eval --config_file examples/simple_calculator/configs/config-tunable-rag-eval.yml
+```
+
+The evaluation results will be saved in `examples/simple_calculator/.tmp/eval/simple_calculator/tuneable_eval_output.json`.
+
 ## Deployment-Oriented Setup
 
 For a production deployment, use Docker:

@@ -0,0 +1,99 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+general:
+  use_uvloop: true
+
+functions:
+  calculator_multiply:
+    _type: calculator_multiply
+  calculator_inequality:
+    _type: calculator_inequality
+  calculator_divide:
+    _type: aiq_simple_calculator/calculator_divide
+  current_datetime:
+    _type: current_datetime
+  calculator_subtract:
+    _type: calculator_subtract
+
+llms:
+  nim_llm:
+    _type: nim
+    model_name: meta/llama-3.1-70b-instruct
+    temperature: 0.0
+    max_tokens: 1024
+  eval_llm:
+    _type: nim
+    model_name: mistralai/mixtral-8x22b-instruct-v0.1
+    temperature: 0.0
+    max_tokens: 1024
+  openai_llm:
+    _type: openai
+    model_name: gpt-3.5-turbo
+    max_tokens: 2000
+
+workflow:
+  _type: react_agent
+  tool_names:
+    - calculator_multiply
+    - calculator_inequality
+    - current_datetime
+    - calculator_divide
+    - calculator_subtract
+  llm_name: nim_llm
+  verbose: true
+  retry_parsing_errors: true
+  max_retries: 3
+
+
+eval:
+  general:
+    output_dir: examples/simple_calculator/.tmp/eval/simple_calculator
+    dataset:
+      _type: json
+      file_path: examples/simple_calculator/data/simple_calculator.json
+  evaluators:
+    tuneable_eval:
+      _type: tunable_rag_evaluator
+      llm_name: eval_llm
+      default_scoring: true
+      default_score_weights:
+        coverage: 0.5
+        correctness: 0.3
+        relevance: 0.2
+      judge_llm_prompt: >
+        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
+        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
+        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.
+
+        Rules:
+        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
+        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
+        - The tags <image> and <chart> are real images and charts.
@@ -20,3 +20,4 @@
 from .rag_evaluator.register import register_ragas_evaluator
 from .swe_bench_evaluator.register import register_swe_bench_evaluator
 from .trajectory_evaluator.register import register_trajectory_evaluator
+from .tunable_rag_evaluator.register import register_tunable_rag_evaluator