diff --git a/docs/source/concepts/evaluate.md b/docs/source/concepts/evaluate.md
index fe0b28825..5fb7a91b4 100644
--- a/docs/source/concepts/evaluate.md
+++ b/docs/source/concepts/evaluate.md
@@ -208,6 +208,62 @@ eval:
 ```
 The swe-bench evaluator uses unstructured dataset entries. The entire row is provided as input to the workflow.
 
+### Tunable RAG Evaluator
+The tunable RAG evaluator is a customizable LLM evaluator that allows for flexible evaluation of RAG workflows.
+It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.
+
+The judge LLM prompt is tunable and can be provided in the `config.yml` file.
+
+A default scoring method is provided as follows:
+- Coverage: Evaluates if the answer covers all mandatory elements of the expected answer.
+- Correctness: Evaluates if the answer is correct compared to the expected answer.
+- Relevance: Evaluates if the answer is relevant to the question.
+
+These weights can be optionally tuned by setting the `default_score_weights` parameter in the `config.yml` file. If not set, each score will be equally weighted.
+
+The default scoring can be overridden by setting the config boolean `default_scoring` to false and providing your own scoring mechanism which you describe in your custom judge LLM prompt.
+Note: if you do choose to use the default scoring method, you are still able to tune the judge LLM prompt.
+
+**Example:**
+`example/simple_calculator/configs/config-tunable-rag-eval.yml`:
+```yaml
+eval:
+  evaluators:
+    tuneable_eval:
+      _type: tunable_rag_evaluator
+      llm_name: nim_rag_eval_llm
+      default_scoring: false
+      default_score_weights:
+        coverage: 0.5
+        correctness: 0.3
+        relevance: 0.2
+      judge_llm_prompt: >
+        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
+        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
+        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.
+
+        Rules:
+        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
+        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
+```
+
+Note: In your evaluation dataset, make sure that the `answer` field is a description of the expected answer with details on what is expected from the generated answer.
+
+**Example:**
+`example/simple_calculator/configs/config-tunable-rag-eval.yml`:
+```json
+{
+  "id": 1,
+  "question": "What is the product of 3 and 7, and is it greater than the current hour?",
+  "answer": "Answer must have the answer of product of 3 and 7 and whether it is greater than the current hour"
+}
+```
+
+**Sample Usage:**
+```bash
+aiq eval --config_file=examples/simple_calculator/configs/config-tunable-rag-eval.yml
+```
+
 ## Adding Custom Evaluators
 You can add custom evaluators to evaluate the workflow output. To add a custom evaluator, you need to implement the evaluator and register it with the AIQ Toolkit evaluator system. See the [Custom Evaluator](../guides/custom-evaluator.md) documentation for more information.
 
diff --git a/examples/simple_calculator/README.md b/examples/simple_calculator/README.md
index 57230af75..bda42e249 100644
--- a/examples/simple_calculator/README.md
+++ b/examples/simple_calculator/README.md
@@ -178,6 +178,17 @@ Workflow Result:
 ### Examine the Traces in Phoenix
 Open your browser and navigate to `http://localhost:6006` to view the traces.
 
+## Accuracy Evaluation
+The answers generated by the workflow can be evaluated using the `Tunable RAG Evaluator`[../../../docs/source/concepts/evaluate.md#tunable-rag-evaluator]. A sample dataset is provided in `examples/simple_calculator/data/simple_calculator.json`.
+
+To run the evaluation, use the `aiq eval` command:
+
+```bash
+aiq eval --config_file examples/simple_calculator/configs/config-tunable-rag-eval.yml
+```
+
+The evaluation results will be saved in `examples/simple_calculator/.tmp/eval/simple_calculator/tuneable_eval_output.json`.
+
 ## Deployment-Oriented Setup
 
 For a production deployment, use Docker:
diff --git a/examples/simple_calculator/src/aiq_simple_calculator/configs/config-tunable-rag-eval.yml b/examples/simple_calculator/src/aiq_simple_calculator/configs/config-tunable-rag-eval.yml
new file mode 100644
index 000000000..12715f393
--- /dev/null
+++ b/examples/simple_calculator/src/aiq_simple_calculator/configs/config-tunable-rag-eval.yml
@@ -0,0 +1,99 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# SPDX-FileCopyrightText: Copyright (c) 2024-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+general:
+  use_uvloop: true
+
+functions:
+  calculator_multiply:
+    _type: calculator_multiply
+  calculator_inequality:
+    _type: calculator_inequality
+  calculator_divide:
+    _type: aiq_simple_calculator/calculator_divide
+  current_datetime:
+    _type: current_datetime
+  calculator_subtract:
+    _type: calculator_subtract
+
+llms:
+  nim_llm:
+    _type: nim
+    model_name: meta/llama-3.1-70b-instruct
+    temperature: 0.0
+    max_tokens: 1024
+  eval_llm:
+    _type: nim
+    model_name: mistralai/mixtral-8x22b-instruct-v0.1
+    temperature: 0.0
+    max_tokens: 1024
+  openai_llm:
+    _type: openai
+    model_name: gpt-3.5-turbo
+    max_tokens: 2000
+
+workflow:
+  _type: react_agent
+  tool_names:
+    - calculator_multiply
+    - calculator_inequality
+    - current_datetime
+    - calculator_divide
+    - calculator_subtract
+  llm_name: nim_llm
+  verbose: true
+  retry_parsing_errors: true
+  max_retries: 3
+
+
+eval:
+  general:
+    output_dir: examples/simple_calculator/.tmp/eval/simple_calculator
+    dataset:
+      _type: json
+      file_path: examples/simple_calculator/data/simple_calculator.json
+  evaluators:
+    tuneable_eval:
+      _type: tunable_rag_evaluator
+      llm_name: eval_llm
+      default_scoring: true
+      default_score_weights:
+        coverage: 0.5
+        correctness: 0.3
+        relevance: 0.2
+      judge_llm_prompt: >
+        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
+        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
+        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.
+
+        Rules:
+        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
+        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
+        - The tags <image> and <chart> are real images and charts.
diff --git a/examples/simple_calculator/src/aiq_simple_calculator/data/simple_calculator.json b/examples/simple_calculator/src/aiq_simple_calculator/data/simple_calculator.json
new file mode 100644
index 000000000..4400e07ac
--- /dev/null
+++ b/examples/simple_calculator/src/aiq_simple_calculator/data/simple_calculator.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b742cea2aac5b2e33878e6aa7cbce842bd8bd1527eede9711fa3b4677e9dd294
+size 13173
diff --git a/src/aiq/eval/register.py b/src/aiq/eval/register.py
index 6dd166cc2..bb5ed436b 100644
--- a/src/aiq/eval/register.py
+++ b/src/aiq/eval/register.py
@@ -20,3 +20,4 @@
 from .rag_evaluator.register import register_ragas_evaluator
 from .swe_bench_evaluator.register import register_swe_bench_evaluator
 from .trajectory_evaluator.register import register_trajectory_evaluator
+from .tunable_rag_evaluator.register import register_tunable_rag_evaluator
diff --git a/src/aiq/eval/tunable_rag_evaluator/__init__.py b/src/aiq/eval/tunable_rag_evaluator/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/src/aiq/eval/tunable_rag_evaluator/evaluate.py b/src/aiq/eval/tunable_rag_evaluator/evaluate.py
new file mode 100644
index 000000000..4eb4d47e6
--- /dev/null
+++ b/src/aiq/eval/tunable_rag_evaluator/evaluate.py
@@ -0,0 +1,263 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import asyncio
+import logging
+
+from langchain.output_parsers import ResponseSchema
+from langchain.output_parsers import StructuredOutputParser
+from langchain.schema import HumanMessage
+from langchain.schema import SystemMessage
+from langchain_core.language_models import BaseChatModel
+from tqdm import tqdm
+
+from aiq.eval.evaluator.evaluator_model import EvalInput
+from aiq.eval.evaluator.evaluator_model import EvalInputItem
+from aiq.eval.evaluator.evaluator_model import EvalOutput
+from aiq.eval.evaluator.evaluator_model import EvalOutputItem
+from aiq.eval.utils.tqdm_position_registry import TqdmPositionRegistry
+
+logger = logging.getLogger(__name__)
+
+# pylint: disable=line-too-long
+# flake8: noqa: E501
+
+
+def evaluation_prompt(judge_llm_prompt: str,
+                      question: str,
+                      answer_description: str,
+                      generated_answer: str,
+                      format_instructions: str,
+                      default_scoring: bool):
+    """
+    This function generates a prompt for the judge LLM to evaluate the generated answer.
+    """
+
+    DEFAULT_SCORING_INSTRUCTIONS = """
+    The coverage score is a measure of how well the generated answer covers the critical aspects mentioned in the expected answer. A low coverage score indicates that the generated answer misses critical aspects of the expected answer. A middle coverage score indicates that the generated answer covers some of the must-haves of the expected answer but lacks other details. A high coverage score indicates that all of the expected aspects are present in the generated answer.
+    The correctness score is a measure of how well the generated answer matches the expected answer. A low correctness score indicates that the generated answer is incorrect or does not match the expected answer. A middle correctness score indicates that the generated answer is correct but lacks some details. A high correctness score indicates that the generated answer is exactly the same as the expected answer.
+    The relevance score is a measure of how well the generated answer is relevant to the question. A low relevance score indicates that the generated answer is not relevant to the question. A middle relevance score indicates that the generated answer is somewhat relevant to the question. A high relevance score indicates that the generated answer is exactly relevant to the question.
+    The reasoning is a 1-2 sentence explanation for the scoring.
+    """
+
+    DEFAULT_EVAL_PROMPT = (f"You are an intelligent assistant that responds strictly in JSON format."
+                           f"Judge based on the following scoring rubric: {DEFAULT_SCORING_INSTRUCTIONS}"
+                           f"{judge_llm_prompt}\n"
+                           f"{format_instructions}\n"
+                           f"Here is the user's query: {question}"
+                           f"Here is the description of the expected answer: {answer_description}"
+                           f"Here is the generated answer: {generated_answer}")
+
+    EVAL_PROMPT = (f"You are an intelligent assistant that responds strictly in JSON format. {judge_llm_prompt}\n"
+                   f"{format_instructions}\n"
+                   f"Here is the user's query: {question}"
+                   f"Here is the description of the expected answer: {answer_description}"
+                   f"Here is the generated answer: {generated_answer}")
+
+    return EVAL_PROMPT if not default_scoring else DEFAULT_EVAL_PROMPT
+
+
+class TunableRagEvaluator:
+    '''Tunable RAG evaluator class with customizable LLM prompt for scoring.'''
+
+    def __init__(self,
+                 llm: BaseChatModel,
+                 judge_llm_prompt: str,
+                 max_concurrency: int,
+                 default_scoring: bool,
+                 default_score_weights: dict):
+        self.llm = llm
+        self.max_concurrency = max_concurrency
+        self.judge_llm_prompt = judge_llm_prompt
+        self.semaphore = asyncio.Semaphore(self.max_concurrency)
+        self.default_scoring = default_scoring
+        # Use user-provided weights if available; otherwise, set equal weights for each score
+        self.default_score_weights = default_score_weights if default_score_weights else {
+            "coverage": 1 / 3, "correctness": 1 / 3, "relevance": 1 / 3
+        }
+
+    async def evaluate(self, eval_input: EvalInput) -> EvalOutput:
+        '''Evaluate function'''
+
+        async def process_item(item):
+            """Compute RAG evaluation for an individual item"""
+            question = item.input_obj
+            answer_description = item.expected_output_obj
+            generated_answer = item.output_obj
+
+            # Call judge LLM to generate score
+            score = 0.0
+
+            default_evaluation_schema = [
+                ResponseSchema(
+                    name="coverage_score",
+                    description=
+                    "Score for the coverage of all critical aspects mentioned in the expected answer. Ex. 0.5",
+                    type="float"),
+                ResponseSchema(
+                    name="correctness_score",
+                    description=
+                    "Score for the accuracy of the generated answer compared to the expected answer. Ex. 0.5",
+                    type="float"),
+                ResponseSchema(name="relevance_score",
+                               description="Score for the relevance of the generated answer to the question. Ex. 0.5",
+                               type="float"),
+                ResponseSchema(
+                    name="reasoning",
+                    description=
+                    "1-2 summarized sentences of reasoning for the scores. Ex. 'The generated answer covers all critical aspects mentioned in the expected answer, is correct, and is relevant to the question.'",
+                    type="string"),
+            ]
+
+            custom_evaluation_schema = [
+                ResponseSchema(name="score", description="Score for the generated answer. Ex. 0.5", type="float"),
+                ResponseSchema(
+                    name="reasoning",
+                    description=
+                    "1-2 sentence reasoning for the score. Ex. 'The generated answer is exactly the same as the description of the expected answer.'",
+                    type="string"),
+            ]
+
+            if self.default_scoring:
+                evaluation_schema = default_evaluation_schema
+            else:
+                evaluation_schema = custom_evaluation_schema
+
+            llm_input_response_parser = StructuredOutputParser.from_response_schemas(evaluation_schema)
+            format_instructions = llm_input_response_parser.get_format_instructions()
+
+            eval_prompt = evaluation_prompt(judge_llm_prompt=self.judge_llm_prompt,
+                                            question=question,
+                                            answer_description=answer_description,
+                                            generated_answer=generated_answer,
+                                            format_instructions=format_instructions,
+                                            default_scoring=self.default_scoring)
+
+            messages = [
+                SystemMessage(content="You must respond only in JSON format."), HumanMessage(content=eval_prompt)
+            ]
+
+            response = await self.llm.ainvoke(messages)
+
+            # Initialize default values to handle service errors
+            coverage_score = 0.0
+            correctness_score = 0.0
+            relevance_score = 0.0
+            reasoning = "Error in evaluator from parsing judge LLM response."
+
+            try:
+                parsed_response = llm_input_response_parser.parse(response.content)
+                if self.default_scoring:
+                    try:
+                        coverage_score = parsed_response["coverage_score"]
+                        correctness_score = parsed_response["correctness_score"]
+                        relevance_score = parsed_response["relevance_score"]
+                        reasoning = parsed_response["reasoning"]
+                    except KeyError as e:
+                        logger.error("Missing required keys in default scoring response: %s",
+                                     ", ".join(str(arg) for arg in e.args))
+                        reasoning = f"Error in evaluator from parsing judge LLM response. Missing required key(s): {', '.join(str(arg) for arg in e.args)}"
+
+                    coverage_weight = self.default_score_weights.get("coverage", 1 / 3)
+                    correctness_weight = self.default_score_weights.get("correctness", 1 / 3)
+                    relevance_weight = self.default_score_weights.get("relevance", 1 / 3)
+
+                    # Calculate score
+                    total_weight = coverage_weight + correctness_weight + relevance_weight
+                    coverage_weight = coverage_weight / total_weight
+                    correctness_weight = correctness_weight / total_weight
+                    relevance_weight = relevance_weight / total_weight
+
+                    if round(coverage_weight + correctness_weight + relevance_weight, 2) != 1:
+                        logger.warning("The sum of the default score weights is not 1. The weights will be normalized.")
+                        coverage_weight = coverage_weight / (coverage_weight + correctness_weight + relevance_weight)
+                        correctness_weight = correctness_weight / (coverage_weight + correctness_weight +
+                                                                   relevance_weight)
+                        relevance_weight = relevance_weight / (coverage_weight + correctness_weight + relevance_weight)
+
+                    score = (coverage_weight * coverage_score + correctness_weight * correctness_score +
+                             relevance_weight * relevance_score)
+
+                else:
+                    try:
+                        score = parsed_response["score"]
+                        reasoning = parsed_response["reasoning"]
+                    except KeyError as e:
+                        logger.error("Missing required keys in custom scoring response: %s",
+                                     ", ".join(str(arg) for arg in e.args))
+                        reasoning = f"Error in evaluator from parsing judge LLM response. Missing required key(s): {', '.join(str(arg) for arg in e.args)}"
+                        raise
+            except (KeyError, ValueError) as e:
+                logger.error("Error parsing judge LLM response: %s", e)
+                score = 0.0
+                reasoning = "Error in evaluator from parsing judge LLM response."
+
+            if self.default_scoring:
+                reasoning = {
+                    "question": question,
+                    "answer_description": answer_description,
+                    "generated_answer": generated_answer,
+                    "score_breakdown": {
+                        "coverage_score": coverage_score,
+                        "correctness_score": correctness_score,
+                        "relevance_score": relevance_score,
+                    },
+                    "reasoning": reasoning,
+                }
+            else:
+                reasoning = {
+                    "question": question,
+                    "answer_description": answer_description,
+                    "generated_answer": generated_answer,
+                    "reasoning": reasoning
+                }
+
+            return score, reasoning
+
+        async def wrapped_process(item: EvalInputItem) -> tuple[float, dict]:
+            """
+            Process an item asynchronously and update the progress bar.
+            Use the semaphore to limit the number of concurrent items.
+            """
+            async with self.semaphore:
+                result = await process_item(item)
+                # Update the progress bar
+                pbar.update(1)
+                return result
+
+        try:
+            # Claim a tqdm position to display the progress bar
+            tqdm_position = TqdmPositionRegistry.claim()
+            # Create a progress bar
+            pbar = tqdm(total=len(eval_input.eval_input_items), desc="Evaluating RAG", position=tqdm_position)
+            # Process items concurrently with a limit on concurrency
+            results = await asyncio.gather(*[wrapped_process(item) for item in eval_input.eval_input_items])
+        finally:
+            pbar.close()
+            TqdmPositionRegistry.release(tqdm_position)
+
+        # Extract scores and reasonings
+        sample_scores, sample_reasonings = zip(*results) if results else ([], [])
+
+        # Compute average score
+        avg_score = round(sum(sample_scores) / len(sample_scores), 2) if sample_scores else 0.0
+
+        # Construct EvalOutputItems
+        eval_output_items = [
+            EvalOutputItem(id=item.id, score=score, reasoning=reasoning)
+            for item, score, reasoning in zip(eval_input.eval_input_items, sample_scores, sample_reasonings)
+        ]
+
+        return EvalOutput(average_score=avg_score, eval_output_items=eval_output_items)
diff --git a/src/aiq/eval/tunable_rag_evaluator/register.py b/src/aiq/eval/tunable_rag_evaluator/register.py
new file mode 100644
index 000000000..43d0c93be
--- /dev/null
+++ b/src/aiq/eval/tunable_rag_evaluator/register.py
@@ -0,0 +1,50 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pydantic import Field
+
+from aiq.builder.builder import EvalBuilder
+from aiq.builder.evaluator import EvaluatorInfo
+from aiq.builder.framework_enum import LLMFrameworkEnum
+from aiq.cli.register_workflow import register_evaluator
+from aiq.data_models.component_ref import LLMRef
+from aiq.data_models.evaluator import EvaluatorBaseConfig
+
+
+class TunableRagEvaluatorConfig(EvaluatorBaseConfig, name="tunable_rag_evaluator"):
+    '''Configuration for tunable RAG evaluator'''
+    llm_name: LLMRef = Field(description="Name of the judge LLM")
+    judge_llm_prompt: str = Field(description="LLM prompt for the judge LLM")
+    default_scoring: bool = Field(description="Whether to use default scoring", default=False)
+    default_score_weights: dict = Field(
+        default={
+            "coverage": 0.5, "correctness": 0.3, "relevance": 0.2
+        },
+        description="Weights for the different scoring components when using default scoring")
+
+
+@register_evaluator(config_type=TunableRagEvaluatorConfig)
+async def register_tunable_rag_evaluator(config: TunableRagEvaluatorConfig, builder: EvalBuilder):
+    '''Register tunable RAG evaluator'''
+    from .evaluate import TunableRagEvaluator
+
+    llm = await builder.get_llm(config.llm_name, wrapper_type=LLMFrameworkEnum.LANGCHAIN)
+    evaluator = TunableRagEvaluator(llm,
+                                    config.judge_llm_prompt,
+                                    builder.get_max_concurrency(),
+                                    config.default_scoring,
+                                    config.default_score_weights)
+
+    yield EvaluatorInfo(config=config, evaluate_fn=evaluator.evaluate, description="Tunable RAG Evaluator")
diff --git a/tests/aiq/eval/tunable_rag_evaluator/test_tunable_rag_evaluate.py b/tests/aiq/eval/tunable_rag_evaluator/test_tunable_rag_evaluate.py
new file mode 100644
index 000000000..0f37c5923
--- /dev/null
+++ b/tests/aiq/eval/tunable_rag_evaluator/test_tunable_rag_evaluate.py
@@ -0,0 +1,139 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from unittest.mock import AsyncMock
+from unittest.mock import MagicMock
+
+import pytest
+from langchain_core.language_models import BaseChatModel
+
+from aiq.eval.evaluator.evaluator_model import EvalInput
+from aiq.eval.evaluator.evaluator_model import EvalInputItem
+from aiq.eval.evaluator.evaluator_model import EvalOutput
+from aiq.eval.tunable_rag_evaluator.evaluate import TunableRagEvaluator
+
+
+@pytest.fixture
+def mock_llm():
+    return MagicMock(spec=BaseChatModel)
+
+
+@pytest.fixture
+def default_score_weights():
+    return {"coverage": 1, "correctness": 1, "relevance": 1}
+
+
+@pytest.fixture
+def rag_eval_input():
+    items = [
+        EvalInputItem(id="1",
+                      input_obj="What is AI?",
+                      expected_output_obj="AI is artificial intelligence.",
+                      output_obj="AI is the simulation of human intelligence.",
+                      expected_trajectory=[],
+                      trajectory=[]),
+        EvalInputItem(id="2",
+                      input_obj="Define ML",
+                      expected_output_obj="Machine Learning is a subset of AI.",
+                      output_obj="ML helps machines learn.",
+                      expected_trajectory=[],
+                      trajectory=[])
+    ]
+    return EvalInput(eval_input_items=items)
+
+
+@pytest.fixture
+def evaluator(mock_llm, default_score_weights):
+    return TunableRagEvaluator(llm=mock_llm,
+                               judge_llm_prompt="Please evaluate the answer.",
+                               max_concurrency=2,
+                               default_scoring=True,
+                               default_score_weights=default_score_weights)
+
+
+async def test_evaluate_success(evaluator, rag_eval_input):
+    """Test successful evaluation using TunableRagEvaluator with mocked LLM."""
+
+    # Mock LLM response content
+    evaluator.llm.ainvoke = AsyncMock(side_effect=[
+        MagicMock(content='{"coverage_score": 0.9, "correctness_score": 0.8,\
+                "relevance_score": 0.7, "reasoning": "Solid answer."}'),
+        MagicMock(content='{"coverage_score": 0.6, "correctness_score": 0.7,\
+                "relevance_score": 0.8, "reasoning": "Good effort."}')
+    ])
+
+    eval_output: EvalOutput = await evaluator.evaluate(rag_eval_input)
+
+    assert isinstance(eval_output, EvalOutput)
+    assert len(eval_output.eval_output_items) == 2
+
+    for item in eval_output.eval_output_items:
+        assert item.score > 0
+        assert isinstance(item.reasoning, dict)
+        assert "reasoning" in item.reasoning
+
+    assert round(eval_output.average_score, 2) > 0.0
+
+
+async def test_evaluate_partial_failure(evaluator, rag_eval_input):
+    """Test partial failure where one LLM response is invalid."""
+
+    # One successful, one broken response
+    evaluator.llm.ainvoke = AsyncMock(side_effect=[
+        MagicMock(
+            content='{"coverage_score": 0.9, "correctness_score": 0.9, "relevance_score": 0.9, "reasoning": "Perfect."}'
+        ),
+        MagicMock(content='INVALID JSON RESPONSE')
+    ])
+
+    eval_output: EvalOutput = await evaluator.evaluate(rag_eval_input)
+
+    assert len(eval_output.eval_output_items) == 2
+
+    successful_item = next(item for item in eval_output.eval_output_items if item.score > 0)
+    failed_item = next(item for item in eval_output.eval_output_items if item.score == 0)
+
+    assert "Perfect" in successful_item.reasoning["reasoning"]
+    assert "parsing judge LLM response" in failed_item.reasoning["reasoning"]
+
+    assert eval_output.average_score > 0
+    assert eval_output.average_score < 1
+
+
+async def test_evaluate_custom_scoring():
+    """Test custom scoring mode (not default)"""
+
+    llm = MagicMock(spec=BaseChatModel)
+    evaluator = TunableRagEvaluator(llm=llm,
+                                    judge_llm_prompt="Score this answer.",
+                                    max_concurrency=1,
+                                    default_scoring=False,
+                                    default_score_weights={})
+
+    input_data = EvalInput(eval_input_items=[
+        EvalInputItem(id="1",
+                      input_obj="What is NLP?",
+                      expected_output_obj="Study of language processing",
+                      output_obj="It's about language.",
+                      expected_trajectory=[],
+                      trajectory=[])
+    ])
+
+    llm.ainvoke = AsyncMock(return_value=MagicMock(content='{"score": 0.75, "reasoning": "Fair explanation."}'))
+
+    output = await evaluator.evaluate(input_data)
+    assert len(output.eval_output_items) == 1
+    assert output.eval_output_items[0].score == 0.75
+    assert output.eval_output_items[0].reasoning["reasoning"] == "Fair explanation."