Skip to content

Conversation

@liamy-nv
Copy link
Contributor

@liamy-nv liamy-nv commented Apr 9, 2025

Description

Feature: a tunable RAG evaluator

This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.

Key Points:

  • Full control over judge LLM prompt
  • Full control over scoring guidelines
  • Everything is configurable from the config file
  • No changes to dataset format other than answer format etc.

Usage Example:

eval:
  evaluators:
    custom_rag_evaluation:
      _type: tunable_rag_evaluator
      llm_name: nim_rag_eval_llm
      default_scoring: false
      default_score_weights:
        coverage: 0.5
        correctness: 0.3
        relevance: 0.2
      judge_llm_prompt: >
        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.

        Rules:
        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.

By Submitting this PR I confirm:

  • I am familiar with the Contributing Guidelines.
  • We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
    • Any contribution which contains commits that are not Signed-Off will not be accepted.
  • When the PR is ready for review, new or existing tests cover these changes.
  • When the PR is ready for review, the documentation is up to date with these changes.

dagardner-nv and others added 11 commits March 17, 2025 17:05
Merge 1.0.0: upstream/develop to upstream/main
Merge pull request NVIDIA#92 from NVIDIA/develop
Updated changelog with another bug fix (NVIDIA#93)
* Fixes CI for non-PR workflows

Closes NVIDIA#80

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - David Gardner (https://github.com/dagardner-nv)

Approvers:
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)

URL: NVIDIA#81
Change done:
1. Added new endpoint `generate/stream/full` to stream the complete IntermediateStep. Sample usage -
```
curl --request POST   --url http://localhost:8000/generate/stream/full   --header 'Content-Type: application/json'   --data '{
    "input_message": "What is LangSmith?"
}'
```
2.  Use the `generate/stream/full` endpoint for evaluating remote workflows.


Sample Usage:
1.  Start server on the remote cluster with the base config.yml:
```
aiq serve --config_file=examples/simple/configs/config.yml 
```
2. Run evaluation, against the remote endpoint, using a different config.yml that provides the dataset:
```
aiq eval --config_file=examples/simple/configs/eval_config.yml --endpoint http://localhost:8000
```

Closes NVIDIA#51

Authors:
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)

Approvers:
  - Eric Evans II (https://github.com/ericevans-nv)

URL: NVIDIA#57
@liamy-nv liamy-nv requested a review from a team as a code owner April 9, 2025 23:03
@copy-pr-bot
Copy link

copy-pr-bot bot commented Apr 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@AnuradhaKaruppiah
Copy link
Contributor

@liamy-nv Thanks for the PR. I am seeing some unrelated changes in this PR. Can you please update your fork with the latest develop and resolve any conflicts?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Files not reviewed (1)
  • examples/simple_calculator/src/aiq_simple_calculator/data/simple_calculator_questions_custom.json: Language not supported

@mdemoret-nv mdemoret-nv added feature request New feature or request non-breaking Non-breaking change labels Apr 15, 2025
@AnuradhaKaruppiah
Copy link
Contributor

/ok to test 96c2d20

@AnuradhaKaruppiah
Copy link
Contributor

/ok to test 610e729

Signed-off-by: Anuradha Karuppiah <[email protected]>
Signed-off-by: Anuradha Karuppiah <[email protected]>
@AnuradhaKaruppiah
Copy link
Contributor

/ok to test 7976a6c

Signed-off-by: Anuradha Karuppiah <[email protected]>
@AnuradhaKaruppiah
Copy link
Contributor

/ok to test a74f83e

Also provide instructions for running the evaluator

Signed-off-by: Anuradha Karuppiah <[email protected]>
@AnuradhaKaruppiah
Copy link
Contributor

/ok to test 28008f0

@AnuradhaKaruppiah
Copy link
Contributor

/ok to test 8a49f6d

It is common for LLM endpoints to timeout. In that case we want to
continue evaluating other items and running other evaluators without
raising an exception.

Signed-off-by: Anuradha Karuppiah <[email protected]>
Signed-off-by: Anuradha Karuppiah <[email protected]>
@AnuradhaKaruppiah
Copy link
Contributor

/ok to test 7d0d89e

@AnuradhaKaruppiah
Copy link
Contributor

@liamy-nv Thanks for your contribution. The changes LGTM and I have approved the PR. It is however failing DCO as you have not signed-off on some commits. Can you please use these instructions to do that -
https://github.com/NVIDIA/AIQToolkit/pull/110/checks

Please remember to pull down all the commits from this PR to avoid accidental overwrites when you do the force push.

@AnuradhaKaruppiah
Copy link
Contributor

/merge

@rapids-bot rapids-bot bot merged commit d68cd9d into NVIDIA:develop May 1, 2025
10 checks passed
yczhang-nv pushed a commit to yczhang-nv/NeMo-Agent-Toolkit that referenced this pull request May 8, 2025
Feature: a tunable RAG evaluator

This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.

Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.

Usage Example:
```
eval:
  evaluators:
    custom_rag_evaluation:
      _type: tunable_rag_evaluator
      llm_name: nim_rag_eval_llm
      default_scoring: false
      default_score_weights:
        coverage: 0.5
        correctness: 0.3
        relevance: 0.2
      judge_llm_prompt: >
        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.

        Rules:
        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - https://github.com/liamy-nv
  - David Gardner (https://github.com/dagardner-nv)
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
  - Michael Demoret (https://github.com/mdemoret-nv)

Approvers:
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)

URL: NVIDIA#110
Signed-off-by: Yuchen Zhang <[email protected]>
yczhang-nv pushed a commit to yczhang-nv/NeMo-Agent-Toolkit that referenced this pull request May 9, 2025
Feature: a tunable RAG evaluator

This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.

Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.

Usage Example:
```
eval:
  evaluators:
    custom_rag_evaluation:
      _type: tunable_rag_evaluator
      llm_name: nim_rag_eval_llm
      default_scoring: false
      default_score_weights:
        coverage: 0.5
        correctness: 0.3
        relevance: 0.2
      judge_llm_prompt: >
        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.

        Rules:
        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - https://github.com/liamy-nv
  - David Gardner (https://github.com/dagardner-nv)
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
  - Michael Demoret (https://github.com/mdemoret-nv)

Approvers:
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)

URL: NVIDIA#110
Signed-off-by: Yuchen Zhang <[email protected]>
ericevans-nv pushed a commit to ericevans-nv/agent-iq that referenced this pull request Jun 3, 2025
Feature: a tunable RAG evaluator

This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.

Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.

Usage Example:
```
eval:
  evaluators:
    custom_rag_evaluation:
      _type: tunable_rag_evaluator
      llm_name: nim_rag_eval_llm
      default_scoring: false
      default_score_weights:
        coverage: 0.5
        correctness: 0.3
        relevance: 0.2
      judge_llm_prompt: >
        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.

        Rules:
        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - https://github.com/liamy-nv
  - David Gardner (https://github.com/dagardner-nv)
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
  - Michael Demoret (https://github.com/mdemoret-nv)

Approvers:
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)

URL: NVIDIA#110
Signed-off-by: Eric Evans <[email protected]>
ericevans-nv pushed a commit to ericevans-nv/agent-iq that referenced this pull request Jun 3, 2025
Feature: a tunable RAG evaluator

This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.

Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.

Usage Example:
```
eval:
  evaluators:
    custom_rag_evaluation:
      _type: tunable_rag_evaluator
      llm_name: nim_rag_eval_llm
      default_scoring: false
      default_score_weights:
        coverage: 0.5
        correctness: 0.3
        relevance: 0.2
      judge_llm_prompt: >
        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.

        Rules:
        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - https://github.com/liamy-nv
  - David Gardner (https://github.com/dagardner-nv)
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
  - Michael Demoret (https://github.com/mdemoret-nv)

Approvers:
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)

URL: NVIDIA#110
Signed-off-by: Eric Evans <[email protected]>
AnuradhaKaruppiah pushed a commit to AnuradhaKaruppiah/oss-agentiq that referenced this pull request Aug 4, 2025
Feature: a tunable RAG evaluator

This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows. 
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer. 

Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.

Usage Example:
```
eval:
  evaluators:
    custom_rag_evaluation:
      _type: tunable_rag_evaluator
      llm_name: nim_rag_eval_llm
      default_scoring: false
      default_score_weights:
        coverage: 0.5
        correctness: 0.3
        relevance: 0.2
      judge_llm_prompt: >
        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.

        Rules:
        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - https://github.com/liamy-nv
  - David Gardner (https://github.com/dagardner-nv)
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
  - Michael Demoret (https://github.com/mdemoret-nv)

Approvers:
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)

URL: NVIDIA#110
scheckerNV pushed a commit to scheckerNV/aiq-factory-reset that referenced this pull request Aug 22, 2025
Feature: a tunable RAG evaluator

This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows. 
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer. 

Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.

Usage Example:
```
eval:
  evaluators:
    custom_rag_evaluation:
      _type: tunable_rag_evaluator
      llm_name: nim_rag_eval_llm
      default_scoring: false
      default_score_weights:
        coverage: 0.5
        correctness: 0.3
        relevance: 0.2
      judge_llm_prompt: >
        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.

        Rules:
        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
  - Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - https://github.com/liamy-nv
  - David Gardner (https://github.com/dagardner-nv)
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
  - Michael Demoret (https://github.com/mdemoret-nv)

Approvers:
  - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)

URL: NVIDIA#110
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants