Add a tunable RAG evaluator #110

liamy-nv · 2025-04-09T23:03:50Z

Description

Feature: a tunable RAG evaluator

This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.

Key Points:

Full control over judge LLM prompt
Full control over scoring guidelines
Everything is configurable from the config file
No changes to dataset format other than answer format etc.

Usage Example:

eval:
  evaluators:
    custom_rag_evaluation:
      _type: tunable_rag_evaluator
      llm_name: nim_rag_eval_llm
      default_scoring: false
      default_score_weights:
        coverage: 0.5
        correctness: 0.3
        relevance: 0.2
      judge_llm_prompt: >
        You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
        The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
        Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.

        Rules:
        - The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
        - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.

By Submitting this PR I confirm:

I am familiar with the Contributing Guidelines.
We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
When the PR is ready for review, new or existing tests cover these changes.
When the PR is ready for review, the documentation is up to date with these changes.

Merge develop to main for rc-10

Merge develop to main

Merge 1.0.0: upstream/develop to upstream/main

Merge pull request NVIDIA#92 from NVIDIA/develop

Updated changelog with another bug fix (NVIDIA#93)

* Fixes CI for non-PR workflows Closes NVIDIA#80 ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - David Gardner (https://github.com/dagardner-nv) Approvers: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) URL: NVIDIA#81

Change done: 1. Added new endpoint `generate/stream/full` to stream the complete IntermediateStep. Sample usage - ``` curl --request POST --url http://localhost:8000/generate/stream/full --header 'Content-Type: application/json' --data '{ "input_message": "What is LangSmith?" }' ``` 2. Use the `generate/stream/full` endpoint for evaluating remote workflows. Sample Usage: 1. Start server on the remote cluster with the base config.yml: ``` aiq serve --config_file=examples/simple/configs/config.yml ``` 2. Run evaluation, against the remote endpoint, using a different config.yml that provides the dataset: ``` aiq eval --config_file=examples/simple/configs/eval_config.yml --endpoint http://localhost:8000 ``` Closes NVIDIA#51 Authors: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) Approvers: - Eric Evans II (https://github.com/ericevans-nv) URL: NVIDIA#57

copy-pr-bot · 2025-04-10T00:06:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

AnuradhaKaruppiah · 2025-04-10T04:55:48Z

@liamy-nv Thanks for the PR. I am seeing some unrelated changes in this PR. Can you please update your fork with the latest develop and resolve any conflicts?

Copilot

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Files not reviewed (1)

examples/simple_calculator/src/aiq_simple_calculator/data/simple_calculator_questions_custom.json: Language not supported

src/aiq/eval/custom_rag_evaluator/evaluate.py

…_evaluator

src/aiq/eval/custom_rag_evaluator/register.py

src/aiq/eval/custom_rag_evaluator/evaluate.py

Signed-off-by: Anuradha Karuppiah <[email protected]>

…_evaluator

AnuradhaKaruppiah · 2025-04-29T23:23:21Z

/ok to test 96c2d20

…_evaluator

AnuradhaKaruppiah · 2025-04-30T21:11:55Z

/ok to test 610e729

Signed-off-by: Anuradha Karuppiah <[email protected]>

AnuradhaKaruppiah · 2025-04-30T21:20:14Z

/ok to test 7976a6c

Signed-off-by: Anuradha Karuppiah <[email protected]>

AnuradhaKaruppiah · 2025-04-30T21:48:59Z

/ok to test a74f83e

Also provide instructions for running the evaluator Signed-off-by: Anuradha Karuppiah <[email protected]>

Signed-off-by: Anuradha Karuppiah <[email protected]>

AnuradhaKaruppiah · 2025-05-01T00:12:37Z

/ok to test 28008f0

Signed-off-by: Anuradha Karuppiah <[email protected]>

AnuradhaKaruppiah · 2025-05-01T01:24:58Z

/ok to test 8a49f6d

It is common for LLM endpoints to timeout. In that case we want to continue evaluating other items and running other evaluators without raising an exception. Signed-off-by: Anuradha Karuppiah <[email protected]>

Signed-off-by: Anuradha Karuppiah <[email protected]>

AnuradhaKaruppiah · 2025-05-01T01:50:29Z

/ok to test 7d0d89e

AnuradhaKaruppiah · 2025-05-01T01:55:38Z

@liamy-nv Thanks for your contribution. The changes LGTM and I have approved the PR. It is however failing DCO as you have not signed-off on some commits. Can you please use these instructions to do that -
https://github.com/NVIDIA/AIQToolkit/pull/110/checks

Please remember to pull down all the commits from this PR to avoid accidental overwrites when you do the force push.

AnuradhaKaruppiah · 2025-05-01T01:59:10Z

/merge

Feature: a tunable RAG evaluator This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows. It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer. Key Points: - Full control over judge LLM prompt - Full control over scoring guidelines - Everything is configurable from the config file - No changes to dataset format other than answer format etc. Usage Example: ``` eval: evaluators: custom_rag_evaluation: _type: tunable_rag_evaluator llm_name: nim_rag_eval_llm default_scoring: false default_score_weights: coverage: 0.5 correctness: 0.3 relevance: 0.2 judge_llm_prompt: > You are an intelligent evaluator that scores the generated answer based on the description of the expected answer. The score is a measure of how well the generated answer matches the description of the expected answer based on the question. Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer. Rules: - The score must be a float of any value between 0.0 and 1.0 on a sliding scale. - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer. ``` ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - https://github.com/liamy-nv - David Gardner (https://github.com/dagardner-nv) - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) - Michael Demoret (https://github.com/mdemoret-nv) Approvers: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) URL: NVIDIA#110 Signed-off-by: Yuchen Zhang <[email protected]>

Feature: a tunable RAG evaluator This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows. It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer. Key Points: - Full control over judge LLM prompt - Full control over scoring guidelines - Everything is configurable from the config file - No changes to dataset format other than answer format etc. Usage Example: ``` eval: evaluators: custom_rag_evaluation: _type: tunable_rag_evaluator llm_name: nim_rag_eval_llm default_scoring: false default_score_weights: coverage: 0.5 correctness: 0.3 relevance: 0.2 judge_llm_prompt: > You are an intelligent evaluator that scores the generated answer based on the description of the expected answer. The score is a measure of how well the generated answer matches the description of the expected answer based on the question. Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer. Rules: - The score must be a float of any value between 0.0 and 1.0 on a sliding scale. - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer. ``` ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - https://github.com/liamy-nv - David Gardner (https://github.com/dagardner-nv) - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) - Michael Demoret (https://github.com/mdemoret-nv) Approvers: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) URL: NVIDIA#110 Signed-off-by: Eric Evans <[email protected]>

Feature: a tunable RAG evaluator This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows. It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer. Key Points: - Full control over judge LLM prompt - Full control over scoring guidelines - Everything is configurable from the config file - No changes to dataset format other than answer format etc. Usage Example: ``` eval: evaluators: custom_rag_evaluation: _type: tunable_rag_evaluator llm_name: nim_rag_eval_llm default_scoring: false default_score_weights: coverage: 0.5 correctness: 0.3 relevance: 0.2 judge_llm_prompt: > You are an intelligent evaluator that scores the generated answer based on the description of the expected answer. The score is a measure of how well the generated answer matches the description of the expected answer based on the question. Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer. Rules: - The score must be a float of any value between 0.0 and 1.0 on a sliding scale. - The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer. ``` ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - https://github.com/liamy-nv - David Gardner (https://github.com/dagardner-nv) - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) - Michael Demoret (https://github.com/mdemoret-nv) Approvers: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) URL: NVIDIA#110

dagardner-nv and others added 11 commits March 17, 2025 17:05

Merge pull request NVIDIA#13 from AnuradhaKaruppiah/merge

94e3bfa

Merge develop to main for rc-10

Merge remote-tracking branch 'upstream/develop' into merge-1

d6f3ad8

Merge pull request NVIDIA#17 from AnuradhaKaruppiah/merge-1

4beb1af

Merge develop to main

Merge remote-tracking branch 'upstream/develop' into merge-1.0.0

479b4e9

Merge pull request NVIDIA#19 from AnuradhaKaruppiah/merge-1.0.0

d85ef78

Merge 1.0.0: upstream/develop to upstream/main

Merge develop to main for v1.1.0a1

c7aba6d

Merge pull request NVIDIA#92 from NVIDIA/develop

Merge pull request NVIDIA#94 from NVIDIA/develop

176ba47

Updated changelog with another bug fix (NVIDIA#93)

Added tunable RAG evaluator and updated evaluation documentation

f4b6bf9

Updated evaluation readme

0a602ee

liamy-nv requested a review from a team as a code owner April 9, 2025 23:03

Merge branch 'develop' into customizable_rag_evaluator

6248290

AnuradhaKaruppiah requested a review from Copilot April 14, 2025 14:50

Copilot AI reviewed Apr 14, 2025

View reviewed changes

src/aiq/eval/custom_rag_evaluator/evaluate.py Outdated Show resolved Hide resolved

src/aiq/eval/custom_rag_evaluator/evaluate.py Show resolved Hide resolved

Merge remote-tracking branch 'upstream/develop' into customizable_rag…

71db095

…_evaluator

mdemoret-nv added feature request New feature or request non-breaking Non-breaking change labels Apr 15, 2025

mdemoret-nv reviewed Apr 15, 2025

View reviewed changes

src/aiq/eval/custom_rag_evaluator/register.py Outdated Show resolved Hide resolved

mdemoret-nv reviewed Apr 15, 2025

View reviewed changes

src/aiq/eval/custom_rag_evaluator/evaluate.py Outdated Show resolved Hide resolved

AnuradhaKaruppiah and others added 5 commits April 22, 2025 14:28

Rename directory and fix registration

4a9c858

Signed-off-by: Anuradha Karuppiah <[email protected]>

Update llm_name type to LLMRef

61f178c

Signed-off-by: Anuradha Karuppiah <[email protected]>

Merge remote-tracking branch 'upstream/develop' into customizable_rag…

05260cf

…_evaluator

Resolved default scoring weights and normalization

552a30e

Merge remote-tracking branch 'upstream/develop' into customizable_rag…

96c2d20

…_evaluator

AnuradhaKaruppiah mentioned this pull request Apr 30, 2025

[BUG]: Move evaluators that use langchain to the aitoolkit-langchain package #180

Open

2 tasks

Merge remote-tracking branch 'upstream/develop' into customizable_rag…

610e729

…_evaluator

AnuradhaKaruppiah added 2 commits April 30, 2025 14:15

"precommit run -a" fixes

a6699cc

Signed-off-by: Anuradha Karuppiah <[email protected]>

Use lazy formatting in logs

7976a6c

Signed-off-by: Anuradha Karuppiah <[email protected]>

Add copyright headers

a74f83e

Signed-off-by: Anuradha Karuppiah <[email protected]>

AnuradhaKaruppiah added 2 commits April 30, 2025 15:55

Rename dataset and config files

cb1c54a

Also provide instructions for running the evaluator Signed-off-by: Anuradha Karuppiah <[email protected]>

Miscellaneous fixes to get the tuneable rag evaluator working

28008f0

Signed-off-by: Anuradha Karuppiah <[email protected]>

Add a note the simple calculator README

8a49f6d

Signed-off-by: Anuradha Karuppiah <[email protected]>

AnuradhaKaruppiah added 2 commits April 30, 2025 18:44

Allow service errors to happen on some items

f23f3f9

It is common for LLM endpoints to timeout. In that case we want to continue evaluating other items and running other evaluators without raising an exception. Signed-off-by: Anuradha Karuppiah <[email protected]>

Added unit tests

7d0d89e

Signed-off-by: Anuradha Karuppiah <[email protected]>

AnuradhaKaruppiah approved these changes May 1, 2025

View reviewed changes

rapids-bot bot merged commit d68cd9d into NVIDIA:develop May 1, 2025
10 checks passed

Add a tunable RAG evaluator #110

Add a tunable RAG evaluator #110

Uh oh!

Conversation

liamy-nv commented Apr 9, 2025

Description

By Submitting this PR I confirm:

Uh oh!

copy-pr-bot bot commented Apr 10, 2025

Uh oh!

AnuradhaKaruppiah commented Apr 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AnuradhaKaruppiah commented Apr 29, 2025

Uh oh!

AnuradhaKaruppiah commented Apr 30, 2025

Uh oh!

AnuradhaKaruppiah commented Apr 30, 2025

Uh oh!

AnuradhaKaruppiah commented Apr 30, 2025

Uh oh!

AnuradhaKaruppiah commented May 1, 2025

Uh oh!

AnuradhaKaruppiah commented May 1, 2025

Uh oh!

AnuradhaKaruppiah commented May 1, 2025

Uh oh!

AnuradhaKaruppiah commented May 1, 2025

Uh oh!

AnuradhaKaruppiah commented May 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants