-
Notifications
You must be signed in to change notification settings - Fork 419
Add a tunable RAG evaluator #110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a tunable RAG evaluator #110
Conversation
Merge develop to main for rc-10
Merge develop to main
Merge 1.0.0: upstream/develop to upstream/main
Merge pull request NVIDIA#92 from NVIDIA/develop
Updated changelog with another bug fix (NVIDIA#93)
* Fixes CI for non-PR workflows Closes NVIDIA#80 ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - David Gardner (https://github.com/dagardner-nv) Approvers: - Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah) URL: NVIDIA#81
Change done:
1. Added new endpoint `generate/stream/full` to stream the complete IntermediateStep. Sample usage -
```
curl --request POST --url http://localhost:8000/generate/stream/full --header 'Content-Type: application/json' --data '{
"input_message": "What is LangSmith?"
}'
```
2. Use the `generate/stream/full` endpoint for evaluating remote workflows.
Sample Usage:
1. Start server on the remote cluster with the base config.yml:
```
aiq serve --config_file=examples/simple/configs/config.yml
```
2. Run evaluation, against the remote endpoint, using a different config.yml that provides the dataset:
```
aiq eval --config_file=examples/simple/configs/eval_config.yml --endpoint http://localhost:8000
```
Closes NVIDIA#51
Authors:
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
Approvers:
- Eric Evans II (https://github.com/ericevans-nv)
URL: NVIDIA#57
|
@liamy-nv Thanks for the PR. I am seeing some unrelated changes in this PR. Can you please update your fork with the latest develop and resolve any conflicts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.
Files not reviewed (1)
- examples/simple_calculator/src/aiq_simple_calculator/data/simple_calculator_questions_custom.json: Language not supported
Signed-off-by: Anuradha Karuppiah <[email protected]>
Signed-off-by: Anuradha Karuppiah <[email protected]>
|
/ok to test 96c2d20 |
|
/ok to test 610e729 |
Signed-off-by: Anuradha Karuppiah <[email protected]>
Signed-off-by: Anuradha Karuppiah <[email protected]>
|
/ok to test 7976a6c |
Signed-off-by: Anuradha Karuppiah <[email protected]>
|
/ok to test a74f83e |
Also provide instructions for running the evaluator Signed-off-by: Anuradha Karuppiah <[email protected]>
Signed-off-by: Anuradha Karuppiah <[email protected]>
|
/ok to test 28008f0 |
Signed-off-by: Anuradha Karuppiah <[email protected]>
|
/ok to test 8a49f6d |
It is common for LLM endpoints to timeout. In that case we want to continue evaluating other items and running other evaluators without raising an exception. Signed-off-by: Anuradha Karuppiah <[email protected]>
Signed-off-by: Anuradha Karuppiah <[email protected]>
|
/ok to test 7d0d89e |
|
@liamy-nv Thanks for your contribution. The changes LGTM and I have approved the PR. It is however failing DCO as you have not signed-off on some commits. Can you please use these instructions to do that - Please remember to pull down all the commits from this PR to avoid accidental overwrites when you do the force push. |
|
/merge |
Feature: a tunable RAG evaluator
This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.
Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.
Usage Example:
```
eval:
evaluators:
custom_rag_evaluation:
_type: tunable_rag_evaluator
llm_name: nim_rag_eval_llm
default_scoring: false
default_score_weights:
coverage: 0.5
correctness: 0.3
relevance: 0.2
judge_llm_prompt: >
You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.
Rules:
- The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
- The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```
## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.
Authors:
- https://github.com/liamy-nv
- David Gardner (https://github.com/dagardner-nv)
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
- Michael Demoret (https://github.com/mdemoret-nv)
Approvers:
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
URL: NVIDIA#110
Signed-off-by: Yuchen Zhang <[email protected]>
Feature: a tunable RAG evaluator
This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.
Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.
Usage Example:
```
eval:
evaluators:
custom_rag_evaluation:
_type: tunable_rag_evaluator
llm_name: nim_rag_eval_llm
default_scoring: false
default_score_weights:
coverage: 0.5
correctness: 0.3
relevance: 0.2
judge_llm_prompt: >
You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.
Rules:
- The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
- The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```
## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.
Authors:
- https://github.com/liamy-nv
- David Gardner (https://github.com/dagardner-nv)
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
- Michael Demoret (https://github.com/mdemoret-nv)
Approvers:
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
URL: NVIDIA#110
Signed-off-by: Yuchen Zhang <[email protected]>
Feature: a tunable RAG evaluator
This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.
Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.
Usage Example:
```
eval:
evaluators:
custom_rag_evaluation:
_type: tunable_rag_evaluator
llm_name: nim_rag_eval_llm
default_scoring: false
default_score_weights:
coverage: 0.5
correctness: 0.3
relevance: 0.2
judge_llm_prompt: >
You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.
Rules:
- The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
- The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```
## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.
Authors:
- https://github.com/liamy-nv
- David Gardner (https://github.com/dagardner-nv)
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
- Michael Demoret (https://github.com/mdemoret-nv)
Approvers:
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
URL: NVIDIA#110
Signed-off-by: Eric Evans <[email protected]>
Feature: a tunable RAG evaluator
This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.
Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.
Usage Example:
```
eval:
evaluators:
custom_rag_evaluation:
_type: tunable_rag_evaluator
llm_name: nim_rag_eval_llm
default_scoring: false
default_score_weights:
coverage: 0.5
correctness: 0.3
relevance: 0.2
judge_llm_prompt: >
You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.
Rules:
- The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
- The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```
## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.
Authors:
- https://github.com/liamy-nv
- David Gardner (https://github.com/dagardner-nv)
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
- Michael Demoret (https://github.com/mdemoret-nv)
Approvers:
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
URL: NVIDIA#110
Signed-off-by: Eric Evans <[email protected]>
Feature: a tunable RAG evaluator
This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.
Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.
Usage Example:
```
eval:
evaluators:
custom_rag_evaluation:
_type: tunable_rag_evaluator
llm_name: nim_rag_eval_llm
default_scoring: false
default_score_weights:
coverage: 0.5
correctness: 0.3
relevance: 0.2
judge_llm_prompt: >
You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.
Rules:
- The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
- The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```
## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.
Authors:
- https://github.com/liamy-nv
- David Gardner (https://github.com/dagardner-nv)
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
- Michael Demoret (https://github.com/mdemoret-nv)
Approvers:
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
URL: NVIDIA#110
Feature: a tunable RAG evaluator
This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.
Key Points:
- Full control over judge LLM prompt
- Full control over scoring guidelines
- Everything is configurable from the config file
- No changes to dataset format other than answer format etc.
Usage Example:
```
eval:
evaluators:
custom_rag_evaluation:
_type: tunable_rag_evaluator
llm_name: nim_rag_eval_llm
default_scoring: false
default_score_weights:
coverage: 0.5
correctness: 0.3
relevance: 0.2
judge_llm_prompt: >
You are an intelligent evaluator that scores the generated answer based on the description of the expected answer.
The score is a measure of how well the generated answer matches the description of the expected answer based on the question.
Take into account the question, the relevance of the answer to the question and the quality compared to the description of the expected answer.
Rules:
- The score must be a float of any value between 0.0 and 1.0 on a sliding scale.
- The reasoning string must be concise and to the point. It should be 1 sentence and 2 only if extra description is needed. It must explain why the score was given and what is different between the generated answer and the expected answer.
```
## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/AgentIQ/blob/develop/docs/source/advanced/contributing.md).
- We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license.
- Any contribution which contains commits that are not Signed-Off will not be accepted.
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.
Authors:
- https://github.com/liamy-nv
- David Gardner (https://github.com/dagardner-nv)
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
- Michael Demoret (https://github.com/mdemoret-nv)
Approvers:
- Anuradha Karuppiah (https://github.com/AnuradhaKaruppiah)
URL: NVIDIA#110
Description
Feature: a tunable RAG evaluator
This PR adds a tunable RAG evaluator that allows for flexible evaluation of RAG workflows.
It includes a default scoring mechanism based on an expected answer description rather than a ground truth answer.
Key Points:
Usage Example:
By Submitting this PR I confirm: