-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validator module #61
Conversation
Current test coverage does not include |
instead of your gist here: https://gist.github.com/elisno/65dca2bebb20e1749afa753784bab920 Please go ahead and PR the same gist as a tutorial notebook to: https://github.com/cleanlab/cleanlab-studio-docs/ For now, have the tutorial show:
|
Adds support for custom evaluation thresholds, introduces ThresholdedTrustworthyRAGScore type, and improves validation error handling with better documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. High-level comment mirrors my comment on the tutorial—how is a user supposed to understand whether they should use Validator, TrustworthyRAG, or TLM directly?
Left a bunch of smaller comments inline.
**scores, | ||
} | ||
|
||
def detect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user just wants to detect bad responses, should they use TrustworthyRAG or Validate.detect? How is a user supposed to understand how these two relate to each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea (which docstring should be updated to reflect) was that Validator is just a version of TrustworthyRAG with different default evals & predetermined thresholds.
The practical impact of those thresholds is they determine when we lookup things in Codex (what is logged in the Project for SME to answer, what gets answered by Codex instead of RAG app). But that impact is primarily realized in Validator.validate()
, not in Validator.detect()
.
So we could make detect()
a private method? It's essentially just another version of the .validate()
method that is not hooked up to any Codex project (e.g. for testing detection configurations out without impacting the Codex project via logging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That solution sounds fine to me—making it private, and updating the instructions to indicate that detect -> TrustworthyRAG, detect + remediate -> Validator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm. @elisno can you also add an optional flag to Validator.validate()
, which allows users to run the detection for testing purposes but without interacting with Codex in anyway? (no querying codex at all to ensure testing runs aren't polluting the Codex Project).
This flag could be something like: testing_mode = False
by default (try to think of better name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, we should keep the detect()
method public for threshold-tuning and testing purposes (without affecting Codex). I've updated the docstring to reflect this.
No need for another optional flag in Validator.validate()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also include screenshot of tutorial where you show that it's clearly explained when to use validate() vs. detect()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed more docstring changes to clearly distinguish these, so review those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also include screenshot of tutorial where you show that it's clearly explained when to use validate() vs. detect()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that screenshot does not explain the main reason to use detect(), which is to test/tune detection configurations like the evaluation score thresholds and TrustworthyRAG settings
This doesn't necessarily have to block merging of this PR, but it would be great for us to dogfood Validator in migrating rag.app to use Codex as a backup. |
Co-authored-by: Anish Athalye <[email protected]>
…n in favor of the new Validator API.
…ecation in favor of the new Validator API.
Tutorial for this
https://github.com/cleanlab/cleanlab-studio-docs/pull/868
Related tutorial for TrustworthyRAG module:
https://github.com/cleanlab/cleanlab-studio-docs/pull/867
The idea is users who are doing RAG and just want real-time detection of issues should use TrustworthyRAG (we will clarify that later on). Users who are doing RAG and want detection + flagging/logging + remediation of issues should use this tutorial (Codex). We can unify everything better later on.
What changed?
Added a module with a new
Validator
class that scores responses from RAG systems and detects & remediates bad responses.A few notes:
BadResponseThresholds
for this module is a Pydantic BaseModel, mainly to validate that the thresholds are 0-1. I'm fine with just having this be a dictionary.get_default_evaluations()
function will control whatEval
s to use with TrustworthyRAG by default. Thisget_default_evaluations()
functions is different from the one defined in cleanlab_tlm (IIRC it's calledget_default_evals()
.tlm_api_key
is None, as long as aCLEANLAB_TLM_API_KEY
is set as an environment variable. So a minimal constructor would look likevalidator = Validator(codex_access_key="<your-access-key>")
, with the rest having pre-defined configurations as defaults.Usage example:
prints out:
Checklist