Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validator module #61

Merged
merged 27 commits into from
Mar 27, 2025
Merged

Add validator module #61

merged 27 commits into from
Mar 27, 2025

Conversation

elisno
Copy link
Member

@elisno elisno commented Mar 21, 2025

Tutorial for this

https://github.com/cleanlab/cleanlab-studio-docs/pull/868

Related tutorial for TrustworthyRAG module:
https://github.com/cleanlab/cleanlab-studio-docs/pull/867

The idea is users who are doing RAG and just want real-time detection of issues should use TrustworthyRAG (we will clarify that later on). Users who are doing RAG and want detection + flagging/logging + remediation of issues should use this tutorial (Codex). We can unify everything better later on.

What changed?

Added a module with a new Validator class that scores responses from RAG systems and detects & remediates bad responses.

A few notes:

  • A BadResponseThresholds for this module is a Pydantic BaseModel, mainly to validate that the thresholds are 0-1. I'm fine with just having this be a dictionary.
  • By default, we'll add explanations to the trustworthiness score of TrustworthyRAG, but only return the trustworthiness score and the response helpfulness score (the other default scores are not used for validation at this moment). The get_default_evaluations() function will control what Evals to use with TrustworthyRAG by default. This get_default_evaluations() functions is different from the one defined in cleanlab_tlm (IIRC it's called get_default_evals().
  • TrustworthyRAG will work fine when the tlm_api_key is None, as long as a CLEANLAB_TLM_API_KEY is set as an environment variable. So a minimal constructor would look like validator = Validator(codex_access_key="<your-access-key>"), with the rest having pre-defined configurations as defaults.

Usage example:

from cleanlab_codex import Validator

validator = Validator(codex_access_key="<your-access-key>")

CONTEXT = """Simple Water Bottle - Amber (limited edition launched Jan 1st 2025)
A water bottle designed with a perfect blend of functionality and aesthetics in mind. Crafted from high-quality, durable plastic with a sleek honey-colored finish.
Price: $24.99 \nDimensions: 10 inches height x 4 inches width"""

results = validator.validate(
    query="How much water can the Simple Water Bottle hold?",
    context=CONTEXT,
    response="The Simple Water Bottle can hold 34 oz of Water",
)
results

prints out:

{
    "is_bad_response": True,
    "expert_answer": "32oz",
    "trustworthiness": {
        "log": {
            "explanation": "The proposed response states that the Simple Water Bottle can hold 34 oz of water. However, the context information provided does not specify the capacity of the water bottle. Without explicit details about the volume it can hold, the response cannot be verified as correct. The dimensions of the bottle (10 inches height x 4 inches width) do not directly indicate its volume capacity, and the assumption made in the response could be inaccurate. A more appropriate response would acknowledge the lack of specific information regarding the bottle's capacity and suggest that the user check the product specifications for accurate details. Therefore, the proposed response is not substantiated by the provided context. \nThis response is untrustworthy due to lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): \nThe Simple Water Bottle can hold approximately 70 fluid ounces of water."
        },
        "score": 0.18455227478066594,
        "is_bad": True,
    },
    "response_helpfulness": {
        "score": 0.9975124364465637,
        "is_bad": False,
    },
}


Checklist

  • Did you link the GitHub issue?
  • Did you follow deployment steps or bump the version if needed?
  • Did you add/update tests? At least for some internal functions.
  • What QA did you do?
    • Tested...

@elisno elisno requested a review from jwmueller March 21, 2025 00:07
@elisno
Copy link
Member Author

elisno commented Mar 21, 2025

Current test coverage does not include src/cleanlab_codex/validator.py (which CI is complaining about). I plan to add tests after the initial round of reviews to avoid unnecessary rework if further changes are needed. Let me know if you prefer adding them earlier.

@jwmueller
Copy link
Member

jwmueller commented Mar 21, 2025

instead of your gist here: https://gist.github.com/elisno/65dca2bebb20e1749afa753784bab920

Please go ahead and PR the same gist as a tutorial notebook to: https://github.com/cleanlab/cleanlab-studio-docs/

For now, have the tutorial show:

  • running with default settings
  • running with advanced settings, where you set basically everything to non-default that the user could possibly specify.

elisno added 4 commits March 21, 2025 14:10
Adds support for custom evaluation thresholds, introduces ThresholdedTrustworthyRAGScore type,
and improves validation error handling with better documentation.
@elisno elisno requested a review from axl1313 March 22, 2025 03:23
@anishathalye anishathalye self-requested a review March 23, 2025 17:03
@jwmueller jwmueller requested a review from aditya1503 March 25, 2025 00:23
Copy link
Member

@anishathalye anishathalye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. High-level comment mirrors my comment on the tutorial—how is a user supposed to understand whether they should use Validator, TrustworthyRAG, or TLM directly?

Left a bunch of smaller comments inline.

**scores,
}

def detect(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user just wants to detect bad responses, should they use TrustworthyRAG or Validate.detect? How is a user supposed to understand how these two relate to each other?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea (which docstring should be updated to reflect) was that Validator is just a version of TrustworthyRAG with different default evals & predetermined thresholds.

The practical impact of those thresholds is they determine when we lookup things in Codex (what is logged in the Project for SME to answer, what gets answered by Codex instead of RAG app). But that impact is primarily realized in Validator.validate(), not in Validator.detect().

So we could make detect() a private method? It's essentially just another version of the .validate() method that is not hooked up to any Codex project (e.g. for testing detection configurations out without impacting the Codex project via logging).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That solution sounds fine to me—making it private, and updating the instructions to indicate that detect -> TrustworthyRAG, detect + remediate -> Validator.

Copy link
Member

@jwmueller jwmueller Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm. @elisno can you also add an optional flag to Validator.validate(), which allows users to run the detection for testing purposes but without interacting with Codex in anyway? (no querying codex at all to ensure testing runs aren't polluting the Codex Project).

This flag could be something like: testing_mode = False by default (try to think of better name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, we should keep the detect() method public for threshold-tuning and testing purposes (without affecting Codex). I've updated the docstring to reflect this.

No need for another optional flag in Validator.validate().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also include screenshot of tutorial where you show that it's clearly explained when to use validate() vs. detect()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed more docstring changes to clearly distinguish these, so review those

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also include screenshot of tutorial where you show that it's clearly explained when to use validate() vs. detect()

https://github.com/cleanlab/cleanlab-studio-docs/pull/868#issuecomment-2756947611

Copy link
Member

@jwmueller jwmueller Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that screenshot does not explain the main reason to use detect(), which is to test/tune detection configurations like the evaluation score thresholds and TrustworthyRAG settings

@anishathalye
Copy link
Member

This doesn't necessarily have to block merging of this PR, but it would be great for us to dogfood Validator in migrating rag.app to use Codex as a backup.

@elisno elisno requested a review from jwmueller March 27, 2025 07:11
@elisno elisno merged commit da1515a into main Mar 27, 2025
11 checks passed
@jwmueller jwmueller deleted the validator branch March 27, 2025 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants