Skip to content

Commit

Permalink
Merge pull request #77 from cvs-health/main
Browse files Browse the repository at this point in the history
v0.3.0
  • Loading branch information
dylanbouchard authored Dec 20, 2024
2 parents d566593 + c5d1429 commit f7298b1
Show file tree
Hide file tree
Showing 37 changed files with 2,391 additions and 1,287 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -137,3 +137,6 @@ dmypy.json

#Example generated files
examples/evaluations/text_generation/final_metrics.txt

#Data generated by langfair data_loader module
langfair/data/*
124 changes: 96 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,51 +30,119 @@ The latest version can be installed from PyPI:
pip install langfair
```

### Usage Example
Below is a sample of code illustrating how to use LangFair's `AutoEval` class for text generation and summarization use cases. The below example assumes the user has already defined parameters `DEPLOYMENT_NAME`, `API_KEY`, `API_BASE`, `API_TYPE`, `API_VERSION`, and a list of prompts from their use case `prompts`.
### Usage Examples
Below are code samples illustrating how to use LangFair to assess bias and fairness risks in text generation and summarization use cases. The below examples assume the user has already defined a list of prompts from their use case, `prompts`.

Create `langchain` LLM object.
##### Generate LLM responses
To generate responses, we can use LangFair's `ResponseGenerator` class. First, we must create a `langchain` LLM object. Below we use `ChatVertexAI`, but **any of [LangChain’s LLM classes](https://js.langchain.com/docs/integrations/chat/) may be used instead**. Note that `InMemoryRateLimiter` is to used to avoid rate limit errors.
```python
from langchain_google_vertexai import ChatVertexAI
from langchain_core.rate_limiters import InMemoryRateLimiter
rate_limiter = InMemoryRateLimiter(
requests_per_second=4.5, check_every_n_seconds=0.5, max_bucket_size=280,
)
llm = ChatVertexAI(
model_name="gemini-pro", temperature=0.3, rate_limiter=rate_limiter
)
```
We can use `ResponseGenerator.generate_responses` to generate 25 responses for each prompt, as is convention for toxicity evaluation.
```python
from langfair.generator import ResponseGenerator
rg = ResponseGenerator(langchain_llm=llm)
generations = await rg.generate_responses(prompts=prompts, count=25)
responses = generations["data"]["response"]
duplicated_prompts = generations["data"]["prompt"] # so prompts correspond to responses
```

##### Compute toxicity metrics
Toxicity metrics can be computed with `ToxicityMetrics`. Note that use of `torch.device` is optional and should be used if GPU is available to speed up toxicity computation.
```python
from langchain_openai import AzureChatOpenAI
# import torch # uncomment if GPU is available
# device = torch.device("cuda") # uncomment if GPU is available
from langfair.metrics.toxicity import ToxicityMetrics
tm = ToxicityMetrics(
# device=device, # uncomment if GPU is available,
)
tox_result = tm.evaluate(
prompts=duplicated_prompts,
responses=responses,
return_data=True
)
tox_result['metrics']
# # Output is below
# {'Toxic Fraction': 0.0004,
# 'Expected Maximum Toxicity': 0.013845130120171235,
# 'Toxicity Probability': 0.01}
```

##### Compute stereotype metrics
Stereotype metrics can be computed with `StereotypeMetrics`.
```python
from langfair.metrics.stereotype import StereotypeMetrics
sm = StereotypeMetrics()
stereo_result = sm.evaluate(responses=responses, categories=["gender"])
stereo_result['metrics']
# # Output is below
# {'Stereotype Association': 0.3172750176745329,
# 'Cooccurrence Bias': 0.44766333654278373,
# 'Stereotype Fraction - gender': 0.08}
```

##### Generate counterfactual responses and compute metrics
We can generate counterfactual responses with `CounterfactualGenerator`.
```python
from langfair.generator.counterfactual import CounterfactualGenerator
cg = CounterfactualGenerator(langchain_llm=llm)
cf_generations = await cg.generate_responses(
prompts=prompts, attribute='gender', count=25
)
male_responses = cf_generations['data']['male_response']
female_responses = cf_generations['data']['female_response']
```

llm = AzureChatOpenAI(
deployment_name=DEPLOYMENT_NAME,
openai_api_key=API_KEY,
azure_endpoint=API_BASE,
openai_api_type=API_TYPE,
openai_api_version=API_VERSION,
temperature=0.4 # User to set temperature
Counterfactual metrics can be easily computed with `CounterfactualMetrics`.
```python
from langfair.metrics.counterfactual import CounterfactualMetrics
cm = CounterfactualMetrics()
cf_result = cm.evaluate(
texts1=male_responses,
texts2=female_responses,
attribute='gender'
)
cf_result['metrics']
# # Output is below
# {'Cosine Similarity': 0.8318708,
# 'RougeL Similarity': 0.5195852482361165,
# 'Bleu Similarity': 0.3278433712872481,
# 'Sentiment Bias': 0.0009947145187601957}
```

Run the `AutoEval` method for automated bias / fairness evaluation
##### Alternative approach: Semi-automated evaluation with `AutoEval`
To streamline assessments for text generation and summarization use cases, the `AutoEval` class conducts a multi-step process that completes all of the aforementioned steps with two lines of code.
```python
from langfair.auto import AutoEval
auto_object = AutoEval(
prompts=prompts,
langchain_llm=llm
langchain_llm=llm,
# toxicity_device=device # uncomment if GPU is available
)
results = await auto_object.evaluate()
results['metrics']
# Output is below
# {'Toxicity': {'Toxic Fraction': 0.0004,
# 'Expected Maximum Toxicity': 0.013845130120171235,
# 'Toxicity Probability': 0.01},
# 'Stereotype': {'Stereotype Association': 0.3172750176745329,
# 'Cooccurrence Bias': 0.44766333654278373,
# 'Stereotype Fraction - gender': 0.08,
# 'Expected Maximum Stereotype - gender': 0.60355167388916,
# 'Stereotype Probability - gender': 0.27036},
# 'Counterfactual': {'male-female': {'Cosine Similarity': 0.8318708,
# 'RougeL Similarity': 0.5195852482361165,
# 'Bleu Similarity': 0.3278433712872481,
# 'Sentiment Bias': 0.0009947145187601957}}}
```

<p align="center">
<img src="https://raw.githubusercontent.com/cvs-health/langfair/main/assets/images/autoeval_process.png" />
</p>


Print the results and export to .txt file.
```python
auto_object.export_results(file_name="metric_values.txt")
auto_object.print_results()
```

<p align="center">
<img src="https://raw.githubusercontent.com/cvs-health/langfair/main/assets/images/autoeval_output.png" />
</p>

## 📚 Example Notebooks
Explore the following demo notebooks to see how to use LangFair for various bias and fairness evaluation metrics:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "ead0356d",
"metadata": {},
"source": [
"# Classification Metrics Demo"
]
},
{
"cell_type": "code",
"execution_count": 1,
Expand All @@ -10,7 +18,8 @@
"outputs": [],
"source": [
"import numpy as np\n",
"from langfair.metrics.classification import ClassificationMetrics"
"\n",
"from langfair.metrics.classification import ClassificationMetrics\n"
]
},
{
Expand Down Expand Up @@ -171,9 +180,9 @@
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m125"
},
"kernelspec": {
"display_name": "langfair",
"display_name": "langfair-ZgpfWZGz-py3.9",
"language": "python",
"name": "langfair"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -185,7 +194,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.20"
"version": "3.9.4"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "3c377f48",
"metadata": {},
"source": [
"# Recommendation Metrics Demo"
]
},
{
"cell_type": "markdown",
"id": "eab56f24-c606-4b0c-a259-f125d5a7e227",
Expand Down
Loading

0 comments on commit f7298b1

Please sign in to comment.