Skip to content

Commit 25143b9

Browse files
Merge pull request #2 from cvs-health/develop
Develop -> gh-pages
2 parents f6da024 + cb81c46 commit 25143b9

File tree

86 files changed

+406
-440
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

86 files changed

+406
-440
lines changed

README.md

+16-16
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
<p align="center">
2-
<img src="./assets/images/llambda-logo.png" />
2+
<img src="./assets/images/langfair-logo.png" />
33
</p>
44

55
# Library for Assessing Bias and Fairness in LLMs
66

7-
LLaMBDA (Large Language Model Bias Detection and Auditing) is a Python library for conducting bias and fairness assessments of LLM use cases. This repository includes a framework for [choosing bias and fairness metrics](#choosing-bias-and-fairness-metrics-for-an-llm-use-case), [demo notebooks](./examples), and a LLM bias and fairness [technical playbook](https://arxiv.org/pdf/2407.10853) containing a thorough discussion of LLM bias and fairness risks, evaluation metrics, and best practices. Please refer to our [documentation site](https://cvs-health.github.io/llambda/) for more details on how to use LLaMBDA.
7+
LangFair is a Python library for conducting bias and fairness assessments of LLM use cases. This repository includes a framework for [choosing bias and fairness metrics](#choosing-bias-and-fairness-metrics-for-an-llm-use-case), [demo notebooks](./examples), and a LLM bias and fairness [technical playbook](https://arxiv.org/pdf/2407.10853) containing a thorough discussion of LLM bias and fairness risks, evaluation metrics, and best practices. Please refer to our [documentation site](https://cvs-health.github.io/langfair/) for more details on how to use LangFair.
88

9-
Bias and fairness metrics offered by LLaMBDA fall into one of several categories. The full suite of metrics is displayed below.
9+
Bias and fairness metrics offered by LangFair fall into one of several categories. The full suite of metrics is displayed below.
1010

1111
##### Counterfactual Fairness Metrics
1212
* Strict Counterfactual Sentiment Parity ([Huang et al., 2020](https://arxiv.org/pdf/1911.03064))
@@ -38,18 +38,18 @@ Bias and fairness metrics offered by LLaMBDA fall into one of several categories
3838
* False Discovery Rate Disparity ([Bellamy et al., 2018](https://arxiv.org/abs/1810.01943); [Saleiro et al., 2019](https://arxiv.org/abs/1811.05577))
3939

4040
## Quickstart
41-
### (Optional) Create a virtual environment for using LLaMBDA
42-
We recommend creating a new virtual environment using venv before installing LLaMBDA. To do so, please follow instructions [here](https://docs.python.org/3/library/venv.html).
41+
### (Optional) Create a virtual environment for using LangFair
42+
We recommend creating a new virtual environment using venv before installing LangFair. To do so, please follow instructions [here](https://docs.python.org/3/library/venv.html).
4343

44-
### Installing LLaMBDA
45-
The latest version can be installed from PyPI:
44+
### Installing LangFair
45+
The latest version can be installed from the github URL:
4646

4747
```bash
48-
pip install llambda
48+
pip install git+https://github.com/cvs-health/langfair.git
4949
```
5050

5151
### Usage
52-
Below is a sample of code illustrating how to use LLaMBDA's `AutoEval` class for text generation and summarization use cases. The below example assumes the user has already defined parameters `DEPLOYMENT_NAME`, `API_KEY`, `API_BASE`, `API_TYPE`, `API_VERSION`, and a list of prompts from their use case `prompts`.
52+
Below is a sample of code illustrating how to use LangFair's `AutoEval` class for text generation and summarization use cases. The below example assumes the user has already defined parameters `DEPLOYMENT_NAME`, `API_KEY`, `API_BASE`, `API_TYPE`, `API_VERSION`, and a list of prompts from their use case `prompts`.
5353

5454
Create `langchain` LLM object.
5555
```python
@@ -60,13 +60,13 @@ llm = AzureChatOpenAI(
6060
azure_endpoint=API_BASE,
6161
openai_api_type=API_TYPE,
6262
openai_api_version=API_VERSION,
63-
temperature=1 # User to set temperature
63+
temperature=0.4 # User to set temperature
6464
)
6565
```
6666

6767
Run the `AutoEval` method for automated bias / fairness evaluation
6868
```python
69-
from llambda.auto import AutoEval
69+
from langfair.auto import AutoEval
7070
auto_object = AutoEval(
7171
prompts=prompts,
7272
langchain_llm=llm
@@ -92,7 +92,7 @@ auto_object.print_results()
9292
</p>
9393

9494
## Example Notebooks
95-
See **[Demo Notebooks](./examples)** for notebooks illustrating how to use LLaMBDA for various bias and fairness evaluation metrics.
95+
See **[Demo Notebooks](./examples)** for notebooks illustrating how to use LangFair for various bias and fairness evaluation metrics.
9696

9797
## Choosing Bias and Fairness Metrics for an LLM Use Case
9898
In general, bias and fairness assessments of LLM use cases do not require satisfying all possible evaluation metrics. Instead, practitioners should prioritize and concentrate on a relevant subset of metrics. To demystify metric choice for bias and fairness assessments of LLM use cases, we introduce a decision framework for selecting the appropriate evaluation metrics, as depicted in the diagram below. Leveraging the use case taxonomy outlined in the [technical playbook](https://arxiv.org/abs/2407.10853), we determine suitable choices of bias and fairness metrics for a given use case based on its relevant characteristics.
@@ -114,7 +114,7 @@ Lastly, we classify the remaining subset of focused use cases as having minimal
114114

115115

116116
## Associated Research
117-
A technical description of LLaMBDA's evaluation metrics and a practitioner's guide for selecting evaluation metrics is contained in **[this paper](https://arxiv.org/pdf/2407.10853)**. Below is the bibtex entry for this paper:
117+
A technical description of LangFair's evaluation metrics and a practitioner's guide for selecting evaluation metrics is contained in **[this paper](https://arxiv.org/pdf/2407.10853)**. Below is the bibtex entry for this paper:
118118

119119
@misc{bouchard2024actionableframeworkassessingbias,
120120
title={An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases},
@@ -127,10 +127,10 @@ A technical description of LLaMBDA's evaluation metrics and a practitioner's gui
127127
}
128128

129129
## Code Documentation
130-
Please refer to our [documentation site](https://cvs-health.github.io/llambda/) for more details on how to use LLaMBDA.
130+
Please refer to our [documentation site](https://cvs-health.github.io/langfair/) for more details on how to use LangFair.
131131

132132
## Development Team
133-
The open-source version of LLaMBDA is the culmination of extensive work carried out by a dedicated team of developers. While the internal commit history will not be made public, we believe it's essential to acknowledge the significant contributions of our development team who were instrumental in bringing this project to fruition:
133+
The open-source version of LangFair is the culmination of extensive work carried out by a dedicated team of developers. While the internal commit history will not be made public, we believe it's essential to acknowledge the significant contributions of our development team who were instrumental in bringing this project to fruition:
134134

135135
- [Dylan Bouchard](https://github.com/dylanbouchard)
136136
- [Mohit Singh Chauhan](https://github.com/mohitcek)
@@ -139,4 +139,4 @@ The open-source version of LLaMBDA is the culmination of extensive work carried
139139
- [Zeya Ahmad](https://github.com/zeya30)
140140

141141
## Contributing
142-
Contributions are welcome. Please refer [here](./CONTRIBUTING.md) for instructions on how to contribute to LLaMBDA.
142+
Contributions are welcome. Please refer [here](./CONTRIBUTING.md) for instructions on how to contribute to LangFair.
-33.8 KB
Binary file not shown.

assets/images/archive/LLaMBDA.png

-17.6 KB
Binary file not shown.
-14.4 KB
Binary file not shown.
-7.98 KB
Binary file not shown.
-8.7 KB
Binary file not shown.

assets/images/langfair-logo.png

18.7 KB
Loading

assets/images/langfair-logo2.png

18 KB
Loading

assets/images/langfair-logo3.png

17.2 KB
Loading
-21.4 KB
Binary file not shown.
-6.51 KB
Binary file not shown.

assets/images/llambda-logo-only.png

-6.96 KB
Binary file not shown.

assets/images/llambda-logo.png

-23.2 KB
Binary file not shown.

assets/images/llambda-logo2.png

-22.1 KB
Binary file not shown.

data/DATA_COPYRIGHT.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Please refer to below for copyright information for the two files contained in `llambda/data`
1+
Please refer to below for copyright information for the two files contained in `langfair/data`
22

33
#### Copyright information for [RealToxicityPrompts.jsonl](https://huggingface.co/datasets/allenai/real-toxicity-prompts)
44
***

examples/evaluations/classification/classification_metrics_demo.ipynb

+6-6
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,15 @@
1313
"\n",
1414
"import numpy as np\n",
1515
"from IPython.display import Image\n",
16-
"from llambda.metrics.classification import ClassificationMetrics"
16+
"from langfair.metrics.classification import ClassificationMetrics"
1717
]
1818
},
1919
{
2020
"cell_type": "markdown",
2121
"id": "b9290443-ce88-4d54-beea-1e1888500b36",
2222
"metadata": {},
2323
"source": [
24-
"Bias and fairness metrics offered by `llambda` fall into various categories: counterfactual discrimination metrics, stereotype metrics, toxicity mtrics, recommendation fairness metrics, and classification fairness metrics. The full suite of metrics is displayed below.\n",
24+
"Bias and fairness metrics offered by `langfair` fall into various categories: counterfactual discrimination metrics, stereotype metrics, toxicity mtrics, recommendation fairness metrics, and classification fairness metrics. The full suite of metrics is displayed below.\n",
2525
"\n",
2626
"##### Counterfactual Discrimination Metrics\n",
2727
"* Strict Counterfactual Sentiment Parity ([Huang et al., 2020](https://arxiv.org/pdf/1911.03064))\n",
@@ -210,15 +210,15 @@
210210
],
211211
"metadata": {
212212
"environment": {
213-
"kernel": "llambda-env",
213+
"kernel": "langfair",
214214
"name": "workbench-notebooks.m121",
215215
"type": "gcloud",
216216
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m121"
217217
},
218218
"kernelspec": {
219-
"display_name": "llambda-env",
219+
"display_name": "langfair",
220220
"language": "python",
221-
"name": "llambda-env"
221+
"name": "langfair"
222222
},
223223
"language_info": {
224224
"codemirror_mode": {
@@ -230,7 +230,7 @@
230230
"name": "python",
231231
"nbconvert_exporter": "python",
232232
"pygments_lexer": "ipython3",
233-
"version": "3.9.19"
233+
"version": "3.9.20"
234234
}
235235
},
236236
"nbformat": 4,

examples/evaluations/recommendation/recommendation_metrics_demo.ipynb

+7-7
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"cells": [
33
{
44
"cell_type": "code",
5-
"execution_count": 3,
5+
"execution_count": 1,
66
"id": "f694ef3c-96cb-472c-80c4-0409222fc4ac",
77
"metadata": {
88
"tags": []
@@ -13,15 +13,15 @@
1313
"\n",
1414
"from IPython.display import Image\n",
1515
"\n",
16-
"from llambda.metrics.recommendation import RecommendationMetrics\n"
16+
"from langfair.metrics.recommendation import RecommendationMetrics"
1717
]
1818
},
1919
{
2020
"cell_type": "markdown",
2121
"id": "b9290443-ce88-4d54-beea-1e1888500b36",
2222
"metadata": {},
2323
"source": [
24-
"Bias and fairness metrics offered by `llambda` fall into various categories: counterfactual discrimination metrics, stereotype metrics, toxicity mtrics, recommendation fairness metrics, and classification fairness metrics. The full suite of metrics is displayed below.\n",
24+
"Bias and fairness metrics offered by `langfair` fall into various categories: counterfactual discrimination metrics, stereotype metrics, toxicity mtrics, recommendation fairness metrics, and classification fairness metrics. The full suite of metrics is displayed below.\n",
2525
"\n",
2626
"##### Counterfactual Discrimination Metrics\n",
2727
"* Strict Counterfactual Sentiment Parity ([Huang et al., 2020](https://arxiv.org/pdf/1911.03064))\n",
@@ -394,15 +394,15 @@
394394
],
395395
"metadata": {
396396
"environment": {
397-
"kernel": "llambda-env",
397+
"kernel": "langfair",
398398
"name": "workbench-notebooks.m121",
399399
"type": "gcloud",
400400
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m121"
401401
},
402402
"kernelspec": {
403-
"display_name": "llambda-env",
403+
"display_name": "langfair",
404404
"language": "python",
405-
"name": "llambda-env"
405+
"name": "langfair"
406406
},
407407
"language_info": {
408408
"codemirror_mode": {
@@ -414,7 +414,7 @@
414414
"name": "python",
415415
"nbconvert_exporter": "python",
416416
"pygments_lexer": "ipython3",
417-
"version": "3.9.19"
417+
"version": "3.9.20"
418418
}
419419
},
420420
"nbformat": 4,

examples/evaluations/text_generation/auto_eval_demo.ipynb

+27-27
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
},
3131
{
3232
"cell_type": "code",
33-
"execution_count": 8,
33+
"execution_count": 2,
3434
"metadata": {
3535
"tags": []
3636
},
@@ -43,14 +43,14 @@
4343
"from dotenv import find_dotenv, load_dotenv\n",
4444
"from langchain_openai import AzureChatOpenAI\n",
4545
"\n",
46-
"from llambda.auto import AutoEval\n",
46+
"from langfair.auto import AutoEval\n",
4747
"\n",
4848
"warnings.filterwarnings(\"ignore\")"
4949
]
5050
},
5151
{
5252
"cell_type": "code",
53-
"execution_count": 2,
53+
"execution_count": 3,
5454
"metadata": {
5555
"tags": []
5656
},
@@ -77,7 +77,7 @@
7777
},
7878
{
7979
"cell_type": "code",
80-
"execution_count": 3,
80+
"execution_count": 4,
8181
"metadata": {
8282
"tags": []
8383
},
@@ -99,7 +99,7 @@
9999
" \"#Person1#: Watsup, ladies! Y'll looking'fine tonight. May I have this dance?\\\\n#Person2#: He's cute! He looks like Tiger Woods! But, I can't dance. . .\\\\n#Person1#: It's all good. I'll show you all the right moves. My name's Malik.\\\\n#Person2#: Nice to meet you. I'm Wen, and this is Nikki.\\\\n#Person1#: How you feeling', vista? Mind if I take your friend'round the dance floor?\\\\n#Person2#: She doesn't mind if you don't mind getting your feet stepped on.\\\\n#Person1#: Right. Cool! Let's go!\\n\"]"
100100
]
101101
},
102-
"execution_count": 3,
102+
"execution_count": 4,
103103
"metadata": {},
104104
"output_type": "execute_result"
105105
}
@@ -118,7 +118,7 @@
118118
},
119119
{
120120
"cell_type": "code",
121-
"execution_count": 4,
121+
"execution_count": 5,
122122
"metadata": {
123123
"tags": []
124124
},
@@ -132,7 +132,7 @@
132132
"cell_type": "markdown",
133133
"metadata": {},
134134
"source": [
135-
"#### `AutoEval()` - For calculating all toxicity, stereotype, and counterfactual metrics supported by LLaMBDA\n",
135+
"#### `AutoEval()` - For calculating all toxicity, stereotype, and counterfactual metrics supported by LangFair\n",
136136
"\n",
137137
"**Class Attributes:**\n",
138138
"- `prompts` - (**list of strings**)\n",
@@ -173,7 +173,7 @@
173173
},
174174
{
175175
"cell_type": "code",
176-
"execution_count": 5,
176+
"execution_count": 6,
177177
"metadata": {
178178
"tags": []
179179
},
@@ -191,7 +191,7 @@
191191
},
192192
{
193193
"cell_type": "code",
194-
"execution_count": 6,
194+
"execution_count": 7,
195195
"metadata": {
196196
"tags": []
197197
},
@@ -216,7 +216,7 @@
216216
},
217217
{
218218
"cell_type": "code",
219-
"execution_count": 7,
219+
"execution_count": 8,
220220
"metadata": {
221221
"tags": []
222222
},
@@ -227,35 +227,35 @@
227227
"text": [
228228
"\u001b[1mStep 1: Fairness Through Unawareness\u001b[0m\n",
229229
"------------------------------------\n",
230-
"LLaMBDA: Number of prompts containing race words: 0\n",
231-
"- LLaMBDA: The prompts satisfy fairness through unawareness for race words, the recommended risk assessment only include Toxicity\n",
232-
"LLaMBDA: Number of prompts containing gender words: 31\n",
233-
"- LLaMBDA: The prompts do not satisfy fairness through unawareness for gender words, the recommended risk assessments include Toxicity, Stereotype, and Counterfactual Discrimination.\n",
230+
"langfair: Number of prompts containing race words: 0\n",
231+
"- langfair: The prompts satisfy fairness through unawareness for race words, the recommended risk assessment only include Toxicity\n",
232+
"langfair: Number of prompts containing gender words: 31\n",
233+
"- langfair: The prompts do not satisfy fairness through unawareness for gender words, the recommended risk assessments include Toxicity, Stereotype, and Counterfactual Discrimination.\n",
234234
"\n",
235235
"\u001b[1mStep 2: Generate Counterfactual Dataset\u001b[0m\n",
236236
"---------------------------------------\n",
237-
"LLaMBDA: gender words found in 31 prompts.\n",
237+
"langfair: gender words found in 31 prompts.\n",
238238
"Generating 25 responses for each gender prompt...\n",
239-
"LLaMBDA: Responses successfully generated!\n",
239+
"langfair: Responses successfully generated!\n",
240240
"\n",
241241
"\u001b[1mStep 3: Generating Model Responses\u001b[0m\n",
242242
"----------------------------------\n",
243-
"LLaMBDA: Generating 25 responses per prompt...\n",
244-
"LLaMBDA: Responses successfully generated!\n",
243+
"langfair: Generating 25 responses per prompt...\n",
244+
"langfair: Responses successfully generated!\n",
245245
"\n",
246246
"\u001b[1mStep 4: Evaluate Toxicity Metrics\u001b[0m\n",
247247
"---------------------------------\n",
248-
"LLaMBDA: Computing toxicity scores...\n",
249-
"LLaMBDA: Evaluating metrics...\n",
248+
"langfair: Computing toxicity scores...\n",
249+
"langfair: Evaluating metrics...\n",
250250
"\n",
251251
"\u001b[1mStep 5: Evaluate Stereotype Metrics\u001b[0m\n",
252252
"-----------------------------------\n",
253-
"LLaMBDA: Computing stereotype scores...\n",
254-
"LLaMBDA: Evaluating metrics...\n",
253+
"langfair: Computing stereotype scores...\n",
254+
"langfair: Evaluating metrics...\n",
255255
"\n",
256256
"\u001b[1mStep 6: Evaluate Counterfactual Metrics\u001b[0m\n",
257257
"---------------------------------------\n",
258-
"LLaMBDA: Evaluating metrics...\n"
258+
"langfair: Evaluating metrics...\n"
259259
]
260260
}
261261
],
@@ -620,15 +620,15 @@
620620
],
621621
"metadata": {
622622
"environment": {
623-
"kernel": "llambda-env",
623+
"kernel": "langfair",
624624
"name": "workbench-notebooks.m121",
625625
"type": "gcloud",
626626
"uri": "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/workbench-notebooks:m121"
627627
},
628628
"kernelspec": {
629-
"display_name": "llambda-env",
629+
"display_name": "langfair",
630630
"language": "python",
631-
"name": "llambda-env"
631+
"name": "langfair"
632632
},
633633
"language_info": {
634634
"codemirror_mode": {
@@ -640,7 +640,7 @@
640640
"name": "python",
641641
"nbconvert_exporter": "python",
642642
"pygments_lexer": "ipython3",
643-
"version": "3.9.19"
643+
"version": "3.9.20"
644644
}
645645
},
646646
"nbformat": 4,

0 commit comments

Comments
 (0)