Skip to content

Commit 6314efc

Browse files
author
Zeya Ahmad
committed
Reorganize documentation
1 parent b44c597 commit 6314efc

File tree

3 files changed

+214
-55
lines changed

3 files changed

+214
-55
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
Choosing Metrics
2+
================
3+
4+
Choosing Bias and Fairness Metrics for an LLM Use Case
5+
------------------------------------------------------
6+
7+
Selecting the appropriate bias and fairness metrics is essential for accurately assessing the performance of large language models (LLMs) in specific use cases. Instead of attempting to compute all possible metrics, practitioners should focus on a relevant subset that aligns with their specific goals and the context of their application.
8+
9+
Our decision framework for selecting appropriate evaluation metrics is illustrated in the diagram below. For more details, refer to our `technical playbook <https://arxiv.org/abs/2407.10853>`_.
10+
11+
.. image:: ./_static/images/use_case_framework.PNG
12+
:width: 800
13+
:align: center
14+
:alt: Use Case Framework
15+
16+
17+
.. note::
18+
19+
Fairness through unawareness means none of the prompts for an LLM
20+
use case include any mention of protected attribute words.
21+
22+
Supported Bias and Fairness Metrics
23+
-----------------------------------
24+
25+
Bias and fairness metrics offered by LangFair are grouped into several categories. The full suite of metrics is displayed below.
26+
27+
**Toxicity Metrics**
28+
29+
* Expected Maximum Toxicity `[Gehman et al., 2020] <https://arxiv.org/abs/2009.11462>`_
30+
* Toxicity Probability `[Gehman et al., 2020] <https://arxiv.org/abs/2009.11462>`_
31+
* Toxic Fraction `[Liang et al., 2023] <https://arxiv.org/abs/2211.09110>`_
32+
33+
**Counterfactual Fairness Metrics**
34+
35+
* Strict Counterfactual Sentiment Parity `[Huang et al., 2020] <https://arxiv.org/abs/1911.03064>`_
36+
* Weak Counterfactual Sentiment Parity `[Bouchard, 2024] <https://arxiv.org/abs/2407.10853>`_
37+
* Counterfactual Cosine Similarity Score `[Bouchard, 2024] <https://arxiv.org/abs/2407.10853>`_
38+
* Counterfactual BLEU `[Bouchard, 2024] <https://arxiv.org/abs/2407.10853>`_
39+
* Counterfactual ROUGE-L `[Bouchard, 2024] <https://arxiv.org/abs/2407.10853>`_
40+
41+
**Stereotype Metrics**
42+
43+
* Stereotypical Associations `[Liang et al., 2023] <https://arxiv.org/abs/2211.09110>`_
44+
* Co-occurrence Bias Score `[Bordia & Bowman, 2019] <https://arxiv.org/abs/1904.03035>`_
45+
* Stereotype classifier metrics `[Zekun et al., 2023] <https://arxiv.org/abs/2311.14126>`_, `[Bouchard, 2024] <https://arxiv.org/abs/2407.10853>`_
46+
47+
**Recommendation (Counterfactual) Fairness Metrics**
48+
49+
* Jaccard Similarity `[Zhang et al., 2023] <https://dl.acm.org/doi/10.1145/3604915.3608860>`_
50+
* Search Result Page Misinformation Score `[Zhang et al., 2023] <https://dl.acm.org/doi/10.1145/3604915.3608860>`_
51+
* Pairwise Ranking Accuracy Gap `[Zhang et al., 2023] <https://dl.acm.org/doi/10.1145/3604915.3608860>`_
52+
53+
**Classification Fairness Metrics**
54+
55+
* Predicted Prevalence Rate Disparity `[Feldman et al., 2015] <https://arxiv.org/abs/1412.3756>`_, `[Bellamy et al., 2018] <https://arxiv.org/abs/1810.01943>`_, `[Saleiro et al., 2019] <https://arxiv.org/abs/1811.05577>`_
56+
* False Negative Rate Disparity `[Bellamy et al., 2018] <https://arxiv.org/abs/1810.01943>`_, `[Saleiro et al., 2019] <https://arxiv.org/abs/1811.05577>`_
57+
* False Omission Rate Disparity `[Bellamy et al., 2018] <https://arxiv.org/abs/1810.01943>`_, `[Saleiro et al., 2019] <https://arxiv.org/abs/1811.05577>`_
58+
* False Positive Rate Disparity `[Bellamy et al., 2018] <https://arxiv.org/abs/1810.01943>`_, `[Saleiro et al., 2019] <https://arxiv.org/abs/1811.05577>`_
59+
* False Discovery Rate Disparity `[Bellamy et al., 2018] <https://arxiv.org/abs/1810.01943>`_, `[Saleiro et al., 2019] <https://arxiv.org/abs/1811.05577>`_

docs_src/latest/source/index.rst

+35-7
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,50 @@
22
sphinx-quickstart on Wed Jun 12 09:11:05 2024.
33
You can adapt this file completely to your liking, but it should at least
44
contain the root `toctree` directive.
5-
5+
66
Welcome to LangFair's documentation!
77
====================================
88

9+
LLM bias and fairness made simple
10+
---------------------------------
11+
12+
LangFair is a comprehensive Python library designed for conducting use-case-specific bias and fairness assessments for large language model (LLM) use cases. Using a unique Bring Your Own Prompts (BYOP) approach, LangFair helps you:
13+
14+
✨ **Evaluate Real-World Scenarios**: Evaluate bias and fairness for actual LLM use cases
15+
16+
🎯 **Get Actionable Metrics**: Measure toxicity, stereotypes, and fairness with applicable metrics
17+
18+
🔍 **Make Informed Decisions**: Use our framework to choose the right evaluation metrics
19+
20+
🛠️ **Simple Integration**: Easy-to-use Python interface for seamless implementation
21+
22+
:doc:`Get Started → <usage>` | :doc:`View Examples → <auto_examples/index>`
23+
24+
Why LangFair?
25+
-------------
26+
27+
Static benchmark assessments, which are typically assumed to be sufficiently representative, often fall short in capturing the risks associated with all possible use cases of LLMs. These models are increasingly used in various applications, including recommendation systems, classification, text generation, and summarization. However, evaluating these models without considering use-case-specific prompts can lead to misleading assessments of their performance, especially regarding bias and fairness risks.
28+
29+
LangFair addresses this gap by adopting a Bring Your Own Prompts (BYOP) approach, allowing users to tailor bias and fairness evaluations to their specific use cases. This ensures that the metrics computed reflect the true performance of the LLMs in real-world scenarios, where prompt-specific risks are critical. Additionally, LangFair's focus is on output-based metrics that are practical for governance audits and real-world testing, without needing access to internal model states.
30+
31+
32+
Quick Links
33+
-----------
34+
935
.. toctree::
1036
:maxdepth: 1
1137
:caption: Contents:
12-
13-
Overview <overview>
38+
1439
Get Started <usage>
40+
Choosing Metrics<choosing_metrics>
1541
API <api>
1642
auto_examples/index
1743
Contributor Guide <guide>
1844

45+
Featured Resources
46+
------------------
1947

20-
21-
Check out the :doc:`usage` section for further information, including how to :ref:`install <installation>` the project.
22-
23-
View `LangFair white paper <https://arxiv.org/pdf/2407.10853>`_.
48+
- 🚀 :doc:`Get started <usage>` in minutes
49+
- 🔬 Explore our :doc:`framework for choosing metrics <choosing_metrics>`
50+
- 💡 Try our :doc:`guided examples <auto_examples/index>`
51+
- 📖 Read the `research paper <https://arxiv.org/abs/2407.10853>`_

docs_src/latest/source/usage.rst

+120-48
Original file line numberDiff line numberDiff line change
@@ -1,75 +1,147 @@
1-
Get Started
2-
===========
1+
Quickstart Guide
2+
================
3+
(Optional) Create a virtual environment for using LangFair
4+
----------------------------------------------------------
5+
We recommend creating a new virtual environment using venv before installing LangFair. To do so, please follow instructions `here <https://docs.python.org/3/library/venv.html>`_.
36

4-
.. _installation:
7+
Installing LangFair
8+
-------------------
9+
The latest version can be installed from PyPI:
510

6-
.. _gettingstarted:
11+
.. code-block:: console
712
8-
Installation
9-
------------
13+
pip install langfair
1014
11-
We recommend creating a new virtual environment using venv before installing LangFair. To do so, please follow instructions `here <https://docs.python.org/3/library/venv.html>`_.
15+
Usage Examples
16+
--------------
17+
Below are code samples illustrating how to use LangFair to assess bias and fairness risks in text generation and summarization use cases. The below examples assume the user has already defined a list of prompts from their use case , ``prompts``.
1218

13-
.. code-block:: console
19+
Generate LLM responses
20+
^^^^^^^^^^^^^^^^^^^^^^
21+
To generate responses, we can use LangFair's ``ResponseGenerator`` class. First, we must create a ``langchain`` LLM object. Below we use ``ChatVertexAI``, but any of `LangChain's LLM classes <https://js.langchain.com/docs/integrations/chat/>`_ may be used instead. Note that ``InMemoryRateLimiter`` is to used to avoid rate limit errors.
1422

15-
pip install langfair
23+
.. code-block:: python
1624
17-
Usage Example
18-
-------------
25+
from langchain_google_vertexai import ChatVertexAI
26+
from langchain_core.rate_limiters import InMemoryRateLimiter
27+
rate_limiter = InMemoryRateLimiter(
28+
requests_per_second=4.5,
29+
check_every_n_seconds=0.5,
30+
max_bucket_size=280,
31+
)
32+
llm = ChatVertexAI(
33+
model_name="gemini-pro",
34+
temperature=0.3,
35+
rate_limiter=rate_limiter
36+
)
37+
38+
We can use ``ResponseGenerator.generate_responses`` to generate 25 responses for each prompt, as is convention for toxicity evaluation.
39+
40+
.. code-block:: python
1941
20-
Below is a sample of code illustrating how to use LangFair's `AutoEval` class for text generation and summarization use cases. The below example assumes the user has already defined parameters
21-
``DEPLOYMENT_NAME``, ``API_KEY``, ``API_BASE``, ``API_TYPE``, ``API_VERSION``, and a list of prompts from their use case `prompts`.
42+
from langfair.generator import ResponseGenerator
43+
rg = ResponseGenerator(langchain_llm=llm)
44+
generations = await rg.generate_responses(prompts=prompts, count=25)
45+
responses = [str(r) for r in generations["data"]["response"]]
46+
duplicated_prompts = [str(r) for r in generations["data"]["prompt"]] # so prompts correspond to responses
2247
23-
|
24-
Create `langchain` LLM object.
48+
49+
Compute toxicity metrics
50+
^^^^^^^^^^^^^^^^^^^^^^^^
51+
Toxicity metrics can be computed with ``ToxicityMetrics``. Note that use of ``torch.device`` is optional and should be used if GPU is available to speed up toxicity computation.
2552

2653
.. code-block:: python
2754
28-
from langchain_openai import AzureChatOpenAI
2955
# import torch # uncomment if GPU is available
3056
# device = torch.device("cuda") # uncomment if GPU is available
31-
32-
llm = AzureChatOpenAI(
33-
deployment_name=DEPLOYMENT_NAME,
34-
openai_api_key=API_KEY,
35-
azure_endpoint=API_BASE,
36-
openai_api_type=API_TYPE,
37-
openai_api_version=API_VERSION,
38-
temperature=0.4 # User to set temperature
57+
from langfair.metrics.toxicity import ToxicityMetrics
58+
tm = ToxicityMetrics(
59+
# device=device, # uncomment if GPU is available,
60+
)
61+
tox_result = tm.evaluate(
62+
prompts=duplicated_prompts,
63+
responses=responses,
64+
return_data=True
3965
)
66+
tox_result['metrics']
67+
# Output is below
68+
# {'Toxic Fraction': 0.0004,
69+
# 'Expected Maximum Toxicity': 0.013845130120171235,
70+
# 'Toxicity Probability': 0.01}
71+
72+
Compute stereotype metrics
73+
^^^^^^^^^^^^^^^^^^^^^^^^^^
74+
Stereotype metrics can be computed with ``StereotypeMetrics``.
75+
76+
.. code-block:: python
4077
41-
.. note::
78+
from langfair.metrics.stereotype import StereotypeMetrics
79+
sm = StereotypeMetrics()
80+
stereo_result = sm.evaluate(responses=responses, categories=["gender"])
81+
stereo_result['metrics']
82+
# Output is below
83+
# {'Stereotype Association': 0.3172750176745329,
84+
# 'Cooccurrence Bias': 0.4476633365427837,
85+
# 'Stereotype Fraction - gender': 0.08}
4286
43-
You can use any of `LangChain's LLM classes <https://js.langchain.com/docs/integrations/chat/>`_ in place of AzureChatOpenAI.
44-
45-
Also, to avoid rate limit errors, use LangChain's `InMemoryRateLimiter <https://api.python.langchain.com/en/latest/rate_limiters/langchain_core.rate_limiters.InMemoryRateLimiter.html#>`_
4687
47-
|
48-
Run the `AutoEval` method for automated bias / fairness evaluation
88+
Generate counterfactual responses and compute metrics
89+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
90+
We can generate counterfactual responses with ``CounterfactualGenerator``.
4991

5092
.. code-block:: python
5193
52-
from langfair.auto import AutoEval
53-
auto_object = AutoEval(
54-
prompts=prompts,
55-
langchain_llm=llm
56-
# toxicity_device=device # uncomment if GPU is available
94+
from langfair.generator.counterfactual import CounterfactualGenerator
95+
cg = CounterfactualGenerator(langchain_llm=llm)
96+
cf_generations = await cg.generate_responses(
97+
prompts=prompts, attribute='gender', count=25
5798
)
58-
results = await auto_object.evaluate()
99+
male_responses = [str(r) for r in cf_generations['data']['male_response']]
100+
female_responses = [str(r) for r in cf_generations['data']['female_response']]
59101
60-
.. image:: ./_static/images/autoeval_process.png
61-
:width: 800
62-
:alt: AutoEval Process
63-
64-
|
65-
Print the results and export to .txt file.
102+
Counterfactual metrics can be easily computed with ``CounterfactualMetrics``.
66103

67104
.. code-block:: python
68105
69-
auto_object.export_results(file_name="metric_values.txt")
70-
auto_object.print_results()
106+
from langfair.metrics.counterfactual import CounterfactualMetrics
107+
cm = CounterfactualMetrics()
108+
cf_result = cm.evaluate(
109+
texts1=male_responses,
110+
texts2=female_responses,
111+
attribute='gender'
112+
)
113+
cf_result
114+
# Output is below
115+
# {'Cosine Similarity': 0.8318708,
116+
# 'RougeL Similarity': 0.5195852482361165,
117+
# 'Bleu Similarity': 0.3278433712872481,
118+
# 'Sentiment Bias': 0.00099471451876019657}
71119
72-
.. image:: ./_static/images/autoeval_output.png
73-
:width: 500
74-
:align: center
75-
:alt: AutoEval Output
120+
121+
Alternative approach: Semi-automated evaluation with ``AutoEval``
122+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
123+
To streamline assessments for text generation and summarization use cases, the ``AutoEval`` class conducts a multi-step process that completes all of the aforementioned steps with two lines of code.
124+
125+
.. code-block:: python
126+
127+
from langfair.auto import AutoEval
128+
auto_object = AutoEval(
129+
prompts=prompts,
130+
langchain_llm=llm,
131+
# toxicity_device=device # uncomment if GPU is available
132+
)
133+
results = await auto_object.evaluate()
134+
results
135+
# Output is below
136+
# {'Toxicity': {'Toxic Fraction': 0.0004,
137+
# 'Expected Maximum Toxicity': 0.01384513012017123,
138+
# 'Toxicity Probability': 0.01},
139+
# 'Stereotype': {'Stereotype Association': 0.3172750176745329,
140+
# 'Cooccurrence Bias': 0.4476633365427837,
141+
# 'Stereotype Fraction - gender': 0.08,
142+
# 'Expected Maximum Stereotype - gender': 0.6035516738891,
143+
# 'Stereotype Probability - gender': 0.27036},
144+
# 'Counterfactual': {'male-female': {'Cosine Similarity': 0.8318708,
145+
# 'RougeL Similarity': 0.5195852482361165,
146+
# 'Bleu Similarity': 0.3278433712872481,
147+
# 'Sentiment Bias': 0.00099471451876019577}}}

0 commit comments

Comments
 (0)