You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/paper.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -29,7 +29,7 @@ bibliography: paper.bib
29
29
30
30
---
31
31
# Summary
32
-
Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce `langfair`, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks relevant to their specific use cases.^[The repository for `langfair` can be found at https://github.com/cvs-health/langfair.] The package offers functionality to easily generate evaluation datasets, comprised of LLM responses to use-case-specific prompts, and subsequently calculate applicable metrics for the practitioner's use case. To guide in metric selection, LangFair offers an actionable decision framework, discussed in detail in the project's companion paper [@bouchard2024actionableframeworkassessingbias].
32
+
Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce `langfair`, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks relevant to their specific use cases.^[The repository for `langfair` can be found at https://github.com/cvs-health/langfair.] The package offers functionality to easily generate evaluation datasets, comprised of LLM responses to use-case-specific prompts, and subsequently calculate applicable metrics for the practitioner's use case. To guide in metric selection, LangFair offers an actionable decision framework, discussed in detail in the project's companion paper Bouchard [-@bouchard2024actionableframeworkassessingbias].
33
33
34
34
35
35
# Statement of Need
@@ -53,7 +53,7 @@ The `langfair.generator` module offers two classes, `ResponseGenerator` and `Cou
53
53
To streamline generation of evaluation datasets, the `ResponseGenerator` class wraps an instance of a `langchain` LLM and leverages asynchronous generation with `asyncio`. To implement, users simply pass a list of prompts (strings) to the `ResponseGenerator.generate_responses` method, which returns a dictionary containing prompts, responses, and applicable metadata.
54
54
55
55
### `CounterfactualGenerator` class
56
-
In the context of LLMs, counterfactual fairness can be assessed by constructing counterfactual input pairs [@gallegos2024biasfairnesslargelanguage; @bouchard2024actionableframeworkassessingbias], comprised of prompt pairs that mention different protected attribute groups but are otherwise identical, and measuring the differences in the corresponding generated output pairs. To address this, the `CounterfactualGenerator` class offers functionality to check for fairness through unawareness (FTU), construct counterfactual input pairs, and generate corresponding pairs of responses asynchronously using a `langchain` LLM instance.^[FTU means prompts do not contain mentions of protected attribute information.] Off the shelf, the FTU check and creation of counterfactual input pairs can be done for gender and race/ethnicity, but users may also provide a custom mapping of protected attribute words to enable this functionality for other attributes as well.
56
+
In the context of LLMs, counterfactual fairness can be assessed by constructing counterfactual input pairs [@gallegos2024biasfairnesslargelanguage; @bouchard2024actionableframeworkassessingbias], comprised of prompt pairs that mention different protected attribute groups but are otherwise identical, and measuring the differences in the corresponding generated output pairs. To address this, the `CounterfactualGenerator` class offers functionality to check for fairness through unawareness (FTU^[FTU means prompts do not contain mentions of protected attribute information.]), construct counterfactual input pairs, and generate corresponding pairs of responses asynchronously using a `langchain` LLM instance. Off the shelf, the FTU check and creation of counterfactual input pairs can be done for gender and race/ethnicity, but users may also provide a custom mapping of protected attribute words to enable this functionality for other attributes as well.
57
57
58
58
# Bias and Fairness Evaluations for Focused Use Cases
59
59
Following [@bouchard2024actionableframeworkassessingbias], evaluation metrics are categorized according to the risks they assess (toxicity, stereotypes, counterfactual unfairness, and allocational harms), as well as the use case task (text generation, classification, and recommendation).^[Note that text generation encompasses all use cases for which output is text, but does not belong to a predefined set of elements (as with classification and recommendation).] Table 1 maps the classes contained in the `langfair.metrics` module to these risks. These classes are discussed in detail below.
@@ -90,7 +90,7 @@ When LLMs are used to solve classification problems, traditional machine learnin
90
90
# Semi-Automated Evaluation
91
91
92
92
### `AutoEval` class
93
-
To streamline assessments for text generation use cases, the `AutoEval` class conducts a multi-step process that includes metric selection, evaluation dataset generation, and metric computation. The user is required to supply a list of prompts and an instance of `langchain` LLM. Below we provide a basic example demonstrating the execution of `AutoEval.evaluate` with a `gemini-pro` instance.^[Note that this example assumes the user has already set up their VertexAI credentials and sampled a list of prompts from their use case prompts.]
93
+
To streamline assessments for text generation use cases, the `AutoEval` class conducts a multi-step process (each step is described above) that includes metric selection, evaluation dataset generation, and metric computation. The user is required to supply a list of prompts and an instance of `langchain` LLM. Below we provide a basic example demonstrating the execution of `AutoEval.evaluate` with a `gemini-pro` instance.^[Note that this example assumes the user has already set up their VertexAI credentials and sampled a list of prompts from their use case prompts.]
0 commit comments