diff --git a/paper/paper.md b/paper/paper.md index 2220ba1..fc26e82 100644 --- a/paper/paper.md +++ b/paper/paper.md @@ -29,11 +29,11 @@ bibliography: paper.bib --- # Summary -Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce `langfair`, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks relevant to their specific use cases.^[The repository for `langfair` can be found at https://github.com/cvs-health/langfair.] The package offers functionality to easily generate evaluation datasets, comprised of LLM responses to use-case-specific prompts, and subsequently calculate applicable metrics for the practitioner's use case. To guide in metric selection, LangFair offers an actionable decision framework, discussed in detail in the project's companion paper by Bouchard [-@bouchard2024actionableframeworkassessingbias]. +Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce `langfair`, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks relevant to their specific use cases.^[The repository for `langfair` can be found at https://github.com/cvs-health/langfair.] The package offers functionality to easily generate evaluation datasets, comprised of LLM responses to use-case-specific prompts, and subsequently calculate applicable metrics for the practitioner's use case. To guide in metric selection, LangFair offers an actionable decision framework, discussed in detail in the project's companion paper, Bouchard [-@bouchard2024actionableframeworkassessingbias]. # Statement of Need -Traditional machine learning (ML) fairness toolkits like AIF360 [@aif360-oct-2018], Fairlearn [@Weerts_Fairlearn_Assessing_and_2023], Aequitas [@2018aequitas] and others [@vasudevan20lift; @DBLP:journals/corr/abs-1907-04135; @tensorflow-no-date] are designed to assess and mitigate bias to ML models. AIF360 is an open-source toolkit developed by IBM, includes various extensive pre-built algorithms that focus on bias assessment through different stages of the AI lifecycle. Fairness toolkit, developed by Google, focus on providing an easy intergration with TensorFlow and existing workflow for bias detection. We acknowledge that these toolkits have laid crucial groundwork but are not tailored to the generative and context-dependent nature of LLMs. +Traditional machine learning (ML) fairness toolkits like AIF360 [@aif360-oct-2018], Fairlearn [@Weerts_Fairlearn_Assessing_and_2023], Aequitas [@2018aequitas] and others [@vasudevan20lift; @DBLP:journals/corr/abs-1907-04135; @tensorflow-no-date] have laid crucial groundwork. These toolkits offer various metrics and algorithms that focus on assessing and mitigating bias and fairness through different stages of the ML lifecycle. While the fairness assessments offered by these toolkits include a wide variety of generic fairness metrics, which can also apply to certain LLM use cases, they are not tailored to the generative and context-dependent nature of LLMs.^[The toolkits mentioned here offer fairness metrics for classification. In a similar vein, the recommendation fairness metrics offered in FaiRLLM [@Zhang_2023] can be applied to ML recommendation systems as well as LLM recommendation use cases.] LLMs are used in systems that solve tasks such as recommendation, classification, text generation, and summarization. In practice, these systems try to restrict the responses of the LLM to the task at hand, often by including task-specific instructions in system or user prompts. When the LLM is evaluated without taking the set of task-specific prompts into account, the evaluation metrics are not representative of the system's true performance. Representing the system's actual performance is especially important when evaluating its outputs for bias and fairness risks because they pose real harm to the user and, by way of repercussions, the system developer. @@ -53,10 +53,10 @@ The `langfair.generator` module offers two classes, `ResponseGenerator` and `Cou To streamline generation of evaluation datasets, the `ResponseGenerator` class wraps an instance of a `langchain` LLM and leverages asynchronous generation with `asyncio`. To implement, users simply pass a list of prompts (strings) to the `ResponseGenerator.generate_responses` method, which returns a dictionary containing prompts, responses, and applicable metadata. ### `CounterfactualGenerator` class -In the context of LLMs, counterfactual fairness can be assessed by constructing counterfactual input pairs [@gallegos2024biasfairnesslargelanguage; @bouchard2024actionableframeworkassessingbias], comprised of prompt pairs that mention different protected attribute groups but are otherwise identical, and measuring the differences in the corresponding generated output pairs. To address this, the `CounterfactualGenerator` class offers functionality to check for fairness through unawareness (FTU^[FTU means prompts do not contain mentions of protected attribute information.]), construct counterfactual input pairs, and generate corresponding pairs of responses asynchronously using a `langchain` LLM instance. Off the shelf, the FTU check and creation of counterfactual input pairs can be done for gender and race/ethnicity, but users may also provide a custom mapping of protected attribute words to enable this functionality for other attributes as well. +In the context of LLMs, counterfactual fairness can be assessed by constructing counterfactual input pairs [@gallegos2024biasfairnesslargelanguage; @bouchard2024actionableframeworkassessingbias], comprised of prompt pairs that mention different protected attribute groups but are otherwise identical, and measuring the differences in the corresponding generated output pairs. These assessments are applicable to use cases that do not satisfy fairness through unawareness (FTU), meaning prompts contain mentions of protected attribute groups. To address this, the `CounterfactualGenerator` class offers functionality to check for FTU, construct counterfactual input pairs, and generate corresponding pairs of responses asynchronously using a `langchain` LLM instance.^[In practice, a FTU check consists of parsing use case prompts for mentions of protected attribute groups.] Off the shelf, the FTU check and creation of counterfactual input pairs can be done for gender and race/ethnicity, but users may also provide a custom mapping of protected attribute words to enable this functionality for other attributes as well. # Bias and Fairness Evaluations for Focused Use Cases -Following paper by Bouchard [-@bouchard2024actionableframeworkassessingbias], evaluation metrics are categorized according to the risks they assess (toxicity, stereotypes, counterfactual unfairness, and allocational harms), as well as the use case task (text generation, classification, and recommendation).^[Note that text generation encompasses all use cases for which output is text, but does not belong to a predefined set of elements (as with classification and recommendation).] Table 1 maps the classes contained in the `langfair.metrics` module to these risks. These classes are discussed in detail below. +Following Bouchard [-@bouchard2024actionableframeworkassessingbias], evaluation metrics are categorized according to the risks they assess (toxicity, stereotypes, counterfactual unfairness, and allocational harms), as well as the use case task (text generation, classification, and recommendation).^[Note that text generation encompasses all use cases for which output is text, but does not belong to a predefined set of elements (as with classification and recommendation).] Table 1 maps the classes contained in the `langfair.metrics` module to these risks. These classes are discussed in detail below. Class | Risk Assessed | Applicable Tasks | @@ -78,7 +78,7 @@ The `ToxicityMetrics` class facilitates simple computation of toxicity metrics f To measure stereotypes in LLM responses, the `StereotypeMetrics` class offers two categories of metrics: metrics based on word cooccurrences and metrics that leverage a pre-trained stereotype classifier. Metrics based on word cooccurrences aim to assess relative cooccurrence of stereotypical words with certain protected attribute words. On the other hand, stereotype-classifier-based metrics leverage the `wu981526092/Sentence-Level-Stereotype-Detector` classifier available on HuggingFace [@zekun2023auditinglargelanguagemodels] and compute analogs of the aforementioned toxicity-classifier-based metrics [@bouchard2024actionableframeworkassessingbias].^[https://huggingface.co/wu981526092/Sentence-Level-Stereotype-Detector] ### Counterfactual Fairness Metrics for Text Generation - The `CounterfactualMetrics` class offers two groups of metrics to assess counterfactual fairness in text generation use cases. The first group of metrics leverage a pre-trained sentiment classifier to measure sentiment disparities in counterfactually generated outputs, more details in Huang et al. [-@huang2020reducingsentimentbiaslanguage]. This class uses the `vaderSentiment` classifier by default but also gives users the option to provide a custom sentiment classifier object.^[https://github.com/cjhutto/vaderSentiment] The second group of metrics addresses a stricter desiderata and measures overall similarity in counterfactually generated outputs using well-established text similarity metrics [@bouchard2024actionableframeworkassessingbias]. + The `CounterfactualMetrics` class offers two groups of metrics to assess counterfactual fairness in text generation use cases. The first group of metrics leverage a pre-trained sentiment classifier to measure sentiment disparities in counterfactually generated outputs (see Huang et al. [-@huang2020reducingsentimentbiaslanguage] for further details). This class uses the `vaderSentiment` classifier by default but also gives users the option to provide a custom sentiment classifier object.^[https://github.com/cjhutto/vaderSentiment] The second group of metrics addresses a stricter desiderata and measures overall similarity in counterfactually generated outputs using well-established text similarity metrics [@bouchard2024actionableframeworkassessingbias]. ### Counterfactual Fairness Metrics for Recommendation @@ -90,7 +90,7 @@ When LLMs are used to solve classification problems, traditional machine learnin # Semi-Automated Evaluation ### `AutoEval` class -To streamline assessments for text generation use cases, the `AutoEval` class conducts a multi-step process (each step is described above) that includes metric selection, evaluation dataset generation, and metric computation. The user is required to supply a list of prompts and an instance of `langchain` LLM. Below we provide a basic example demonstrating the execution of `AutoEval.evaluate` with a `gemini-pro` instance.^[Note that this example assumes the user has already set up their VertexAI credentials and sampled a list of prompts from their use case prompts.] +To streamline assessments for text generation use cases, the `AutoEval` class conducts a multi-step process (each step is described in detail above) for a comprehensive fairness assessment. Specifically, these steps include metric selection (based on whether FTU is satsified), evaluation dataset generation from user-provided prompts with a user-provided LLM, and computation of applicable fairness metrics. To implement, the user is required to supply a list of prompts and an instance of `langchain` LLM. Below we provide a basic example demonstrating the execution of `AutoEval.evaluate` with a `gemini-pro` instance.^[Note that this example assumes the user has already set up their VertexAI credentials and sampled a list of prompts from their use case prompts.] ```python