Reorganize documentation

Zeya Ahmad · Zeya Ahmad · commit 6314efcab96d · 2024-12-20T11:05:04.000-05:00
diff --git a/docs_src/latest/source/choosing_metrics.rst b/docs_src/latest/source/choosing_metrics.rst
@@ -0,0 +1,59 @@
+Choosing Metrics
+================
+
+Choosing Bias and Fairness Metrics for an LLM Use Case
+------------------------------------------------------
+
+Selecting the appropriate bias and fairness metrics is essential for accurately assessing the performance of large language models (LLMs) in specific use cases. Instead of attempting to compute all possible metrics, practitioners should focus on a relevant subset that aligns with their specific goals and the context of their application.
+
+Our decision framework for selecting appropriate evaluation metrics is illustrated in the diagram below. For more details, refer to our `technical playbook <https://arxiv.org/abs/2407.10853>`_.
+
+.. image:: ./_static/images/use_case_framework.PNG
+   :width: 800
+   :align: center
+   :alt: Use Case Framework
+
+
+.. note::
+
+   Fairness through unawareness means none of the prompts for an LLM 
+   use case include any mention of protected attribute words.
+
+Supported Bias and Fairness Metrics 
+-----------------------------------
+
+Bias and fairness metrics offered by LangFair are grouped into several categories. The full suite of metrics is displayed below.
+
+**Toxicity Metrics**
+
+* Expected Maximum Toxicity `[Gehman et al., 2020] <https://arxiv.org/abs/2009.11462>`_
+* Toxicity Probability `[Gehman et al., 2020] <https://arxiv.org/abs/2009.11462>`_
+* Toxic Fraction `[Liang et al., 2023] <https://arxiv.org/abs/2211.09110>`_
+
+**Counterfactual Fairness Metrics**
+
+* Strict Counterfactual Sentiment Parity `[Huang et al., 2020] <https://arxiv.org/abs/1911.03064>`_
+* Weak Counterfactual Sentiment Parity `[Bouchard, 2024] <https://arxiv.org/abs/2407.10853>`_
+* Counterfactual Cosine Similarity Score `[Bouchard, 2024] <https://arxiv.org/abs/2407.10853>`_
+* Counterfactual BLEU `[Bouchard, 2024] <https://arxiv.org/abs/2407.10853>`_
+* Counterfactual ROUGE-L `[Bouchard, 2024] <https://arxiv.org/abs/2407.10853>`_
+
+**Stereotype Metrics** 
+
+* Stereotypical Associations `[Liang et al., 2023] <https://arxiv.org/abs/2211.09110>`_
+* Co-occurrence Bias Score `[Bordia & Bowman, 2019] <https://arxiv.org/abs/1904.03035>`_
+* Stereotype classifier metrics `[Zekun et al., 2023] <https://arxiv.org/abs/2311.14126>`_, `[Bouchard, 2024] <https://arxiv.org/abs/2407.10853>`_
+
+**Recommendation (Counterfactual) Fairness Metrics**
+
+* Jaccard Similarity `[Zhang et al., 2023] <https://dl.acm.org/doi/10.1145/3604915.3608860>`_
+* Search Result Page Misinformation Score `[Zhang et al., 2023] <https://dl.acm.org/doi/10.1145/3604915.3608860>`_
+* Pairwise Ranking Accuracy Gap `[Zhang et al., 2023] <https://dl.acm.org/doi/10.1145/3604915.3608860>`_
+
+**Classification Fairness Metrics**
+
+* Predicted Prevalence Rate Disparity `[Feldman et al., 2015] <https://arxiv.org/abs/1412.3756>`_, `[Bellamy et al., 2018] <https://arxiv.org/abs/1810.01943>`_, `[Saleiro et al., 2019] <https://arxiv.org/abs/1811.05577>`_
+* False Negative Rate Disparity `[Bellamy et al., 2018] <https://arxiv.org/abs/1810.01943>`_, `[Saleiro et al., 2019] <https://arxiv.org/abs/1811.05577>`_
+* False Omission Rate Disparity `[Bellamy et al., 2018] <https://arxiv.org/abs/1810.01943>`_, `[Saleiro et al., 2019] <https://arxiv.org/abs/1811.05577>`_
+* False Positive Rate Disparity `[Bellamy et al., 2018] <https://arxiv.org/abs/1810.01943>`_, `[Saleiro et al., 2019] <https://arxiv.org/abs/1811.05577>`_
+* False Discovery Rate Disparity `[Bellamy et al., 2018] <https://arxiv.org/abs/1810.01943>`_, `[Saleiro et al., 2019] <https://arxiv.org/abs/1811.05577>`_
diff --git a/docs_src/latest/source/index.rst b/docs_src/latest/source/index.rst
@@ -2,22 +2,50 @@
    sphinx-quickstart on Wed Jun 12 09:11:05 2024.
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
-
+   
 Welcome to LangFair's documentation!
 ====================================
 
+LLM bias and fairness made simple
+---------------------------------
+
+LangFair is a comprehensive Python library designed for conducting use-case-specific bias and fairness assessments for large language model (LLM) use cases. Using a unique Bring Your Own Prompts (BYOP) approach, LangFair helps you:
+
+✨ **Evaluate Real-World Scenarios**: Evaluate bias and fairness for actual LLM use cases
+
+🎯 **Get Actionable Metrics**: Measure toxicity, stereotypes, and fairness with applicable metrics
+
+🔍 **Make Informed Decisions**: Use our framework to choose the right evaluation metrics
+
+🛠️ **Simple Integration**: Easy-to-use Python interface for seamless implementation
+
+:doc:`Get Started → <usage>` | :doc:`View Examples → <auto_examples/index>`
+
+Why LangFair?
+-------------
+
+Static benchmark assessments, which are typically assumed to be sufficiently representative, often fall short in capturing the risks associated with all possible use cases of LLMs. These models are increasingly used in various applications, including recommendation systems, classification, text generation, and summarization. However, evaluating these models without considering use-case-specific prompts can lead to misleading assessments of their performance, especially regarding bias and fairness risks.
+
+LangFair addresses this gap by adopting a Bring Your Own Prompts (BYOP) approach, allowing users to tailor bias and fairness evaluations to their specific use cases. This ensures that the metrics computed reflect the true performance of the LLMs in real-world scenarios, where prompt-specific risks are critical. Additionally, LangFair's focus is on output-based metrics that are practical for governance audits and real-world testing, without needing access to internal model states.
+
+
+Quick Links
+-----------
+
 .. toctree::
    :maxdepth: 1
    :caption: Contents:
-
-   Overview <overview>
+   
    Get Started <usage>
+   Choosing Metrics<choosing_metrics>
    API <api>
    auto_examples/index
    Contributor Guide <guide>
 
+Featured Resources
+------------------
 
-
-Check out the :doc:`usage` section for further information, including how to :ref:`install <installation>` the project.
-
-View `LangFair white paper <https://arxiv.org/pdf/2407.10853>`_. 
+- 🚀 :doc:`Get started <usage>` in minutes 
+- 🔬 Explore our :doc:`framework for choosing metrics <choosing_metrics>`
+- 💡 Try our :doc:`guided examples <auto_examples/index>`
+- 📖 Read the `research paper <https://arxiv.org/abs/2407.10853>`_
diff --git a/docs_src/latest/source/usage.rst b/docs_src/latest/source/usage.rst
@@ -1,75 +1,147 @@
-Get Started
-===========
+Quickstart Guide
+================
+(Optional) Create a virtual environment for using LangFair
+----------------------------------------------------------
+We recommend creating a new virtual environment using venv before installing LangFair. To do so, please follow instructions `here <https://docs.python.org/3/library/venv.html>`_.
 
-.. _installation:
+Installing LangFair
+-------------------
+The latest version can be installed from PyPI:
 
-.. _gettingstarted:
+.. code-block:: console
 
-Installation
-------------
+   pip install langfair
 
-We recommend creating a new virtual environment using venv before installing LangFair. To do so, please follow instructions `here <https://docs.python.org/3/library/venv.html>`_.
+Usage Examples
+--------------
+Below are code samples illustrating how to use LangFair to assess bias and fairness risks in text generation and summarization use cases. The below examples assume the user has already defined a list of prompts from their use case , ``prompts``.
 
-.. code-block:: console
+Generate LLM responses
+^^^^^^^^^^^^^^^^^^^^^^
+To generate responses, we can use LangFair's ``ResponseGenerator`` class. First, we must create a ``langchain`` LLM object. Below we use ``ChatVertexAI``, but any of `LangChain's LLM classes <https://js.langchain.com/docs/integrations/chat/>`_ may be used instead. Note that ``InMemoryRateLimiter`` is to used to avoid rate limit errors.
 
-   pip install langfair
+.. code-block:: python
 
-Usage Example
--------------
+   from langchain_google_vertexai import ChatVertexAI
+   from langchain_core.rate_limiters import InMemoryRateLimiter
+   rate_limiter = InMemoryRateLimiter(
+       requests_per_second=4.5,
+       check_every_n_seconds=0.5,
+       max_bucket_size=280,
+   )
+   llm = ChatVertexAI(
+       model_name="gemini-pro",
+       temperature=0.3,
+       rate_limiter=rate_limiter
+   )
+
+We can use ``ResponseGenerator.generate_responses`` to generate 25 responses for each prompt, as is convention for toxicity evaluation.
+
+.. code-block:: python
 
-Below is a sample of code illustrating how to use LangFair's `AutoEval` class for text generation and summarization use cases. The below example assumes the user has already defined parameters
-``DEPLOYMENT_NAME``, ``API_KEY``, ``API_BASE``, ``API_TYPE``, ``API_VERSION``, and a list of prompts from their use case `prompts`.
+   from langfair.generator import ResponseGenerator
+   rg = ResponseGenerator(langchain_llm=llm)
+   generations = await rg.generate_responses(prompts=prompts, count=25)
+   responses = [str(r) for r in generations["data"]["response"]]
+   duplicated_prompts = [str(r) for r in generations["data"]["prompt"]]  # so prompts correspond to responses
 
-|
-Create `langchain` LLM object.
+
+Compute toxicity metrics
+^^^^^^^^^^^^^^^^^^^^^^^^
+Toxicity metrics can be computed with ``ToxicityMetrics``. Note that use of ``torch.device`` is optional and should be used if GPU is available to speed up toxicity computation.
 
 .. code-block:: python
 
-   from langchain_openai import AzureChatOpenAI
    # import torch # uncomment if GPU is available
    # device = torch.device("cuda") # uncomment if GPU is available
-
-   llm = AzureChatOpenAI(
-      deployment_name=DEPLOYMENT_NAME,
-      openai_api_key=API_KEY,
-      azure_endpoint=API_BASE,
-      openai_api_type=API_TYPE,
-      openai_api_version=API_VERSION,
-      temperature=0.4 # User to set temperature
+   from langfair.metrics.toxicity import ToxicityMetrics
+   tm = ToxicityMetrics(
+       # device=device, # uncomment if GPU is available,
+   )
+   tox_result = tm.evaluate(
+       prompts=duplicated_prompts,
+       responses=responses,
+       return_data=True
    )
+   tox_result['metrics']
+   # Output is below
+   # {'Toxic Fraction': 0.0004,
+   # 'Expected Maximum Toxicity': 0.013845130120171235,
+   # 'Toxicity Probability': 0.01}
+   
+Compute stereotype metrics
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+Stereotype metrics can be computed with ``StereotypeMetrics``.
+
+.. code-block:: python
 
-.. note::
+   from langfair.metrics.stereotype import StereotypeMetrics
+   sm = StereotypeMetrics()
+   stereo_result = sm.evaluate(responses=responses, categories=["gender"])
+   stereo_result['metrics']
+   # Output is below
+   # {'Stereotype Association': 0.3172750176745329,
+   # 'Cooccurrence Bias': 0.4476633365427837,
+   # 'Stereotype Fraction - gender': 0.08}
 
-   You can use any of `LangChain's LLM classes <https://js.langchain.com/docs/integrations/chat/>`_ in place of AzureChatOpenAI. 
-   
-   Also, to avoid rate limit errors, use LangChain's `InMemoryRateLimiter <https://api.python.langchain.com/en/latest/rate_limiters/langchain_core.rate_limiters.InMemoryRateLimiter.html#>`_
 
-|
-Run the `AutoEval` method for automated bias / fairness evaluation
+Generate counterfactual responses and compute metrics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+We can generate counterfactual responses with ``CounterfactualGenerator``.
 
 .. code-block:: python
 
-   from langfair.auto import AutoEval
-   auto_object = AutoEval(
-      prompts=prompts, 
-      langchain_llm=llm
-      # toxicity_device=device # uncomment if GPU is available
+   from langfair.generator.counterfactual import CounterfactualGenerator
+   cg = CounterfactualGenerator(langchain_llm=llm)
+   cf_generations = await cg.generate_responses(
+       prompts=prompts, attribute='gender', count=25
    )
-   results = await auto_object.evaluate() 
+   male_responses = [str(r) for r in cf_generations['data']['male_response']]
+   female_responses = [str(r) for r in cf_generations['data']['female_response']]
 
-.. image:: ./_static/images/autoeval_process.png
-   :width: 800
-   :alt: AutoEval Process
-   
-|
-Print the results and export to .txt file.
+Counterfactual metrics can be easily computed with ``CounterfactualMetrics``.
 
 .. code-block:: python
 
-   auto_object.export_results(file_name="metric_values.txt")
-   auto_object.print_results()
+   from langfair.metrics.counterfactual import CounterfactualMetrics
+   cm = CounterfactualMetrics()
+   cf_result = cm.evaluate(
+       texts1=male_responses,
+       texts2=female_responses,
+       attribute='gender'
+   )
+   cf_result
+   # Output is below
+   # {'Cosine Similarity': 0.8318708,
+   # 'RougeL Similarity': 0.5195852482361165,
+   # 'Bleu Similarity': 0.3278433712872481,
+   # 'Sentiment Bias': 0.00099471451876019657}
 
-.. image:: ./_static/images/autoeval_output.png
-   :width: 500
-   :align: center
-   :alt: AutoEval Output
+
+Alternative approach: Semi-automated evaluation with ``AutoEval``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+To streamline assessments for text generation and summarization use cases, the ``AutoEval`` class conducts a multi-step process that completes all of the aforementioned steps with two lines of code.
+
+.. code-block:: python
+
+   from langfair.auto import AutoEval
+   auto_object = AutoEval(
+       prompts=prompts,
+       langchain_llm=llm,
+       # toxicity_device=device # uncomment if GPU is available
+   )
+   results = await auto_object.evaluate()
+   results
+   # Output is below
+   # {'Toxicity': {'Toxic Fraction': 0.0004,
+   #   'Expected Maximum Toxicity': 0.01384513012017123,
+   #   'Toxicity Probability': 0.01},
+   # 'Stereotype': {'Stereotype Association': 0.3172750176745329,
+   #   'Cooccurrence Bias': 0.4476633365427837,
+   #   'Stereotype Fraction - gender': 0.08,
+   #   'Expected Maximum Stereotype - gender': 0.6035516738891,
+   #   'Stereotype Probability - gender': 0.27036},
+   # 'Counterfactual': {'male-female': {'Cosine Similarity': 0.8318708,
+   #   'RougeL Similarity': 0.5195852482361165,
+   #   'Bleu Similarity': 0.3278433712872481,
+   #   'Sentiment Bias': 0.00099471451876019577}}}