|
1 |
| -Get Started |
2 |
| -=========== |
| 1 | +Quickstart Guide |
| 2 | +================ |
| 3 | +(Optional) Create a virtual environment for using LangFair |
| 4 | +---------------------------------------------------------- |
| 5 | +We recommend creating a new virtual environment using venv before installing LangFair. To do so, please follow instructions `here <https://docs.python.org/3/library/venv.html>`_. |
3 | 6 |
|
4 |
| -.. _installation: |
| 7 | +Installing LangFair |
| 8 | +------------------- |
| 9 | +The latest version can be installed from PyPI: |
5 | 10 |
|
6 |
| -.. _gettingstarted: |
| 11 | +.. code-block:: console |
7 | 12 |
|
8 |
| -Installation |
9 |
| ------------- |
| 13 | + pip install langfair |
10 | 14 |
|
11 |
| -We recommend creating a new virtual environment using venv before installing LangFair. To do so, please follow instructions `here <https://docs.python.org/3/library/venv.html>`_. |
| 15 | +Usage Examples |
| 16 | +-------------- |
| 17 | +Below are code samples illustrating how to use LangFair to assess bias and fairness risks in text generation and summarization use cases. The below examples assume the user has already defined a list of prompts from their use case , ``prompts``. |
12 | 18 |
|
13 |
| -.. code-block:: console |
| 19 | +Generate LLM responses |
| 20 | +^^^^^^^^^^^^^^^^^^^^^^ |
| 21 | +To generate responses, we can use LangFair's ``ResponseGenerator`` class. First, we must create a ``langchain`` LLM object. Below we use ``ChatVertexAI``, but any of `LangChain's LLM classes <https://js.langchain.com/docs/integrations/chat/>`_ may be used instead. Note that ``InMemoryRateLimiter`` is to used to avoid rate limit errors. |
14 | 22 |
|
15 |
| - pip install langfair |
| 23 | +.. code-block:: python |
16 | 24 |
|
17 |
| -Usage Example |
18 |
| -------------- |
| 25 | + from langchain_google_vertexai import ChatVertexAI |
| 26 | + from langchain_core.rate_limiters import InMemoryRateLimiter |
| 27 | + rate_limiter = InMemoryRateLimiter( |
| 28 | + requests_per_second=4.5, |
| 29 | + check_every_n_seconds=0.5, |
| 30 | + max_bucket_size=280, |
| 31 | + ) |
| 32 | + llm = ChatVertexAI( |
| 33 | + model_name="gemini-pro", |
| 34 | + temperature=0.3, |
| 35 | + rate_limiter=rate_limiter |
| 36 | + ) |
| 37 | +
|
| 38 | +We can use ``ResponseGenerator.generate_responses`` to generate 25 responses for each prompt, as is convention for toxicity evaluation. |
| 39 | + |
| 40 | +.. code-block:: python |
19 | 41 |
|
20 |
| -Below is a sample of code illustrating how to use LangFair's `AutoEval` class for text generation and summarization use cases. The below example assumes the user has already defined parameters |
21 |
| -``DEPLOYMENT_NAME``, ``API_KEY``, ``API_BASE``, ``API_TYPE``, ``API_VERSION``, and a list of prompts from their use case `prompts`. |
| 42 | + from langfair.generator import ResponseGenerator |
| 43 | + rg = ResponseGenerator(langchain_llm=llm) |
| 44 | + generations = await rg.generate_responses(prompts=prompts, count=25) |
| 45 | + responses = [str(r) for r in generations["data"]["response"]] |
| 46 | + duplicated_prompts = [str(r) for r in generations["data"]["prompt"]] # so prompts correspond to responses |
22 | 47 |
|
23 |
| -| |
24 |
| -Create `langchain` LLM object. |
| 48 | +
|
| 49 | +Compute toxicity metrics |
| 50 | +^^^^^^^^^^^^^^^^^^^^^^^^ |
| 51 | +Toxicity metrics can be computed with ``ToxicityMetrics``. Note that use of ``torch.device`` is optional and should be used if GPU is available to speed up toxicity computation. |
25 | 52 |
|
26 | 53 | .. code-block:: python
|
27 | 54 |
|
28 |
| - from langchain_openai import AzureChatOpenAI |
29 | 55 | # import torch # uncomment if GPU is available
|
30 | 56 | # device = torch.device("cuda") # uncomment if GPU is available
|
31 |
| -
|
32 |
| - llm = AzureChatOpenAI( |
33 |
| - deployment_name=DEPLOYMENT_NAME, |
34 |
| - openai_api_key=API_KEY, |
35 |
| - azure_endpoint=API_BASE, |
36 |
| - openai_api_type=API_TYPE, |
37 |
| - openai_api_version=API_VERSION, |
38 |
| - temperature=0.4 # User to set temperature |
| 57 | + from langfair.metrics.toxicity import ToxicityMetrics |
| 58 | + tm = ToxicityMetrics( |
| 59 | + # device=device, # uncomment if GPU is available, |
| 60 | + ) |
| 61 | + tox_result = tm.evaluate( |
| 62 | + prompts=duplicated_prompts, |
| 63 | + responses=responses, |
| 64 | + return_data=True |
39 | 65 | )
|
| 66 | + tox_result['metrics'] |
| 67 | + # Output is below |
| 68 | + # {'Toxic Fraction': 0.0004, |
| 69 | + # 'Expected Maximum Toxicity': 0.013845130120171235, |
| 70 | + # 'Toxicity Probability': 0.01} |
| 71 | + |
| 72 | +Compute stereotype metrics |
| 73 | +^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 74 | +Stereotype metrics can be computed with ``StereotypeMetrics``. |
| 75 | + |
| 76 | +.. code-block:: python |
40 | 77 |
|
41 |
| -.. note:: |
| 78 | + from langfair.metrics.stereotype import StereotypeMetrics |
| 79 | + sm = StereotypeMetrics() |
| 80 | + stereo_result = sm.evaluate(responses=responses, categories=["gender"]) |
| 81 | + stereo_result['metrics'] |
| 82 | + # Output is below |
| 83 | + # {'Stereotype Association': 0.3172750176745329, |
| 84 | + # 'Cooccurrence Bias': 0.4476633365427837, |
| 85 | + # 'Stereotype Fraction - gender': 0.08} |
42 | 86 |
|
43 |
| - You can use any of `LangChain's LLM classes <https://js.langchain.com/docs/integrations/chat/>`_ in place of AzureChatOpenAI. |
44 |
| - |
45 |
| - Also, to avoid rate limit errors, use LangChain's `InMemoryRateLimiter <https://api.python.langchain.com/en/latest/rate_limiters/langchain_core.rate_limiters.InMemoryRateLimiter.html#>`_ |
46 | 87 |
|
47 |
| -| |
48 |
| -Run the `AutoEval` method for automated bias / fairness evaluation |
| 88 | +Generate counterfactual responses and compute metrics |
| 89 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 90 | +We can generate counterfactual responses with ``CounterfactualGenerator``. |
49 | 91 |
|
50 | 92 | .. code-block:: python
|
51 | 93 |
|
52 |
| - from langfair.auto import AutoEval |
53 |
| - auto_object = AutoEval( |
54 |
| - prompts=prompts, |
55 |
| - langchain_llm=llm |
56 |
| - # toxicity_device=device # uncomment if GPU is available |
| 94 | + from langfair.generator.counterfactual import CounterfactualGenerator |
| 95 | + cg = CounterfactualGenerator(langchain_llm=llm) |
| 96 | + cf_generations = await cg.generate_responses( |
| 97 | + prompts=prompts, attribute='gender', count=25 |
57 | 98 | )
|
58 |
| - results = await auto_object.evaluate() |
| 99 | + male_responses = [str(r) for r in cf_generations['data']['male_response']] |
| 100 | + female_responses = [str(r) for r in cf_generations['data']['female_response']] |
59 | 101 |
|
60 |
| -.. image:: ./_static/images/autoeval_process.png |
61 |
| - :width: 800 |
62 |
| - :alt: AutoEval Process |
63 |
| - |
64 |
| -| |
65 |
| -Print the results and export to .txt file. |
| 102 | +Counterfactual metrics can be easily computed with ``CounterfactualMetrics``. |
66 | 103 |
|
67 | 104 | .. code-block:: python
|
68 | 105 |
|
69 |
| - auto_object.export_results(file_name="metric_values.txt") |
70 |
| - auto_object.print_results() |
| 106 | + from langfair.metrics.counterfactual import CounterfactualMetrics |
| 107 | + cm = CounterfactualMetrics() |
| 108 | + cf_result = cm.evaluate( |
| 109 | + texts1=male_responses, |
| 110 | + texts2=female_responses, |
| 111 | + attribute='gender' |
| 112 | + ) |
| 113 | + cf_result |
| 114 | + # Output is below |
| 115 | + # {'Cosine Similarity': 0.8318708, |
| 116 | + # 'RougeL Similarity': 0.5195852482361165, |
| 117 | + # 'Bleu Similarity': 0.3278433712872481, |
| 118 | + # 'Sentiment Bias': 0.00099471451876019657} |
71 | 119 |
|
72 |
| -.. image:: ./_static/images/autoeval_output.png |
73 |
| - :width: 500 |
74 |
| - :align: center |
75 |
| - :alt: AutoEval Output |
| 120 | +
|
| 121 | +Alternative approach: Semi-automated evaluation with ``AutoEval`` |
| 122 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 123 | +To streamline assessments for text generation and summarization use cases, the ``AutoEval`` class conducts a multi-step process that completes all of the aforementioned steps with two lines of code. |
| 124 | + |
| 125 | +.. code-block:: python |
| 126 | +
|
| 127 | + from langfair.auto import AutoEval |
| 128 | + auto_object = AutoEval( |
| 129 | + prompts=prompts, |
| 130 | + langchain_llm=llm, |
| 131 | + # toxicity_device=device # uncomment if GPU is available |
| 132 | + ) |
| 133 | + results = await auto_object.evaluate() |
| 134 | + results |
| 135 | + # Output is below |
| 136 | + # {'Toxicity': {'Toxic Fraction': 0.0004, |
| 137 | + # 'Expected Maximum Toxicity': 0.01384513012017123, |
| 138 | + # 'Toxicity Probability': 0.01}, |
| 139 | + # 'Stereotype': {'Stereotype Association': 0.3172750176745329, |
| 140 | + # 'Cooccurrence Bias': 0.4476633365427837, |
| 141 | + # 'Stereotype Fraction - gender': 0.08, |
| 142 | + # 'Expected Maximum Stereotype - gender': 0.6035516738891, |
| 143 | + # 'Stereotype Probability - gender': 0.27036}, |
| 144 | + # 'Counterfactual': {'male-female': {'Cosine Similarity': 0.8318708, |
| 145 | + # 'RougeL Similarity': 0.5195852482361165, |
| 146 | + # 'Bleu Similarity': 0.3278433712872481, |
| 147 | + # 'Sentiment Bias': 0.00099471451876019577}}} |
0 commit comments