Skip to content

Commit 8e29433

Browse files
authored
Rename metrics (explodinggradients#48)
Rename factuality to faithfulness to convey the idea correctly and in favor of incoming feature that measures factual consistency.
1 parent 9ad17b1 commit 8e29433

File tree

11 files changed

+43
-43
lines changed

11 files changed

+43
-43
lines changed

README.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -74,14 +74,14 @@ dataset: Dataset
7474

7575
results = evaluate(dataset)
7676
# {'ragas_score': 0.860, 'context_relavency': 0.817,
77-
# 'factuality': 0.892, 'answer_relevancy': 0.874}
77+
# 'faithfulness': 0.892, 'answer_relevancy': 0.874}
7878
```
7979
If you want a more in-depth explanation of core components, check out our [quick-start notebook](./examples/quickstart.ipynb)
8080
## :luggage: Metrics
8181

8282
Ragas measures your pipeline's performance against two dimensions
83-
1. **Factuality**: measures the factual consistency of the generated answer against the given context.
84-
2. **Relevancy**: measures how relevant retrieved contexts and the generated answer are to the question.
83+
1. **Faithfulness**: measures the information consistency of the generated answer against the given context. If any claims made in the answer that cannot be deduced from context is penalized.
84+
2. **Relevancy**: measures how relevant retrieved contexts and the generated answer are to the question. The presence of extra or redundant information is penalized.
8585

8686
Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.
8787

@@ -103,7 +103,7 @@ If you want to get more involved with Ragas, check out our [discord server](http
103103
## :raising_hand_man: FAQ
104104
1. Why harmonic mean?
105105

106-
Harmonic mean penalizes extreme values. For example, if your generated answer is fully factually consistent with the context (factuality = 1) but is not relevant to the question (relevancy = 0), a simple average would give you a score of 0.5 but a harmonic mean will give you 0.0
106+
Harmonic mean penalizes extreme values. For example, if your generated answer is fully factually consistent with the context (faithfulness = 1) but is not relevant to the question (relevancy = 0), a simple average would give you a score of 0.5 but a harmonic mean will give you 0.0
107107

108108

109109

docs/assets/bar-graph.svg

+1-1
Loading

docs/metrics.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
# Metrics
22

3-
1. `factuality` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
3+
1. `faithfulness` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
44
```python
5-
from ragas.metrics import factuality
5+
from ragas.metrics import faithfulness
66
# Dataset({
77
# features: ['question','contexts','answer'],
88
# num_rows: 25
99
# })
1010
dataset: Dataset
1111

12-
results = evaluate(dataset, metrics=[factuality])
12+
results = evaluate(dataset, metrics=[faithfulness])
1313
```
1414
2. `answer_relevancy`: measures how relevant is the generated answer to the prompt. This is quantified using conditional likelihood of an LLM generating the question given the answer. This is implemented using a custom model. Values range (0,1), higher the better.
1515
```python

examples/quickstart.ipynb

+8-8
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@
122122
"\n",
123123
"Ragas measures your pipeline's performance against two dimensions\n",
124124
"\n",
125-
"1. Factuality: measures the factual consistency of the generated answer against the given context.\n",
125+
"1. Faithfulness: measures the factual consistency of the generated answer against the given context.\n",
126126
"2. Relevancy: measures how relevant retrieved contexts and the generated answer are to the question.\n",
127127
"\n",
128128
"Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.\n",
@@ -137,7 +137,7 @@
137137
"metadata": {},
138138
"outputs": [],
139139
"source": [
140-
"from ragas.metrics import context_relevancy, answer_relevancy, factuality"
140+
"from ragas.metrics import context_relevancy, answer_relevancy, faithfulness"
141141
]
142142
},
143143
{
@@ -149,9 +149,9 @@
149149
"\n",
150150
"1. context_relevancy - a measure of how relevent the retrieved context is to the question. Conveys quality of the retrieval pipeline.\n",
151151
"2. answer_relevancy - a measure of how relevent the answer is to the question\n",
152-
"3. factuality - the factual consistancy of the answer to the context base on the question.\n",
152+
"3. faithfulness - the factual consistancy of the answer to the context base on the question.\n",
153153
"\n",
154-
"**Note:** *`factuality` using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key.*\n",
154+
"**Note:** *`faithfulness` using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key.*\n",
155155
"\n",
156156
"**Note:** *`context_relevancy` and `answer_relevancy` use very small LLMs to compute the score. It will run on CPU but having a GPU is recommended.*\n",
157157
"\n",
@@ -188,7 +188,7 @@
188188
{
189189
"data": {
190190
"text/plain": [
191-
"{'ragas_score': 0.860, 'context_relavency': 0.817, 'factuality': 0.892, 'answer_relevancy': 0.874}"
191+
"{'ragas_score': 0.860, 'context_relavency': 0.817, 'faithfulness': 0.892, 'answer_relevancy': 0.874}"
192192
]
193193
},
194194
"execution_count": 8,
@@ -200,7 +200,7 @@
200200
"from ragas import evaluate\n",
201201
"\n",
202202
"result = evaluate(\n",
203-
" fiqa_eval[\"baseline\"], metrics=[context_relevancy, factuality, answer_relevancy]\n",
203+
" fiqa_eval[\"baseline\"], metrics=[context_relevancy, faithfulness, answer_relevancy]\n",
204204
")\n",
205205
"\n",
206206
"result"
@@ -248,7 +248,7 @@
248248
" <th>answer</th>\n",
249249
" <th>contexts</th>\n",
250250
" <th>context_relavency</th>\n",
251-
" <th>factuality</th>\n",
251+
" <th>faithfulness</th>\n",
252252
" <th>answer_relevancy</th>\n",
253253
" </tr>\n",
254254
" </thead>\n",
@@ -336,7 +336,7 @@
336336
"3 [Set up a meeting with the bank that handles y... 0.781 \n",
337337
"4 [The time horizon for your 401K/IRA is essenti... 0.737 \n",
338338
"\n",
339-
" factuality answer_relevancy \n",
339+
" faithfulness answer_relevancy \n",
340340
"0 1.0 0.922 \n",
341341
"1 1.0 0.923 \n",
342342
"2 1.0 0.824 \n",

0 commit comments

Comments
 (0)