-
Notifications
You must be signed in to change notification settings - Fork 528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to define custom metrics with Ragas ? #1032
Comments
Bonjour @saadbouhya! Je suis là pour vous aider avec vos questions sur Ragas. Je suis un bot qui peut vous aider à résoudre des bugs, répondre à des questions et devenir un contributeur. Yes, it is possible to define custom metrics with Ragas for use with French data, even though the automatic prompt adaptation is not integrated with the evaluation. You can create custom metrics similarly to how |
@dosu even though the prompts can adapt, the language is hardcoded as english on the evaluation level. |
To modify the evaluation process in Ragas to support French data, you need to adjust the Here is the relevant part of the code with the necessary modifications: def evaluate(
dataset: Dataset,
metrics: list[Metric] | None = None,
llm: t.Optional[BaseRagasLLM | LangchainLLM] = None,
embeddings: t.Optional[BaseRagasEmbeddings | LangchainEmbeddings] = None,
callbacks: Callbacks = None,
in_ci: bool = False,
is_async: bool = True,
run_config: t.Optional[RunConfig] = None,
raise_exceptions: bool = True,
column_map: t.Optional[t.Dict[str, str]] = None,
language: str = "en" # Add a language parameter with default as English
) -> Result:
"""
Run the evaluation on the dataset with different metrics
Parameters
----------
dataset : Dataset[question: list[str], contexts: list[list[str]], answer: list[str], ground_truth: list[list[str]]]
The dataset in the format of ragas which the metrics will use to score the RAG
pipeline with
metrics : list[Metric] , optional
List of metrics to use for evaluation. If not provided then ragas will run the
evaluation on the best set of metrics to give a complete view.
llm: BaseRagasLLM, optional
The language model to use for the metrics. If not provided then ragas will use
the default language model for metrics which require an LLM. This can we overridden by the llm specified in
the metric level with `metric.llm`.
embeddings: BaseRagasEmbeddings, optional
The embeddings to use for the metrics. If not provided then ragas will use
the default embeddings for metrics which require embeddings. This can we overridden by the embeddings specified in
the metric level with `metric.embeddings`.
callbacks: Callbacks, optional
Lifecycle Langchain Callbacks to run during evaluation. Check the
[langchain documentation](https://python.langchain.com/docs/modules/callbacks/)
for more information.
in_ci: bool
Whether the evaluation is running in CI or not. If set to True then some
metrics will be run to increase the reproducability of the evaluations. This
will increase the runtime and cost of evaluations. Default is False.
is_async: bool
Whether to run the evaluation in async mode or not. If set to True then the
evaluation is run by calling the `metric.ascore` method. In case the llm or
embeddings does not support async then the evaluation can be run in sync mode
with `is_async=False`. Default is False.
run_config: RunConfig, optional
Configuration for runtime settings like timeout and retries. If not provided,
default values are used.
raise_exceptions: True
Whether to raise exceptions or not. If set to True then the evaluation will
raise an exception if any of the metrics fail. If set to False then the
evaluation will return `np.nan` for the row that failed. Default is True.
column_map : dict[str, str], optional
The column names of the dataset to use for evaluation. If the column names of
the dataset are different from the default ones then you can provide the
mapping as a dictionary here. Example: If the dataset column name is contexts_v1,
column_map can be given as {"contexts":"contexts_v1"}
language: str, optional
The language of the dataset. Default is "en" (English). Set to "fr" for French.
Returns
-------
Result
Result object containing the scores of each metric. You can use this do analysis
later.
Raises
------
ValueError
if validation fails because the columns required for the metrics are missing or
if the columns are of the wrong format.
Examples
--------
the basic usage is as follows:
```
from ragas import evaluate
>>> dataset
Dataset({
features: ['question', 'ground_truth', 'answer', 'contexts'],
num_rows: 30
})
>>> result = evaluate(dataset, language="fr")
>>> print(result)
{'context_precision': 0.817,
'faithfulness': 0.892,
'answer_relevancy': 0.874}
```
"""
column_map = column_map or {}
callbacks = callbacks or []
if dataset is None:
raise ValueError("Provide dataset!")
# default run_config
run_config = run_config or RunConfig()
# default metrics
if metrics is None:
from ragas.metrics import (
answer_relevancy,
context_precision,
context_recall,
faithfulness,
)
metrics = [answer_relevancy, context_precision, faithfulness, context_recall]
# remap column names from the dataset
dataset = remap_column_names(dataset, column_map)
# validation
dataset = handle_deprecated_ground_truths(dataset)
validate_evaluation_modes(dataset, metrics)
validate_column_dtypes(dataset)
# set the llm and embeddings
if isinstance(llm, LangchainLLM):
llm = LangchainLLMWrapper(llm, run_config=run_config)
if isinstance(embeddings, LangchainEmbeddings):
embeddings = LangchainEmbeddingsWrapper(embeddings)
# init llms and embeddings
binary_metrics = []
llm_changed: t.List[int] = []
embeddings_changed: t.List[int] = []
reproducable_metrics: t.List[int] = []
answer_correctness_is_set = -1
# loop through the metrics and perform initializations
for i, metric in enumerate(metrics):
# set llm and embeddings if not set
if isinstance(metric, AspectCritique):
binary_metrics.append(metric.name)
if isinstance(metric, MetricWithLLM) and metric.llm is None:
if llm is None:
llm = llm_factory()
metric.llm = llm
llm_changed.append(i)
if isinstance(metric, MetricWithEmbeddings) and metric.embeddings is None:
if embeddings is None:
embeddings = embedding_factory()
metric.embeddings = embeddings
embeddings_changed.append(i)
if isinstance(metric, AnswerCorrectness):
if metric.answer_similarity is None:
answer_correctness_is_set = i
# set reproducibility for metrics if in CI
if in_ci and is_reproducable(metric):
if metric.reproducibility == 1: # type: ignore
# only set a value if not already set
metric.reproducibility = 3 # type: ignore
reproducable_metrics.append(i)
# init all the models
metric.init(run_config)
executor = Executor(
desc="Evaluating",
keep_progress_bar=True,
raise_exceptions=raise_exceptions,
run_config=run_config,
)
# new evaluation chain
row_run_managers = []
evaluation_rm, evaluation_group_cm = new_group(
name="ragas evaluation", inputs={}, callbacks=callbacks, is_async=is_async
)
for i, row in enumerate(dataset):
row = t.cast(t.Dict[str, t.Any], row)
row_rm, row_group_cm = new_group(
name=f"row {i}",
inputs=row,
callbacks=evaluation_group_cm,
is_async=is_async
)
row_run_managers.append((row_rm, row_group_cm))
[
executor.submit(
metric.ascore,
row,
row_group_cm,
is_async,
name=f"{metric.name}-{i}",
thread_timeout=run_config.thread_timeout,
)
for metric in metrics
]
scores = []
try:
# get the results
results = executor.results()
if results == []:
raise ExceptionInRunner()
# convert results to dataset_like
for i, _ in enumerate(dataset):
s = {}
for j, m in enumerate(metrics):
s[m.name] = results[len(metrics) * i + j]
scores.append(s)
# close the row chain
row_rm, row_group_cm = row_run_managers[i]
if not row_group_cm.ended:
row_rm.on_chain_end(s)
# run evaluation task
except Exception as e:
if not evaluation_group_cm.ended:
evaluation_rm.on_chain_error(e)
raise e
else:
result = Result(
scores=Dataset.from_list(scores),
dataset=dataset,
binary_columns=binary_metrics,
)
if not evaluation_group_cm.ended:
evaluation_rm.on_chain_end(result)
finally:
# reset llms and embeddings if changed
for i in llm_changed:
t.cast(MetricWithLLM, metrics[i]).llm = None
for i in embeddings_changed:
t.cast(MetricWithEmbeddings, metrics[i]).embeddings = None
if answer_correctness_is_set != -1:
t.cast(
AnswerCorrectness, metrics[answer_correctness_is_set]
).answer_similarity = None
for i in reproducable_metrics:
metrics[i].reproducibility = 1 # type: ignore
# log the evaluation event
metrics_names = [m.name for m in metrics]
metric_lang = [get_feature_language(m) for m in metrics]
metric_lang = np.unique([m for m in metric_lang if m is not None])
track(
EvaluationEvent(
event_type="evaluation",
metrics=metrics_names,
evaluation_mode="",
num_rows=dataset.shape[0],
language=language, # Use the language parameter here
in_ci=in_ci,
)
)
return result With these modifications, you can now pass the |
I need to use it with french data, but the automatic prompt adaptation is not integrated with the evaluation.
The text was updated successfully, but these errors were encountered: