-
Notifications
You must be signed in to change notification settings - Fork 359
Description
Describe the bug
Several errors were encountered while trying to use bert_score
for the evaluation of a summarization task. I would like to share my current approaches to fixing them (with linked PRs), some of which require manual setting of variables.
I hope to discuss whether we would like to make some of these changes accessible/configurable via LightevalTaskConfig
. Also, curious to learn whether I have perhaps missed some steps that led to these issues below.
Metrics.bert_score
throws TypeError:
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 236, in evaluate
self._compute_metrics(sample_id_to_responses)
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 288, in _compute_metrics
metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 111, in apply_generative_metric
metric.compute(
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 74, in compute
return self.sample_level_fn(**kwargs) # result, formatted_doc,
TypeError: BertScore.compute() got an unexpected keyword argument 'formatted_doc'
This can be fixed by adding **kwarg to the compute function for BERTScore, but I also noticed a few other metrics (eg. BLEURT) that are missing **kwargs so maybe they can all be fixed at once?
- After fixing the above error,
OverflowError: int too big to convert
is encountered. To avoid this, it would be helpful to be able to set themax_length
of tokenizer. As seen below, when I manually override it as 512, the issue is resolved. I am wondering if there is interest in modifying the LightevalTaskConfig class to pass this in?
def sent_encode(tokenizer, sent):
"Encoding as sentence based on the tokenizer"
sent = sent.strip()
if sent == "":
return tokenizer.build_inputs_with_special_tokens([])
return tokenizer.encode(
sent,
add_special_tokens=True,
# max_length=tokenizer.model_max_length,
max_length=512,
truncation=True,
)
- BERTScore enforces a baseline
raise ValueError(f"Baseline not Found for {self.model_type} on {self.lang} at {self.baseline}
.
- Current code defaults to using baseline file, but wonder if it's preferred for the default to be
False
for easier use, or provide an interface to control this value?
To Reproduce
- Define a custom task very similar to this example
metric=[Metrics.bert_score]
and run using custom tasks path
lighteval accelerate \
--model_args="pretrained=<path to model on the hub>"\
--tasks <task parameters> \
--output_dir output_dir
Expected behavior
Compute BERTScores for model summarization output and gold reference.
Version info
operating system: MacOS
lighteval version: Version: 0.5.0.dev0