Skip to content

[BUG] Errors when using BERTScore for evaluation #310

@chuandudx

Description

@chuandudx

Describe the bug

Several errors were encountered while trying to use bert_score for the evaluation of a summarization task. I would like to share my current approaches to fixing them (with linked PRs), some of which require manual setting of variables.

I hope to discuss whether we would like to make some of these changes accessible/configurable via LightevalTaskConfig. Also, curious to learn whether I have perhaps missed some steps that led to these issues below.

  1. Metrics.bert_score throws TypeError:
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 236, in evaluate
    self._compute_metrics(sample_id_to_responses)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 288, in _compute_metrics
    metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 111, in apply_generative_metric
    metric.compute(
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 74, in compute
    return self.sample_level_fn(**kwargs)  # result, formatted_doc,
TypeError: BertScore.compute() got an unexpected keyword argument 'formatted_doc'

This can be fixed by adding **kwarg to the compute function for BERTScore, but I also noticed a few other metrics (eg. BLEURT) that are missing **kwargs so maybe they can all be fixed at once?

  1. After fixing the above error, OverflowError: int too big to convert is encountered. To avoid this, it would be helpful to be able to set the max_length of tokenizer. As seen below, when I manually override it as 512, the issue is resolved. I am wondering if there is interest in modifying the LightevalTaskConfig class to pass this in?
def sent_encode(tokenizer, sent):
    "Encoding as sentence based on the tokenizer"
    sent = sent.strip()
    if sent == "":
        return tokenizer.build_inputs_with_special_tokens([])
    return tokenizer.encode(
        sent,
        add_special_tokens=True,
        # max_length=tokenizer.model_max_length,
        max_length=512,
        truncation=True,
    )
  1. BERTScore enforces a baseline raise ValueError(f"Baseline not Found for {self.model_type} on {self.lang} at {self.baseline}.
  • Current code defaults to using baseline file, but wonder if it's preferred for the default to be False for easier use, or provide an interface to control this value?

To Reproduce

  1. Define a custom task very similar to this example metric=[Metrics.bert_score]and run using custom tasks path
lighteval accelerate \
    --model_args="pretrained=<path to model on the hub>"\
    --tasks <task parameters> \
    --output_dir output_dir

Expected behavior

Compute BERTScores for model summarization output and gold reference.

Version info

operating system: MacOS
lighteval version: Version: 0.5.0.dev0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions