[BUG] Errors when using BERTScore for evaluation

## Describe the bug
Several errors were encountered while trying to use `bert_score` for the evaluation of a summarization task. I would like to share my current approaches to fixing them (with linked PRs), some of which require manual setting of variables.

I hope to discuss whether we would like to make some of these changes accessible/configurable via `LightevalTaskConfig`. Also, curious to learn whether I have perhaps missed some steps that led to these issues below.

1.  `Metrics.bert_score` throws TypeError:
```
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 236, in evaluate
    self._compute_metrics(sample_id_to_responses)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 288, in _compute_metrics
    metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 111, in apply_generative_metric
    metric.compute(
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 74, in compute
    return self.sample_level_fn(**kwargs)  # result, formatted_doc,
TypeError: BertScore.compute() got an unexpected keyword argument 'formatted_doc'
```
This can be fixed by adding **kwarg to the [compute function](https://github.com/huggingface/lighteval/blob/6b943ecb8d900c4f2333457df57f6fbbfa8cc034/src/lighteval/metrics/metrics_sample.py#L569) for BERTScore, but I also noticed a few other metrics (eg. [BLEURT](https://github.com/huggingface/lighteval/blob/6b943ecb8d900c4f2333457df57f6fbbfa8cc034/src/lighteval/metrics/metrics_sample.py#L627)) that are missing **kwargs so maybe they can all be fixed at once? 

2. After fixing the above error, `OverflowError: int too big to convert` is encountered. To avoid this, it would be helpful to be able to set the `max_length` of tokenizer. As seen below, when I manually override it as 512, the issue is resolved. I am wondering if there is interest in modifying the [LightevalTaskConfig class](https://github.com/huggingface/lighteval/blob/6b943ecb8d900c4f2333457df57f6fbbfa8cc034/src/lighteval/tasks/lighteval_task.py#L65) to pass this in? 
```
def sent_encode(tokenizer, sent):
    "Encoding as sentence based on the tokenizer"
    sent = sent.strip()
    if sent == "":
        return tokenizer.build_inputs_with_special_tokens([])
    return tokenizer.encode(
        sent,
        add_special_tokens=True,
        # max_length=tokenizer.model_max_length,
        max_length=512,
        truncation=True,
    )
 ```

3. BERTScore enforces a baseline `raise ValueError(f"Baseline not Found for {self.model_type} on {self.lang} at {self.baseline}`. 
- Current code defaults to using baseline file, but wonder if it's preferred for the default to be `False` for easier use, or provide an interface to control this value? 

## To Reproduce
1. Define a custom task very similar to [this example](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/default_tasks.py#L9706)  `metric=[Metrics.bert_score]`and run using custom tasks path
```
lighteval accelerate \
    --model_args="pretrained=<path to model on the hub>"\
    --tasks <task parameters> \
    --output_dir output_dir
```

## Expected behavior
Compute BERTScores for model summarization output and gold reference.

## Version info
operating system: MacOS
lighteval version: Version: 0.5.0.dev0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Errors when using BERTScore for evaluation #310

Describe the bug

To Reproduce

Expected behavior

Version info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Errors when using BERTScore for evaluation #310

Description

Describe the bug

To Reproduce

Expected behavior

Version info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions