[BUG] Errors when using BLEURT metric

## Describe the bug
1. Add `**kwargs` to allow formatted_doc to be passed into metric computation to address the following
```
    metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 111, in apply_generative_metric
    metric.compute(
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 75, in compute
    return {self.metric_name: self.sample_level_fn(**kwargs)}  # result, formatted_doc,
TypeError: BLEURT.compute() got an unexpected keyword argument 'formatted_doc'
```
2. create `BLEURT()` instance to address the following
```
    metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 111, in apply_generative_metric
    metric.compute(
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 75, in compute
    return {self.metric_name: self.sample_level_fn(**kwargs)}  # result, formatted_doc,
TypeError: BLEURT.compute() missing 1 required positional argument: 'self'
```
3. Unable to take mean of the compute output.
- the BLEURT compute function computes where the outputs are collected into a list: `scores = self.model(**self.tokenizer(golds, predictions, return_tensors="pt"))[0].squeeze()` ([code reference](https://github.com/huggingface/lighteval/blob/6b943ecb8d900c4f2333457df57f6fbbfa8cc034/src/lighteval/metrics/metrics_sample.py#L639))
- example input to `corpus_level_fn`: `Content of x: [tensor(-1.3048, grad_fn=<SqueezeBackward0>), tensor(-1.2869, grad_fn=<SqueezeBackward0>), tensor(-1.3146, grad_fn=<SqueezeBackward0>)]`
- extract values to take mean.
```
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/main_accelerate.py", line 85, in main
    pipeline.evaluate()
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 241, in evaluate
    self.evaluation_tracker.metrics_logger.aggregate(task_dict=self.task_dict, bootstrap_iters=1000)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/logging/info_loggers.py", line 508, in aggregate
    metric_result = task.aggregation()[metric_name](metric_values)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/metrics.py", line 131, in <lambda>
    corpus_level_fn=lambda x: np.mean(x.flatten()),  # flatten, then average
AttributeError: 'list' object has no attribute 'flatten'
```
4. Tried to do this `lambda x: torch.stack(x).mean()` but encountered pickle error:
```
_pickle.PicklingError: Can't pickle <function Metrics.<lambda> at 0x17d25a290>: attribute lookup Metrics.<lambda> on lighteval.metrics.metrics failed
```
After adding a separate function to avoid pickling, there is another torch related issue, this time in logging. So maybe the output from compute needs to be converted, so this was the final change. However, the `compute_mean()` function is pretty out of place and open to suggestions on how best to approach this!
```
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/stderr.py", line 39, in _stddev
    mu = np.mean(arr)
  File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 3504, in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/lib/python3.10/site-packages/numpy/core/_methods.py", line 102, in _mean
    arr = asanyarray(a)
  File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/lib/python3.10/site-packages/torch/_tensor.py", line 1083, in __array__
    return self.numpy()
RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
```

## To Reproduce
1. Define a custom task very similar to [this example](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/default_tasks.py#L9706)  `metric=[Metrics.bleurt]`and run using custom tasks path
```
lighteval accelerate \
    --model_args="pretrained=<path to model on the hub>"\
    --tasks <task parameters> \
    --output_dir output_dir
```

## Expected behavior
Compute BLEURT for model summarization output and gold reference.

## Version info
operating system: MacOS
lighteval version: Version: 0.5.0.dev0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Errors when using BLEURT metric #315

Describe the bug

To Reproduce

Expected behavior

Version info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Errors when using BLEURT metric #315

Description

Describe the bug

To Reproduce

Expected behavior

Version info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions