Skip to content

[BUG] Errors when using BLEURT metric #315

@chuandudx

Description

@chuandudx

Describe the bug

  1. Add **kwargs to allow formatted_doc to be passed into metric computation to address the following
    metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 111, in apply_generative_metric
    metric.compute(
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 75, in compute
    return {self.metric_name: self.sample_level_fn(**kwargs)}  # result, formatted_doc,
TypeError: BLEURT.compute() got an unexpected keyword argument 'formatted_doc'
  1. create BLEURT() instance to address the following
    metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 111, in apply_generative_metric
    metric.compute(
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 75, in compute
    return {self.metric_name: self.sample_level_fn(**kwargs)}  # result, formatted_doc,
TypeError: BLEURT.compute() missing 1 required positional argument: 'self'
  1. Unable to take mean of the compute output.
  • the BLEURT compute function computes where the outputs are collected into a list: scores = self.model(**self.tokenizer(golds, predictions, return_tensors="pt"))[0].squeeze() (code reference)
  • example input to corpus_level_fn: Content of x: [tensor(-1.3048, grad_fn=<SqueezeBackward0>), tensor(-1.2869, grad_fn=<SqueezeBackward0>), tensor(-1.3146, grad_fn=<SqueezeBackward0>)]
  • extract values to take mean.
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/main_accelerate.py", line 85, in main
    pipeline.evaluate()
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 241, in evaluate
    self.evaluation_tracker.metrics_logger.aggregate(task_dict=self.task_dict, bootstrap_iters=1000)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/logging/info_loggers.py", line 508, in aggregate
    metric_result = task.aggregation()[metric_name](metric_values)
  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/metrics.py", line 131, in <lambda>
    corpus_level_fn=lambda x: np.mean(x.flatten()),  # flatten, then average
AttributeError: 'list' object has no attribute 'flatten'
  1. Tried to do this lambda x: torch.stack(x).mean() but encountered pickle error:
_pickle.PicklingError: Can't pickle <function Metrics.<lambda> at 0x17d25a290>: attribute lookup Metrics.<lambda> on lighteval.metrics.metrics failed

After adding a separate function to avoid pickling, there is another torch related issue, this time in logging. So maybe the output from compute needs to be converted, so this was the final change. However, the compute_mean() function is pretty out of place and open to suggestions on how best to approach this!

  File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/stderr.py", line 39, in _stddev
    mu = np.mean(arr)
  File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 3504, in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/lib/python3.10/site-packages/numpy/core/_methods.py", line 102, in _mean
    arr = asanyarray(a)
  File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/lib/python3.10/site-packages/torch/_tensor.py", line 1083, in __array__
    return self.numpy()
RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.

To Reproduce

  1. Define a custom task very similar to this example metric=[Metrics.bleurt]and run using custom tasks path
lighteval accelerate \
    --model_args="pretrained=<path to model on the hub>"\
    --tasks <task parameters> \
    --output_dir output_dir

Expected behavior

Compute BLEURT for model summarization output and gold reference.

Version info

operating system: MacOS
lighteval version: Version: 0.5.0.dev0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions