-
Notifications
You must be signed in to change notification settings - Fork 362
Closed
Labels
Description
Describe the bug
- Add
**kwargs
to allow formatted_doc to be passed into metric computation to address the following
metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 111, in apply_generative_metric
metric.compute(
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 75, in compute
return {self.metric_name: self.sample_level_fn(**kwargs)} # result, formatted_doc,
TypeError: BLEURT.compute() got an unexpected keyword argument 'formatted_doc'
- create
BLEURT()
instance to address the following
metrics = compute_metric(results=sample_responses, formatted_doc=doc, metrics=metric_category_metrics)
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/__init__.py", line 111, in apply_generative_metric
metric.compute(
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/utils.py", line 75, in compute
return {self.metric_name: self.sample_level_fn(**kwargs)} # result, formatted_doc,
TypeError: BLEURT.compute() missing 1 required positional argument: 'self'
- Unable to take mean of the compute output.
- the BLEURT compute function computes where the outputs are collected into a list:
scores = self.model(**self.tokenizer(golds, predictions, return_tensors="pt"))[0].squeeze()
(code reference) - example input to
corpus_level_fn
:Content of x: [tensor(-1.3048, grad_fn=<SqueezeBackward0>), tensor(-1.2869, grad_fn=<SqueezeBackward0>), tensor(-1.3146, grad_fn=<SqueezeBackward0>)]
- extract values to take mean.
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/main_accelerate.py", line 85, in main
pipeline.evaluate()
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/pipeline.py", line 241, in evaluate
self.evaluation_tracker.metrics_logger.aggregate(task_dict=self.task_dict, bootstrap_iters=1000)
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/logging/info_loggers.py", line 508, in aggregate
metric_result = task.aggregation()[metric_name](metric_values)
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/metrics.py", line 131, in <lambda>
corpus_level_fn=lambda x: np.mean(x.flatten()), # flatten, then average
AttributeError: 'list' object has no attribute 'flatten'
- Tried to do this
lambda x: torch.stack(x).mean()
but encountered pickle error:
_pickle.PicklingError: Can't pickle <function Metrics.<lambda> at 0x17d25a290>: attribute lookup Metrics.<lambda> on lighteval.metrics.metrics failed
After adding a separate function to avoid pickling, there is another torch related issue, this time in logging. So maybe the output from compute needs to be converted, so this was the final change. However, the compute_mean()
function is pretty out of place and open to suggestions on how best to approach this!
File "/Users/chuandu/Documents/workspace/lighteval/src/lighteval/metrics/stderr.py", line 39, in _stddev
mu = np.mean(arr)
File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/lib/python3.10/site-packages/numpy/core/fromnumeric.py", line 3504, in mean
return _methods._mean(a, axis=axis, dtype=dtype,
File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/lib/python3.10/site-packages/numpy/core/_methods.py", line 102, in _mean
arr = asanyarray(a)
File "/Users/chuandu/Documents/workspace/legal_llm_evaluation/llm_eval_env/lib/python3.10/site-packages/torch/_tensor.py", line 1083, in __array__
return self.numpy()
RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
To Reproduce
- Define a custom task very similar to this example
metric=[Metrics.bleurt]
and run using custom tasks path
lighteval accelerate \
--model_args="pretrained=<path to model on the hub>"\
--tasks <task parameters> \
--output_dir output_dir
Expected behavior
Compute BLEURT for model summarization output and gold reference.
Version info
operating system: MacOS
lighteval version: Version: 0.5.0.dev0