enabling accuracy tests for mbxp gsm8k datasets (#178)#1840
Conversation
regisss
left a comment
There was a problem hiding this comment.
Can you move all these files to a new folder called examples/text-generation/mbxp_evaluation and add a README there explaining quickly how to install and run everything please?
|
We are very close to the release date. |
4cc5eab to
4f5c2fe
Compare
fixed |
|
Can't we move all files to the |
|
@regisss I've moved those files, but now I'll be testing that, will let you know when I finish testing it |
|
@kwisniewski98 I suggest to also add a small |
|
We've identified new issues that are blocking this PR. |
|
I am closing this PR due to lack of functionality in run_generation.py script. In the same time I would like to introduce fix for this issue in another PR where I implemented this functionality and added results evaluation guide. |
This change was requested to enable accuracy evaluating on more datasets like GSM8k, MBXP and OpenOrca. To do this test without my change I needed to use evaluation script from vllm-benchmark repository which is confusing.
The dataset that I was using is a combination of three datasets (OpenOrca, mbxp, gsm8k).
This accuracy check was done using mixtral-8x7b. Results below:
{'rouge1': 45.5749, 'rouge2': 23.3549, 'rougeL': 30.4524, 'rougeLsum': 42.5538, 'gsm8k': 73.88, 'mbxp': 60.04, 'gen_len': 4298469, 'gen_num': 15000, 'gen_tok_len': 2805167, 'tokens_per_sample': 187.0, 'performance': 0, 'accuracy': 99.8}