Skip to content

enabling accuracy tests for mbxp gsm8k datasets (#178)#1840

Closed
rbogdano wants to merge 2 commits into
huggingface:mainfrom
HabanaAI:auto-pr-e9d1b70
Closed

enabling accuracy tests for mbxp gsm8k datasets (#178)#1840
rbogdano wants to merge 2 commits into
huggingface:mainfrom
HabanaAI:auto-pr-e9d1b70

Conversation

@rbogdano
Copy link
Copy Markdown
Contributor

This change was requested to enable accuracy evaluating on more datasets like GSM8k, MBXP and OpenOrca. To do this test without my change I needed to use evaluation script from vllm-benchmark repository which is confusing.
The dataset that I was using is a combination of three datasets (OpenOrca, mbxp, gsm8k).

This accuracy check was done using mixtral-8x7b. Results below:

{'rouge1': 45.5749, 'rouge2': 23.3549, 'rougeL': 30.4524, 'rougeLsum': 42.5538, 'gsm8k': 73.88, 'mbxp': 60.04, 'gen_len': 4298469, 'gen_num': 15000, 'gen_tok_len': 2805167, 'tokens_per_sample': 187.0, 'performance': 0, 'accuracy': 99.8}

@libinta libinta added the run-test Run CI for PRs from external contributors label Mar 26, 2025
Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move all these files to a new folder called examples/text-generation/mbxp_evaluation and add a README there explaining quickly how to install and run everything please?

@karol-brejna-i
Copy link
Copy Markdown
Collaborator

We are very close to the release date.
@rbogdano Is there a chance to introduce required changes in the next release (1.18 with Synapse 1.21) or do we need to postpone it?
It looks like it would be good to have this PR merged...

@rbogdano
Copy link
Copy Markdown
Contributor Author

Can you move all these files to a new folder called examples/text-generation/mbxp_evaluation and add a README there explaining quickly how to install and run everything please?

fixed

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented May 1, 2025

Can't we move all files to the mbxp_evaluation folder?

@kwisniewski98
Copy link
Copy Markdown
Contributor

@regisss I've moved those files, but now I'll be testing that, will let you know when I finish testing it

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented May 14, 2025

@kwisniewski98 I suggest to also add a small README.md file in examples/text-generation/mbxp_evaluation just to know how to use this code

@rbogdano
Copy link
Copy Markdown
Contributor Author

We've identified new issues that are blocking this PR.
Currently, we're not generating the JSON with the responses required by evaluation.py. As a result, this PR is incomplete.
The next step would be to implement response generation in run_generation.py, but doing so would take additional time. Properly testing this functionality would also require additional effort and time.
Therefore, I suggest moving this change to next release.

@rbogdano
Copy link
Copy Markdown
Contributor Author

I am closing this PR due to lack of functionality in run_generation.py script.

In the same time I would like to introduce fix for this issue in another PR where I implemented this functionality and added results evaluation guide.
#1986

@rbogdano rbogdano closed this May 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors synapse 1.21

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants