enabling accuracy tests for mbxp gsm8k datasets (#178) by rbogdano · Pull Request #1840 · huggingface/optimum-habana

rbogdano · 2025-03-11T11:27:21Z

This change was requested to enable accuracy evaluating on more datasets like GSM8k, MBXP and OpenOrca. To do this test without my change I needed to use evaluation script from vllm-benchmark repository which is confusing.
The dataset that I was using is a combination of three datasets (OpenOrca, mbxp, gsm8k).

This accuracy check was done using mixtral-8x7b. Results below:

{'rouge1': 45.5749, 'rouge2': 23.3549, 'rougeL': 30.4524, 'rougeLsum': 42.5538, 'gsm8k': 73.88, 'mbxp': 60.04, 'gen_len': 4298469, 'gen_num': 15000, 'gen_tok_len': 2805167, 'tokens_per_sample': 187.0, 'performance': 0, 'accuracy': 99.8}

regisss

Can you move all these files to a new folder called examples/text-generation/mbxp_evaluation and add a README there explaining quickly how to install and run everything please?

karol-brejna-i · 2025-04-29T11:59:36Z

We are very close to the release date.
@rbogdano Is there a chance to introduce required changes in the next release (1.18 with Synapse 1.21) or do we need to postpone it?
It looks like it would be good to have this PR merged...

rbogdano · 2025-04-30T17:20:08Z

Can you move all these files to a new folder called examples/text-generation/mbxp_evaluation and add a README there explaining quickly how to install and run everything please?

fixed

regisss · 2025-05-01T12:51:46Z

Can't we move all files to the mbxp_evaluation folder?

kwisniewski98 · 2025-05-14T10:59:26Z

@regisss I've moved those files, but now I'll be testing that, will let you know when I finish testing it

regisss · 2025-05-14T13:00:54Z

@kwisniewski98 I suggest to also add a small README.md file in examples/text-generation/mbxp_evaluation just to know how to use this code

rbogdano · 2025-05-15T13:31:31Z

We've identified new issues that are blocking this PR.
Currently, we're not generating the JSON with the responses required by evaluation.py. As a result, this PR is incomplete.
The next step would be to implement response generation in run_generation.py, but doing so would take additional time. Properly testing this functionality would also require additional effort and time.
Therefore, I suggest moving this change to next release.

rbogdano · 2025-05-21T07:44:31Z

I am closing this PR due to lack of functionality in run_generation.py script.

In the same time I would like to introduce fix for this issue in another PR where I implemented this functionality and added results evaluation guide.
#1986

rbogdano requested a review from regisss as a code owner March 11, 2025 11:27

libinta added the synapse 1.21 label Mar 19, 2025

mounikamandava approved these changes Mar 19, 2025

View reviewed changes

libinta added the run-test Run CI for PRs from external contributors label Mar 26, 2025

regisss reviewed Apr 17, 2025

View reviewed changes

enabling accuracy tests for mbxp gsm8k datasets (#178)

4f5c2fe

rbogdano force-pushed the auto-pr-e9d1b70 branch from 4cc5eab to 4f5c2fe Compare April 30, 2025 17:16

Move eval files to mbxp_evaluation directory

a2d9169

rbogdano closed this May 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enabling accuracy tests for mbxp gsm8k datasets (#178)#1840

enabling accuracy tests for mbxp gsm8k datasets (#178)#1840
rbogdano wants to merge 2 commits into
huggingface:mainfrom
HabanaAI:auto-pr-e9d1b70

rbogdano commented Mar 11, 2025

Uh oh!

regisss left a comment

Uh oh!

karol-brejna-i commented Apr 29, 2025

Uh oh!

rbogdano commented Apr 30, 2025

Uh oh!

regisss commented May 1, 2025

Uh oh!

kwisniewski98 commented May 14, 2025

Uh oh!

regisss commented May 14, 2025

Uh oh!

rbogdano commented May 15, 2025

Uh oh!

rbogdano commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

rbogdano commented Mar 11, 2025

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

karol-brejna-i commented Apr 29, 2025

Uh oh!

rbogdano commented Apr 30, 2025

Uh oh!

regisss commented May 1, 2025

Uh oh!

kwisniewski98 commented May 14, 2025

Uh oh!

regisss commented May 14, 2025

Uh oh!

rbogdano commented May 15, 2025

Uh oh!

rbogdano commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants