-
Notifications
You must be signed in to change notification settings - Fork 163
Audiobench and LibriSpeech-PC Benchmarks Evaluation #1060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
73a757e
9225fab
de4914a
0997361
e2a876b
0a543be
ca3ffec
3d963f6
189a47f
3dd21ff
35f3666
fccb644
0c10924
7694a6b
59b4f1d
fd9838b
907b9fb
e059482
c885b73
2807505
7abea9e
74e21f9
a908156
245743b
5c289f0
9cd886a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -82,13 +82,11 @@ There are a few parameters specific to SWE-bench. They have to be specified with | |
|
|
||
| - **++eval_harness_repo:** URL of the repository to use for the evaluation harness. This is passed directly as an argument to `git clone`. Defaults to [`https://github.com/Kipok/SWE-bench.git`](https://github.com/Kipok/SWE-bench), our fork of SWE-bench that supports local evaluation. | ||
|
|
||
| - **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning eval_harness_repo. Defaults to `HEAD`, i.e. the latest commit. | ||
|
|
||
| - **++setup_timeout:** The timeout for downloading & installing the agent framework and the evaluation harness, in seconds. Defaults to 1200, i.e. 20 minutes. | ||
| - **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning agent_harness_repo. Defaults to `HEAD`, i.e. the latest commit. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fix parameter name in ++eval_harness_commit description
- **++eval_harness_commit:** The commit hash, branch or tag to checkout after cloning `eval_harness_repo`. Defaults to `HEAD`, i.e. the latest commit.The updated Based on learnings, keeping docs tightly aligned with flags helps avoid user confusion. Also applies to: 89-89 🤖 Prompt for AI Agents |
||
|
|
||
| - **++swebench_tests_timeout:** The timeout for tests after applying the generated patch during evaluation, in seconds. Defaults to 1800, i.e. 30 minutes. | ||
|
|
||
| - **++max_retries:** How many times to try running setup, inference and evaluation until a valid output file is produced. Defaults to 3. | ||
| - **++max_retries:** How many times to try running inference and evaluation until a valid output file is produced. Defaults to 3. | ||
|
|
||
| - **++min_retry_interval, ++max_retry_interval:** The interval between retries, in seconds. Selected randomly between min and max on each retry. Defaults to 60 and 180 respectively. | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """AudioBench: A comprehensive benchmark for speech and audio language models. | ||
|
|
||
| AudioBench evaluates models across multiple tasks: | ||
| - ASR (Automatic Speech Recognition) | ||
| - Translation (speech-to-text translation) | ||
| - Speech QA (question answering based on audio) | ||
| - Audio understanding (emotion, gender, accent recognition, etc.) | ||
|
|
||
| The benchmark is organized into two main categories: | ||
| - nonjudge: Tasks evaluated with automatic metrics (WER, BLEU) | ||
| - judge: Tasks requiring LLM-as-a-judge evaluation | ||
| """ | ||
|
|
||
| DATASET_GROUP = "speechlm" | ||
| IS_BENCHMARK_GROUP = True | ||
| SCORE_MODULE = "nemo_skills.evaluation.metrics.speechlm_metrics" | ||
|
|
||
| # Top-level benchmarks: evaluate all judge or all nonjudge datasets | ||
| BENCHMARKS = { | ||
| "audiobench.nonjudge": {}, | ||
| "audiobench.judge": {}, | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """AudioBench judge tasks dataset configuration. | ||
|
|
||
| This dataset includes tasks that require LLM-based evaluation such as: | ||
| - Audio captioning | ||
| - Spoken question answering | ||
| - Audio understanding and reasoning | ||
|
|
||
| These tasks require an LLM judge for evaluation, matching MMAU-Pro evaluation setup. | ||
| """ | ||
|
|
||
| # Dataset configuration - CRITICAL: needed for audio to work | ||
| DATASET_GROUP = "speechlm" | ||
| METRICS_TYPE = "speechlm" | ||
| DEFAULT_SPLIT = "test" | ||
| GENERATION_ARGS = "++prompt_format=openai " | ||
|
|
||
| # Judge configuration matching AudioBench official implementation | ||
| # Using Llama-3.1-70B with vllm (can be overridden in run scripts) | ||
| JUDGE_PIPELINE_ARGS = { | ||
| "model": "meta-llama/Meta-Llama-3.1-70B-Instruct", | ||
| "server_type": "vllm", | ||
| "server_gpus": 8, | ||
| "server_args": "--max-model-len 8192 --gpu-memory-utilization 0.95", | ||
| } | ||
| JUDGE_ARGS = "++prompt_config=judge/audiobench ++generation_key=judgement" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """AudioBench non-judge tasks dataset configuration. | ||
|
|
||
| This dataset includes ASR, translation, and other tasks that use | ||
| automatic metrics (WER, BLEU, WER-PC) instead of judge evaluation. | ||
|
|
||
| NO JUDGE REQUIRED - Metrics computed automatically from model outputs. | ||
| """ | ||
|
|
||
| # Dataset configuration - CRITICAL: needed for audio to work | ||
| DATASET_GROUP = "speechlm" | ||
| METRICS_TYPE = "speechlm" | ||
|
|
||
| # Evaluation settings | ||
| EVAL_ARGS = "++eval_type=audiobench " | ||
|
|
||
| # Generation settings - OpenAI format for audio-language models | ||
| GENERATION_ARGS = "++prompt_format=openai " |
Uh oh!
There was an error while loading. Please reload this page.