Skip to content
Merged
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Here are some of the features we support:
- [**Instruction following**](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following): e.g. [ifbench](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following/#ifbench), [ifeval](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following/#ifeval)
- [**Long-context**](https://nvidia.github.io/NeMo-Skills/evaluation/long-context): e.g. [ruler](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#ruler), [mrcr](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#mrcr), [aalcr](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#aalcr)
- [**Tool-calling**](https://nvidia.github.io/NeMo-Skills/evaluation/tool-calling): e.g. [bfcl_v3](https://nvidia.github.io/NeMo-Skills/evaluation/tool-calling/#bfcl_v3)
- [**Multilingual**](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#mmlu-prox)
- [**Multilingual**](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#mmlu-prox), [FLORES-200](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#FLORES-200), [wmt24pp](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#wmt24pp)
- Easily parallelize each evaluation across many slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
- [Model training](https://nvidia.github.io/NeMo-Skills/pipelines/training): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).

Expand Down
4 changes: 2 additions & 2 deletions docs/evaluation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f
- [**Instruction following**](./instruction-following.md): e.g. [ifbench](./instruction-following.md#ifbench), [ifeval](./instruction-following.md#ifeval)
- [**Long-context**](./long-context.md): e.g. [ruler](./long-context.md#ruler), [mrcr](./long-context.md#mrcr)
- [**Tool-calling**](./tool-calling.md): e.g. [bfcl_v3](./tool-calling.md#bfcl_v3)
- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox)
- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#FLORES-200), [wmt24pp](./multilingual.md#wmt24pp)

See [nemo_skills/dataset](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support.

Expand Down Expand Up @@ -246,4 +246,4 @@ To create a new benchmark follow this process:
prompt config in `GENERATION_ARGS` and evaluation / metric parameters. But if extra customization is needed for the generation, you can provide
a fully custom generation module. See [scicode](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/scicode/__init__.py) or [swe-bench](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/swe-bench/__init__.py) for examples of this.
4. Create a new [evaluation class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/evaluator/__init__.py) (if cannot re-use existing one).
5. Create a new [metrics class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).
5. Create a new [metrics class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).
2 changes: 1 addition & 1 deletion docs/evaluation/long-context.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,4 @@ ns eval \
The results, including per-category scores, are stored in metrics.json. Detailed breakdowns by category and sequence length are also available via
```
ns summarize_results --cluster=<cluster_config> <folder_of_output_json>
```
```
152 changes: 149 additions & 3 deletions docs/evaluation/multilingual.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Multilingual

Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation (to be added).
Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation.

All benchmarks in this category will have an extra `--language` argument with its associated `ns prepare` command, which allows you to choose which language(s) of the benchmark to run.
Once prepared, the `ns eval` command will run on all languages prepared, and the summarized results generated with `ns eval` will include per-language breakdowns.
Expand All @@ -9,7 +9,7 @@ Once prepared, the `ns eval` command will run on all languages prepared, and the

### mmlu-prox

- Benchmark is defined in [`nemo_skills/dataset/mmlu-pro/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/mmlu-prox/__init__.py)
- Benchmark is defined in [`nemo_skills/dataset/mmlu-prox/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/mmlu-prox/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/li-lab/MMLU-ProX).

Our evaluation template and answer extraction mechanism tries to match the configration in [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu_prox).
Expand Down Expand Up @@ -68,4 +68,150 @@ Some reference numbers for reference and commands for reproduction:
++inference.temperature=0.6 \
++inference.top_k=20 \
++inference.tokens_to_generate=38912
```
```

### FLORES-200

- Benchmark is defined in [`nemo_skills/dataset/flores200/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/flores200/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/openlanguagedata/flores_plus).

Some reference numbers for devtest split (xx corresponds to average over 5 languages: de, es, fr, it, ja):

| Model | en->xx | xx->en | xx->xx |
|:-----------------------|------:|------:|------:|
| Nemotron-NanoV2-9B-v2 | 32.5 | 34 | 25.9 |
| Qwen3-8B | 31.5 | 34.6 | 25.7 |
| Qwen3-30B-A3B | 33.3 | 35.5 | 27.1 |
| gpt-oss-20B | 32.4 | 34.1 | 25 |

=== "Nemotron-NanoV2-9B-v2"

```bash
ns eval \
--cluster=[cluster] \
--model=NVIDIA/Nemotron-Nano-9B-v2 \
--benchmarks flores200 \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=devtest \
++inference.tokens_to_generate=512
++system_message='/no_think'
```

=== "Qwen3-8B"

```bash
ns eval \
--cluster=[cluster] \
--model=Qwen/Qwen3-8B \
--benchmarks flores200 \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=devtest \
++inference.tokens_to_generate=512
++prompt_suffix='/no_think'
```

=== "Qwen3-30B-A3B"

```bash
ns eval \
--cluster=[cluster] \
--model=Qwen/Qwen3-30B-A3B \
--benchmarks flores200 \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=devtest \
++inference.tokens_to_generate=512
++prompt_suffix='/no_think'
```

=== "gpt-oss-20B"

```bash
ns eval \
--cluster=[cluster] \
--model=openai/gpt-oss-20b \
--benchmarks flores200 \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=devtest \
++inference.tokens_to_generate=2048
```

### wmt24pp

- Benchmark is defined in [`nemo_skills/dataset/wmt24pp/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/wmt24pp/__init__.py)
- Original benchmark source is [here](https://huggingface.co/datasets/google/wmt24pp).

Some reference numbers for test split (xx corresponds to average over 5 languages: de, es, fr, it, ja):

| Model | en->de | en->es | en->fr | en->it | en->ja | en->xx |
|:-----------------------|------:|------:|------:|------:|------:|------:|
| Nemotron-NanoV2-9B-v2 | 25.3 | 37.7 | 33.4 | 33.8 | 20.9 | 30.2 |
| Qwen3-8B | 26.2 | 38.5 | 33.1 | 33.1 | 21.7 | 30.5 |
| Qwen3-30B-A3B | 28.5 | 40 | 35.1 | 36 | 23.2 | 32.5 |
| gpt-oss-20B | 27.3 | 42.3 | 32.8 | 34.9 | 25.2 | 32.5 |

=== "Nemotron-NanoV2-9B-v2"

```bash
ns eval \
--cluster=[cluster] \
--model=NVIDIA/Nemotron-Nano-9B-v2 \
--benchmarks wmt24pp \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=test \
++inference.tokens_to_generate=512
++system_message='/no_think'
```

=== "Qwen3-8B"

```bash
ns eval \
--cluster=[cluster] \
--model=Qwen/Qwen3-8B \
--benchmarks wmt24pp \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=test \
++inference.tokens_to_generate=512
++prompt_suffix='/no_think'
```

=== "Qwen3-30B-A3B"

```bash
ns eval \
--cluster=[cluster] \
--model=Qwen/Qwen3-30B-A3B \
--benchmarks wmt24pp \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=test \
++inference.tokens_to_generate=512
++prompt_suffix='/no_think'
```

=== "gpt-oss-20B"

```bash
ns eval \
--cluster=[cluster] \
--model=openai/gpt-oss-20b \
--benchmarks wmt24pp \
--output_dir=[output dir] \
--server_type=vllm \
--server_gpus=8 \
--split=test \
++inference.tokens_to_generate=2048
```
3 changes: 2 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,8 @@ Here are some of the features we support:
- [**Instruction following**](./evaluation/instruction-following.md): e.g. [ifbench](./evaluation/instruction-following.md#ifbench), [ifeval](./evaluation/instruction-following.md#ifeval)
- [**Long-context**](./evaluation/long-context.md): e.g. [ruler](./evaluation/long-context.md#ruler), [mrcr](./evaluation/long-context.md#mrcr)
- [**Tool-calling**](./evaluation/tool-calling.md): e.g. [bfcl_v3](./evaluation/tool-calling.md#bfcl_v3)
- [**Robustness Evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
- [**Multilingual capabilities**](./evaluation/multilingual.md): e.g. [mmlu-prox](./evaluation/multilingual.md#mmlu-prox), [flores-200](./evaluation/multilingual.md#FLORES-200), [wmt24pp](./evaluation/multilingual.md#wmt24pp)
- [**Robustness evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
- Easily parallelize each evaluation across many Slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
- [Model training](pipelines/training.md): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).

Expand Down
22 changes: 22 additions & 0 deletions nemo_skills/dataset/flores200/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# settings that define how evaluation should be done by default (all can be changed from cmdline)

PROMPT_CONFIG = "multilingual/segment-translation"
DATASET_GROUP = "chat"
METRICS_TYPE = "translation"
EVAL_ARGS = "++eval_type=no-op"
GENERATION_ARGS = ""
73 changes: 73 additions & 0 deletions nemo_skills/dataset/flores200/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import json
from pathlib import Path

from datasets import load_dataset
from langcodes import Language


def write_data_to_file(output_file, datasets, src_languages, tgt_languages):
with open(output_file, "wt", encoding="utf-8") as fout:
for src_lang in src_languages:
for tgt_lang in tgt_languages:
if src_lang != tgt_lang:
for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True):
json_dict = {
"text": src,
"translation": tgt,
"source_language": src_lang,
"target_language": tgt_lang,
"source_lang_name": Language(src_lang).display_name(),
"target_lang_name": Language(tgt_lang).display_name(),
}
json.dump(json_dict, fout)
fout.write("\n")


def main(args):
all_languages = list(set(args.source_languages).union(set(args.target_languages)))

datasets = {}
for lang in all_languages:
iso_639_3 = Language(lang).to_alpha3()
iso_15924 = Language(lang).maximize().script
lang_code = f"{iso_639_3}_{iso_15924}"
datasets[lang] = load_dataset("openlanguagedata/flores_plus", lang_code, split=args.split)["text"]

data_dir = Path(__file__).absolute().parent
data_dir.mkdir(exist_ok=True)
output_file = data_dir / f"{args.split}.jsonl"
write_data_to_file(output_file, datasets, src_languages=args.source_languages, tgt_languages=args.target_languages)


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--split", default="dev", choices=("dev", "devtest"), help="Dataset split to process.")
parser.add_argument(
"--source_languages",
default=["en", "de", "es", "fr", "it", "ja"],
nargs="+",
help="Languages to translate from.",
)
parser.add_argument(
"--target_languages",
default=["en", "de", "es", "fr", "it", "ja"],
nargs="+",
help="Languages to translate to.",
)
args = parser.parse_args()
main(args)
22 changes: 22 additions & 0 deletions nemo_skills/dataset/wmt24pp/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# settings that define how evaluation should be done by default (all can be changed from cmdline)

PROMPT_CONFIG = "multilingual/segment-translation"
DATASET_GROUP = "chat"
METRICS_TYPE = "translation"
EVAL_ARGS = "++eval_type=no-op"
GENERATION_ARGS = ""
60 changes: 60 additions & 0 deletions nemo_skills/dataset/wmt24pp/prepare.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import json
from pathlib import Path

from datasets import load_dataset
from langcodes import Language


def write_data_to_file(output_file, datasets, tgt_languages):
with open(output_file, "wt", encoding="utf-8") as fout:
for tgt_lang in tgt_languages:
for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"], strict=True):
json_dict = {
"text": src,
"translation": tgt,
"source_language": "en",
"target_language": tgt_lang,
"source_lang_name": "English",
"target_lang_name": Language(tgt_lang[:2]).display_name(),
}
json.dump(json_dict, fout)
fout.write("\n")


def main(args):
datasets = {}
for lang in args.target_languages:
datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]

data_dir = Path(__file__).absolute().parent
data_dir.mkdir(exist_ok=True)
output_file = data_dir / f"{args.split}.jsonl"
write_data_to_file(output_file, datasets, tgt_languages=args.target_languages)


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--split", default="test", choices=("test",), help="Dataset split to process.")
parser.add_argument(
"--target_languages",
default=["de_DE", "es_MX", "fr_FR", "it_IT", "ja_JP"],
nargs="+",
help="Languages to translate to.",
)
args = parser.parse_args()
main(args)
Loading