NVIDIA-NeMo · Kipok · Oct 7, 2025 · Oct 2, 2025 · Oct 4, 2025 · Oct 4, 2025
diff --git a/README.md b/README.md
@@ -17,7 +17,7 @@ Here are some of the features we support:
     - [**Instruction following**](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following): e.g. [ifbench](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following/#ifbench), [ifeval](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following/#ifeval)
     - [**Long-context**](https://nvidia.github.io/NeMo-Skills/evaluation/long-context): e.g. [ruler](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#ruler), [mrcr](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#mrcr), [aalcr](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#aalcr)
     - [**Tool-calling**](https://nvidia.github.io/NeMo-Skills/evaluation/tool-calling): e.g. [bfcl_v3](https://nvidia.github.io/NeMo-Skills/evaluation/tool-calling/#bfcl_v3)
-    - [**Multilingual**](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#mmlu-prox)
+    - [**Multilingual**](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#mmlu-prox), [FLORES-200](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#FLORES-200), [wmt24pp](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#wmt24pp)
   - Easily parallelize each evaluation across many slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
 - [Model training](https://nvidia.github.io/NeMo-Skills/pipelines/training): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).
 

diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md
@@ -9,7 +9,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f
 - [**Instruction following**](./instruction-following.md): e.g. [ifbench](./instruction-following.md#ifbench), [ifeval](./instruction-following.md#ifeval)
 - [**Long-context**](./long-context.md): e.g. [ruler](./long-context.md#ruler), [mrcr](./long-context.md#mrcr)
 - [**Tool-calling**](./tool-calling.md): e.g. [bfcl_v3](./tool-calling.md#bfcl_v3)
-- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox)
+- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#FLORES-200), [wmt24pp](./multilingual.md#wmt24pp)
 
 See [nemo_skills/dataset](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support.
 
@@ -246,4 +246,4 @@ To create a new benchmark follow this process:
    prompt config in `GENERATION_ARGS` and evaluation / metric parameters. But if extra customization is needed for the generation, you can provide
    a fully custom generation module. See [scicode](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/scicode/__init__.py) or [swe-bench](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/swe-bench/__init__.py) for examples of this.
 4. Create a new [evaluation class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/evaluator/__init__.py) (if cannot re-use existing one).
-5. Create a new [metrics class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).
+5. Create a new [metrics class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).
diff --git a/docs/evaluation/long-context.md b/docs/evaluation/long-context.md
@@ -49,4 +49,4 @@ ns eval \
 The results, including per-category scores, are stored in metrics.json. Detailed breakdowns by category and sequence length are also available via
 ```
 ns summarize_results --cluster=<cluster_config> <folder_of_output_json>
-```
+```
diff --git a/docs/evaluation/multilingual.md b/docs/evaluation/multilingual.md
@@ -1,6 +1,6 @@
 # Multilingual
 
-Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation (to be added).
+Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation.
 
 All benchmarks in this category will have an extra `--language` argument with its associated `ns prepare` command, which allows you to choose which language(s) of the benchmark to run.
 Once prepared, the `ns eval` command will run on all languages prepared, and the summarized results generated with `ns eval` will include per-language breakdowns.
@@ -9,7 +9,7 @@ Once prepared, the `ns eval` command will run on all languages prepared, and the
 
 ### mmlu-prox
 
-- Benchmark is defined in [`nemo_skills/dataset/mmlu-pro/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/mmlu-prox/__init__.py)
+- Benchmark is defined in [`nemo_skills/dataset/mmlu-prox/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/mmlu-prox/__init__.py)
 - Original benchmark source is [here](https://huggingface.co/datasets/li-lab/MMLU-ProX).
 
 Our evaluation template and answer extraction mechanism tries to match the configration in [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu_prox).
@@ -68,4 +68,150 @@ Some reference numbers for reference and commands for reproduction:
         ++inference.temperature=0.6 \
         ++inference.top_k=20 \
         ++inference.tokens_to_generate=38912
-    ```
+    ```
+
+### FLORES-200
+
+- Benchmark is defined in [`nemo_skills/dataset/flores200/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/flores200/__init__.py)
+- Original benchmark source is [here](https://huggingface.co/datasets/openlanguagedata/flores_plus).
+
+Some reference numbers for devtest split (xx corresponds to average over 5 languages: de, es, fr, it, ja):
+
+| Model                  | en->xx | xx->en | xx->xx |
+|:-----------------------|------:|------:|------:|
+| Nemotron-NanoV2-9B-v2  | 32.5 |  34  | 25.9 |
+| Qwen3-8B               | 31.5 | 34.6 | 25.7 |
+| Qwen3-30B-A3B          | 33.3 | 35.5 | 27.1 |
+| gpt-oss-20B            | 32.4 | 34.1 |  25  |
+
+=== "Nemotron-NanoV2-9B-v2"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=NVIDIA/Nemotron-Nano-9B-v2 \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=512
+        ++system_message='/no_think'
+    ```
+
+=== "Qwen3-8B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-8B \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "Qwen3-30B-A3B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-30B-A3B \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "gpt-oss-20B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=openai/gpt-oss-20b \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=2048
+    ```
+
+### wmt24pp
+
+- Benchmark is defined in [`nemo_skills/dataset/wmt24pp/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/wmt24pp/__init__.py)
+- Original benchmark source is [here](https://huggingface.co/datasets/google/wmt24pp).
+
+Some reference numbers for test split (xx corresponds to average over 5 languages: de, es, fr, it, ja):
+
+| Model                  | en->de | en->es | en->fr | en->it | en->ja | en->xx |
+|:-----------------------|------:|------:|------:|------:|------:|------:|
+| Nemotron-NanoV2-9B-v2  | 25.3 | 37.7 | 33.4 | 33.8 | 20.9 |  30.2  |
+| Qwen3-8B               | 26.2 | 38.5 | 33.1 | 33.1 | 21.7 | 30.5 |
+| Qwen3-30B-A3B          | 28.5 |  40  | 35.1 |  36  | 23.2 | 32.5 |
+| gpt-oss-20B            | 27.3 | 42.3 | 32.8 | 34.9 | 25.2 | 32.5 |
+
+=== "Nemotron-NanoV2-9B-v2"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=NVIDIA/Nemotron-Nano-9B-v2 \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=512
+        ++system_message='/no_think'
+    ```
+
+=== "Qwen3-8B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-8B \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "Qwen3-30B-A3B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-30B-A3B \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "gpt-oss-20B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=openai/gpt-oss-20b \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=2048
+    ```
diff --git a/docs/index.md b/docs/index.md
@@ -21,7 +21,8 @@ Here are some of the features we support:
         - [**Instruction following**](./evaluation/instruction-following.md): e.g. [ifbench](./evaluation/instruction-following.md#ifbench), [ifeval](./evaluation/instruction-following.md#ifeval)
         - [**Long-context**](./evaluation/long-context.md): e.g. [ruler](./evaluation/long-context.md#ruler), [mrcr](./evaluation/long-context.md#mrcr)
         - [**Tool-calling**](./evaluation/tool-calling.md): e.g. [bfcl_v3](./evaluation/tool-calling.md#bfcl_v3)
-        - [**Robustness Evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
+        - [**Multilingual capabilities**](./evaluation/multilingual.md): e.g. [mmlu-prox](./evaluation/multilingual.md#mmlu-prox), [flores-200](./evaluation/multilingual.md#FLORES-200), [wmt24pp](./evaluation/multilingual.md#wmt24pp)
+        - [**Robustness evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
     - Easily parallelize each evaluation across many Slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
 - [Model training](pipelines/training.md): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).
 

diff --git a/nemo_skills/dataset/flores200/__init__.py b/nemo_skills/dataset/flores200/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# settings that define how evaluation should be done by default (all can be changed from cmdline)
+
+PROMPT_CONFIG = "multilingual/segment-translation"
+DATASET_GROUP = "chat"
+METRICS_TYPE = "translation"
+EVAL_ARGS = "++eval_type=no-op"
+GENERATION_ARGS = ""
diff --git a/nemo_skills/dataset/flores200/prepare.py b/nemo_skills/dataset/flores200/prepare.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+from pathlib import Path
+
+from datasets import load_dataset
+from langcodes import Language
+
+
+def write_data_to_file(output_file, datasets, src_languages, tgt_languages):
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for src_lang in src_languages:
+            for tgt_lang in tgt_languages:
+                if src_lang != tgt_lang:
+                    for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True):
+                        json_dict = {
+                            "text": src,
+                            "translation": tgt,
+                            "source_language": src_lang,
+                            "target_language": tgt_lang,
+                            "source_lang_name": Language(src_lang).display_name(),
+                            "target_lang_name": Language(tgt_lang).display_name(),
+                        }
+                        json.dump(json_dict, fout)
+                        fout.write("\n")
+
+
+def main(args):
+    all_languages = list(set(args.source_languages).union(set(args.target_languages)))
+
+    datasets = {}
+    for lang in all_languages:
+        iso_639_3 = Language(lang).to_alpha3()
+        iso_15924 = Language(lang).maximize().script
+        lang_code = f"{iso_639_3}_{iso_15924}"
+        datasets[lang] = load_dataset("openlanguagedata/flores_plus", lang_code, split=args.split)["text"]
+
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+    output_file = data_dir / f"{args.split}.jsonl"
+    write_data_to_file(output_file, datasets, src_languages=args.source_languages, tgt_languages=args.target_languages)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--split", default="dev", choices=("dev", "devtest"), help="Dataset split to process.")
+    parser.add_argument(
+        "--source_languages",
+        default=["en", "de", "es", "fr", "it", "ja"],
+        nargs="+",
+        help="Languages to translate from.",
+    )
+    parser.add_argument(
+        "--target_languages",
+        default=["en", "de", "es", "fr", "it", "ja"],
+        nargs="+",
+        help="Languages to translate to.",
+    )
+    args = parser.parse_args()
+    main(args)
diff --git a/nemo_skills/dataset/wmt24pp/__init__.py b/nemo_skills/dataset/wmt24pp/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# settings that define how evaluation should be done by default (all can be changed from cmdline)
+
+PROMPT_CONFIG = "multilingual/segment-translation"
+DATASET_GROUP = "chat"
+METRICS_TYPE = "translation"
+EVAL_ARGS = "++eval_type=no-op"
+GENERATION_ARGS = ""
diff --git a/nemo_skills/dataset/wmt24pp/prepare.py b/nemo_skills/dataset/wmt24pp/prepare.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+from pathlib import Path
+
+from datasets import load_dataset
+from langcodes import Language
+
+
+def write_data_to_file(output_file, datasets, tgt_languages):
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for tgt_lang in tgt_languages:
+            for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"], strict=True):
+                json_dict = {
+                    "text": src,
+                    "translation": tgt,
+                    "source_language": "en",
+                    "target_language": tgt_lang,
+                    "source_lang_name": "English",
+                    "target_lang_name": Language(tgt_lang[:2]).display_name(),
+                }
+                json.dump(json_dict, fout)
+                fout.write("\n")
+
+
+def main(args):
+    datasets = {}
+    for lang in args.target_languages:
+        datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]
+
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+    output_file = data_dir / f"{args.split}.jsonl"
+    write_data_to_file(output_file, datasets, tgt_languages=args.target_languages)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--split", default="test", choices=("test",), help="Dataset split to process.")
+    parser.add_argument(
+        "--target_languages",
+        default=["de_DE", "es_MX", "fr_FR", "it_IT", "ja_JP"],
+        nargs="+",
+        help="Languages to translate to.",
+    )
+    args = parser.parse_args()
+    main(args)
-Original file line number
+Diff line change
@@ Expand Up / @@ -49,4 +49,4 @@ ns eval \ @@
     The results, including per-category scores, are stored in metrics.json. Detailed breakdowns by category and sequence length are also available via
     ```
     ns summarize_results --cluster=<cluster_config> <folder_of_output_json>
-    ```
+    ```