MT datasets FLORES200 and WMT24pp#870
Conversation
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
WalkthroughAdds module-level dataset defaults and prepare scripts for Flores200 and WMT24pp, a new TranslationMetrics class computing BLEU and registering it, and a generic translation prompt template; new scripts serialize HuggingFace splits into per-split JSONL with language metadata. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as prepare.py (CLI)
participant HF as HuggingFace Datasets
participant Writer as JSONL Writer
User->>CLI: run with --split / --source_languages / --target_languages
CLI->>HF: load_dataset(...) per language or pair
HF-->>CLI: dataset split (texts)
loop for each language pair and example
CLI->>CLI: normalize codes & names, build record
CLI->>Writer: append JSONL line {text, translation, src/tgt codes, names}
end
Note over Writer: Output file: <split>.jsonl (one JSON object per line)
sequenceDiagram
autonumber
participant Eval as Evaluator
participant TM as TranslationMetrics
participant SB as sacrebleu
Eval->>TM: reset()
loop for each batch
Eval->>TM: update(predictions) %% store per-pair hypotheses/references
end
Eval->>TM: get_metrics()
TM->>SB: corpus_bleu(hypotheses, references, tokenizer=auto or ja-mecab)
SB-->>TM: BLEU scores per pair
TM-->>Eval: {per-pair BLEU, per-src/tgt aggregates, overall BLEU}
Note over TM: Aggregation buckets by source, target, and overall
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (5)
nemo_skills/prompt/config/generic/translation.yaml (1)
3-3: Tighten the prompt to reduce chatty outputs.Consider explicitly asking for only the translation text (no quotes, no prefix/suffix) to avoid LLM boilerplate.
Apply this tweak if desired:
-user: "Translate the following segment into {target_lang_name}, without additional explanation.\n\n{text}" +user: "Translate the following segment into {target_lang_name}. Return only the translated text (no quotes, no extra words).\n\n{text}"nemo_skills/evaluation/metrics/translation_metrics.py (2)
31-34: Avoid re-derivingtgt_lang; use the parsed variable.You already have
tgt_langfrom the split; reassigning risks drift if key parsing changes.- tokenize = "13a" - tgt_lang = key.split("_")[-1] + tokenize = "13a" if tgt_lang == "ja": tokenize = "ja-mecab"
26-27: Key parsing viasplit("_")is brittle for codes containing underscores.Prefer tuple keys
(src_lang, tgt_lang)to avoid delimiter parsing bugs.Minimal change:
- for key in self.translation_dict: - src_lang, tgt_lang = key.split("_") - preds = self.translation_dict[key]["preds"] - gts = self.translation_dict[key]["gts"] + for (src_lang, tgt_lang), pair_dict in self.translation_dict.items(): + preds = pair_dict["preds"] + gts = pair_dict["gts"]And in
update:- self.translation_dict[f"{src_lang}_{tgt_lang}"]["preds"].append(generation) - self.translation_dict[f"{src_lang}_{tgt_lang}"]["gts"].append(ground_truth) + self.translation_dict[(src_lang, tgt_lang)]["preds"].append(generation) + self.translation_dict[(src_lang, tgt_lang)]["gts"].append(ground_truth)This avoids future failures if a language like
zh_Hantever appears.nemo_skills/dataset/flores200/prepare.py (2)
32-51: Consider adding error handling for language code conversion and dataset loading.The language code conversion (lines 38-40) and dataset loading (line 41) assume valid inputs and successful operations. If an invalid language code is provided or if dataset loading fails, the script will crash without helpful error messages.
Consider adding try-except blocks to provide clearer error messages for:
- Invalid language codes that
langcodes.Languagecannot parse- Missing language variants in the FLORES+ dataset
- Empty or malformed datasets
Example error handling:
for lang in all_languages: try: iso_639_3 = Language(lang).to_alpha3() iso_15924 = Language(lang).maximize().script lang_code = f"{iso_639_3}_{iso_15924}" datasets[lang] = load_dataset("openlanguagedata/flores_plus", lang_code, split=args.split)['text'] except Exception as e: raise ValueError(f"Failed to load dataset for language '{lang}': {e}")
54-66: LGTM with optional help text improvement.The CLI argument parsing is well-structured. The default language lists will generate all permutations excluding self-translation pairs (e.g., en→de, de→en, etc.), which is appropriate for evaluation.
Optionally, consider clarifying in the help text that identical source and target language lists will generate all cross-language pairs:
parser.add_argument( "--source_languages", default=["en", "de", "es", "fr", "it", "ja"], - nargs="+", help="Languages to translate from." + nargs="+", help="Languages to translate from. Will generate pairs with all target languages (excluding self-pairs)." )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
nemo_skills/dataset/flores200/__init__.py(1 hunks)nemo_skills/dataset/flores200/prepare.py(1 hunks)nemo_skills/dataset/wmt24pp/__init__.py(1 hunks)nemo_skills/dataset/wmt24pp/prepare.py(1 hunks)nemo_skills/evaluation/metrics/map_metrics.py(2 hunks)nemo_skills/evaluation/metrics/translation_metrics.py(1 hunks)nemo_skills/prompt/config/generic/translation.yaml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
TranslationMetrics(22-80)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (2)
BaseMetrics(21-332)as_float(343-344)
nemo_skills/dataset/flores200/prepare.py (1)
nemo_skills/dataset/wmt24pp/prepare.py (2)
write_data_to_file(16-29)main(32-46)
nemo_skills/dataset/wmt24pp/prepare.py (1)
nemo_skills/dataset/flores200/prepare.py (2)
write_data_to_file(14-29)main(32-51)
🪛 Ruff (0.13.1)
nemo_skills/evaluation/metrics/translation_metrics.py
38-38: f-string without any placeholders
Remove extraneous f prefix
(F541)
nemo_skills/dataset/flores200/prepare.py
19-19: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
nemo_skills/dataset/wmt24pp/prepare.py
19-19: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🔇 Additional comments (5)
nemo_skills/dataset/wmt24pp/__init__.py (1)
18-22: Confirm defaults for group/eval args align with intended pipeline.Current defaults set DATASET_GROUP="chat" and EVAL_ARGS="++eval_type=no-op". For a translation dataset, did you intend DATASET_GROUP="translation" and empty EVAL_ARGS? If "chat"/no-op is intentional (e.g., to reuse chat scaffolding), no change needed—just confirming.
nemo_skills/dataset/wmt24pp/prepare.py (1)
35-38: Verify WMT24pp schema and language codes.Please confirm that:
- The dataset columns are named
"source"and"target"for configen-{LANG2CODE[lang]}.- The configs
en-de_DE,en-es_MX, etc., exist as expected.If column names differ (e.g., a dict
translationwith language keys), we should adapt the extraction.nemo_skills/evaluation/metrics/translation_metrics.py (1)
33-36: ja tokenization dependency check.
sacrebleu’sja-mecabtokenizer requires extra deps (e.g., fugashi/unidic-lite). If missing, this may raise at runtime. Consider a graceful fallback to"ja-mecab" if available else "13a"with a warning.nemo_skills/evaluation/metrics/map_metrics.py (1)
31-51: LGTM — metric wired into the map cleanly.nemo_skills/dataset/flores200/__init__.py (1)
18-22: LGTM!The module-level constants are well-structured and appropriate for a translation dataset. The defaults align with the translation-focused evaluation flow added in this PR.
| import argparse | ||
| import importlib.util | ||
| import json | ||
| import tempfile | ||
| import urllib.request | ||
| from pathlib import Path | ||
|
|
||
| from langcodes import Language | ||
|
|
||
| from datasets import load_dataset | ||
| from tqdm import tqdm |
There was a problem hiding this comment.
Remove unused imports.
Several imports are unused in this file: importlib.util, tempfile, urllib.request, and tqdm.
Apply this diff to remove the unused imports:
import argparse
-import importlib.util
import json
-import tempfile
-import urllib.request
from pathlib import Path
from langcodes import Language
from datasets import load_dataset
-from tqdm import tqdm📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import argparse | |
| import importlib.util | |
| import json | |
| import tempfile | |
| import urllib.request | |
| from pathlib import Path | |
| from langcodes import Language | |
| from datasets import load_dataset | |
| from tqdm import tqdm | |
| import argparse | |
| import json | |
| from pathlib import Path | |
| from langcodes import Language | |
| from datasets import load_dataset |
🤖 Prompt for AI Agents
In nemo_skills/dataset/flores200/prepare.py around lines 1 to 11, remove the
unused imports importlib.util, tempfile, urllib.request, and tqdm; keep the
required imports (argparse, json, Path, Language, load_dataset) and ensure
import order remains consistent (standard libs first, third-party after) and no
other references to the removed modules remain in the file.
| def write_data_to_file(output_file, datasets, src_languages, tgt_languages): | ||
| with open(output_file, "wt", encoding="utf-8") as fout: | ||
| for src_lang in src_languages: | ||
| for tgt_lang in tgt_languages: | ||
| if src_lang != tgt_lang: | ||
| for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]): | ||
| json_dict = { | ||
| "text": src, | ||
| "translation": tgt, | ||
| "source_language": src_lang, | ||
| "target_language": tgt_lang, | ||
| "source_lang_name": Language(src_lang).display_name(), | ||
| "target_lang_name": Language(tgt_lang).display_name(), | ||
| } | ||
| json.dump(json_dict, fout) | ||
| fout.write("\n") |
There was a problem hiding this comment.
Add strict=True to zip() to prevent silent data loss.
The zip() call on line 19 lacks an explicit strict= parameter. If the source and target language datasets have mismatched lengths, zip() will silently truncate to the shorter sequence, potentially losing translation pairs without warning.
For parallel corpora like FLORES200, all language datasets should have identical lengths. Adding strict=True will raise a ValueError if lengths mismatch, making data integrity issues immediately visible.
Apply this diff:
- for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]):
+ for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True):Note: This same issue exists in the WMT24pp prepare.py script and should be addressed there as well.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def write_data_to_file(output_file, datasets, src_languages, tgt_languages): | |
| with open(output_file, "wt", encoding="utf-8") as fout: | |
| for src_lang in src_languages: | |
| for tgt_lang in tgt_languages: | |
| if src_lang != tgt_lang: | |
| for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]): | |
| json_dict = { | |
| "text": src, | |
| "translation": tgt, | |
| "source_language": src_lang, | |
| "target_language": tgt_lang, | |
| "source_lang_name": Language(src_lang).display_name(), | |
| "target_lang_name": Language(tgt_lang).display_name(), | |
| } | |
| json.dump(json_dict, fout) | |
| fout.write("\n") | |
| def write_data_to_file(output_file, datasets, src_languages, tgt_languages): | |
| with open(output_file, "wt", encoding="utf-8") as fout: | |
| for src_lang in src_languages: | |
| for tgt_lang in tgt_languages: | |
| if src_lang != tgt_lang: | |
| for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True): | |
| json_dict = { | |
| "text": src, | |
| "translation": tgt, | |
| "source_language": src_lang, | |
| "target_language": tgt_lang, | |
| "source_lang_name": Language(src_lang).display_name(), | |
| "target_lang_name": Language(tgt_lang).display_name(), | |
| } | |
| json.dump(json_dict, fout) | |
| fout.write("\n") |
🧰 Tools
🪛 Ruff (0.13.1)
19-19: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In nemo_skills/dataset/flores200/prepare.py around lines 14 to 29, the use of
zip(datasets[src_lang], datasets[tgt_lang]) can silently drop pairs when the two
lists differ in length; update the zip call to zip(datasets[src_lang],
datasets[tgt_lang], strict=True) so a ValueError is raised on length mismatch,
making data integrity failures explicit and preventing silent data loss (also
apply the same change to the WMT24pp prepare.py where the same pattern appears).
| datasets = {} | ||
| for lang in args.target_languages: | ||
| lang_code = f"en-{LANG2CODE[lang]}" | ||
| datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"] |
There was a problem hiding this comment.
Hardcoded split causes incorrect data selection.
You load "train" regardless of --split. Use the CLI split for loading.
- datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"]
+ datasets[lang] = load_dataset("google/wmt24pp", lang_code, split=args.split)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"] | |
| datasets[lang] = load_dataset("google/wmt24pp", lang_code, split=args.split) |
🤖 Prompt for AI Agents
In nemo_skills/dataset/wmt24pp/prepare.py around line 37, the code always loads
the "train" split; update the load_dataset call to use the CLI split variable
instead (e.g. load_dataset("google/wmt24pp", lang_code)[split] or pass split
directly if supported) so the dataset loaded matches the --split provided;
ensure the CLI variable name (split) is in scope and validate or map common
values (train/validation/test) if necessary before calling load_dataset.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (5)
nemo_skills/dataset/flores200/prepare.py (2)
15-25: Remove unused imports.Several imports are unused:
importlib.util,tempfile,urllib.request, andtqdm.Apply this diff:
import argparse -import importlib.util import json -import tempfile -import urllib.request from pathlib import Path from langcodes import Language from datasets import load_dataset -from tqdm import tqdm
33-33: Addstrict=Truetozip()to prevent silent data loss.For parallel corpora like FLORES200, all language datasets should have identical lengths. Without
strict=True,zip()will silently truncate to the shorter sequence if lengths mismatch.Apply this diff:
- for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]): + for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True):nemo_skills/dataset/wmt24pp/prepare.py (3)
33-33: Addstrict=Truetozip()to prevent silent data loss.Without
strict=True,zip()will silently truncate if source and target lists have different lengths.Apply this diff:
- for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"]): + for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"], strict=True):
51-51: Hardcoded split causes incorrect data selection.The code always loads the
"train"split regardless of the--splitargument value. This means users cannot select other splits.Apply this diff:
- datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"] + datasets[lang] = load_dataset("google/wmt24pp", lang_code, split=args.split)
65-65: Fix argparse choices bug.
choices=("test")is a string, not a tuple. This means argparse will accept any single character from "test" ('t', 'e', 's', 't') as valid input.Apply this diff:
- parser.add_argument("--split", default="test", choices=("test"), help="Dataset split to process.") + parser.add_argument("--split", default="test", choices=("test",), help="Dataset split to process.")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/dataset/flores200/prepare.py(1 hunks)nemo_skills/dataset/wmt24pp/prepare.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/dataset/flores200/prepare.py (1)
nemo_skills/dataset/wmt24pp/prepare.py (2)
write_data_to_file(30-43)main(46-60)
nemo_skills/dataset/wmt24pp/prepare.py (1)
nemo_skills/dataset/flores200/prepare.py (2)
write_data_to_file(28-43)main(46-65)
🪛 Ruff (0.13.1)
nemo_skills/dataset/flores200/prepare.py
33-33: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
nemo_skills/dataset/wmt24pp/prepare.py
33-33: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🔇 Additional comments (1)
nemo_skills/dataset/flores200/prepare.py (1)
46-65: LGTM!The
mainfunction correctly usesargs.splitwhen loading datasets and properly handles the union of source and target languages.
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Kipok
left a comment
There was a problem hiding this comment.
Thanks @AlexGrinch ! A few things to change besides the comments
- Please recreate PR from a branch (I sent you an invite), so all tests can run. See https://github.com/NVIDIA/NeMo-Skills/blob/main/CONTRIBUTING.md for more information
- Please add these new benchmarks to the documentation https://github.com/NVIDIA/NeMo-Skills/blob/main/docs/evaluation/multilingual.md. Could also list them in https://github.com/NVIDIA/NeMo-Skills/blob/main/docs/index.md and https://github.com/NVIDIA/NeMo-Skills/blob/main/README.md in the list of benchmarks with new Multilingual category (and perhaps add mmlu-prox there)
- It would be great to add some tests, e.g. you could add an example of metrics data to https://github.com/NVIDIA/NeMo-Skills/blob/main/tests/test_metrics.py#L24 or add a new slurm test https://github.com/NVIDIA/NeMo-Skills/tree/main/tests/slurm-tests to ensure this logic is not being broken. We could add tests in another PR if this is too much work for now, let me know
| # settings that define how evaluation should be done by default (all can be changed from cmdline) | ||
|
|
||
| PROMPT_CONFIG = "generic/translation" | ||
| DATASET_GROUP = "chat" |
There was a problem hiding this comment.
maybe we add a new group for multilingual datasets? Can also update to that type in mmlu-prox
| from collections import Counter, defaultdict | ||
|
|
||
| from sacrebleu import corpus_bleu | ||
| # from comet import download_model, load_from_checkpoint |
|
|
||
| return metrics_dict | ||
|
|
||
| def update(self, predictions): |
There was a problem hiding this comment.
do you think it's feasible to refactor this to reuse the parent method functions for computing pass@k? That way more code could be shared and pass@k can be defined as highest blue across k attempts (should work out-of-the-box as we have similar logic for ifeval)
shuoyangd
left a comment
There was a problem hiding this comment.
I have a few other minor questions and/or requested changes aside from what Igor commented earlier.
|
|
||
| from datasets import load_dataset | ||
|
|
||
| LANG2CODE = {"de": "de_DE", "es": "es_MX", "fr": "fr_FR", "it": "it_IT", "ja": "ja_JP"} |
There was a problem hiding this comment.
Can we just directly use the long code instead of adding this conversion dict? Because this will limit the utility of this script to just the 5 languages above, and if we use two-letter code we will have extendability problem once we include zh_CN and zh_TW.
| # from comet import download_model, load_from_checkpoint | ||
|
|
||
| # + | ||
| class TranslationMetrics(BaseMetrics): |
There was a problem hiding this comment.
Do you intend to extend this to compute COMET as well? If it's going to be in a different class we should just call this BLEU.
|
|
||
| tokenize = "13a" | ||
| tgt_lang = key.split("_")[-1] | ||
| if tgt_lang == "ja": |
There was a problem hiding this comment.
I think if you pass tokenize=None in L36 then you don't have to do this (it will default to jieba for Chinese, ja-mecab for Japanese, ko-mecab for Korean, 13a otherwise.
|
One more thing: we need two more dependencies here: |
|
let me close this one - @AlexGrinch please recreate from a branch |
Summary by CodeRabbit