MT datasets FLORES200 and WMT24pp by AlexGrinch · Pull Request #870 · NVIDIA-NeMo/Skills

AlexGrinch · 2025-09-30T18:30:47Z

Default text translation prompt for LLMs
Support for two machine translation datasets — FLORES200 and WMT24pp.

Summary by CodeRabbit

New Features
- Translation evaluation: per-language-pair BLEU and aggregated scores (with Japanese tokenization).
- New translation prompt template for clearer target-language outputs.
- Dataset preparation tools to export JSONL translation splits for FLORES-like and WMT24pp corpora.
- Default evaluation/generation presets for translation datasets to streamline setup.

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

coderabbitai · 2025-09-30T18:30:55Z

Walkthrough

Adds module-level dataset defaults and prepare scripts for Flores200 and WMT24pp, a new TranslationMetrics class computing BLEU and registering it, and a generic translation prompt template; new scripts serialize HuggingFace splits into per-split JSONL with language metadata.

Changes

Cohort / File(s)	Summary
Dataset default configs `nemo_skills/dataset/flores200/__init__.py`, `nemo_skills/dataset/wmt24pp/__init__.py`	New module-level constants: `PROMPT_CONFIG`, `DATASET_GROUP`, `METRICS_TYPE`, `EVAL_ARGS`, `GENERATION_ARGS` as default evaluation/generation settings.
Dataset preparation scripts `nemo_skills/dataset/flores200/prepare.py`, `nemo_skills/dataset/wmt24pp/prepare.py`	New CLI utilities that load parallel corpora from HuggingFace, normalize language codes/names, build language-pair mappings, and write split-based JSONL records containing `text`, `translation`, `source_language`, `target_language`, and readable language names.
Metrics integration `nemo_skills/evaluation/metrics/map_metrics.py`, `nemo_skills/evaluation/metrics/translation_metrics.py`	Adds `TranslationMetrics` (BLEU via `sacrebleu.corpus_bleu`, special-case tokenization for Japanese), per-pair storage and aggregation (per-src, per-tgt, overall), and registers it under `"translation"` in `METRICS_MAP`.
Prompt config `nemo_skills/prompt/config/generic/translation.yaml`	New prompt template: "Translate the following segment into {target_lang_name}, without additional explanation.\n\n{text}".

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as prepare.py (CLI)
  participant HF as HuggingFace Datasets
  participant Writer as JSONL Writer

  User->>CLI: run with --split / --source_languages / --target_languages
  CLI->>HF: load_dataset(...) per language or pair
  HF-->>CLI: dataset split (texts)
  loop for each language pair and example
    CLI->>CLI: normalize codes & names, build record
    CLI->>Writer: append JSONL line {text, translation, src/tgt codes, names}
  end
  Note over Writer: Output file: <split>.jsonl (one JSON object per line)

sequenceDiagram
  autonumber
  participant Eval as Evaluator
  participant TM as TranslationMetrics
  participant SB as sacrebleu

  Eval->>TM: reset()
  loop for each batch
    Eval->>TM: update(predictions)  %% store per-pair hypotheses/references
  end
  Eval->>TM: get_metrics()
  TM->>SB: corpus_bleu(hypotheses, references, tokenizer=auto or ja-mecab)
  SB-->>TM: BLEU scores per pair
  TM-->>Eval: {per-pair BLEU, per-src/tgt aggregates, overall BLEU}
  Note over TM: Aggregation buckets by source, target, and overall

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my nose at BLEU so bright,
JSONL carrots stacked just right.
From Flores fields to WMT lanes,
I hop through pairs — en to ja, oui!
Prompt in paw, I test and bite.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly and accurately reflects the primary focus of the pull request, which is adding support for the MT datasets FLORES200 and WMT24pp, and it is concise and clear for a teammate reviewing the history.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2a8cad and cba9e46.

📒 Files selected for processing (1)

nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/evaluation/metrics/map_metrics.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (5)

nemo_skills/prompt/config/generic/translation.yaml (1)
3-3: Tighten the prompt to reduce chatty outputs.

Consider explicitly asking for only the translation text (no quotes, no prefix/suffix) to avoid LLM boilerplate.

Apply this tweak if desired:
-user: "Translate the following segment into {target_lang_name}, without additional explanation.\n\n{text}"
+user: "Translate the following segment into {target_lang_name}. Return only the translated text (no quotes, no extra words).\n\n{text}"
nemo_skills/evaluation/metrics/translation_metrics.py (2)
31-34: Avoid re-deriving tgt_lang; use the parsed variable.

You already have tgt_lang from the split; reassigning risks drift if key parsing changes.
-            tokenize = "13a"
-            tgt_lang = key.split("_")[-1]
+            tokenize = "13a"
             if tgt_lang == "ja":
                 tokenize = "ja-mecab"
26-27: Key parsing via split("_") is brittle for codes containing underscores.

Prefer tuple keys (src_lang, tgt_lang) to avoid delimiter parsing bugs.

Minimal change:
-        for key in self.translation_dict:
-            src_lang, tgt_lang = key.split("_")
-            preds = self.translation_dict[key]["preds"]
-            gts = self.translation_dict[key]["gts"]
+        for (src_lang, tgt_lang), pair_dict in self.translation_dict.items():
+            preds = pair_dict["preds"]
+            gts = pair_dict["gts"]
And in update:
-            self.translation_dict[f"{src_lang}_{tgt_lang}"]["preds"].append(generation)
-            self.translation_dict[f"{src_lang}_{tgt_lang}"]["gts"].append(ground_truth)
+            self.translation_dict[(src_lang, tgt_lang)]["preds"].append(generation)
+            self.translation_dict[(src_lang, tgt_lang)]["gts"].append(ground_truth)
This avoids future failures if a language like zh_Hant ever appears.
nemo_skills/dataset/flores200/prepare.py (2)
32-51: Consider adding error handling for language code conversion and dataset loading.

The language code conversion (lines 38-40) and dataset loading (line 41) assume valid inputs and successful operations. If an invalid language code is provided or if dataset loading fails, the script will crash without helpful error messages.

Consider adding try-except blocks to provide clearer error messages for:

Invalid language codes that langcodes.Language cannot parse

Missing language variants in the FLORES+ dataset

Empty or malformed datasets

Example error handling:
for lang in all_languages:
    try:
        iso_639_3 = Language(lang).to_alpha3()
        iso_15924 = Language(lang).maximize().script
        lang_code = f"{iso_639_3}_{iso_15924}"
        datasets[lang] = load_dataset("openlanguagedata/flores_plus", lang_code, split=args.split)['text']
    except Exception as e:
        raise ValueError(f"Failed to load dataset for language '{lang}': {e}")
54-66: LGTM with optional help text improvement.

The CLI argument parsing is well-structured. The default language lists will generate all permutations excluding self-translation pairs (e.g., en→de, de→en, etc.), which is appropriate for evaluation.

Optionally, consider clarifying in the help text that identical source and target language lists will generate all cross-language pairs:
     parser.add_argument(
         "--source_languages", default=["en", "de", "es", "fr", "it", "ja"],
-        nargs="+", help="Languages to translate from."
+        nargs="+", help="Languages to translate from. Will generate pairs with all target languages (excluding self-pairs)."
     )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74b8ed8 and 85f0a13.

📒 Files selected for processing (7)

nemo_skills/dataset/flores200/__init__.py (1 hunks)
nemo_skills/dataset/flores200/prepare.py (1 hunks)
nemo_skills/dataset/wmt24pp/__init__.py (1 hunks)
nemo_skills/dataset/wmt24pp/prepare.py (1 hunks)
nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)
nemo_skills/prompt/config/generic/translation.yaml (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

nemo_skills/evaluation/metrics/map_metrics.py (1)

nemo_skills/evaluation/metrics/translation_metrics.py (1)

TranslationMetrics (22-80)

nemo_skills/evaluation/metrics/translation_metrics.py (1)

nemo_skills/evaluation/metrics/base.py (2)

BaseMetrics (21-332)

as_float (343-344)

nemo_skills/dataset/flores200/prepare.py (1)

nemo_skills/dataset/wmt24pp/prepare.py (2)

write_data_to_file (16-29)

main (32-46)

nemo_skills/dataset/wmt24pp/prepare.py (1)

nemo_skills/dataset/flores200/prepare.py (2)

write_data_to_file (14-29)

main (32-51)

🪛 Ruff (0.13.1)

nemo_skills/evaluation/metrics/translation_metrics.py

38-38: f-string without any placeholders

Remove extraneous f prefix

(F541)

nemo_skills/dataset/flores200/prepare.py

19-19: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/dataset/wmt24pp/prepare.py

19-19: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🔇 Additional comments (5)

nemo_skills/dataset/wmt24pp/__init__.py (1)

18-22: Confirm defaults for group/eval args align with intended pipeline.

Current defaults set DATASET_GROUP="chat" and EVAL_ARGS="++eval_type=no-op". For a translation dataset, did you intend DATASET_GROUP="translation" and empty EVAL_ARGS? If "chat"/no-op is intentional (e.g., to reuse chat scaffolding), no change needed—just confirming.

nemo_skills/dataset/wmt24pp/prepare.py (1)

35-38: Verify WMT24pp schema and language codes.

Please confirm that:

The dataset columns are named "source" and "target" for config en-{LANG2CODE[lang]}.

The configs en-de_DE, en-es_MX, etc., exist as expected.

If column names differ (e.g., a dict translation with language keys), we should adapt the extraction.

nemo_skills/evaluation/metrics/translation_metrics.py (1)

33-36: ja tokenization dependency check.

sacrebleu’s ja-mecab tokenizer requires extra deps (e.g., fugashi/unidic-lite). If missing, this may raise at runtime. Consider a graceful fallback to "ja-mecab" if available else "13a" with a warning.

nemo_skills/evaluation/metrics/map_metrics.py (1)

31-51: LGTM — metric wired into the map cleanly.

nemo_skills/dataset/flores200/__init__.py (1)

18-22: LGTM!

The module-level constants are well-structured and appropriate for a translation dataset. The defaults align with the translation-focused evaluation flow added in this PR.

coderabbitai · 2025-09-30T18:39:33Z

nemo_skills/dataset/flores200/prepare.py

+import argparse
+import importlib.util
+import json
+import tempfile
+import urllib.request
+from pathlib import Path
+
+from langcodes import Language
+
+from datasets import load_dataset
+from tqdm import tqdm


⚠️ Potential issue | 🟡 Minor

Remove unused imports.

Several imports are unused in this file: importlib.util, tempfile, urllib.request, and tqdm.

Apply this diff to remove the unused imports:

import argparse -import importlib.util import json -import tempfile -import urllib.request from pathlib import Path from langcodes import Language from datasets import load_dataset -from tqdm import tqdm

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import argparse

import importlib.util

import json

import tempfile

import urllib.request

from pathlib import Path

from langcodes import Language

from datasets import load_dataset

from tqdm import tqdm

import argparse

import json

from pathlib import Path

from langcodes import Language

from datasets import load_dataset

🤖 Prompt for AI Agents

In nemo_skills/dataset/flores200/prepare.py around lines 1 to 11, remove the unused imports importlib.util, tempfile, urllib.request, and tqdm; keep the required imports (argparse, json, Path, Language, load_dataset) and ensure import order remains consistent (standard libs first, third-party after) and no other references to the removed modules remain in the file.

coderabbitai · 2025-09-30T18:39:34Z

nemo_skills/dataset/flores200/prepare.py

+def write_data_to_file(output_file, datasets, src_languages, tgt_languages):
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for src_lang in src_languages:
+            for tgt_lang in tgt_languages:
+                if src_lang != tgt_lang:
+                    for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]):
+                        json_dict = {
+                            "text": src,
+                            "translation": tgt,
+                            "source_language": src_lang,
+                            "target_language": tgt_lang,
+                            "source_lang_name": Language(src_lang).display_name(),
+                            "target_lang_name": Language(tgt_lang).display_name(),
+                        }
+                        json.dump(json_dict, fout)
+                        fout.write("\n")


⚠️ Potential issue | 🟠 Major

Add strict=True to zip() to prevent silent data loss.

The zip() call on line 19 lacks an explicit strict= parameter. If the source and target language datasets have mismatched lengths, zip() will silently truncate to the shorter sequence, potentially losing translation pairs without warning.

For parallel corpora like FLORES200, all language datasets should have identical lengths. Adding strict=True will raise a ValueError if lengths mismatch, making data integrity issues immediately visible.

Apply this diff:

- for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]): + for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True):

Note: This same issue exists in the WMT24pp prepare.py script and should be addressed there as well.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def write_data_to_file(output_file, datasets, src_languages, tgt_languages):

with open(output_file, "wt", encoding="utf-8") as fout:

for src_lang in src_languages:

for tgt_lang in tgt_languages:

if src_lang != tgt_lang:

for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]):

json_dict = {

"text": src,

"translation": tgt,

"source_language": src_lang,

"target_language": tgt_lang,

"source_lang_name": Language(src_lang).display_name(),

"target_lang_name": Language(tgt_lang).display_name(),

}

json.dump(json_dict, fout)

fout.write("\n")

def write_data_to_file(output_file, datasets, src_languages, tgt_languages):

with open(output_file, "wt", encoding="utf-8") as fout:

for src_lang in src_languages:

for tgt_lang in tgt_languages:

if src_lang != tgt_lang:

for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True):

json_dict = {

"text": src,

"translation": tgt,

"source_language": src_lang,

"target_language": tgt_lang,

"source_lang_name": Language(src_lang).display_name(),

"target_lang_name": Language(tgt_lang).display_name(),

}

json.dump(json_dict, fout)

fout.write("\n")

🧰 Tools

🪛 Ruff (0.13.1)

19-19: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In nemo_skills/dataset/flores200/prepare.py around lines 14 to 29, the use of zip(datasets[src_lang], datasets[tgt_lang]) can silently drop pairs when the two lists differ in length; update the zip call to zip(datasets[src_lang], datasets[tgt_lang], strict=True) so a ValueError is raised on length mismatch, making data integrity failures explicit and preventing silent data loss (also apply the same change to the WMT24pp prepare.py where the same pattern appears).

nemo_skills/dataset/wmt24pp/prepare.py

coderabbitai · 2025-09-30T18:39:34Z

nemo_skills/dataset/wmt24pp/prepare.py

+    datasets = {}
+    for lang in args.target_languages:
+        lang_code = f"en-{LANG2CODE[lang]}"
+        datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"]


⚠️ Potential issue | 🔴 Critical

Hardcoded split causes incorrect data selection.

You load "train" regardless of --split. Use the CLI split for loading.

- datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"] + datasets[lang] = load_dataset("google/wmt24pp", lang_code, split=args.split)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"]

datasets[lang] = load_dataset("google/wmt24pp", lang_code, split=args.split)

🤖 Prompt for AI Agents

In nemo_skills/dataset/wmt24pp/prepare.py around line 37, the code always loads the "train" split; update the load_dataset call to use the CLI split variable instead (e.g. load_dataset("google/wmt24pp", lang_code)[split] or pass split directly if supported) so the dataset loaded matches the --split provided; ensure the CLI variable name (split) is in scope and validate or map common values (train/validation/test) if necessary before calling load_dataset.

nemo_skills/dataset/wmt24pp/prepare.py

nemo_skills/evaluation/metrics/translation_metrics.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (5)

nemo_skills/dataset/flores200/prepare.py (2)
15-25: Remove unused imports.

Several imports are unused: importlib.util, tempfile, urllib.request, and tqdm.

Apply this diff:
 import argparse
-import importlib.util
 import json
-import tempfile
-import urllib.request
 from pathlib import Path
 
 from langcodes import Language
 
 from datasets import load_dataset
-from tqdm import tqdm
33-33: Add strict=True to zip() to prevent silent data loss.

For parallel corpora like FLORES200, all language datasets should have identical lengths. Without strict=True, zip() will silently truncate to the shorter sequence if lengths mismatch.

Apply this diff:
-                    for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]):
+                    for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True):
nemo_skills/dataset/wmt24pp/prepare.py (3)
33-33: Add strict=True to zip() to prevent silent data loss.

Without strict=True, zip() will silently truncate if source and target lists have different lengths.

Apply this diff:
-                for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"]):
+                for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"], strict=True):
51-51: Hardcoded split causes incorrect data selection.

The code always loads the "train" split regardless of the --split argument value. This means users cannot select other splits.

Apply this diff:
-        datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"]
+        datasets[lang] = load_dataset("google/wmt24pp", lang_code, split=args.split)
65-65: Fix argparse choices bug.

choices=("test") is a string, not a tuple. This means argparse will accept any single character from "test" ('t', 'e', 's', 't') as valid input.

Apply this diff:
-    parser.add_argument("--split", default="test", choices=("test"), help="Dataset split to process.")
+    parser.add_argument("--split", default="test", choices=("test",), help="Dataset split to process.")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 85f0a13 and dadb627.

📒 Files selected for processing (2)

nemo_skills/dataset/flores200/prepare.py (1 hunks)
nemo_skills/dataset/wmt24pp/prepare.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

nemo_skills/dataset/flores200/prepare.py (1)

nemo_skills/dataset/wmt24pp/prepare.py (2)

write_data_to_file (30-43)

main (46-60)

nemo_skills/dataset/wmt24pp/prepare.py (1)

nemo_skills/dataset/flores200/prepare.py (2)

write_data_to_file (28-43)

main (46-65)

🪛 Ruff (0.13.1)

nemo_skills/dataset/flores200/prepare.py

33-33: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/dataset/wmt24pp/prepare.py

33-33: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🔇 Additional comments (1)

nemo_skills/dataset/flores200/prepare.py (1)

46-65: LGTM!

The main function correctly uses args.split when loading datasets and properly handles the union of source and target languages.

nemo_skills/dataset/wmt24pp/prepare.py

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>

Kipok

Thanks @AlexGrinch ! A few things to change besides the comments

Please recreate PR from a branch (I sent you an invite), so all tests can run. See https://github.com/NVIDIA/NeMo-Skills/blob/main/CONTRIBUTING.md for more information
Please add these new benchmarks to the documentation https://github.com/NVIDIA/NeMo-Skills/blob/main/docs/evaluation/multilingual.md. Could also list them in https://github.com/NVIDIA/NeMo-Skills/blob/main/docs/index.md and https://github.com/NVIDIA/NeMo-Skills/blob/main/README.md in the list of benchmarks with new Multilingual category (and perhaps add mmlu-prox there)
It would be great to add some tests, e.g. you could add an example of metrics data to https://github.com/NVIDIA/NeMo-Skills/blob/main/tests/test_metrics.py#L24 or add a new slurm test https://github.com/NVIDIA/NeMo-Skills/tree/main/tests/slurm-tests to ensure this logic is not being broken. We could add tests in another PR if this is too much work for now, let me know

Kipok · 2025-09-30T22:52:29Z

nemo_skills/dataset/flores200/__init__.py

+# settings that define how evaluation should be done by default (all can be changed from cmdline)
+
+PROMPT_CONFIG = "generic/translation"
+DATASET_GROUP = "chat"


maybe we add a new group for multilingual datasets? Can also update to that type in mmlu-prox

Kipok · 2025-09-30T22:54:25Z

nemo_skills/evaluation/metrics/translation_metrics.py

+from collections import Counter, defaultdict
+
+from sacrebleu import corpus_bleu
+# from comet import download_model, load_from_checkpoint


Kipok · 2025-09-30T22:56:28Z

nemo_skills/evaluation/metrics/translation_metrics.py

+
+        return metrics_dict
+
+    def update(self, predictions):


do you think it's feasible to refactor this to reuse the parent method functions for computing pass@k? That way more code could be shared and pass@k can be defined as highest blue across k attempts (should work out-of-the-box as we have similar logic for ifeval)

shuoyangd

I have a few other minor questions and/or requested changes aside from what Igor commented earlier.

shuoyangd · 2025-10-01T21:33:03Z

nemo_skills/dataset/wmt24pp/prepare.py

+
+from datasets import load_dataset
+
+LANG2CODE = {"de": "de_DE", "es": "es_MX", "fr": "fr_FR", "it": "it_IT", "ja": "ja_JP"}


Can we just directly use the long code instead of adding this conversion dict? Because this will limit the utility of this script to just the 5 languages above, and if we use two-letter code we will have extendability problem once we include zh_CN and zh_TW.

shuoyangd · 2025-10-01T21:37:04Z

nemo_skills/evaluation/metrics/translation_metrics.py

+# from comet import download_model, load_from_checkpoint
+
+# +
+class TranslationMetrics(BaseMetrics):


Do you intend to extend this to compute COMET as well? If it's going to be in a different class we should just call this BLEU.

shuoyangd · 2025-10-01T21:44:19Z

nemo_skills/evaluation/metrics/translation_metrics.py

+
+            tokenize = "13a"
+            tgt_lang = key.split("_")[-1]
+            if tgt_lang == "ja":


I think if you pass tokenize=None in L36 then you don't have to do this (it will default to jieba for Chinese, ja-mecab for Japanese, ko-mecab for Korean, 13a otherwise.

shuoyangd · 2025-10-02T16:01:35Z

One more thing: we need two more dependencies here: langcodes and sacrebleu

Kipok · 2025-10-03T22:18:05Z

let me close this one - @AlexGrinch please recreate from a branch

AlexGrinch added 2 commits September 17, 2025 13:50

init commit

7a6e874

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

separate prompt config

85f0a13

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

added license strings

dadb627

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

nemo_skills/dataset/wmt24pp/prepare.py Outdated Show resolved Hide resolved

AlexGrinch added 2 commits September 30, 2025 12:01

coderabbit

f2a8cad

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

Merge branch 'main' into flores200

cba9e46

Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>

Kipok reviewed Sep 30, 2025

View reviewed changes

shuoyangd reviewed Oct 1, 2025

View reviewed changes

Kipok closed this Oct 3, 2025

AlexGrinch mentioned this pull request Oct 4, 2025

MT datasets FLORES200 and WMT24pp #892

Merged

	datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"]
	datasets[lang] = load_dataset("google/wmt24pp", lang_code, split=args.split)


		from datasets import load_dataset

		LANG2CODE = {"de": "de_DE", "es": "es_MX", "fr": "fr_FR", "it": "it_IT", "ja": "ja_JP"}

Conversation

AlexGrinch commented Sep 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Kipok Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Kipok Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

shuoyangd left a comment

Choose a reason for hiding this comment

Uh oh!

shuoyangd Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

shuoyangd Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

shuoyangd Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

shuoyangd commented Oct 2, 2025

Uh oh!

Kipok commented Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlexGrinch commented Sep 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 30, 2025 •

edited

Loading