Skip to content

MT datasets FLORES200 and WMT24pp#892

Merged
Kipok merged 11 commits intomainfrom
ohrinchuk/mt_datasets
Oct 7, 2025
Merged

MT datasets FLORES200 and WMT24pp#892
Kipok merged 11 commits intomainfrom
ohrinchuk/mt_datasets

Conversation

@AlexGrinch
Copy link
Collaborator

@AlexGrinch AlexGrinch commented Oct 4, 2025

PR (#870) recreated from branch

  1. Addressed most of the comments from last PR (removed unused imports, added docs, refactored wmt24pp to support arbitrary languages).
  2. Passing tokenize=None to corpus_bleu does not handle Japanese texts correctly, left as is.
  3. Tested that running evaluation with FLORES-200 and wmt24pp runs correctly on cluster. Translation metrics refactoring and tests is left for another PR.

Summary by CodeRabbit

  • New Features

    • BLEU-based translation evaluation and a multilingual segment-translation prompt.
    • Support for FLORES-200 and WMT24++ datasets with dataset preparation scripts and runnable example commands.
  • Documentation

    • Expanded multilingual evaluation guide with new dataset sections, benchmark corrections, sample result tables, and added "Multilingual capabilities" note.
    • Formatting fixes to code blocks and end-of-file newlines.
  • Chores

    • Added dependencies for language handling and BLEU scoring.

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 4, 2025

Walkthrough

Adds multilingual translation evaluation: two new datasets (FLORES-200, wmt24pp) with prepare scripts and default configs, a segment-translation prompt, a TranslationMetrics implementation (BLEU via sacrebleu) registered in METRICS_MAP, docs updates, and new deps (langcodes, sacrebleu).

Changes

Cohort / File(s) Summary
Docs updates
README.md, docs/index.md, docs/evaluation/multilingual.md, docs/evaluation/index.md, docs/evaluation/long-context.md
Updated multilingual evaluation bullets (added FLORES-200, wmt24pp), removed "(to be added)", fixed mmlu-prox path/fencing, added benchmark sections, example commands/results, and minor formatting fixes.
FLORES-200 dataset
nemo_skills/dataset/flores200/__init__.py, nemo_skills/dataset/flores200/prepare.py
Added dataset default constants and a CLI prepare script that loads openlanguagedata/flores_plus, derives language codes, zips source/target texts, and writes JSONL with text, translation, language codes, and display names.
WMT24PP dataset
nemo_skills/dataset/wmt24pp/__init__.py, nemo_skills/dataset/wmt24pp/prepare.py
Added dataset default constants and a CLI prepare script that loads google/wmt24pp en→target splits, zips source/target texts, and writes JSONL with text, translation, language metadata, and display names.
Translation metrics
nemo_skills/evaluation/metrics/map_metrics.py, nemo_skills/evaluation/metrics/translation_metrics.py
New TranslationMetrics(BaseMetrics) using sacrebleu.corpus_bleu with language-specific tokenization; maintains per-pair and aggregated BLEU buckets; registered "translation" in METRICS_MAP.
Prompt config
nemo_skills/prompt/config/multilingual/__init__.py, nemo_skills/prompt/config/multilingual/segment-translation.yaml
New multilingual prompt package and a segment-translation YAML prompt template for segment-level translation.
Requirements
requirements/main.txt
Added external dependencies: langcodes, sacrebleu.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as Eval CLI
  participant Loader as Dataset Loader
  participant Prompt as Prompt Config
  participant Model as Model
  participant Metrics as TranslationMetrics

  User->>CLI: run eval (dataset=FLORES-200/WMT24PP, metrics=translation)
  CLI->>Loader: load JSONL samples
  CLI->>Prompt: load multilingual/segment-translation
  loop per sample
    CLI->>Model: send prompt(text, target_language)
    Model-->>CLI: hypothesis
    CLI->>Metrics: update(src_lang, tgt_lang, ref, hyp)
  end
  CLI->>Metrics: get_metrics()
  Metrics-->>CLI: BLEU per-pair and aggregated scores
  CLI-->>User: results
Loading
sequenceDiagram
  autonumber
  actor Dev
  participant Prep as prepare.py
  participant HF as HF Datasets
  participant FS as Filesystem

  Dev->>Prep: run prepare (--split, --source_languages/--target_languages)
  loop languages
    Prep->>HF: load dataset[split, lang_code]
    HF-->>Prep: parallel texts
  end
  Prep->>FS: write data/{split}.jsonl (one JSON object per src↔tgt pair)
  FS-->>Dev: JSONL ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A nibble of BLEU in a meadow of tongues,
I hop through FLORES where wmt24pp hums.
Prompts like clover, segments neatly spun,
Metrics bloom softly—another run done.
Twitching nose, I translate the sun. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “MT datasets FLORES200 and WMT24pp” directly reflects the central change of the pull request, which is the introduction of support for two new machine translation datasets, and it is concise and specific enough for teammates to understand the main feature added without extraneous details.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ohrinchuk/mt_datasets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_skills/dataset/wmt24pp/prepare.py (1)

36-36: Consider validating language code format.

Line 36 extracts the language code prefix using tgt_lang[:2], which assumes language codes follow the format "xx_XX" (e.g., "de_DE"). While the default values in line 62 confirm this format, there's no validation if a user provides a different format.

Consider adding validation or using a more robust approach:

+                        # Validate language code format
+                        if '_' not in tgt_lang or len(tgt_lang.split('_')[0]) != 2:
+                            raise ValueError(f"Invalid language code format: {tgt_lang}. Expected format: 'xx_XX'")
                         "target_lang_name": Language(tgt_lang[:2]).display_name()

Or, extract the language code more explicitly:

-                        "target_lang_name": Language(tgt_lang[:2]).display_name()
+                        lang_code = tgt_lang.split('_')[0] if '_' in tgt_lang else tgt_lang[:2]
+                        "target_lang_name": Language(lang_code).display_name()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f94fdbd and 0856d8f.

📒 Files selected for processing (11)
  • README.md (1 hunks)
  • docs/evaluation/multilingual.md (3 hunks)
  • docs/index.md (1 hunks)
  • nemo_skills/dataset/flores200/__init__.py (1 hunks)
  • nemo_skills/dataset/flores200/prepare.py (1 hunks)
  • nemo_skills/dataset/wmt24pp/__init__.py (1 hunks)
  • nemo_skills/dataset/wmt24pp/prepare.py (1 hunks)
  • nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
  • nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)
  • nemo_skills/prompt/config/multilingual/__init__.py (1 hunks)
  • nemo_skills/prompt/config/multilingual/segment-translation.yaml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
  • TranslationMetrics (21-80)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (2)
  • BaseMetrics (23-434)
  • as_float (449-452)
nemo_skills/dataset/flores200/prepare.py (1)
nemo_skills/dataset/wmt24pp/prepare.py (2)
  • write_data_to_file (24-39)
  • main (42-55)
nemo_skills/dataset/wmt24pp/prepare.py (1)
nemo_skills/dataset/flores200/prepare.py (2)
  • write_data_to_file (28-43)
  • main (46-65)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (14)
docs/index.md (1)

24-25: LGTM!

The capitalization fix and the new multilingual capabilities bullet accurately reflect the expanded features introduced in this PR.

nemo_skills/prompt/config/multilingual/__init__.py (1)

1-13: LGTM!

This is a standard Python package marker file with the correct license header.

README.md (1)

20-20: LGTM!

The addition of FLORES-200 and wmt24pp to the multilingual evaluation examples accurately documents the new benchmarks introduced in this PR.

nemo_skills/prompt/config/multilingual/segment-translation.yaml (1)

1-3: LGTM!

The translation prompt template is clear, concise, and correctly uses placeholders for variable substitution.

nemo_skills/evaluation/metrics/map_metrics.py (2)

35-35: LGTM!

The import of TranslationMetrics follows the existing import pattern in this file.


59-59: LGTM!

The addition of the "translation": TranslationMetrics mapping integrates correctly with the existing metrics map structure.

docs/evaluation/multilingual.md (3)

3-3: LGTM!

Removing the "(to be added)" note correctly reflects that machine translation is now a present feature.


12-12: LGTM!

The wording improvement makes the benchmark reference clearer.


71-99: LGTM!

The new FLORES-200 and wmt24pp benchmark sections are well-structured, include all necessary information (definition, source, reference results), and follow the established documentation format.

nemo_skills/dataset/wmt24pp/__init__.py (1)

18-22: LGTM!

The evaluation configuration constants are well-defined and follow the established pattern from the flores200 dataset module.

nemo_skills/dataset/wmt24pp/prepare.py (2)

24-39: Good structure and use of strict zip.

The overall structure is clean and follows best practices:

  • Clear function separation between data writing and orchestration.
  • Use of strict=True in zip() (line 27-28) prevents silent data loss from length mismatches.
  • File path handling using Path is appropriate.

However, please address the critical split issue flagged separately.

Also applies to: 42-55, 58-66


46-46: Critical: Use CLI split argument when loading the dataset

-        datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]
+        datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")[args.split]

Confirm that google/wmt24pp supports the requested split (e.g. "test" by default) and adjust --split choices if only "train" is available.

nemo_skills/dataset/flores200/prepare.py (1)

55-55: Re-run column check for Flores+ dataset
The sandbox can’t import 🤗datasets; please install it locally and run:

pip install datasets
python - <<'PY'
from datasets import load_dataset
print(load_dataset("openlanguagedata/flores_plus", "eng_Latn", split="dev").column_names)
PY

Confirm whether the split uses "text" or "sentence" and update the indexing accordingly.

nemo_skills/evaluation/metrics/translation_metrics.py (1)

31-36: Verify sacrebleu tokenizer availability at runtime

  • Ensure your installed sacrebleu build includes ja-mecab (requires fugashi/unidic extras) and ko-mecab (requires mecab-ko support); otherwise install with the appropriate extras.
  • Guard on AVAILABLE_TOKENIZERS or catch missing-tokenizer errors and fall back to "13a" if ja-mecab/ko-mecab aren’t available.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0856d8f and 7985a85.

📒 Files selected for processing (1)
  • requirements/main.txt (2 hunks)
🧰 Additional context used
🪛 GitHub Actions: Lint and Format
requirements/main.txt

[error] 1-1: pre-commit: requirements-txt-fixer sorted/updated dependencies; hook failed.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
nemo_skills/evaluation/metrics/translation_metrics.py (3)

30-36: Consider using a library for language code normalization.

The current approach using tgt_lang[:2] to determine tokenization is fragile and assumes language codes follow a specific format. Language codes can be:

  • 2-character (ISO 639-1): "ja", "zh", "ko"
  • 3-character (ISO 639-3): "jpn", "cmn", "kor"
  • With region variants: "zh-CN", "zh-TW"

Consider using the langcodes library (already a dependency per the PR) to normalize language codes before selecting tokenization:

from langcodes import Language

# In get_metrics method:
tgt_lang_code = Language.get(tgt_lang).language  # Gets base language code
tokenize = "13a"
if tgt_lang_code == "ja":
    tokenize = "ja-mecab"
elif tgt_lang_code == "zh":
    tokenize = "zh"
elif tgt_lang_code == "ko":
    tokenize = "ko-mecab"

This would handle variants like "zh-CN", "zh-TW", and 3-character codes more reliably.


44-45: Add defensive check for empty aggregation.

While unlikely in practice, if self.aggregation_dict[key] is empty, line 45 would cause a division by zero error.

Apply this diff to add a defensive check:

 for key in self.aggregation_dict:
+    if not self.aggregation_dict[key]:
+        continue
     metrics_dict[key] = {"bleu": sum(self.aggregation_dict[key]) / len(self.aggregation_dict[key])}

58-65: Consider adding input validation.

The method assumes each prediction contains the expected keys (source_language, target_language, generation, translation). Adding validation would make debugging easier if the input format is incorrect.

Apply this diff to add validation:

 for pred in predictions:
+    required_keys = ["source_language", "target_language", "generation", "translation"]
+    missing_keys = [key for key in required_keys if key not in pred]
+    if missing_keys:
+        raise ValueError(f"Prediction missing required keys: {missing_keys}")
     src_lang = pred["source_language"]
     tgt_lang = pred["target_language"]
     generation = pred["generation"]
     ground_truth = pred["translation"]
nemo_skills/dataset/wmt24pp/prepare.py (1)

33-33: Consider more robust language code parsing.

The slicing tgt_lang[:2] assumes a specific format (e.g., "de_DE") and will fail or produce incorrect results if the format varies. The FLORES-200 preparation script uses Language(lang).display_name() which is more robust.

Consider this alternative:

-                    "target_lang_name": Language(tgt_lang[:2]).display_name(),
+                    "target_lang_name": Language(tgt_lang.split('_')[0]).display_name(),

Or validate the format at the start of the function:

def write_data_to_file(output_file, datasets, tgt_languages):
    with open(output_file, "wt", encoding="utf-8") as fout:
        for tgt_lang in tgt_languages:
            lang_code = tgt_lang.split('_')[0] if '_' in tgt_lang else tgt_lang
            for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"], strict=True):
                json_dict = {
                    "text": src,
                    "translation": tgt,
                    "source_language": "en",
                    "target_language": tgt_lang,
                    "source_lang_name": "English",
                    "target_lang_name": Language(lang_code).display_name(),
                }
                json.dump(json_dict, fout)
                fout.write("\n")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 868f91a and b9d9ac4.

📒 Files selected for processing (4)
  • nemo_skills/dataset/flores200/prepare.py (1 hunks)
  • nemo_skills/dataset/wmt24pp/prepare.py (1 hunks)
  • nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)
  • nemo_skills/prompt/config/multilingual/segment-translation.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/prompt/config/multilingual/segment-translation.yaml
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_skills/dataset/wmt24pp/prepare.py (1)
nemo_skills/dataset/flores200/prepare.py (2)
  • write_data_to_file (23-38)
  • main (41-54)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (1)
  • as_float (449-452)
nemo_skills/dataset/flores200/prepare.py (1)
nemo_skills/dataset/wmt24pp/prepare.py (2)
  • write_data_to_file (23-36)
  • main (39-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (7)
nemo_skills/evaluation/metrics/translation_metrics.py (3)

1-22: LGTM!

The imports and class definition are correct.


67-70: LGTM!

The reset method correctly initializes the data structures.


76-78: LGTM!

The metrics_to_print method correctly returns the BLEU metric with the as_float formatter.

nemo_skills/dataset/flores200/prepare.py (3)

23-38: LGTM! Good implementation with proper error handling.

The function correctly:

  • Iterates over all source-target language pairs
  • Skips identical language pairs (line 27)
  • Uses strict=True in zip to ensure data integrity
  • Properly derives language display names using the Language API

57-73: LGTM! Clear CLI interface with sensible defaults.

The argument parser provides:

  • Appropriate split choices for FLORES-200 dataset
  • Flexible source and target language configuration
  • Clear help messages
  • Reasonable default languages

41-54: Add error handling for invalid language codes and dataset loads
Wrap the ISO conversion (Language(...).to_alpha3()) and each load_dataset call in try/except, and verify returned datasets aren’t empty before writing.

nemo_skills/dataset/wmt24pp/prepare.py (1)

39-47: Critical: Replace hardcoded split with args.split

The CLI --split argument is defined but unused—your code always loads "train". Update:

- datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]
+ datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")[args.split]

Then confirm that google/wmt24pp actually provides the requested split (e.g. "test"), or adjust the CLI choices to match the dataset’s available splits.

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)

30-36: Consider more robust language code detection.

Using [:2] slice assumes all language codes follow the 2-letter ISO 639-1 format. This may not handle region-specific codes (e.g., "zh-CN", "pt-BR") or longer codes correctly.

Since langcodes is already a project dependency (added in this PR), consider using it for more robust language detection:

-tokenize = "13a"
-if tgt_lang[:2] == "ja":
-    tokenize = "ja-mecab"
-if tgt_lang[:2] == "zh":
-    tokenize = "zh"
-if tgt_lang[:2] == "ko":
-    tokenize = "ko-mecab"
+import langcodes
+
+tokenize = "13a"
+lang_code = langcodes.Language.get(tgt_lang).language
+if lang_code == "ja":
+    tokenize = "ja-mecab"
+elif lang_code == "zh":
+    tokenize = "zh"
+elif lang_code == "ko":
+    tokenize = "ko-mecab"
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9d9ac4 and 54b60a1.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (2)
  • BaseMetrics (23-434)
  • as_float (449-452)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (4)
nemo_skills/evaluation/metrics/translation_metrics.py (4)

1-20: LGTM!

The imports are appropriate and the license header is correct.


67-70: LGTM!

The reset logic correctly initializes the nested data structures.


72-74: Past review comment addressed.

The docstring has been updated from the previous version and no longer references irrelevant metrics like "majority/rm/pass". The current description accurately reflects the method's purpose.


76-78: LGTM!

The metrics formatting is correctly configured using the as_float formatter from the base module.

Comment on lines +58 to +65
for pred in predictions:
src_lang = pred["source_language"]
tgt_lang = pred["target_language"]
generation = pred["generation"]
ground_truth = pred["translation"]

self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation)
self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add validation for required prediction keys.

The code assumes all predictions contain source_language, target_language, generation, and translation keys. If these keys are missing, a KeyError will be raised at runtime, potentially crashing the evaluation pipeline.

Consider adding validation:

 for pred in predictions:
+    required_keys = ["source_language", "target_language", "generation", "translation"]
+    if not all(key in pred for key in required_keys):
+        raise ValueError(f"Prediction missing required keys. Expected {required_keys}, got {list(pred.keys())}")
     src_lang = pred["source_language"]
     tgt_lang = pred["target_language"]
     generation = pred["generation"]
     ground_truth = pred["translation"]

     self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation)
     self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth)

Alternatively, document the expected prediction schema in the docstring if validation is considered too strict for your use case.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for pred in predictions:
src_lang = pred["source_language"]
tgt_lang = pred["target_language"]
generation = pred["generation"]
ground_truth = pred["translation"]
self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation)
self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth)
for pred in predictions:
required_keys = ["source_language", "target_language", "generation", "translation"]
if not all(key in pred for key in required_keys):
raise ValueError(
f"Prediction missing required keys. Expected {required_keys}, got {list(pred.keys())}"
)
src_lang = pred["source_language"]
tgt_lang = pred["target_language"]
generation = pred["generation"]
ground_truth = pred["translation"]
self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation)
self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth)
🤖 Prompt for AI Agents
In nemo_skills/evaluation/metrics/translation_metrics.py around lines 58 to 65,
the loop assumes each prediction dict contains source_language, target_language,
generation, and translation which can raise KeyError; add validation at the top
of the loop to verify those keys exist (e.g., check set membership or try/except
KeyError), and handle missing keys by either skipping that prediction with a
warning/log entry that includes the missing keys and the offending prediction,
or by raising a clear ValueError with a descriptive message; ensure downstream
code only appends when all required fields are present.

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few small comments from me. @shuoyangd could you please also do another review and if all looks good we can merge?

@Kipok Kipok requested a review from shuoyangd October 7, 2025 00:23
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
nemo_skills/evaluation/metrics/translation_metrics.py (2)

22-24: Create GitHub issue to track TODOs.

The TODO comments mention future refactoring and additional metrics. As requested in the past review, please create a GitHub issue to track these tasks.


61-68: Consider adding key validation.

The code assumes all predictions contain source_language, target_language, generation, and translation keys. If any key is missing, a KeyError will be raised.

As suggested in the previous review, consider adding validation:

 for pred in predictions:
+    required_keys = ["source_language", "target_language", "generation", "translation"]
+    missing_keys = [k for k in required_keys if k not in pred]
+    if missing_keys:
+        raise ValueError(f"Prediction missing required keys: {missing_keys}. Got: {list(pred.keys())}")
     src_lang = pred["source_language"]
     tgt_lang = pred["target_language"]
     generation = pred["generation"]
     ground_truth = pred["translation"]
🧹 Nitpick comments (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)

75-77: Docstring improved but could be more specific.

The docstring has been updated from the previous misleading version, which is good. However, it could be more precise about what the method returns.

Consider this more explicit docstring:

 def evaluations_to_print(self):
-    """Returns all translation pairs and aggregated multilingual dictionaries."""
+    """Returns all translation pair keys (e.g., 'en->de') and aggregation keys (e.g., 'xx->xx', 'en->xx') for logging."""
     return list(self.translation_dict.keys()) + list(self.aggregation_dict.keys())
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54b60a1 and e7b23bd.

📒 Files selected for processing (5)
  • docs/evaluation/index.md (2 hunks)
  • docs/evaluation/long-context.md (1 hunks)
  • docs/evaluation/multilingual.md (3 hunks)
  • docs/index.md (1 hunks)
  • nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • docs/evaluation/long-context.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/index.md
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (2)
  • BaseMetrics (23-434)
  • as_float (449-452)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/multilingual.md

73-73: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)


79-79: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


107-107: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (10)
nemo_skills/evaluation/metrics/translation_metrics.py (3)

26-50: LGTM with a note on robustness.

The BLEU computation logic is clear and follows a reasonable pattern for language-specific tokenization. The aggregation into multilingual buckets (xx->xx, src->xx, xx->tgt) is well-structured.

Consider adding a length check to ensure preds and gts have the same length before calling corpus_bleu at line 41. While sacrebleu.corpus_bleu likely handles this gracefully, explicit validation would make debugging easier if length mismatches occur.

if len(preds) != len(gts):
    raise ValueError(f"Length mismatch for {key}: {len(preds)} predictions vs {len(gts)} ground truths")

70-73: LGTM!

The reset method correctly initializes the data structures using defaultdict for automatic key creation.


79-81: LGTM!

The metrics_to_print method correctly maps the BLEU metric to the as_float formatter, consistent with the base metrics pattern.

docs/evaluation/index.md (2)

12-12: LGTM!

The addition of FLORES-200 and wmt24pp references to the multilingual benchmarks list is accurate and properly formatted with correct links.


249-249: LGTM!

The documentation for creating new metrics classes remains accurate and helpful for contributors adding new benchmarks.

docs/evaluation/multilingual.md (5)

3-3: LGTM!

The updated description accurately reflects that multilingual benchmarks now include machine translation in addition to multilingual reasoning.


12-12: LGTM!

The path reference has been correctly updated to mmlu-prox/__init__.py.


71-72: LGTM!

The code block closing has been properly formatted with an additional newline for clarity.


73-144: Excellent comprehensive documentation for FLORES-200!

The FLORES-200 section provides:

  • Clear benchmark definition and source reference
  • Reference numbers table with multiple models
  • Runnable example commands for 4 different models with appropriate configurations

This addresses the past review comment requesting example commands and follows the established documentation pattern.


146-217: Excellent comprehensive documentation for wmt24pp!

The wmt24pp section provides:

  • Clear benchmark definition and source reference
  • Reference numbers table showing per-language and average BLEU scores
  • Runnable example commands for 4 different models

The documentation is consistent with the FLORES-200 section and provides all necessary information for users to run evaluations.

@Kipok Kipok merged commit 11df63d into main Oct 7, 2025
6 checks passed
@Kipok Kipok deleted the ohrinchuk/mt_datasets branch October 7, 2025 19:35
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 9, 2025
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
SeanNaren pushed a commit that referenced this pull request Oct 9, 2025
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Oct 9, 2025
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Oct 29, 2025
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
This was referenced Feb 25, 2026
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants