MT datasets FLORES200 and WMT24pp by AlexGrinch · Pull Request #892 · NVIDIA-NeMo/Skills

AlexGrinch · 2025-10-04T00:16:50Z

PR (#870) recreated from branch

Addressed most of the comments from last PR (removed unused imports, added docs, refactored wmt24pp to support arbitrary languages).
Passing tokenize=None to corpus_bleu does not handle Japanese texts correctly, left as is.
Tested that running evaluation with FLORES-200 and wmt24pp runs correctly on cluster. Translation metrics refactoring and tests is left for another PR.

Summary by CodeRabbit

New Features
- BLEU-based translation evaluation and a multilingual segment-translation prompt.
- Support for FLORES-200 and WMT24++ datasets with dataset preparation scripts and runnable example commands.
Documentation
- Expanded multilingual evaluation guide with new dataset sections, benchmark corrections, sample result tables, and added "Multilingual capabilities" note.
- Formatting fixes to code blocks and end-of-file newlines.
Chores
- Added dependencies for language handling and BLEU scoring.

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

coderabbitai · 2025-10-04T00:17:00Z

Walkthrough

Adds multilingual translation evaluation: two new datasets (FLORES-200, wmt24pp) with prepare scripts and default configs, a segment-translation prompt, a TranslationMetrics implementation (BLEU via sacrebleu) registered in METRICS_MAP, docs updates, and new deps (langcodes, sacrebleu).

Changes

Cohort / File(s)	Summary
Docs updates `README.md`, `docs/index.md`, `docs/evaluation/multilingual.md`, `docs/evaluation/index.md`, `docs/evaluation/long-context.md`	Updated multilingual evaluation bullets (added FLORES-200, wmt24pp), removed "(to be added)", fixed mmlu-prox path/fencing, added benchmark sections, example commands/results, and minor formatting fixes.
FLORES-200 dataset `nemo_skills/dataset/flores200/__init__.py`, `nemo_skills/dataset/flores200/prepare.py`	Added dataset default constants and a CLI prepare script that loads openlanguagedata/flores_plus, derives language codes, zips source/target texts, and writes JSONL with `text`, `translation`, language codes, and display names.
WMT24PP dataset `nemo_skills/dataset/wmt24pp/__init__.py`, `nemo_skills/dataset/wmt24pp/prepare.py`	Added dataset default constants and a CLI prepare script that loads `google/wmt24pp` en→target splits, zips source/target texts, and writes JSONL with `text`, `translation`, language metadata, and display names.
Translation metrics `nemo_skills/evaluation/metrics/map_metrics.py`, `nemo_skills/evaluation/metrics/translation_metrics.py`	New `TranslationMetrics(BaseMetrics)` using `sacrebleu.corpus_bleu` with language-specific tokenization; maintains per-pair and aggregated BLEU buckets; registered `"translation"` in `METRICS_MAP`.
Prompt config `nemo_skills/prompt/config/multilingual/__init__.py`, `nemo_skills/prompt/config/multilingual/segment-translation.yaml`	New multilingual prompt package and a `segment-translation` YAML prompt template for segment-level translation.
Requirements `requirements/main.txt`	Added external dependencies: `langcodes`, `sacrebleu`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as Eval CLI
  participant Loader as Dataset Loader
  participant Prompt as Prompt Config
  participant Model as Model
  participant Metrics as TranslationMetrics

  User->>CLI: run eval (dataset=FLORES-200/WMT24PP, metrics=translation)
  CLI->>Loader: load JSONL samples
  CLI->>Prompt: load multilingual/segment-translation
  loop per sample
    CLI->>Model: send prompt(text, target_language)
    Model-->>CLI: hypothesis
    CLI->>Metrics: update(src_lang, tgt_lang, ref, hyp)
  end
  CLI->>Metrics: get_metrics()
  Metrics-->>CLI: BLEU per-pair and aggregated scores
  CLI-->>User: results

sequenceDiagram
  autonumber
  actor Dev
  participant Prep as prepare.py
  participant HF as HF Datasets
  participant FS as Filesystem

  Dev->>Prep: run prepare (--split, --source_languages/--target_languages)
  loop languages
    Prep->>HF: load dataset[split, lang_code]
    HF-->>Prep: parallel texts
  end
  Prep->>FS: write data/{split}.jsonl (one JSON object per src↔tgt pair)
  FS-->>Dev: JSONL ready

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A nibble of BLEU in a meadow of tongues,
I hop through FLORES where wmt24pp hums.
Prompts like clover, segments neatly spun,
Metrics bloom softly—another run done.
Twitching nose, I translate the sun. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “MT datasets FLORES200 and WMT24pp” directly reflects the central change of the pull request, which is the introduction of support for two new machine translation datasets, and it is concise and specific enough for teammates to understand the main feature added without extraneous details.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ohrinchuk/mt_datasets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

nemo_skills/dataset/wmt24pp/prepare.py (1)
36-36: Consider validating language code format.

Line 36 extracts the language code prefix using tgt_lang[:2], which assumes language codes follow the format "xx_XX" (e.g., "de_DE"). While the default values in line 62 confirm this format, there's no validation if a user provides a different format.

Consider adding validation or using a more robust approach:
+                        # Validate language code format
+                        if '_' not in tgt_lang or len(tgt_lang.split('_')[0]) != 2:
+                            raise ValueError(f"Invalid language code format: {tgt_lang}. Expected format: 'xx_XX'")
                         "target_lang_name": Language(tgt_lang[:2]).display_name()
Or, extract the language code more explicitly:
-                        "target_lang_name": Language(tgt_lang[:2]).display_name()
+                        lang_code = tgt_lang.split('_')[0] if '_' in tgt_lang else tgt_lang[:2]
+                        "target_lang_name": Language(lang_code).display_name()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f94fdbd and 0856d8f.

📒 Files selected for processing (11)

README.md (1 hunks)
docs/evaluation/multilingual.md (3 hunks)
docs/index.md (1 hunks)
nemo_skills/dataset/flores200/__init__.py (1 hunks)
nemo_skills/dataset/flores200/prepare.py (1 hunks)
nemo_skills/dataset/wmt24pp/__init__.py (1 hunks)
nemo_skills/dataset/wmt24pp/prepare.py (1 hunks)
nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)
nemo_skills/prompt/config/multilingual/__init__.py (1 hunks)
nemo_skills/prompt/config/multilingual/segment-translation.yaml (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

nemo_skills/evaluation/metrics/map_metrics.py (1)

nemo_skills/evaluation/metrics/translation_metrics.py (1)

TranslationMetrics (21-80)

nemo_skills/evaluation/metrics/translation_metrics.py (1)

nemo_skills/evaluation/metrics/base.py (2)

BaseMetrics (23-434)

as_float (449-452)

nemo_skills/dataset/flores200/prepare.py (1)

nemo_skills/dataset/wmt24pp/prepare.py (2)

write_data_to_file (24-39)

main (42-55)

nemo_skills/dataset/wmt24pp/prepare.py (1)

nemo_skills/dataset/flores200/prepare.py (2)

write_data_to_file (28-43)

main (46-65)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (14)

docs/index.md (1)

24-25: LGTM!

The capitalization fix and the new multilingual capabilities bullet accurately reflect the expanded features introduced in this PR.

nemo_skills/prompt/config/multilingual/__init__.py (1)

1-13: LGTM!

This is a standard Python package marker file with the correct license header.

README.md (1)

20-20: LGTM!

The addition of FLORES-200 and wmt24pp to the multilingual evaluation examples accurately documents the new benchmarks introduced in this PR.

nemo_skills/prompt/config/multilingual/segment-translation.yaml (1)

1-3: LGTM!

The translation prompt template is clear, concise, and correctly uses placeholders for variable substitution.

nemo_skills/evaluation/metrics/map_metrics.py (2)

35-35: LGTM!

The import of TranslationMetrics follows the existing import pattern in this file.

59-59: LGTM!

The addition of the "translation": TranslationMetrics mapping integrates correctly with the existing metrics map structure.

docs/evaluation/multilingual.md (3)

3-3: LGTM!

Removing the "(to be added)" note correctly reflects that machine translation is now a present feature.

12-12: LGTM!

The wording improvement makes the benchmark reference clearer.

71-99: LGTM!

The new FLORES-200 and wmt24pp benchmark sections are well-structured, include all necessary information (definition, source, reference results), and follow the established documentation format.

nemo_skills/dataset/wmt24pp/__init__.py (1)

18-22: LGTM!

The evaluation configuration constants are well-defined and follow the established pattern from the flores200 dataset module.
nemo_skills/dataset/wmt24pp/prepare.py (2)
24-39: Good structure and use of strict zip.

The overall structure is clean and follows best practices:

Clear function separation between data writing and orchestration.

Use of strict=True in zip() (line 27-28) prevents silent data loss from length mismatches.

File path handling using Path is appropriate.

However, please address the critical split issue flagged separately.

Also applies to: 42-55, 58-66

46-46: Critical: Use CLI split argument when loading the dataset
-        datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]
+        datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")[args.split]
Confirm that google/wmt24pp supports the requested split (e.g. "test" by default) and adjust --split choices if only "train" is available.
nemo_skills/dataset/flores200/prepare.py (1)
55-55: Re-run column check for Flores+ dataset
The sandbox can’t import 🤗datasets; please install it locally and run:
pip install datasets
python - <<'PY'
from datasets import load_dataset
print(load_dataset("openlanguagedata/flores_plus", "eng_Latn", split="dev").column_names)
PY
Confirm whether the split uses "text" or "sentence" and update the indexing accordingly.
nemo_skills/evaluation/metrics/translation_metrics.py (1)

31-36: Verify sacrebleu tokenizer availability at runtime

Ensure your installed sacrebleu build includes ja-mecab (requires fugashi/unidic extras) and ko-mecab (requires mecab-ko support); otherwise install with the appropriate extras.

Guard on AVAILABLE_TOKENIZERS or catch missing-tokenizer errors and fall back to "13a" if ja-mecab/ko-mecab aren’t available.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0856d8f and 7985a85.

📒 Files selected for processing (1)

requirements/main.txt (2 hunks)

🧰 Additional context used

🪛 GitHub Actions: Lint and Format

requirements/main.txt

[error] 1-1: pre-commit: requirements-txt-fixer sorted/updated dependencies; hook failed.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

requirements/main.txt

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

…o-Skills into ohrinchuk/mt_datasets

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

nemo_skills/evaluation/metrics/translation_metrics.py (3)
30-36: Consider using a library for language code normalization.

The current approach using tgt_lang[:2] to determine tokenization is fragile and assumes language codes follow a specific format. Language codes can be:

2-character (ISO 639-1): "ja", "zh", "ko"

3-character (ISO 639-3): "jpn", "cmn", "kor"

With region variants: "zh-CN", "zh-TW"

Consider using the langcodes library (already a dependency per the PR) to normalize language codes before selecting tokenization:
from langcodes import Language

# In get_metrics method:
tgt_lang_code = Language.get(tgt_lang).language  # Gets base language code
tokenize = "13a"
if tgt_lang_code == "ja":
    tokenize = "ja-mecab"
elif tgt_lang_code == "zh":
    tokenize = "zh"
elif tgt_lang_code == "ko":
    tokenize = "ko-mecab"
This would handle variants like "zh-CN", "zh-TW", and 3-character codes more reliably.

44-45: Add defensive check for empty aggregation.

While unlikely in practice, if self.aggregation_dict[key] is empty, line 45 would cause a division by zero error.

Apply this diff to add a defensive check:
 for key in self.aggregation_dict:
+    if not self.aggregation_dict[key]:
+        continue
     metrics_dict[key] = {"bleu": sum(self.aggregation_dict[key]) / len(self.aggregation_dict[key])}
58-65: Consider adding input validation.

The method assumes each prediction contains the expected keys (source_language, target_language, generation, translation). Adding validation would make debugging easier if the input format is incorrect.

Apply this diff to add validation:
 for pred in predictions:
+    required_keys = ["source_language", "target_language", "generation", "translation"]
+    missing_keys = [key for key in required_keys if key not in pred]
+    if missing_keys:
+        raise ValueError(f"Prediction missing required keys: {missing_keys}")
     src_lang = pred["source_language"]
     tgt_lang = pred["target_language"]
     generation = pred["generation"]
     ground_truth = pred["translation"]
nemo_skills/dataset/wmt24pp/prepare.py (1)
33-33: Consider more robust language code parsing.

The slicing tgt_lang[:2] assumes a specific format (e.g., "de_DE") and will fail or produce incorrect results if the format varies. The FLORES-200 preparation script uses Language(lang).display_name() which is more robust.

Consider this alternative:
-                    "target_lang_name": Language(tgt_lang[:2]).display_name(),
+                    "target_lang_name": Language(tgt_lang.split('_')[0]).display_name(),
Or validate the format at the start of the function:
def write_data_to_file(output_file, datasets, tgt_languages):
    with open(output_file, "wt", encoding="utf-8") as fout:
        for tgt_lang in tgt_languages:
            lang_code = tgt_lang.split('_')[0] if '_' in tgt_lang else tgt_lang
            for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"], strict=True):
                json_dict = {
                    "text": src,
                    "translation": tgt,
                    "source_language": "en",
                    "target_language": tgt_lang,
                    "source_lang_name": "English",
                    "target_lang_name": Language(lang_code).display_name(),
                }
                json.dump(json_dict, fout)
                fout.write("\n")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 868f91a and b9d9ac4.

📒 Files selected for processing (4)

nemo_skills/dataset/flores200/prepare.py (1 hunks)
nemo_skills/dataset/wmt24pp/prepare.py (1 hunks)
nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)
nemo_skills/prompt/config/multilingual/segment-translation.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/prompt/config/multilingual/segment-translation.yaml

🧰 Additional context used

🧬 Code graph analysis (3)

nemo_skills/dataset/wmt24pp/prepare.py (1)

nemo_skills/dataset/flores200/prepare.py (2)

write_data_to_file (23-38)

main (41-54)

nemo_skills/evaluation/metrics/translation_metrics.py (1)

nemo_skills/evaluation/metrics/base.py (1)

as_float (449-452)

nemo_skills/dataset/flores200/prepare.py (1)

nemo_skills/dataset/wmt24pp/prepare.py (2)

write_data_to_file (23-36)

main (39-47)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (7)

nemo_skills/evaluation/metrics/translation_metrics.py (3)

1-22: LGTM!

The imports and class definition are correct.

67-70: LGTM!

The reset method correctly initializes the data structures.

76-78: LGTM!

The metrics_to_print method correctly returns the BLEU metric with the as_float formatter.

nemo_skills/dataset/flores200/prepare.py (3)

23-38: LGTM! Good implementation with proper error handling.

The function correctly:

Iterates over all source-target language pairs

Skips identical language pairs (line 27)

Uses strict=True in zip to ensure data integrity

Properly derives language display names using the Language API

57-73: LGTM! Clear CLI interface with sensible defaults.

The argument parser provides:

Appropriate split choices for FLORES-200 dataset

Flexible source and target language configuration

Clear help messages

Reasonable default languages

41-54: Add error handling for invalid language codes and dataset loads
Wrap the ISO conversion (Language(...).to_alpha3()) and each load_dataset call in try/except, and verify returned datasets aren’t empty before writing.
nemo_skills/dataset/wmt24pp/prepare.py (1)
39-47: Critical: Replace hardcoded split with args.split

The CLI --split argument is defined but unused—your code always loads "train". Update:
- datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]
+ datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")[args.split]
Then confirm that google/wmt24pp actually provides the requested split (e.g. "test"), or adjust the CLI choices to match the dataset’s available splits.

nemo_skills/evaluation/metrics/translation_metrics.py

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

nemo_skills/evaluation/metrics/translation_metrics.py (1)
30-36: Consider more robust language code detection.

Using [:2] slice assumes all language codes follow the 2-letter ISO 639-1 format. This may not handle region-specific codes (e.g., "zh-CN", "pt-BR") or longer codes correctly.

Since langcodes is already a project dependency (added in this PR), consider using it for more robust language detection:
-tokenize = "13a"
-if tgt_lang[:2] == "ja":
-    tokenize = "ja-mecab"
-if tgt_lang[:2] == "zh":
-    tokenize = "zh"
-if tgt_lang[:2] == "ko":
-    tokenize = "ko-mecab"
+import langcodes
+
+tokenize = "13a"
+lang_code = langcodes.Language.get(tgt_lang).language
+if lang_code == "ja":
+    tokenize = "ja-mecab"
+elif lang_code == "zh":
+    tokenize = "zh"
+elif lang_code == "ko":
+    tokenize = "ko-mecab"

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9d9ac4 and 54b60a1.

📒 Files selected for processing (1)

nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/metrics/translation_metrics.py (1)

nemo_skills/evaluation/metrics/base.py (2)

BaseMetrics (23-434)

as_float (449-452)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (4)

nemo_skills/evaluation/metrics/translation_metrics.py (4)

1-20: LGTM!

The imports are appropriate and the license header is correct.

67-70: LGTM!

The reset logic correctly initializes the nested data structures.

72-74: Past review comment addressed.

The docstring has been updated from the previous version and no longer references irrelevant metrics like "majority/rm/pass". The current description accurately reflects the method's purpose.

76-78: LGTM!

The metrics formatting is correctly configured using the as_float formatter from the base module.

coderabbitai · 2025-10-07T00:00:53Z

nemo_skills/evaluation/metrics/translation_metrics.py

+        for pred in predictions:
+            src_lang = pred["source_language"]
+            tgt_lang = pred["target_language"]
+            generation = pred["generation"]
+            ground_truth = pred["translation"]
+
+            self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation)
+            self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth)


⚠️ Potential issue | 🟠 Major

Add validation for required prediction keys.

The code assumes all predictions contain source_language, target_language, generation, and translation keys. If these keys are missing, a KeyError will be raised at runtime, potentially crashing the evaluation pipeline.

Consider adding validation:

for pred in predictions: + required_keys = ["source_language", "target_language", "generation", "translation"] + if not all(key in pred for key in required_keys): + raise ValueError(f"Prediction missing required keys. Expected {required_keys}, got {list(pred.keys())}") src_lang = pred["source_language"] tgt_lang = pred["target_language"] generation = pred["generation"] ground_truth = pred["translation"] self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation) self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth)

Alternatively, document the expected prediction schema in the docstring if validation is considered too strict for your use case.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for pred in predictions:

src_lang = pred["source_language"]

tgt_lang = pred["target_language"]

generation = pred["generation"]

ground_truth = pred["translation"]

self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation)

self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth)

for pred in predictions:

required_keys = ["source_language", "target_language", "generation", "translation"]

if not all(key in pred for key in required_keys):

raise ValueError(

f"Prediction missing required keys. Expected {required_keys}, got {list(pred.keys())}"

)

src_lang = pred["source_language"]

tgt_lang = pred["target_language"]

generation = pred["generation"]

ground_truth = pred["translation"]

self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation)

self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth)

🤖 Prompt for AI Agents

In nemo_skills/evaluation/metrics/translation_metrics.py around lines 58 to 65, the loop assumes each prediction dict contains source_language, target_language, generation, and translation which can raise KeyError; add validation at the top of the loop to verify those keys exist (e.g., check set membership or try/except KeyError), and handle missing keys by either skipping that prediction with a warning/log entry that includes the missing keys and the offending prediction, or by raising a clear ValueError with a descriptive message; ensure downstream code only appends when all required fields are present.

Kipok

just a few small comments from me. @shuoyangd could you please also do another review and if all looks good we can merge?

docs/evaluation/multilingual.md

docs/index.md

nemo_skills/evaluation/metrics/translation_metrics.py

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

nemo_skills/evaluation/metrics/translation_metrics.py (2)
22-24: Create GitHub issue to track TODOs.

The TODO comments mention future refactoring and additional metrics. As requested in the past review, please create a GitHub issue to track these tasks.

61-68: Consider adding key validation.

The code assumes all predictions contain source_language, target_language, generation, and translation keys. If any key is missing, a KeyError will be raised.

As suggested in the previous review, consider adding validation:
 for pred in predictions:
+    required_keys = ["source_language", "target_language", "generation", "translation"]
+    missing_keys = [k for k in required_keys if k not in pred]
+    if missing_keys:
+        raise ValueError(f"Prediction missing required keys: {missing_keys}. Got: {list(pred.keys())}")
     src_lang = pred["source_language"]
     tgt_lang = pred["target_language"]
     generation = pred["generation"]
     ground_truth = pred["translation"]

🧹 Nitpick comments (1)

nemo_skills/evaluation/metrics/translation_metrics.py (1)
75-77: Docstring improved but could be more specific.

The docstring has been updated from the previous misleading version, which is good. However, it could be more precise about what the method returns.

Consider this more explicit docstring:
 def evaluations_to_print(self):
-    """Returns all translation pairs and aggregated multilingual dictionaries."""
+    """Returns all translation pair keys (e.g., 'en->de') and aggregation keys (e.g., 'xx->xx', 'en->xx') for logging."""
     return list(self.translation_dict.keys()) + list(self.aggregation_dict.keys())

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54b60a1 and e7b23bd.

📒 Files selected for processing (5)

docs/evaluation/index.md (2 hunks)
docs/evaluation/long-context.md (1 hunks)
docs/evaluation/multilingual.md (3 hunks)
docs/index.md (1 hunks)
nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

docs/evaluation/long-context.md

🚧 Files skipped from review as they are similar to previous changes (1)

docs/index.md

🧰 Additional context used

🧬 Code graph analysis (1)

nemo_skills/evaluation/metrics/translation_metrics.py (1)

nemo_skills/evaluation/metrics/base.py (2)

BaseMetrics (23-434)

as_float (449-452)

🪛 markdownlint-cli2 (0.18.1)

docs/evaluation/multilingual.md

73-73: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

79-79: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

107-107: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit-tests
GitHub Check: pre-commit

🔇 Additional comments (10)

nemo_skills/evaluation/metrics/translation_metrics.py (3)
26-50: LGTM with a note on robustness.

The BLEU computation logic is clear and follows a reasonable pattern for language-specific tokenization. The aggregation into multilingual buckets (xx->xx, src->xx, xx->tgt) is well-structured.

Consider adding a length check to ensure preds and gts have the same length before calling corpus_bleu at line 41. While sacrebleu.corpus_bleu likely handles this gracefully, explicit validation would make debugging easier if length mismatches occur.
if len(preds) != len(gts):
    raise ValueError(f"Length mismatch for {key}: {len(preds)} predictions vs {len(gts)} ground truths")
70-73: LGTM!

The reset method correctly initializes the data structures using defaultdict for automatic key creation.

79-81: LGTM!

The metrics_to_print method correctly maps the BLEU metric to the as_float formatter, consistent with the base metrics pattern.
docs/evaluation/index.md (2)

12-12: LGTM!

The addition of FLORES-200 and wmt24pp references to the multilingual benchmarks list is accurate and properly formatted with correct links.

249-249: LGTM!

The documentation for creating new metrics classes remains accurate and helpful for contributors adding new benchmarks.

docs/evaluation/multilingual.md (5)

3-3: LGTM!

The updated description accurately reflects that multilingual benchmarks now include machine translation in addition to multilingual reasoning.

12-12: LGTM!

The path reference has been correctly updated to mmlu-prox/__init__.py.

71-72: LGTM!

The code block closing has been properly formatted with an additional newline for clarity.

73-144: Excellent comprehensive documentation for FLORES-200!

The FLORES-200 section provides:

Clear benchmark definition and source reference

Reference numbers table with multiple models

Runnable example commands for 4 different models with appropriate configurations

This addresses the past review comment requesting example commands and follows the established documentation pattern.

146-217: Excellent comprehensive documentation for wmt24pp!

The wmt24pp section provides:

Clear benchmark definition and source reference

Reference numbers table showing per-language and average BLEU scores

Runnable example commands for 4 different models

The documentation is consistent with the FLORES-200 section and provides all necessary information for users to run evaluations.

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com> Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com> Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com> Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

AlexGrinch added 2 commits October 2, 2025 15:34

initial commit for flores200, wmt24pp datasets and bleu metrics

53c1ec8

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

added docs

0856d8f

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

added dependencies

7985a85

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

coderabbitai bot reviewed Oct 4, 2025

View reviewed changes

requirements/main.txt Show resolved Hide resolved

AlexGrinch added 4 commits October 6, 2025 15:18

fixed requirements

342674e

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

Merge branch 'main' into ohrinchuk/mt_datasets

868f91a

Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>

pre commit passed

f22814e

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

Merge branch 'ohrinchuk/mt_datasets' of https://github.com/NVIDIA/NeM…

b9d9ac4

…o-Skills into ohrinchuk/mt_datasets

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

nemo_skills/evaluation/metrics/translation_metrics.py Show resolved Hide resolved

docstring corrected

54b60a1

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

Kipok reviewed Oct 7, 2025

View reviewed changes

docs/evaluation/multilingual.md Show resolved Hide resolved

docs/index.md Outdated Show resolved Hide resolved

nemo_skills/evaluation/metrics/translation_metrics.py Show resolved Hide resolved

Kipok requested a review from shuoyangd October 7, 2025 00:23

better docs for multilingual, todo for mt metrics

e7b23bd

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

This was referenced Oct 7, 2025

Refactor BLEU metrics computation so it reuses parent method functions from pass@k #901

Open

Add support for other translation metrics, such as COMET and MetricX #902

Open

Kipok approved these changes Oct 7, 2025

View reviewed changes

Kipok and others added 2 commits October 7, 2025 11:10

Merge branch 'main' into ohrinchuk/mt_datasets

ba14f6f

Merge branch 'main' into ohrinchuk/mt_datasets

1f848d7

Kipok merged commit 11df63d into main Oct 7, 2025
6 checks passed

Kipok deleted the ohrinchuk/mt_datasets branch October 7, 2025 19:35

coderabbitai bot mentioned this pull request Dec 17, 2025

BIRD Benchmark (Text-to-SQL) #1122

Closed

coderabbitai bot mentioned this pull request Jan 28, 2026

add multilingual aime25, gqpa and lcb #1194

Open

This was referenced Feb 10, 2026

Add SysBench benchmark #1225

Closed

Add nemo-skills-core subpackage for lightweight installs #1229

Merged

This was referenced Feb 25, 2026

Add SPEED-bench #1276

Closed

Add SPEED-Bench (within repo) #1279

Merged

coderabbitai bot mentioned this pull request Mar 4, 2026

Add HotpotQA multi-hop QA benchmark #1292

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MT datasets FLORES200 and WMT24pp#892

MT datasets FLORES200 and WMT24pp#892
Kipok merged 11 commits intomainfrom
ohrinchuk/mt_datasets

AlexGrinch commented Oct 4, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 4, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 7, 2025

Uh oh!

Kipok left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlexGrinch commented Oct 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AlexGrinch commented Oct 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 4, 2025 •

edited

Loading