Conversation
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
WalkthroughAdds multilingual translation evaluation: two new datasets (FLORES-200, wmt24pp) with prepare scripts and default configs, a segment-translation prompt, a TranslationMetrics implementation (BLEU via sacrebleu) registered in METRICS_MAP, docs updates, and new deps ( Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as Eval CLI
participant Loader as Dataset Loader
participant Prompt as Prompt Config
participant Model as Model
participant Metrics as TranslationMetrics
User->>CLI: run eval (dataset=FLORES-200/WMT24PP, metrics=translation)
CLI->>Loader: load JSONL samples
CLI->>Prompt: load multilingual/segment-translation
loop per sample
CLI->>Model: send prompt(text, target_language)
Model-->>CLI: hypothesis
CLI->>Metrics: update(src_lang, tgt_lang, ref, hyp)
end
CLI->>Metrics: get_metrics()
Metrics-->>CLI: BLEU per-pair and aggregated scores
CLI-->>User: results
sequenceDiagram
autonumber
actor Dev
participant Prep as prepare.py
participant HF as HF Datasets
participant FS as Filesystem
Dev->>Prep: run prepare (--split, --source_languages/--target_languages)
loop languages
Prep->>HF: load dataset[split, lang_code]
HF-->>Prep: parallel texts
end
Prep->>FS: write data/{split}.jsonl (one JSON object per src↔tgt pair)
FS-->>Dev: JSONL ready
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
nemo_skills/dataset/wmt24pp/prepare.py (1)
36-36: Consider validating language code format.Line 36 extracts the language code prefix using
tgt_lang[:2], which assumes language codes follow the format"xx_XX"(e.g.,"de_DE"). While the default values in line 62 confirm this format, there's no validation if a user provides a different format.Consider adding validation or using a more robust approach:
+ # Validate language code format + if '_' not in tgt_lang or len(tgt_lang.split('_')[0]) != 2: + raise ValueError(f"Invalid language code format: {tgt_lang}. Expected format: 'xx_XX'") "target_lang_name": Language(tgt_lang[:2]).display_name()Or, extract the language code more explicitly:
- "target_lang_name": Language(tgt_lang[:2]).display_name() + lang_code = tgt_lang.split('_')[0] if '_' in tgt_lang else tgt_lang[:2] + "target_lang_name": Language(lang_code).display_name()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
README.md(1 hunks)docs/evaluation/multilingual.md(3 hunks)docs/index.md(1 hunks)nemo_skills/dataset/flores200/__init__.py(1 hunks)nemo_skills/dataset/flores200/prepare.py(1 hunks)nemo_skills/dataset/wmt24pp/__init__.py(1 hunks)nemo_skills/dataset/wmt24pp/prepare.py(1 hunks)nemo_skills/evaluation/metrics/map_metrics.py(2 hunks)nemo_skills/evaluation/metrics/translation_metrics.py(1 hunks)nemo_skills/prompt/config/multilingual/__init__.py(1 hunks)nemo_skills/prompt/config/multilingual/segment-translation.yaml(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
TranslationMetrics(21-80)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (2)
BaseMetrics(23-434)as_float(449-452)
nemo_skills/dataset/flores200/prepare.py (1)
nemo_skills/dataset/wmt24pp/prepare.py (2)
write_data_to_file(24-39)main(42-55)
nemo_skills/dataset/wmt24pp/prepare.py (1)
nemo_skills/dataset/flores200/prepare.py (2)
write_data_to_file(28-43)main(46-65)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (14)
docs/index.md (1)
24-25: LGTM!The capitalization fix and the new multilingual capabilities bullet accurately reflect the expanded features introduced in this PR.
nemo_skills/prompt/config/multilingual/__init__.py (1)
1-13: LGTM!This is a standard Python package marker file with the correct license header.
README.md (1)
20-20: LGTM!The addition of FLORES-200 and wmt24pp to the multilingual evaluation examples accurately documents the new benchmarks introduced in this PR.
nemo_skills/prompt/config/multilingual/segment-translation.yaml (1)
1-3: LGTM!The translation prompt template is clear, concise, and correctly uses placeholders for variable substitution.
nemo_skills/evaluation/metrics/map_metrics.py (2)
35-35: LGTM!The import of
TranslationMetricsfollows the existing import pattern in this file.
59-59: LGTM!The addition of the
"translation": TranslationMetricsmapping integrates correctly with the existing metrics map structure.docs/evaluation/multilingual.md (3)
3-3: LGTM!Removing the "(to be added)" note correctly reflects that machine translation is now a present feature.
12-12: LGTM!The wording improvement makes the benchmark reference clearer.
71-99: LGTM!The new FLORES-200 and wmt24pp benchmark sections are well-structured, include all necessary information (definition, source, reference results), and follow the established documentation format.
nemo_skills/dataset/wmt24pp/__init__.py (1)
18-22: LGTM!The evaluation configuration constants are well-defined and follow the established pattern from the flores200 dataset module.
nemo_skills/dataset/wmt24pp/prepare.py (2)
24-39: Good structure and use of strict zip.The overall structure is clean and follows best practices:
- Clear function separation between data writing and orchestration.
- Use of
strict=Trueinzip()(line 27-28) prevents silent data loss from length mismatches.- File path handling using
Pathis appropriate.However, please address the critical split issue flagged separately.
Also applies to: 42-55, 58-66
46-46: Critical: Use CLI split argument when loading the dataset- datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"] + datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")[args.split]Confirm that
google/wmt24ppsupports the requested split (e.g."test"by default) and adjust--splitchoices if only"train"is available.nemo_skills/dataset/flores200/prepare.py (1)
55-55: Re-run column check for Flores+ dataset
The sandbox can’t import 🤗datasets; please install it locally and run:pip install datasets python - <<'PY' from datasets import load_dataset print(load_dataset("openlanguagedata/flores_plus", "eng_Latn", split="dev").column_names) PYConfirm whether the split uses
"text"or"sentence"and update the indexing accordingly.nemo_skills/evaluation/metrics/translation_metrics.py (1)
31-36: Verify sacrebleu tokenizer availability at runtime
- Ensure your installed sacrebleu build includes
ja-mecab(requires fugashi/unidic extras) andko-mecab(requires mecab-ko support); otherwise install with the appropriate extras.- Guard on
AVAILABLE_TOKENIZERSor catch missing-tokenizer errors and fall back to"13a"ifja-mecab/ko-mecabaren’t available.
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
requirements/main.txt(2 hunks)
🧰 Additional context used
🪛 GitHub Actions: Lint and Format
requirements/main.txt
[error] 1-1: pre-commit: requirements-txt-fixer sorted/updated dependencies; hook failed.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: unit-tests
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
…o-Skills into ohrinchuk/mt_datasets
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
nemo_skills/evaluation/metrics/translation_metrics.py (3)
30-36: Consider using a library for language code normalization.The current approach using
tgt_lang[:2]to determine tokenization is fragile and assumes language codes follow a specific format. Language codes can be:
- 2-character (ISO 639-1): "ja", "zh", "ko"
- 3-character (ISO 639-3): "jpn", "cmn", "kor"
- With region variants: "zh-CN", "zh-TW"
Consider using the
langcodeslibrary (already a dependency per the PR) to normalize language codes before selecting tokenization:from langcodes import Language # In get_metrics method: tgt_lang_code = Language.get(tgt_lang).language # Gets base language code tokenize = "13a" if tgt_lang_code == "ja": tokenize = "ja-mecab" elif tgt_lang_code == "zh": tokenize = "zh" elif tgt_lang_code == "ko": tokenize = "ko-mecab"This would handle variants like "zh-CN", "zh-TW", and 3-character codes more reliably.
44-45: Add defensive check for empty aggregation.While unlikely in practice, if
self.aggregation_dict[key]is empty, line 45 would cause a division by zero error.Apply this diff to add a defensive check:
for key in self.aggregation_dict: + if not self.aggregation_dict[key]: + continue metrics_dict[key] = {"bleu": sum(self.aggregation_dict[key]) / len(self.aggregation_dict[key])}
58-65: Consider adding input validation.The method assumes each prediction contains the expected keys (
source_language,target_language,generation,translation). Adding validation would make debugging easier if the input format is incorrect.Apply this diff to add validation:
for pred in predictions: + required_keys = ["source_language", "target_language", "generation", "translation"] + missing_keys = [key for key in required_keys if key not in pred] + if missing_keys: + raise ValueError(f"Prediction missing required keys: {missing_keys}") src_lang = pred["source_language"] tgt_lang = pred["target_language"] generation = pred["generation"] ground_truth = pred["translation"]nemo_skills/dataset/wmt24pp/prepare.py (1)
33-33: Consider more robust language code parsing.The slicing
tgt_lang[:2]assumes a specific format (e.g., "de_DE") and will fail or produce incorrect results if the format varies. The FLORES-200 preparation script usesLanguage(lang).display_name()which is more robust.Consider this alternative:
- "target_lang_name": Language(tgt_lang[:2]).display_name(), + "target_lang_name": Language(tgt_lang.split('_')[0]).display_name(),Or validate the format at the start of the function:
def write_data_to_file(output_file, datasets, tgt_languages): with open(output_file, "wt", encoding="utf-8") as fout: for tgt_lang in tgt_languages: lang_code = tgt_lang.split('_')[0] if '_' in tgt_lang else tgt_lang for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"], strict=True): json_dict = { "text": src, "translation": tgt, "source_language": "en", "target_language": tgt_lang, "source_lang_name": "English", "target_lang_name": Language(lang_code).display_name(), } json.dump(json_dict, fout) fout.write("\n")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
nemo_skills/dataset/flores200/prepare.py(1 hunks)nemo_skills/dataset/wmt24pp/prepare.py(1 hunks)nemo_skills/evaluation/metrics/translation_metrics.py(1 hunks)nemo_skills/prompt/config/multilingual/segment-translation.yaml(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/prompt/config/multilingual/segment-translation.yaml
🧰 Additional context used
🧬 Code graph analysis (3)
nemo_skills/dataset/wmt24pp/prepare.py (1)
nemo_skills/dataset/flores200/prepare.py (2)
write_data_to_file(23-38)main(41-54)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (1)
as_float(449-452)
nemo_skills/dataset/flores200/prepare.py (1)
nemo_skills/dataset/wmt24pp/prepare.py (2)
write_data_to_file(23-36)main(39-47)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (7)
nemo_skills/evaluation/metrics/translation_metrics.py (3)
1-22: LGTM!The imports and class definition are correct.
67-70: LGTM!The reset method correctly initializes the data structures.
76-78: LGTM!The metrics_to_print method correctly returns the BLEU metric with the as_float formatter.
nemo_skills/dataset/flores200/prepare.py (3)
23-38: LGTM! Good implementation with proper error handling.The function correctly:
- Iterates over all source-target language pairs
- Skips identical language pairs (line 27)
- Uses
strict=Truein zip to ensure data integrity- Properly derives language display names using the
LanguageAPI
57-73: LGTM! Clear CLI interface with sensible defaults.The argument parser provides:
- Appropriate split choices for FLORES-200 dataset
- Flexible source and target language configuration
- Clear help messages
- Reasonable default languages
41-54: Add error handling for invalid language codes and dataset loads
Wrap the ISO conversion (Language(...).to_alpha3()) and eachload_datasetcall in try/except, and verify returned datasets aren’t empty before writing.nemo_skills/dataset/wmt24pp/prepare.py (1)
39-47: Critical: Replace hardcoded split withargs.splitThe CLI
--splitargument is defined but unused—your code always loads"train". Update:- datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"] + datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")[args.split]Then confirm that
google/wmt24ppactually provides the requested split (e.g."test"), or adjust the CLI choices to match the dataset’s available splits.
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
30-36: Consider more robust language code detection.Using
[:2]slice assumes all language codes follow the 2-letter ISO 639-1 format. This may not handle region-specific codes (e.g., "zh-CN", "pt-BR") or longer codes correctly.Since
langcodesis already a project dependency (added in this PR), consider using it for more robust language detection:-tokenize = "13a" -if tgt_lang[:2] == "ja": - tokenize = "ja-mecab" -if tgt_lang[:2] == "zh": - tokenize = "zh" -if tgt_lang[:2] == "ko": - tokenize = "ko-mecab" +import langcodes + +tokenize = "13a" +lang_code = langcodes.Language.get(tgt_lang).language +if lang_code == "ja": + tokenize = "ja-mecab" +elif lang_code == "zh": + tokenize = "zh" +elif lang_code == "ko": + tokenize = "ko-mecab"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
nemo_skills/evaluation/metrics/translation_metrics.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (2)
BaseMetrics(23-434)as_float(449-452)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (4)
nemo_skills/evaluation/metrics/translation_metrics.py (4)
1-20: LGTM!The imports are appropriate and the license header is correct.
67-70: LGTM!The reset logic correctly initializes the nested data structures.
72-74: Past review comment addressed.The docstring has been updated from the previous version and no longer references irrelevant metrics like "majority/rm/pass". The current description accurately reflects the method's purpose.
76-78: LGTM!The metrics formatting is correctly configured using the
as_floatformatter from the base module.
| for pred in predictions: | ||
| src_lang = pred["source_language"] | ||
| tgt_lang = pred["target_language"] | ||
| generation = pred["generation"] | ||
| ground_truth = pred["translation"] | ||
|
|
||
| self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation) | ||
| self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth) |
There was a problem hiding this comment.
Add validation for required prediction keys.
The code assumes all predictions contain source_language, target_language, generation, and translation keys. If these keys are missing, a KeyError will be raised at runtime, potentially crashing the evaluation pipeline.
Consider adding validation:
for pred in predictions:
+ required_keys = ["source_language", "target_language", "generation", "translation"]
+ if not all(key in pred for key in required_keys):
+ raise ValueError(f"Prediction missing required keys. Expected {required_keys}, got {list(pred.keys())}")
src_lang = pred["source_language"]
tgt_lang = pred["target_language"]
generation = pred["generation"]
ground_truth = pred["translation"]
self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation)
self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth)Alternatively, document the expected prediction schema in the docstring if validation is considered too strict for your use case.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for pred in predictions: | |
| src_lang = pred["source_language"] | |
| tgt_lang = pred["target_language"] | |
| generation = pred["generation"] | |
| ground_truth = pred["translation"] | |
| self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation) | |
| self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth) | |
| for pred in predictions: | |
| required_keys = ["source_language", "target_language", "generation", "translation"] | |
| if not all(key in pred for key in required_keys): | |
| raise ValueError( | |
| f"Prediction missing required keys. Expected {required_keys}, got {list(pred.keys())}" | |
| ) | |
| src_lang = pred["source_language"] | |
| tgt_lang = pred["target_language"] | |
| generation = pred["generation"] | |
| ground_truth = pred["translation"] | |
| self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation) | |
| self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth) |
🤖 Prompt for AI Agents
In nemo_skills/evaluation/metrics/translation_metrics.py around lines 58 to 65,
the loop assumes each prediction dict contains source_language, target_language,
generation, and translation which can raise KeyError; add validation at the top
of the loop to verify those keys exist (e.g., check set membership or try/except
KeyError), and handle missing keys by either skipping that prediction with a
warning/log entry that includes the missing keys and the offending prediction,
or by raising a clear ValueError with a descriptive message; ensure downstream
code only appends when all required fields are present.
Kipok
left a comment
There was a problem hiding this comment.
just a few small comments from me. @shuoyangd could you please also do another review and if all looks good we can merge?
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (2)
nemo_skills/evaluation/metrics/translation_metrics.py (2)
22-24: Create GitHub issue to track TODOs.The TODO comments mention future refactoring and additional metrics. As requested in the past review, please create a GitHub issue to track these tasks.
61-68: Consider adding key validation.The code assumes all predictions contain
source_language,target_language,generation, andtranslationkeys. If any key is missing, aKeyErrorwill be raised.As suggested in the previous review, consider adding validation:
for pred in predictions: + required_keys = ["source_language", "target_language", "generation", "translation"] + missing_keys = [k for k in required_keys if k not in pred] + if missing_keys: + raise ValueError(f"Prediction missing required keys: {missing_keys}. Got: {list(pred.keys())}") src_lang = pred["source_language"] tgt_lang = pred["target_language"] generation = pred["generation"] ground_truth = pred["translation"]
🧹 Nitpick comments (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
75-77: Docstring improved but could be more specific.The docstring has been updated from the previous misleading version, which is good. However, it could be more precise about what the method returns.
Consider this more explicit docstring:
def evaluations_to_print(self): - """Returns all translation pairs and aggregated multilingual dictionaries.""" + """Returns all translation pair keys (e.g., 'en->de') and aggregation keys (e.g., 'xx->xx', 'en->xx') for logging.""" return list(self.translation_dict.keys()) + list(self.aggregation_dict.keys())
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
docs/evaluation/index.md(2 hunks)docs/evaluation/long-context.md(1 hunks)docs/evaluation/multilingual.md(3 hunks)docs/index.md(1 hunks)nemo_skills/evaluation/metrics/translation_metrics.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- docs/evaluation/long-context.md
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/index.md
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (2)
BaseMetrics(23-434)as_float(449-452)
🪛 markdownlint-cli2 (0.18.1)
docs/evaluation/multilingual.md
73-73: Code block style
Expected: fenced; Actual: indented
(MD046, code-block-style)
79-79: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
107-107: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (10)
nemo_skills/evaluation/metrics/translation_metrics.py (3)
26-50: LGTM with a note on robustness.The BLEU computation logic is clear and follows a reasonable pattern for language-specific tokenization. The aggregation into multilingual buckets (xx->xx, src->xx, xx->tgt) is well-structured.
Consider adding a length check to ensure
predsandgtshave the same length before callingcorpus_bleuat line 41. Whilesacrebleu.corpus_bleulikely handles this gracefully, explicit validation would make debugging easier if length mismatches occur.if len(preds) != len(gts): raise ValueError(f"Length mismatch for {key}: {len(preds)} predictions vs {len(gts)} ground truths")
70-73: LGTM!The reset method correctly initializes the data structures using
defaultdictfor automatic key creation.
79-81: LGTM!The
metrics_to_printmethod correctly maps the BLEU metric to theas_floatformatter, consistent with the base metrics pattern.docs/evaluation/index.md (2)
12-12: LGTM!The addition of FLORES-200 and wmt24pp references to the multilingual benchmarks list is accurate and properly formatted with correct links.
249-249: LGTM!The documentation for creating new metrics classes remains accurate and helpful for contributors adding new benchmarks.
docs/evaluation/multilingual.md (5)
3-3: LGTM!The updated description accurately reflects that multilingual benchmarks now include machine translation in addition to multilingual reasoning.
12-12: LGTM!The path reference has been correctly updated to
mmlu-prox/__init__.py.
71-72: LGTM!The code block closing has been properly formatted with an additional newline for clarity.
73-144: Excellent comprehensive documentation for FLORES-200!The FLORES-200 section provides:
- Clear benchmark definition and source reference
- Reference numbers table with multiple models
- Runnable example commands for 4 different models with appropriate configurations
This addresses the past review comment requesting example commands and follows the established documentation pattern.
146-217: Excellent comprehensive documentation for wmt24pp!The wmt24pp section provides:
- Clear benchmark definition and source reference
- Reference numbers table showing per-language and average BLEU scores
- Runnable example commands for 4 different models
The documentation is consistent with the FLORES-200 section and provides all necessary information for users to run evaluations.
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com> Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com> Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com> Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com> Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com> Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
PR (#870) recreated from branch
Summary by CodeRabbit
New Features
Documentation
Chores