Skip to content

MT datasets FLORES200 and WMT24pp#870

Closed
AlexGrinch wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
AlexGrinch:flores200
Closed

MT datasets FLORES200 and WMT24pp#870
AlexGrinch wants to merge 5 commits intoNVIDIA-NeMo:mainfrom
AlexGrinch:flores200

Conversation

@AlexGrinch
Copy link
Collaborator

@AlexGrinch AlexGrinch commented Sep 30, 2025

  1. Default text translation prompt for LLMs
  2. Support for two machine translation datasets — FLORES200 and WMT24pp.

Summary by CodeRabbit

  • New Features
    • Translation evaluation: per-language-pair BLEU and aggregated scores (with Japanese tokenization).
    • New translation prompt template for clearer target-language outputs.
    • Dataset preparation tools to export JSONL translation splits for FLORES-like and WMT24pp corpora.
    • Default evaluation/generation presets for translation datasets to streamline setup.

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 30, 2025

Walkthrough

Adds module-level dataset defaults and prepare scripts for Flores200 and WMT24pp, a new TranslationMetrics class computing BLEU and registering it, and a generic translation prompt template; new scripts serialize HuggingFace splits into per-split JSONL with language metadata.

Changes

Cohort / File(s) Summary
Dataset default configs
nemo_skills/dataset/flores200/__init__.py, nemo_skills/dataset/wmt24pp/__init__.py
New module-level constants: PROMPT_CONFIG, DATASET_GROUP, METRICS_TYPE, EVAL_ARGS, GENERATION_ARGS as default evaluation/generation settings.
Dataset preparation scripts
nemo_skills/dataset/flores200/prepare.py, nemo_skills/dataset/wmt24pp/prepare.py
New CLI utilities that load parallel corpora from HuggingFace, normalize language codes/names, build language-pair mappings, and write split-based JSONL records containing text, translation, source_language, target_language, and readable language names.
Metrics integration
nemo_skills/evaluation/metrics/map_metrics.py, nemo_skills/evaluation/metrics/translation_metrics.py
Adds TranslationMetrics (BLEU via sacrebleu.corpus_bleu, special-case tokenization for Japanese), per-pair storage and aggregation (per-src, per-tgt, overall), and registers it under "translation" in METRICS_MAP.
Prompt config
nemo_skills/prompt/config/generic/translation.yaml
New prompt template: "Translate the following segment into {target_lang_name}, without additional explanation.\n\n{text}".

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as prepare.py (CLI)
  participant HF as HuggingFace Datasets
  participant Writer as JSONL Writer

  User->>CLI: run with --split / --source_languages / --target_languages
  CLI->>HF: load_dataset(...) per language or pair
  HF-->>CLI: dataset split (texts)
  loop for each language pair and example
    CLI->>CLI: normalize codes & names, build record
    CLI->>Writer: append JSONL line {text, translation, src/tgt codes, names}
  end
  Note over Writer: Output file: <split>.jsonl (one JSON object per line)
Loading
sequenceDiagram
  autonumber
  participant Eval as Evaluator
  participant TM as TranslationMetrics
  participant SB as sacrebleu

  Eval->>TM: reset()
  loop for each batch
    Eval->>TM: update(predictions)  %% store per-pair hypotheses/references
  end
  Eval->>TM: get_metrics()
  TM->>SB: corpus_bleu(hypotheses, references, tokenizer=auto or ja-mecab)
  SB-->>TM: BLEU scores per pair
  TM-->>Eval: {per-pair BLEU, per-src/tgt aggregates, overall BLEU}
  Note over TM: Aggregation buckets by source, target, and overall
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my nose at BLEU so bright,
JSONL carrots stacked just right.
From Flores fields to WMT lanes,
I hop through pairs — en to ja, oui!
Prompt in paw, I test and bite.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately reflects the primary focus of the pull request, which is adding support for the MT datasets FLORES200 and WMT24pp, and it is concise and clear for a teammate reviewing the history.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f2a8cad and cba9e46.

📒 Files selected for processing (1)
  • nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/evaluation/metrics/map_metrics.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
nemo_skills/prompt/config/generic/translation.yaml (1)

3-3: Tighten the prompt to reduce chatty outputs.

Consider explicitly asking for only the translation text (no quotes, no prefix/suffix) to avoid LLM boilerplate.

Apply this tweak if desired:

-user: "Translate the following segment into {target_lang_name}, without additional explanation.\n\n{text}"
+user: "Translate the following segment into {target_lang_name}. Return only the translated text (no quotes, no extra words).\n\n{text}"
nemo_skills/evaluation/metrics/translation_metrics.py (2)

31-34: Avoid re-deriving tgt_lang; use the parsed variable.

You already have tgt_lang from the split; reassigning risks drift if key parsing changes.

-            tokenize = "13a"
-            tgt_lang = key.split("_")[-1]
+            tokenize = "13a"
             if tgt_lang == "ja":
                 tokenize = "ja-mecab"

26-27: Key parsing via split("_") is brittle for codes containing underscores.

Prefer tuple keys (src_lang, tgt_lang) to avoid delimiter parsing bugs.

Minimal change:

-        for key in self.translation_dict:
-            src_lang, tgt_lang = key.split("_")
-            preds = self.translation_dict[key]["preds"]
-            gts = self.translation_dict[key]["gts"]
+        for (src_lang, tgt_lang), pair_dict in self.translation_dict.items():
+            preds = pair_dict["preds"]
+            gts = pair_dict["gts"]

And in update:

-            self.translation_dict[f"{src_lang}_{tgt_lang}"]["preds"].append(generation)
-            self.translation_dict[f"{src_lang}_{tgt_lang}"]["gts"].append(ground_truth)
+            self.translation_dict[(src_lang, tgt_lang)]["preds"].append(generation)
+            self.translation_dict[(src_lang, tgt_lang)]["gts"].append(ground_truth)

This avoids future failures if a language like zh_Hant ever appears.

nemo_skills/dataset/flores200/prepare.py (2)

32-51: Consider adding error handling for language code conversion and dataset loading.

The language code conversion (lines 38-40) and dataset loading (line 41) assume valid inputs and successful operations. If an invalid language code is provided or if dataset loading fails, the script will crash without helpful error messages.

Consider adding try-except blocks to provide clearer error messages for:

  • Invalid language codes that langcodes.Language cannot parse
  • Missing language variants in the FLORES+ dataset
  • Empty or malformed datasets

Example error handling:

for lang in all_languages:
    try:
        iso_639_3 = Language(lang).to_alpha3()
        iso_15924 = Language(lang).maximize().script
        lang_code = f"{iso_639_3}_{iso_15924}"
        datasets[lang] = load_dataset("openlanguagedata/flores_plus", lang_code, split=args.split)['text']
    except Exception as e:
        raise ValueError(f"Failed to load dataset for language '{lang}': {e}")

54-66: LGTM with optional help text improvement.

The CLI argument parsing is well-structured. The default language lists will generate all permutations excluding self-translation pairs (e.g., en→de, de→en, etc.), which is appropriate for evaluation.

Optionally, consider clarifying in the help text that identical source and target language lists will generate all cross-language pairs:

     parser.add_argument(
         "--source_languages", default=["en", "de", "es", "fr", "it", "ja"],
-        nargs="+", help="Languages to translate from."
+        nargs="+", help="Languages to translate from. Will generate pairs with all target languages (excluding self-pairs)."
     )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74b8ed8 and 85f0a13.

📒 Files selected for processing (7)
  • nemo_skills/dataset/flores200/__init__.py (1 hunks)
  • nemo_skills/dataset/flores200/prepare.py (1 hunks)
  • nemo_skills/dataset/wmt24pp/__init__.py (1 hunks)
  • nemo_skills/dataset/wmt24pp/prepare.py (1 hunks)
  • nemo_skills/evaluation/metrics/map_metrics.py (2 hunks)
  • nemo_skills/evaluation/metrics/translation_metrics.py (1 hunks)
  • nemo_skills/prompt/config/generic/translation.yaml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
  • TranslationMetrics (22-80)
nemo_skills/evaluation/metrics/translation_metrics.py (1)
nemo_skills/evaluation/metrics/base.py (2)
  • BaseMetrics (21-332)
  • as_float (343-344)
nemo_skills/dataset/flores200/prepare.py (1)
nemo_skills/dataset/wmt24pp/prepare.py (2)
  • write_data_to_file (16-29)
  • main (32-46)
nemo_skills/dataset/wmt24pp/prepare.py (1)
nemo_skills/dataset/flores200/prepare.py (2)
  • write_data_to_file (14-29)
  • main (32-51)
🪛 Ruff (0.13.1)
nemo_skills/evaluation/metrics/translation_metrics.py

38-38: f-string without any placeholders

Remove extraneous f prefix

(F541)

nemo_skills/dataset/flores200/prepare.py

19-19: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/dataset/wmt24pp/prepare.py

19-19: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🔇 Additional comments (5)
nemo_skills/dataset/wmt24pp/__init__.py (1)

18-22: Confirm defaults for group/eval args align with intended pipeline.

Current defaults set DATASET_GROUP="chat" and EVAL_ARGS="++eval_type=no-op". For a translation dataset, did you intend DATASET_GROUP="translation" and empty EVAL_ARGS? If "chat"/no-op is intentional (e.g., to reuse chat scaffolding), no change needed—just confirming.

nemo_skills/dataset/wmt24pp/prepare.py (1)

35-38: Verify WMT24pp schema and language codes.

Please confirm that:

  • The dataset columns are named "source" and "target" for config en-{LANG2CODE[lang]}.
  • The configs en-de_DE, en-es_MX, etc., exist as expected.

If column names differ (e.g., a dict translation with language keys), we should adapt the extraction.

nemo_skills/evaluation/metrics/translation_metrics.py (1)

33-36: ja tokenization dependency check.

sacrebleu’s ja-mecab tokenizer requires extra deps (e.g., fugashi/unidic-lite). If missing, this may raise at runtime. Consider a graceful fallback to "ja-mecab" if available else "13a" with a warning.

nemo_skills/evaluation/metrics/map_metrics.py (1)

31-51: LGTM — metric wired into the map cleanly.

nemo_skills/dataset/flores200/__init__.py (1)

18-22: LGTM!

The module-level constants are well-structured and appropriate for a translation dataset. The defaults align with the translation-focused evaluation flow added in this PR.

Comment on lines +1 to +11
import argparse
import importlib.util
import json
import tempfile
import urllib.request
from pathlib import Path

from langcodes import Language

from datasets import load_dataset
from tqdm import tqdm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused imports.

Several imports are unused in this file: importlib.util, tempfile, urllib.request, and tqdm.

Apply this diff to remove the unused imports:

 import argparse
-import importlib.util
 import json
-import tempfile
-import urllib.request
 from pathlib import Path
 
 from langcodes import Language
 
 from datasets import load_dataset
-from tqdm import tqdm
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import argparse
import importlib.util
import json
import tempfile
import urllib.request
from pathlib import Path
from langcodes import Language
from datasets import load_dataset
from tqdm import tqdm
import argparse
import json
from pathlib import Path
from langcodes import Language
from datasets import load_dataset
🤖 Prompt for AI Agents
In nemo_skills/dataset/flores200/prepare.py around lines 1 to 11, remove the
unused imports importlib.util, tempfile, urllib.request, and tqdm; keep the
required imports (argparse, json, Path, Language, load_dataset) and ensure
import order remains consistent (standard libs first, third-party after) and no
other references to the removed modules remain in the file.

Comment on lines +14 to +29
def write_data_to_file(output_file, datasets, src_languages, tgt_languages):
with open(output_file, "wt", encoding="utf-8") as fout:
for src_lang in src_languages:
for tgt_lang in tgt_languages:
if src_lang != tgt_lang:
for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]):
json_dict = {
"text": src,
"translation": tgt,
"source_language": src_lang,
"target_language": tgt_lang,
"source_lang_name": Language(src_lang).display_name(),
"target_lang_name": Language(tgt_lang).display_name(),
}
json.dump(json_dict, fout)
fout.write("\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add strict=True to zip() to prevent silent data loss.

The zip() call on line 19 lacks an explicit strict= parameter. If the source and target language datasets have mismatched lengths, zip() will silently truncate to the shorter sequence, potentially losing translation pairs without warning.

For parallel corpora like FLORES200, all language datasets should have identical lengths. Adding strict=True will raise a ValueError if lengths mismatch, making data integrity issues immediately visible.

Apply this diff:

-                    for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]):
+                    for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True):

Note: This same issue exists in the WMT24pp prepare.py script and should be addressed there as well.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def write_data_to_file(output_file, datasets, src_languages, tgt_languages):
with open(output_file, "wt", encoding="utf-8") as fout:
for src_lang in src_languages:
for tgt_lang in tgt_languages:
if src_lang != tgt_lang:
for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]):
json_dict = {
"text": src,
"translation": tgt,
"source_language": src_lang,
"target_language": tgt_lang,
"source_lang_name": Language(src_lang).display_name(),
"target_lang_name": Language(tgt_lang).display_name(),
}
json.dump(json_dict, fout)
fout.write("\n")
def write_data_to_file(output_file, datasets, src_languages, tgt_languages):
with open(output_file, "wt", encoding="utf-8") as fout:
for src_lang in src_languages:
for tgt_lang in tgt_languages:
if src_lang != tgt_lang:
for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True):
json_dict = {
"text": src,
"translation": tgt,
"source_language": src_lang,
"target_language": tgt_lang,
"source_lang_name": Language(src_lang).display_name(),
"target_lang_name": Language(tgt_lang).display_name(),
}
json.dump(json_dict, fout)
fout.write("\n")
🧰 Tools
🪛 Ruff (0.13.1)

19-19: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In nemo_skills/dataset/flores200/prepare.py around lines 14 to 29, the use of
zip(datasets[src_lang], datasets[tgt_lang]) can silently drop pairs when the two
lists differ in length; update the zip call to zip(datasets[src_lang],
datasets[tgt_lang], strict=True) so a ValueError is raised on length mismatch,
making data integrity failures explicit and preventing silent data loss (also
apply the same change to the WMT24pp prepare.py where the same pattern appears).

datasets = {}
for lang in args.target_languages:
lang_code = f"en-{LANG2CODE[lang]}"
datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Hardcoded split causes incorrect data selection.

You load "train" regardless of --split. Use the CLI split for loading.

-        datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"]
+        datasets[lang] = load_dataset("google/wmt24pp", lang_code, split=args.split)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"]
datasets[lang] = load_dataset("google/wmt24pp", lang_code, split=args.split)
🤖 Prompt for AI Agents
In nemo_skills/dataset/wmt24pp/prepare.py around line 37, the code always loads
the "train" split; update the load_dataset call to use the CLI split variable
instead (e.g. load_dataset("google/wmt24pp", lang_code)[split] or pass split
directly if supported) so the dataset loaded matches the --split provided;
ensure the CLI variable name (split) is in scope and validate or map common
values (train/validation/test) if necessary before calling load_dataset.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
nemo_skills/dataset/flores200/prepare.py (2)

15-25: Remove unused imports.

Several imports are unused: importlib.util, tempfile, urllib.request, and tqdm.

Apply this diff:

 import argparse
-import importlib.util
 import json
-import tempfile
-import urllib.request
 from pathlib import Path
 
 from langcodes import Language
 
 from datasets import load_dataset
-from tqdm import tqdm

33-33: Add strict=True to zip() to prevent silent data loss.

For parallel corpora like FLORES200, all language datasets should have identical lengths. Without strict=True, zip() will silently truncate to the shorter sequence if lengths mismatch.

Apply this diff:

-                    for src, tgt in zip(datasets[src_lang], datasets[tgt_lang]):
+                    for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True):
nemo_skills/dataset/wmt24pp/prepare.py (3)

33-33: Add strict=True to zip() to prevent silent data loss.

Without strict=True, zip() will silently truncate if source and target lists have different lengths.

Apply this diff:

-                for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"]):
+                for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"], strict=True):

51-51: Hardcoded split causes incorrect data selection.

The code always loads the "train" split regardless of the --split argument value. This means users cannot select other splits.

Apply this diff:

-        datasets[lang] = load_dataset("google/wmt24pp", lang_code)["train"]
+        datasets[lang] = load_dataset("google/wmt24pp", lang_code, split=args.split)

65-65: Fix argparse choices bug.

choices=("test") is a string, not a tuple. This means argparse will accept any single character from "test" ('t', 'e', 's', 't') as valid input.

Apply this diff:

-    parser.add_argument("--split", default="test", choices=("test"), help="Dataset split to process.")
+    parser.add_argument("--split", default="test", choices=("test",), help="Dataset split to process.")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 85f0a13 and dadb627.

📒 Files selected for processing (2)
  • nemo_skills/dataset/flores200/prepare.py (1 hunks)
  • nemo_skills/dataset/wmt24pp/prepare.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/dataset/flores200/prepare.py (1)
nemo_skills/dataset/wmt24pp/prepare.py (2)
  • write_data_to_file (30-43)
  • main (46-60)
nemo_skills/dataset/wmt24pp/prepare.py (1)
nemo_skills/dataset/flores200/prepare.py (2)
  • write_data_to_file (28-43)
  • main (46-65)
🪛 Ruff (0.13.1)
nemo_skills/dataset/flores200/prepare.py

33-33: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/dataset/wmt24pp/prepare.py

33-33: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🔇 Additional comments (1)
nemo_skills/dataset/flores200/prepare.py (1)

46-65: LGTM!

The main function correctly uses args.split when loading datasets and properly handles the union of source and target languages.

Signed-off-by: AlexGrinch <grinchuk.alexey@gmail.com>
Signed-off-by: Aleksey Grinchuk (Oleksii Hrinchuk) <grinchuk.alexey@gmail.com>
Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AlexGrinch ! A few things to change besides the comments

# settings that define how evaluation should be done by default (all can be changed from cmdline)

PROMPT_CONFIG = "generic/translation"
DATASET_GROUP = "chat"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we add a new group for multilingual datasets? Can also update to that type in mmlu-prox

from collections import Counter, defaultdict

from sacrebleu import corpus_bleu
# from comet import download_model, load_from_checkpoint
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?


return metrics_dict

def update(self, predictions):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it's feasible to refactor this to reuse the parent method functions for computing pass@k? That way more code could be shared and pass@k can be defined as highest blue across k attempts (should work out-of-the-box as we have similar logic for ifeval)

Copy link
Collaborator

@shuoyangd shuoyangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few other minor questions and/or requested changes aside from what Igor commented earlier.


from datasets import load_dataset

LANG2CODE = {"de": "de_DE", "es": "es_MX", "fr": "fr_FR", "it": "it_IT", "ja": "ja_JP"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just directly use the long code instead of adding this conversion dict? Because this will limit the utility of this script to just the 5 languages above, and if we use two-letter code we will have extendability problem once we include zh_CN and zh_TW.

# from comet import download_model, load_from_checkpoint

# +
class TranslationMetrics(BaseMetrics):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you intend to extend this to compute COMET as well? If it's going to be in a different class we should just call this BLEU.


tokenize = "13a"
tgt_lang = key.split("_")[-1]
if tgt_lang == "ja":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you pass tokenize=None in L36 then you don't have to do this (it will default to jieba for Chinese, ja-mecab for Japanese, ko-mecab for Korean, 13a otherwise.

@shuoyangd
Copy link
Collaborator

One more thing: we need two more dependencies here: langcodes and sacrebleu

@Kipok
Copy link
Collaborator

Kipok commented Oct 3, 2025

let me close this one - @AlexGrinch please recreate from a branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants