diff --git a/README.md b/README.md
index 8fc8fc5167..54ec693ee0 100644
--- a/README.md
+++ b/README.md
@@ -17,7 +17,7 @@ Here are some of the features we support:
     - [**Instruction following**](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following): e.g. [ifbench](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following/#ifbench), [ifeval](https://nvidia.github.io/NeMo-Skills/evaluation/instruction-following/#ifeval)
     - [**Long-context**](https://nvidia.github.io/NeMo-Skills/evaluation/long-context): e.g. [ruler](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#ruler), [mrcr](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#mrcr), [aalcr](https://nvidia.github.io/NeMo-Skills/evaluation/long-context/#aalcr)
     - [**Tool-calling**](https://nvidia.github.io/NeMo-Skills/evaluation/tool-calling): e.g. [bfcl_v3](https://nvidia.github.io/NeMo-Skills/evaluation/tool-calling/#bfcl_v3)
-    - [**Multilingual**](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#mmlu-prox)
+    - [**Multilingual**](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual): e.g. [mmlu-prox](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#mmlu-prox), [FLORES-200](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#FLORES-200), [wmt24pp](https://nvidia.github.io/NeMo-Skills/evaluation/multilingual/#wmt24pp)
   - Easily parallelize each evaluation across many slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
 - [Model training](https://nvidia.github.io/NeMo-Skills/pipelines/training): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).
 
diff --git a/docs/evaluation/index.md b/docs/evaluation/index.md
index f5b6f3f345..41fdadfe28 100644
--- a/docs/evaluation/index.md
+++ b/docs/evaluation/index.md
@@ -9,7 +9,7 @@ We support many popular benchmarks and it's easy to add new in the future. The f
 - [**Instruction following**](./instruction-following.md): e.g. [ifbench](./instruction-following.md#ifbench), [ifeval](./instruction-following.md#ifeval)
 - [**Long-context**](./long-context.md): e.g. [ruler](./long-context.md#ruler), [mrcr](./long-context.md#mrcr)
 - [**Tool-calling**](./tool-calling.md): e.g. [bfcl_v3](./tool-calling.md#bfcl_v3)
-- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox)
+- [**Multilingual**](./multilingual.md): e.g. [mmlu-prox](./multilingual.md#mmlu-prox), [flores-200](./multilingual.md#FLORES-200), [wmt24pp](./multilingual.md#wmt24pp)
 
 See [nemo_skills/dataset](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset) where each folder is a benchmark we support.
 
@@ -246,4 +246,4 @@ To create a new benchmark follow this process:
    prompt config in `GENERATION_ARGS` and evaluation / metric parameters. But if extra customization is needed for the generation, you can provide
    a fully custom generation module. See [scicode](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/scicode/__init__.py) or [swe-bench](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/swe-bench/__init__.py) for examples of this.
 4. Create a new [evaluation class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/evaluator/__init__.py) (if cannot re-use existing one).
-5. Create a new [metrics class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).
\ No newline at end of file
+5. Create a new [metrics class](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/evaluation/metrics/map_metrics.py) ( if cannot re-use existing one).
diff --git a/docs/evaluation/long-context.md b/docs/evaluation/long-context.md
index 52c3aae59e..506bfb10e9 100644
--- a/docs/evaluation/long-context.md
+++ b/docs/evaluation/long-context.md
@@ -49,4 +49,4 @@ ns eval \
 The results, including per-category scores, are stored in metrics.json. Detailed breakdowns by category and sequence length are also available via
 ```
 ns summarize_results --cluster=<cluster_config> <folder_of_output_json>
-```
\ No newline at end of file
+```
diff --git a/docs/evaluation/multilingual.md b/docs/evaluation/multilingual.md
index 6ace23aafd..bebe25d981 100644
--- a/docs/evaluation/multilingual.md
+++ b/docs/evaluation/multilingual.md
@@ -1,6 +1,6 @@
 # Multilingual
 
-Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation (to be added).
+Our multilingual benchmarks cover things like multilingual reasoning as well as machine translation.
 
 All benchmarks in this category will have an extra `--language` argument with its associated `ns prepare` command, which allows you to choose which language(s) of the benchmark to run.
 Once prepared, the `ns eval` command will run on all languages prepared, and the summarized results generated with `ns eval` will include per-language breakdowns.
@@ -9,7 +9,7 @@ Once prepared, the `ns eval` command will run on all languages prepared, and the
 
 ### mmlu-prox
 
-- Benchmark is defined in [`nemo_skills/dataset/mmlu-pro/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/mmlu-prox/__init__.py)
+- Benchmark is defined in [`nemo_skills/dataset/mmlu-prox/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/mmlu-prox/__init__.py)
 - Original benchmark source is [here](https://huggingface.co/datasets/li-lab/MMLU-ProX).
 
 Our evaluation template and answer extraction mechanism tries to match the configration in [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/mmlu_prox).
@@ -68,4 +68,150 @@ Some reference numbers for reference and commands for reproduction:
         ++inference.temperature=0.6 \
         ++inference.top_k=20 \
         ++inference.tokens_to_generate=38912
-    ```
\ No newline at end of file
+    ```
+
+### FLORES-200
+
+- Benchmark is defined in [`nemo_skills/dataset/flores200/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/flores200/__init__.py)
+- Original benchmark source is [here](https://huggingface.co/datasets/openlanguagedata/flores_plus).
+
+Some reference numbers for devtest split (xx corresponds to average over 5 languages: de, es, fr, it, ja):
+
+| Model                  | en->xx | xx->en | xx->xx |
+|:-----------------------|------:|------:|------:|
+| Nemotron-NanoV2-9B-v2  | 32.5 |  34  | 25.9 |
+| Qwen3-8B               | 31.5 | 34.6 | 25.7 |
+| Qwen3-30B-A3B          | 33.3 | 35.5 | 27.1 |
+| gpt-oss-20B            | 32.4 | 34.1 |  25  |
+
+=== "Nemotron-NanoV2-9B-v2"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=NVIDIA/Nemotron-Nano-9B-v2 \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=512
+        ++system_message='/no_think'
+    ```
+
+=== "Qwen3-8B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-8B \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "Qwen3-30B-A3B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-30B-A3B \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "gpt-oss-20B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=openai/gpt-oss-20b \
+        --benchmarks flores200 \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=devtest \
+        ++inference.tokens_to_generate=2048
+    ```
+
+### wmt24pp
+
+- Benchmark is defined in [`nemo_skills/dataset/wmt24pp/__init__.py`](https://github.com/NVIDIA/NeMo-Skills/blob/main/nemo_skills/dataset/wmt24pp/__init__.py)
+- Original benchmark source is [here](https://huggingface.co/datasets/google/wmt24pp).
+
+Some reference numbers for test split (xx corresponds to average over 5 languages: de, es, fr, it, ja):
+
+| Model                  | en->de | en->es | en->fr | en->it | en->ja | en->xx |
+|:-----------------------|------:|------:|------:|------:|------:|------:|
+| Nemotron-NanoV2-9B-v2  | 25.3 | 37.7 | 33.4 | 33.8 | 20.9 |  30.2  |
+| Qwen3-8B               | 26.2 | 38.5 | 33.1 | 33.1 | 21.7 | 30.5 |
+| Qwen3-30B-A3B          | 28.5 |  40  | 35.1 |  36  | 23.2 | 32.5 |
+| gpt-oss-20B            | 27.3 | 42.3 | 32.8 | 34.9 | 25.2 | 32.5 |
+
+=== "Nemotron-NanoV2-9B-v2"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=NVIDIA/Nemotron-Nano-9B-v2 \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=512
+        ++system_message='/no_think'
+    ```
+
+=== "Qwen3-8B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-8B \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "Qwen3-30B-A3B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=Qwen/Qwen3-30B-A3B \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=512
+        ++prompt_suffix='/no_think'
+    ```
+
+=== "gpt-oss-20B"
+
+    ```bash
+    ns eval \
+        --cluster=[cluster] \
+        --model=openai/gpt-oss-20b \
+        --benchmarks wmt24pp \
+        --output_dir=[output dir] \
+        --server_type=vllm \
+        --server_gpus=8 \
+        --split=test \
+        ++inference.tokens_to_generate=2048
+    ```
diff --git a/docs/index.md b/docs/index.md
index ea6a986218..e96ac847b9 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -21,7 +21,8 @@ Here are some of the features we support:
         - [**Instruction following**](./evaluation/instruction-following.md): e.g. [ifbench](./evaluation/instruction-following.md#ifbench), [ifeval](./evaluation/instruction-following.md#ifeval)
         - [**Long-context**](./evaluation/long-context.md): e.g. [ruler](./evaluation/long-context.md#ruler), [mrcr](./evaluation/long-context.md#mrcr)
         - [**Tool-calling**](./evaluation/tool-calling.md): e.g. [bfcl_v3](./evaluation/tool-calling.md#bfcl_v3)
-        - [**Robustness Evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
+        - [**Multilingual capabilities**](./evaluation/multilingual.md): e.g. [mmlu-prox](./evaluation/multilingual.md#mmlu-prox), [flores-200](./evaluation/multilingual.md#FLORES-200), [wmt24pp](./evaluation/multilingual.md#wmt24pp)
+        - [**Robustness evaluation**](./evaluation/robustness.md): Evaluate model sensitvity against changes in prompt.
     - Easily parallelize each evaluation across many Slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
 - [Model training](pipelines/training.md): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).
 
diff --git a/nemo_skills/dataset/flores200/__init__.py b/nemo_skills/dataset/flores200/__init__.py
new file mode 100644
index 0000000000..86a7f76717
--- /dev/null
+++ b/nemo_skills/dataset/flores200/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# settings that define how evaluation should be done by default (all can be changed from cmdline)
+
+PROMPT_CONFIG = "multilingual/segment-translation"
+DATASET_GROUP = "chat"
+METRICS_TYPE = "translation"
+EVAL_ARGS = "++eval_type=no-op"
+GENERATION_ARGS = ""
diff --git a/nemo_skills/dataset/flores200/prepare.py b/nemo_skills/dataset/flores200/prepare.py
new file mode 100644
index 0000000000..7a427e0f1f
--- /dev/null
+++ b/nemo_skills/dataset/flores200/prepare.py
@@ -0,0 +1,73 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+from pathlib import Path
+
+from datasets import load_dataset
+from langcodes import Language
+
+
+def write_data_to_file(output_file, datasets, src_languages, tgt_languages):
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for src_lang in src_languages:
+            for tgt_lang in tgt_languages:
+                if src_lang != tgt_lang:
+                    for src, tgt in zip(datasets[src_lang], datasets[tgt_lang], strict=True):
+                        json_dict = {
+                            "text": src,
+                            "translation": tgt,
+                            "source_language": src_lang,
+                            "target_language": tgt_lang,
+                            "source_lang_name": Language(src_lang).display_name(),
+                            "target_lang_name": Language(tgt_lang).display_name(),
+                        }
+                        json.dump(json_dict, fout)
+                        fout.write("\n")
+
+
+def main(args):
+    all_languages = list(set(args.source_languages).union(set(args.target_languages)))
+
+    datasets = {}
+    for lang in all_languages:
+        iso_639_3 = Language(lang).to_alpha3()
+        iso_15924 = Language(lang).maximize().script
+        lang_code = f"{iso_639_3}_{iso_15924}"
+        datasets[lang] = load_dataset("openlanguagedata/flores_plus", lang_code, split=args.split)["text"]
+
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+    output_file = data_dir / f"{args.split}.jsonl"
+    write_data_to_file(output_file, datasets, src_languages=args.source_languages, tgt_languages=args.target_languages)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--split", default="dev", choices=("dev", "devtest"), help="Dataset split to process.")
+    parser.add_argument(
+        "--source_languages",
+        default=["en", "de", "es", "fr", "it", "ja"],
+        nargs="+",
+        help="Languages to translate from.",
+    )
+    parser.add_argument(
+        "--target_languages",
+        default=["en", "de", "es", "fr", "it", "ja"],
+        nargs="+",
+        help="Languages to translate to.",
+    )
+    args = parser.parse_args()
+    main(args)
diff --git a/nemo_skills/dataset/wmt24pp/__init__.py b/nemo_skills/dataset/wmt24pp/__init__.py
new file mode 100644
index 0000000000..86a7f76717
--- /dev/null
+++ b/nemo_skills/dataset/wmt24pp/__init__.py
@@ -0,0 +1,22 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# settings that define how evaluation should be done by default (all can be changed from cmdline)
+
+PROMPT_CONFIG = "multilingual/segment-translation"
+DATASET_GROUP = "chat"
+METRICS_TYPE = "translation"
+EVAL_ARGS = "++eval_type=no-op"
+GENERATION_ARGS = ""
diff --git a/nemo_skills/dataset/wmt24pp/prepare.py b/nemo_skills/dataset/wmt24pp/prepare.py
new file mode 100644
index 0000000000..c97ee351c0
--- /dev/null
+++ b/nemo_skills/dataset/wmt24pp/prepare.py
@@ -0,0 +1,60 @@
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import json
+from pathlib import Path
+
+from datasets import load_dataset
+from langcodes import Language
+
+
+def write_data_to_file(output_file, datasets, tgt_languages):
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for tgt_lang in tgt_languages:
+            for src, tgt in zip(datasets[tgt_lang]["source"], datasets[tgt_lang]["target"], strict=True):
+                json_dict = {
+                    "text": src,
+                    "translation": tgt,
+                    "source_language": "en",
+                    "target_language": tgt_lang,
+                    "source_lang_name": "English",
+                    "target_lang_name": Language(tgt_lang[:2]).display_name(),
+                }
+                json.dump(json_dict, fout)
+                fout.write("\n")
+
+
+def main(args):
+    datasets = {}
+    for lang in args.target_languages:
+        datasets[lang] = load_dataset("google/wmt24pp", f"en-{lang}")["train"]
+
+    data_dir = Path(__file__).absolute().parent
+    data_dir.mkdir(exist_ok=True)
+    output_file = data_dir / f"{args.split}.jsonl"
+    write_data_to_file(output_file, datasets, tgt_languages=args.target_languages)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--split", default="test", choices=("test",), help="Dataset split to process.")
+    parser.add_argument(
+        "--target_languages",
+        default=["de_DE", "es_MX", "fr_FR", "it_IT", "ja_JP"],
+        nargs="+",
+        help="Languages to translate to.",
+    )
+    args = parser.parse_args()
+    main(args)
diff --git a/nemo_skills/evaluation/metrics/map_metrics.py b/nemo_skills/evaluation/metrics/map_metrics.py
index 90f3feebea..8eea47930e 100644
--- a/nemo_skills/evaluation/metrics/map_metrics.py
+++ b/nemo_skills/evaluation/metrics/map_metrics.py
@@ -33,6 +33,7 @@
 from nemo_skills.evaluation.metrics.mrcr_metrics import MRCRMetrics
 from nemo_skills.evaluation.metrics.ruler_metrics import RulerMetrics
 from nemo_skills.evaluation.metrics.simpleqa_metrics import SimpleQAMetrics
+from nemo_skills.evaluation.metrics.translation_metrics import TranslationMetrics
 
 METRICS_MAP = {
     "math": MathMetrics,
@@ -56,6 +57,7 @@
     "aalcr": AALCRMetrics,
     "livebench_coding": LiveCodeBenchMetrics,
     "ojbench": OJBenchMetrics,
+    "translation": TranslationMetrics,
     "human_eval_infilling": HumanEvalInfillingMetrics,
 }
 
diff --git a/nemo_skills/evaluation/metrics/translation_metrics.py b/nemo_skills/evaluation/metrics/translation_metrics.py
new file mode 100644
index 0000000000..dbeadbede8
--- /dev/null
+++ b/nemo_skills/evaluation/metrics/translation_metrics.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from collections import defaultdict
+
+from sacrebleu import corpus_bleu
+
+from nemo_skills.evaluation.metrics.base import BaseMetrics, as_float
+
+
+class TranslationMetrics(BaseMetrics):
+    # TODO: refactor BLEU computation so it reuses parent method functions from pass@k
+    # TODO: add support for other translation metrics, such as COMET and MetricX
+
+    def get_metrics(self):
+        metrics_dict = {}
+        for key in self.translation_dict:
+            src_lang, tgt_lang = key.split("->")
+            preds = self.translation_dict[key]["preds"]
+            gts = self.translation_dict[key]["gts"]
+
+            tokenize = "13a"
+            if tgt_lang[:2] == "ja":
+                tokenize = "ja-mecab"
+            if tgt_lang[:2] == "zh":
+                tokenize = "zh"
+            if tgt_lang[:2] == "ko":
+                tokenize = "ko-mecab"
+
+            bleu_score = corpus_bleu(preds, [gts], tokenize=tokenize).score
+            metrics_dict[key] = {"bleu": bleu_score}
+            self.aggregation_dict["xx->xx"].append(bleu_score)
+            self.aggregation_dict[f"{src_lang}->xx"].append(bleu_score)
+            self.aggregation_dict[f"xx->{tgt_lang}"].append(bleu_score)
+
+        for key in self.aggregation_dict:
+            metrics_dict[key] = {"bleu": sum(self.aggregation_dict[key]) / len(self.aggregation_dict[key])}
+
+        return metrics_dict
+
+    def update(self, predictions):
+        """Updating the evaluation results with the current element.
+
+        Args:
+            predictions (list[dict]): aggregated predictions across all generations.
+                The content of the file is benchmark specific.
+        """
+        super().update(predictions)
+
+        for pred in predictions:
+            src_lang = pred["source_language"]
+            tgt_lang = pred["target_language"]
+            generation = pred["generation"]
+            ground_truth = pred["translation"]
+
+            self.translation_dict[f"{src_lang}->{tgt_lang}"]["preds"].append(generation)
+            self.translation_dict[f"{src_lang}->{tgt_lang}"]["gts"].append(ground_truth)
+
+    def reset(self):
+        super().reset()
+        self.translation_dict = defaultdict(lambda: defaultdict(list))
+        self.aggregation_dict = defaultdict(list)
+
+    def evaluations_to_print(self):
+        """Returns all translation pairs and aggregated multilingual dictionaries."""
+        return list(self.translation_dict.keys()) + list(self.aggregation_dict.keys())
+
+    def metrics_to_print(self):
+        metrics_to_print = {"bleu": as_float}
+        return metrics_to_print
diff --git a/nemo_skills/prompt/config/multilingual/__init__.py b/nemo_skills/prompt/config/multilingual/__init__.py
new file mode 100644
index 0000000000..341a77c5bc
--- /dev/null
+++ b/nemo_skills/prompt/config/multilingual/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/nemo_skills/prompt/config/multilingual/segment-translation.yaml b/nemo_skills/prompt/config/multilingual/segment-translation.yaml
new file mode 100644
index 0000000000..57e1bdc750
--- /dev/null
+++ b/nemo_skills/prompt/config/multilingual/segment-translation.yaml
@@ -0,0 +1,3 @@
+# Default prompt for text translation.
+
+user: "Translate the following segment into {target_lang_name}, without additional explanation.\n\n{text}"
diff --git a/requirements/main.txt b/requirements/main.txt
index a623f88fa6..27fd526960 100644
--- a/requirements/main.txt
+++ b/requirements/main.txt
@@ -26,6 +26,7 @@ huggingface_hub
 hydra-core
 ipython
 iso639-lang
+langcodes
 litellm[caching] < 1.75.0  # some bug with asyncio.run hanging forever
 math-verify[antlr4_9_3]
 mcp
@@ -35,6 +36,7 @@ openai
 pyyaml
 rank_bm25
 requests
+sacrebleu
 scikit-learn
 sdp @ git+https://github.com/NVIDIA/NeMo-speech-data-processor@29b9b1ec0ceaf3ffa441c1d01297371b3f8e11d2
 sympy