From f41501259f732ebebf60b0165577b6a737535867 Mon Sep 17 00:00:00 2001
From: Shubham Toshniwal <stoshniwal@nvidia.com>
Date: Mon, 11 Aug 2025 14:32:34 -0700
Subject: [PATCH 01/20] Docs for Llama Nemotron

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 docs/releases/llamanemotron_super_v1.5/evaluation.md |  0
 docs/releases/llamanemotron_super_v1.5/index.md      | 11 +++++++++++
 2 files changed, 11 insertions(+)
 create mode 100644 docs/releases/llamanemotron_super_v1.5/evaluation.md
 create mode 100644 docs/releases/llamanemotron_super_v1.5/index.md

diff --git a/docs/releases/llamanemotron_super_v1.5/evaluation.md b/docs/releases/llamanemotron_super_v1.5/evaluation.md
new file mode 100644
index 0000000000..e69de29bb2
diff --git a/docs/releases/llamanemotron_super_v1.5/index.md b/docs/releases/llamanemotron_super_v1.5/index.md
new file mode 100644
index 0000000000..12e08574f8
--- /dev/null
+++ b/docs/releases/llamanemotron_super_v1.5/index.md
@@ -0,0 +1,11 @@
+# Llama Nemotron Super 49B v1.5 
+
+This section has instructions for evaluating the  
+[Llama Nemotron Super 49B v1.5](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5).
+
+Please note that unless you have an access to a large GPU cluster, it might take a very long time
+for some of the commands to complete!
+
+- [Model evaluation](evaluation.md)
+
+

From 0d1f6863df4e4e8d97fbf38e18f59404895526b9 Mon Sep 17 00:00:00 2001
From: Shubham Toshniwal <shtoshni@gmail.com>
Date: Fri, 15 Aug 2025 13:24:04 -0400
Subject: [PATCH 02/20] Reasoning results added

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 .../llamanemotron_super_v1.5/evaluation.md    |   0
 .../llamanemotron_super_v1.5/index.md         |  11 -
 .../posts/llama-nemotron-super-v1.5-evals.md  | 260 ++++++++++++++++++
 3 files changed, 260 insertions(+), 11 deletions(-)
 delete mode 100644 docs/releases/llamanemotron_super_v1.5/evaluation.md
 delete mode 100644 docs/releases/llamanemotron_super_v1.5/index.md
 create mode 100644 docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md

diff --git a/docs/releases/llamanemotron_super_v1.5/evaluation.md b/docs/releases/llamanemotron_super_v1.5/evaluation.md
deleted file mode 100644
index e69de29bb2..0000000000
diff --git a/docs/releases/llamanemotron_super_v1.5/index.md b/docs/releases/llamanemotron_super_v1.5/index.md
deleted file mode 100644
index 12e08574f8..0000000000
--- a/docs/releases/llamanemotron_super_v1.5/index.md
+++ /dev/null
@@ -1,11 +0,0 @@
-# Llama Nemotron Super 49B v1.5 
-
-This section has instructions for evaluating the  
-[Llama Nemotron Super 49B v1.5](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5).
-
-Please note that unless you have an access to a large GPU cluster, it might take a very long time
-for some of the commands to complete!
-
-- [Model evaluation](evaluation.md)
-
-
diff --git a/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md b/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
new file mode 100644
index 0000000000..ad3e72ac2c
--- /dev/null
+++ b/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
@@ -0,0 +1,260 @@
+---
+date: 2025-08-12
+readtime: 20 # TODO: Revisit this number
+---
+
+# Reproducing Llama-Nemotron-Super-49B-V1.5 Evals 
+
+In this tutorial, we will reproduce the evals for the Llama-3.3-Nemotron-Super-49B-v1.5 model using NeMo-Skills.
+For an introduction to the NeMo-Skills framework, we recommed going over [our introductory tutorial](./omr-simple-recipe.md).
+
+
+We assume you have `/workspace` defined in your [cluster config](../../basics/cluster-configs.md) and are
+executing all commands from that folder locally. Change all commands accordingly if running on slurm or using different paths.
+
+<!-- more -->
+
+## Download the model
+
+Get the model from HF. 
+```bash
+pip install -U "huggingface_hub[cli]"
+huggingface-cli download nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 --local-dir /workspace/Llama-3_3-Nemotron-Super-49B-v1_5
+```
+
+## Prepare evaluation data
+
+We will evaluate the model on the following:
+
+- Science & General reasoning benchmarks:
+    - GPQA
+    - MMLU-Pro 
+    - HLE 
+
+- Coding reasoning benchmarks
+    - LiveCodeBench
+    - SciCode
+
+- Math reasoning benchmarks:
+    - MATH-500
+    - AIME24
+    - AIME25
+
+- Tool-calling:
+    - BFCL v3
+
+
+Here is the command to prepare these datasets using NeMo-Skills:
+
+```bash
+ns prepare_data gpqa mmlu-pro hle livecodebench scicode bfcl_v3 math-500 aime24 aime25 
+```
+
+
+## Evaluation commands
+
+Llama-3.3-Nemotron-Super-49B-v1.5 can perform inference in both reasoning on and off modes. 
+We detail the evaluation commands and results for both the modes. 
+Note that you might not get exactly the same numbers as reported here because of the stochastic nature of LLM generations. 
+
+!!! note 
+    The commands provided here assume you're working with a local machine where benchmarks/subsets are evaluated sequentially which will take a very long time. If running on slurm, you can set `--num_jobs` to a bigger number or just set it to -1 to run each benchmark and their random seeds as an independent job which in case of Llama-Nemotron-Super-49B-V1.5 requires one node per job.  
+
+
+
+### Reasoning-on Evals
+
+For the reasoning mode evals, we follow the recommended recipe of setting: 
+
+- temperature to 0.6
+- top-p to 0.95
+- system_message to empty i.e. ''
+- maximum number of generated tokens to 65536 
+
+#### Command for Math, Code, and Science Reasoning Eval (Reasoning on)
+
+The following command evaluates the model on GPQA, MMLU-Pro, Scicode, MATH-500, AIME24, and AIME25 across 16 different runs for all benchmarks. We have highlighted the inference settings recommended above in the following command:
+
+
+```bash hl_lines="9-13"
+ns eval \
+    --cluster=local \
+    --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5 \
+    --server_type=vllm \
+    --output_dir=/workspace/llama_nemotron_49b_1_5/ \
+    --benchmarks=gpqa:16,mmlu-pro:16,scicode:16,math-500:16,aime24:16,aime25:16 \
+    --server_gpus=8 \
+    --num_jobs=1 \
+    ++inference.tokens_to_generate=65536 \
+    ++inference.temperature=0.6 \
+    ++inference.top_p=0.95 \
+    ++system_message=''
+```
+
+For LiveCodeBench, we additionally specify the exact split on which we evaluate the benchmark. In the following command, we evaluate the model on the 166 problems from the 1 October 2024 to 1 March 2025 subset from release_v5. To evaluate on the Artificial Analysis Index (AAI) split, set split to `test_v5_2407_2412`:
+
+```bash hl_lines="7"
+ns eval \
+    --cluster=local \
+    --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5 \
+    --server_type=vllm \
+    --output_dir=/workspace/llama_nemotron_49b_1_5/ \
+    --benchmarks=livecodebench:16 \
+    --split=test_v5_2410_2502 \
+    --server_gpus=8 \
+    --num_jobs=1 \
+    ++inference.tokens_to_generate=65536 \
+    ++inference.temperature=0.6 \
+    ++inference.top_p=0.95 \
+    ++system_message=''
+```
+
+#### Command for HLE Eval (Reasoning on)
+
+
+For HLE, because symbolic comparison is not sufficient to determine the correctness of the output, we use the recommended `o3-mini-20250131` model as the judge. Note that this model is the default in NeMo-Skills, and we have just added this argument for illustration purposes. To evaluate for the [Artificial Analysis Index (AAI) setting, please use the gpt-4o-20240806 model as the judge](https://artificialanalysis.ai/methodology/intelligence-benchmarking#intelligence-index-evaluation-suite-overview){target="_blank"}.
+
+```bash hl_lines="9-10"
+ns eval \
+    --cluster=local \
+    --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5 \
+    --server_type=vllm \
+    --output_dir=/workspace/llama_nemotron_49b_1_5/ \
+    --benchmarks=hle:16 \
+    --server_gpus=8 \
+    --num_jobs=1 \
+    --judge_model="o3-mini-20250131" \
+    --extra_judge_args="++inference.tokens_to_generate=4096 ++max_concurrent_requests=8" \
+    ++inference.tokens_to_generate=65536 \
+    ++inference.temperature=0.6 \
+    ++inference.top_p=0.95 \
+    ++system_message='' \
+```
+
+!!! note
+    For Llama-Nemotron-Super-49B-V1.5, we found that the difference in judge models can result in almost 0.8-1% performance difference. Our earlier experiments with GPT-4.1 as the judge was giving a performance of 6.8%. This can explain why [AAI reports a performance of 6.8%](https://artificialanalysis.ai/models/llama-nemotron-super-49b-v1-5-reasoning#intelligence-evaluations){target="_blank"} vs our reproduced performance of 7.75%.  
+
+!!! note
+    If the OpenAI API throws the `Rate limit exceeded` error, please reduce the `max_concurrent_requests` value in the `extra_judge_args` argument and restart the job.
+
+
+### Command for BFCL Eval (Reasoning on)
+
+Tool-calling benchmarks require tool-call parsing and execution. NeMo-Skills supports both client-side parsing (default) and server-side parsing. For server-side parsing, the vLLM server requires the parsing details as highlighted in the below command:
+```bash hl_lines="13-17"
+ns eval \
+  --cluster=local \
+  --benchmarks=bfcl_v3 \
+  --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/ \
+  --server_gpus=8 \
+  --server_type=vllm \
+  --num_jobs=1 \
+  --output_dir=/workspace/llama_nemotron_49b_1_5_tool_calling/ \
+  ++inference.tokens_to_generate=65536 \
+  ++inference.temperature=0.6 \
+  ++inference.top_p=0.95 \
+  ++system_message='' \
+  ++use_client_parsing=False \
+  --server_args="--tool-parser-plugin \"/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/llama_nemotron_toolcall_parser_no_streaming.py\" \
+                 --tool-call-parser \"llama_nemotron_json\" \
+                 --enable-auto-tool-choice"
+
+```
+
+
+
+### Reasoning-on Results
+
+We use the `summarize_results` pipeline to calculate the evaluation metrics, for all but BFCL where the metrics are calculated as part of the evaluation job itself. 
+The following results were obtained by running the command:
+
+
+```bash
+ns summarize_results --cluster=local /workspace/llama_nemotron_49b_1_5/eval-results/{BENCHMARK}
+```
+
+
+#### Results for Science & General Reasoning benchmarks (Reasoning on)
+
+```
+------------------------------------------ gpqa -----------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
+pass@1[avg-of-16] | 198         | 11046      | 1986        | 74.65%           | 0.60%
+majority@16       | 198         | 11046      | 1986        | 78.28%           | 0.00%
+pass@16           | 198         | 11046      | 1986        | 92.93%           | 0.00%
+
+---------------------------------------- mmlu-pro ---------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
+pass@1[avg-of-16] | 12032       | 4879       | 12516       | 81.44%           | 0.05%
+majority@16       | 12032       | 4879       | 12516       | 83.05%           | 0.00%
+pass@16           | 12032       | 4879       | 12516       | 91.32%           | 0.00%
+
+-------------------------------------------------- hle --------------------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | judge_correct | symbolic_correct | no_answer
+pass@1[avg-of-15] | 2158        | 12111      | 7782        | 7.75%         | 2.40%            | 64.13%
+majority@15       | 2158        | 12111      | 7782        | 4.31%         | 3.43%            | 49.91%
+pass@15           | 2158        | 12111      | 7782        | 27.80%        | 10.10%           | 49.91%
+```
+
+
+#### Results for Code Reasoning benchmarks (Reasoning on)
+
+```
+--------------------------- livecodebench ---------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy
+pass@1[avg-of-16] | 166         | 18881      | 1552        | 71.72%
+pass@16           | 166         | 18881      | 1552        | 87.35%
+
+--------------------------------------------------- scicode ----------------------------------------------------
+evaluation_mode   | avg_tokens | gen_seconds | problem_accuracy | subtask_accuracy | num_problems | num_subtasks
+pass@1[avg-of-16] | 43481      | 69963       | 3.08%            | 28.91%           | 65           | 288
+pass@16           | 43481      | 69963       | 7.69%            | 40.97%           | 65           | 288
+```
+
+#### Results for Math Reasoning benchmarks (Reasoning on)
+
+```
+---------------------------------------- math-500 ---------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
+pass@1[avg-of-16] | 500         | 5807       | 2828        | 97.79%           | 0.28%
+majority@16       | 500         | 5807       | 2828        | 99.00%           | 0.00%
+pass@16           | 500         | 5807       | 2828        | 99.40%           | 0.00%
+
+
+----------------------------------------- aime24 ----------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
+pass@1[avg-of-16] | 30          | 19875      | 2042        | 88.54%           | 1.88%
+majority@16       | 30          | 19875      | 2042        | 93.33%           | 0.00%
+pass@16           | 30          | 19875      | 2042        | 93.33%           | 0.00%
+
+
+----------------------------------------- aime25 ----------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
+pass@1[avg-of-16] | 30          | 23366      | 832         | 84.38%           | 3.96%
+majority@16       | 30          | 23366      | 832         | 93.33%           | 0.00%
+pass@16           | 30          | 23366      | 832         | 93.33%           | 0.00%
+```
+
+
+#### Results for Tool Calling  (Reasoning on)
+
+```
+----------------------- bfcl_v3 ------------------------
+| Category                    | num_entries | accuracy |
+|-----------------------------|-------------|----------|
+| overall_accuracy            | 4441        | 72.64%   |
+| overall_non_live            | 1390        | 88.20%   |
+| non_live_ast                | 1150        | 88.58%   |
+| irrelevance                 | 240         | 86.67%   |
+| overall_live                | 2251        | 83.34%   |
+| live_ast                    | 1351        | 82.68%   |
+| live_irrelevance            | 882         | 84.47%   |
+| live_relevance              | 18          | 77.78%   |
+| overall_multi_turn          | 800         | 46.38%   |
+
+```
+
+!!! note
+    Currently `summarize_results` doesn't support benchmarks like BFCL v3 which have their specific logic of combining subset scores to arrive at the overall score. This table was created by formatting the `metrics.json` file from `/workspace/llama_nemotron_49b_1_5_tool_calling/bfcl_v3/metrics.json` using an LLM.  
+
+

From f02be7891836b3f5aa60aa0f071498b18b449940 Mon Sep 17 00:00:00 2001
From: fayejf <36722593+fayejf@users.noreply.github.com>
Date: Mon, 11 Aug 2025 14:26:06 -0700
Subject: [PATCH 03/20] Adding long context benchmark MRCR (#634)

Signed-off-by: fayejf <fayejf07@gmail.com>
Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 README.md                                     |   2 +-
 nemo_skills/dataset/mrcr/__init__.py          |  19 ++++
 nemo_skills/dataset/mrcr/prepare.py           | 102 ++++++++++++++++++
 nemo_skills/evaluation/evaluator/__init__.py  |   2 +
 nemo_skills/evaluation/evaluator/mrcr.py      |  48 +++++++++
 nemo_skills/evaluation/metrics/map_metrics.py |   2 +
 .../evaluation/metrics/mrcr_metrics.py        |  26 +++++
 7 files changed, 200 insertions(+), 1 deletion(-)
 create mode 100644 nemo_skills/dataset/mrcr/__init__.py
 create mode 100644 nemo_skills/dataset/mrcr/prepare.py
 create mode 100644 nemo_skills/evaluation/evaluator/mrcr.py
 create mode 100644 nemo_skills/evaluation/metrics/mrcr_metrics.py

diff --git a/README.md b/README.md
index b730121436..41c1105867 100644
--- a/README.md
+++ b/README.md
@@ -15,7 +15,7 @@ Here are some of the features we support:
     - Coding skills: scicode, livecodebench, human-eval, mbpp
     - Chat/instruction following: ifbench, ifeval, arena-hard
     - General knowledge: mmlu, mmlu-pro, gpqa
-    - Long context: ruler
+    - Long context: ruler, mrcr
   - Easily parallelize each evaluation across many slurm jobs, self-host LLM judges, bring your own prompts or change benchmark configuration in any other way.
 - [Model training](https://nvidia.github.io/NeMo-Skills/pipelines/training): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).
 
diff --git a/nemo_skills/dataset/mrcr/__init__.py b/nemo_skills/dataset/mrcr/__init__.py
new file mode 100644
index 0000000000..5b2ff73cc0
--- /dev/null
+++ b/nemo_skills/dataset/mrcr/__init__.py
@@ -0,0 +1,19 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+EVAL_SPLIT = 'all'
+PROMPT_CONFIG = 'null'
+DATASET_GROUP = 'long-context'
+METRICS_TYPE = 'mrcr'
+EVAL_ARGS = '++eval_type=mrcr'
+GENERATION_ARGS = '++prompt_format=openai'
diff --git a/nemo_skills/dataset/mrcr/prepare.py b/nemo_skills/dataset/mrcr/prepare.py
new file mode 100644
index 0000000000..848d89b3fd
--- /dev/null
+++ b/nemo_skills/dataset/mrcr/prepare.py
@@ -0,0 +1,102 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import json
+import subprocess
+from pathlib import Path
+
+import tiktoken
+from datasets import load_dataset
+from tqdm import tqdm
+
+"""
+Usage
+# default. setup is all.
+python prepare.py
+
+# prepare subset needle2_128k.
+python prepare.py --max_context_window 131072 --needles_subset 2 --setup needle2_128k
+python prepare.py --max_context_window 131072 --needles_subset 2 4 --setup needle2_needle_4_128k
+"""
+
+
+def count_n_tokens(messages: list[dict]) -> int:
+    """
+    Follow the official way to count tokens in messages.
+    with tokenizer o200k_base
+    """
+    enc = tiktoken.get_encoding("o200k_base")
+    return sum([len(enc.encode(m["content"])) for m in messages])
+
+
+def write_data_to_file(output_file, data, max_context_window, needles_subset):
+    with open(output_file, "wt", encoding="utf-8") as fout:
+        for idx, entry in tqdm(enumerate(data), desc=f"Writing {output_file.name}"):
+            messages = json.loads(entry["prompt"])
+
+            if entry['n_needles'] not in needles_subset:
+                print(f"Skipping {idx} because it has {entry['n_needles']} needle")
+                continue
+
+            # find n_tokens
+            n_tokens = count_n_tokens(messages)
+            if max_context_window is not None:
+                if n_tokens > max_context_window:
+                    print(f"Skipping {idx} because it has {n_tokens} tokens")
+                    continue
+                
+            entry['messages'] = entry.pop('prompt')
+            entry['expected_answer'] = entry.pop('answer')
+            entry['n_tokens'] = n_tokens
+            json.dump(entry, fout)
+            fout.write("\n")
+
+
+def get_mrcr_data(needles_subset, setup, max_context_window):
+    dataset = load_dataset("openai/mrcr")['train']
+    data_dir = Path(__file__).absolute().parent
+
+    output_file = data_dir / f"{setup}.jsonl"
+    write_data_to_file(output_file, dataset, max_context_window, needles_subset)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Prepare MRCR dataset.")
+    parser.add_argument(
+        "--max_context_window",
+        type=int,
+        default=None,
+        help="Maximum context window size.",
+    )
+    parser.add_argument(
+        "--needles_subset",
+        nargs="+",
+        type=int,
+        choices=[2, 4, 8],
+        default=[2, 4, 8],
+        help="Needles subset to include.",
+    )
+
+    parser.add_argument(
+        "--setup",
+        type=str,
+        default="all",
+        help="setup name. e.g. all or <needle2>_<128k>",
+    )
+
+    args = parser.parse_args()
+
+    print(f"Preparing MRCR dataset with additional arguments: {args}")
+    get_mrcr_data(args.needles_subset, args.setup, args.max_context_window)
+    print(f"MRCR dataset preparation with setup {args.setup} completed. Use --split=${args.setup} to evaluate!")
diff --git a/nemo_skills/evaluation/evaluator/__init__.py b/nemo_skills/evaluation/evaluator/__init__.py
index c6e8c6b413..d297ed57e5 100644
--- a/nemo_skills/evaluation/evaluator/__init__.py
+++ b/nemo_skills/evaluation/evaluator/__init__.py
@@ -22,6 +22,7 @@
 from nemo_skills.evaluation.evaluator.mcq import eval_mcq
 from nemo_skills.evaluation.evaluator.ruler import eval_ruler
 from nemo_skills.evaluation.evaluator.scicode import eval_scicode
+from nemo_skills.evaluation.evaluator.mrcr import eval_mrcr
 
 
 def dummy_eval(cfg):
@@ -43,6 +44,7 @@ def dummy_eval(cfg):
     'livecodebench': eval_livecodebench,
     'livecodebench_pro': eval_livecodebench_pro,
     'scicode': eval_scicode,
+    'mrcr': eval_mrcr,
 }
 
 
diff --git a/nemo_skills/evaluation/evaluator/mrcr.py b/nemo_skills/evaluation/evaluator/mrcr.py
new file mode 100644
index 0000000000..efdf463443
--- /dev/null
+++ b/nemo_skills/evaluation/evaluator/mrcr.py
@@ -0,0 +1,48 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+from tqdm import tqdm
+from nemo_skills.utils import get_logger_name,  unroll_files
+from difflib import SequenceMatcher
+
+LOG = logging.getLogger(get_logger_name(__file__))
+
+
+def eval_mrcr(cfg):
+    def grade(response, answer, random_string_to_prepend) -> float:
+        """
+        Compare response and answer.
+        # Offical grading function: https://huggingface.co/datasets/openai/mrcr
+        """
+        if not response.startswith(random_string_to_prepend):
+            return 0
+        response = response.removeprefix(random_string_to_prepend)
+        answer = answer.removeprefix(random_string_to_prepend)
+        return float(SequenceMatcher(None, response, answer).ratio())
+
+
+
+    for file in unroll_files(cfg.input_files):
+        with open(file, 'rt', encoding='utf-8') as fin:
+            data = [json.loads(line) for line in fin]
+        with open(file, 'wt', encoding='utf-8') as fout:
+            for sample in tqdm(data):
+                sample['seq_match_ratio'] = grade(
+                    sample['generation'], 
+                    sample['expected_answer'], 
+                    sample['random_string_to_prepend']
+                    )
+                fout.write(json.dumps(sample) + "\n")
diff --git a/nemo_skills/evaluation/metrics/map_metrics.py b/nemo_skills/evaluation/metrics/map_metrics.py
index b5b1afb7d4..c3c754f777 100644
--- a/nemo_skills/evaluation/metrics/map_metrics.py
+++ b/nemo_skills/evaluation/metrics/map_metrics.py
@@ -19,6 +19,7 @@
 from nemo_skills.evaluation.metrics.lean4_metrics import Lean4Metrics
 from nemo_skills.evaluation.metrics.math_metrics import MathMetrics
 from nemo_skills.evaluation.metrics.ruler_metrics import RulerMetrics
+from nemo_skills.evaluation.metrics.mrcr_metrics import MRCRMetrics
 
 METRICS_MAP = {
     "math": MathMetrics,
@@ -33,6 +34,7 @@
     "ruler": RulerMetrics,
     "livecodebench": LiveCodeBenchMetrics,
     "scicode": SciCodeMetrics,
+    "mrcr": MRCRMetrics,
 }
 
 
diff --git a/nemo_skills/evaluation/metrics/mrcr_metrics.py b/nemo_skills/evaluation/metrics/mrcr_metrics.py
new file mode 100644
index 0000000000..7b04e0b791
--- /dev/null
+++ b/nemo_skills/evaluation/metrics/mrcr_metrics.py
@@ -0,0 +1,26 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from nemo_skills.evaluation.metrics.base import BaseMetrics
+
+
+class MRCRMetrics(BaseMetrics):
+    """Metrics for MRCR (Multi-Round Coreference) evaluation."""
+    
+    def _get_score_dict(self, prediction: dict) -> dict[str, bool | int | float]:
+        return {"accuracy": prediction['seq_match_ratio']}
+
+    def update(self, predictions):
+        super().update(predictions)
+        self._compute_pass_at_k(predictions=predictions)
\ No newline at end of file

From 4de54b1974a0c0a5a32d3b79aae3bdf9b2cabce8 Mon Sep 17 00:00:00 2001
From: Feng Chen <42473790+fchen97@users.noreply.github.com>
Date: Mon, 11 Aug 2025 17:03:16 -0700
Subject: [PATCH 04/20] Fix a small bug in generation with chunks (#661)

Signed-off-by: Feng Chen <42473790+fchen97@users.noreply.github.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 nemo_skills/pipeline/utils/generation.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/nemo_skills/pipeline/utils/generation.py b/nemo_skills/pipeline/utils/generation.py
index 45d9f8a372..c122a240cc 100644
--- a/nemo_skills/pipeline/utils/generation.py
+++ b/nemo_skills/pipeline/utils/generation.py
@@ -76,7 +76,7 @@ def get_remaining_jobs(cluster_config, output_dir, random_seeds, chunk_ids, reru
         check_commands.append(f'if [ ! -f "{unmounted_path}" ]; then echo "MISSING:{seed_str}:{chunk_str}"; fi')
     # If random_seeds has more than N elements, split commands into groups of N
     request_size = len(check_commands[0]) // 10
-    if len(random_seeds) > request_size:
+    if len(expected_files) > request_size:
         outputs = []
         for i in range(0, len(check_commands), request_size):
             group = check_commands[i : i + request_size]

From 3b7aa334d2580cbf02e4c5521732f069cc835106 Mon Sep 17 00:00:00 2001
From: fayejf <36722593+fayejf@users.noreply.github.com>
Date: Tue, 12 Aug 2025 08:56:42 -0700
Subject: [PATCH 05/20] Small fix for mrcr prepare.py (#662)

Signed-off-by: fayejf <36722593+fayejf@users.noreply.github.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 nemo_skills/dataset/mrcr/prepare.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/nemo_skills/dataset/mrcr/prepare.py b/nemo_skills/dataset/mrcr/prepare.py
index 848d89b3fd..4aea15ca3e 100644
--- a/nemo_skills/dataset/mrcr/prepare.py
+++ b/nemo_skills/dataset/mrcr/prepare.py
@@ -43,7 +43,7 @@ def count_n_tokens(messages: list[dict]) -> int:
 def write_data_to_file(output_file, data, max_context_window, needles_subset):
     with open(output_file, "wt", encoding="utf-8") as fout:
         for idx, entry in tqdm(enumerate(data), desc=f"Writing {output_file.name}"):
-            messages = json.loads(entry["prompt"])
+            messages = json.loads(entry.pop("prompt"))
 
             if entry['n_needles'] not in needles_subset:
                 print(f"Skipping {idx} because it has {entry['n_needles']} needle")
@@ -56,7 +56,7 @@ def write_data_to_file(output_file, data, max_context_window, needles_subset):
                     print(f"Skipping {idx} because it has {n_tokens} tokens")
                     continue
                 
-            entry['messages'] = entry.pop('prompt')
+            entry['messages'] = messages
             entry['expected_answer'] = entry.pop('answer')
             entry['n_tokens'] = n_tokens
             json.dump(entry, fout)

From 13613098b72375bc090152a95e44bc721fe8ab4e Mon Sep 17 00:00:00 2001
From: Igor Gitman <igitman@nvidia.com>
Date: Tue, 12 Aug 2025 13:34:55 -0700
Subject: [PATCH 06/20] Fix base checkpoints in the docs

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 docs/releases/openreasoning/training.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/releases/openreasoning/training.md b/docs/releases/openreasoning/training.md
index 7c1cac6b32..d47281bf04 100644
--- a/docs/releases/openreasoning/training.md
+++ b/docs/releases/openreasoning/training.md
@@ -141,11 +141,12 @@ dataset.to_json("open-reasoning-science-cot.jsonl")
 
 We mostly use the same training commands as for [OpenMathReasoning models](../openmathreasoning/training.md#run-training). The only difference
 is that we pack sequences to 49152 length and use a little different hyperparameters detailed in the following table.
+Note that unlike OpenMathReasoning, we are not starting from *Math* models, but are using standard base models for all model sizes.
 
 |                       | **lr** | **min_lr** | **TP** | **PP** | **CP** |
 | --------------------- | ------ | ---------- | ------ | ------ | ------ |
-| **Qwen2.5-Math-1.5B** | 1e-4   | 1e-7       | 1      | 1      | 4      |
-| **Qwen2.5-Math-7B**   | 1e-4   | 1e-7       | 4      | 1      | 4      |
+| **Qwen2.5-1.5B** | 1e-4   | 1e-7       | 1      | 1      | 4      |
+| **Qwen2.5-7B**   | 1e-4   | 1e-7       | 4      | 1      | 4      |
 | **Qwen2.5-14B**       | 1e-4   | 1e-7       | 8      | 1      | 4      |
 | **Qwen2.5-32B**       | 1e-4   | 1e-7       | 8      | 2      | 4      |
 

From c12e802ac870ead62d720f03cb4cd21895d37a6a Mon Sep 17 00:00:00 2001
From: Igor Gitman <igitman@nvidia.com>
Date: Tue, 12 Aug 2025 13:38:56 -0700
Subject: [PATCH 07/20] Fix formatting

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 docs/releases/openreasoning/training.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/releases/openreasoning/training.md b/docs/releases/openreasoning/training.md
index d47281bf04..ce6ce78e5c 100644
--- a/docs/releases/openreasoning/training.md
+++ b/docs/releases/openreasoning/training.md
@@ -145,8 +145,8 @@ Note that unlike OpenMathReasoning, we are not starting from *Math* models, but
 
 |                       | **lr** | **min_lr** | **TP** | **PP** | **CP** |
 | --------------------- | ------ | ---------- | ------ | ------ | ------ |
-| **Qwen2.5-1.5B** | 1e-4   | 1e-7       | 1      | 1      | 4      |
-| **Qwen2.5-7B**   | 1e-4   | 1e-7       | 4      | 1      | 4      |
+| **Qwen2.5-1.5B**      | 1e-4   | 1e-7       | 1      | 1      | 4      |
+| **Qwen2.5-7B**        | 1e-4   | 1e-7       | 4      | 1      | 4      |
 | **Qwen2.5-14B**       | 1e-4   | 1e-7       | 8      | 1      | 4      |
 | **Qwen2.5-32B**       | 1e-4   | 1e-7       | 8      | 2      | 4      |
 

From 275f196ca870985f36c8b9d8a4d42e6e0327eccc Mon Sep 17 00:00:00 2001
From: Igor Gitman <igitman@nvidia.com>
Date: Tue, 12 Aug 2025 18:10:26 -0700
Subject: [PATCH 08/20] Fix type mismatch for max code executions (#665)

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 nemo_skills/inference/generate.py             | 3 +--
 nemo_skills/inference/model/code_execution.py | 4 ----
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/nemo_skills/inference/generate.py b/nemo_skills/inference/generate.py
index 06e1dc9e3e..5b85663812 100644
--- a/nemo_skills/inference/generate.py
+++ b/nemo_skills/inference/generate.py
@@ -458,8 +458,7 @@ async def process_single_datapoint(self, data_point, all_data):
 
         if self.cfg.code_execution:
             if self.cfg.override_max_code_executions and self.cfg.total_code_executions_in_prompt is not None:
-                max_code_executions_values = [data_point['total_code_executions']]
-                generation_params['max_code_executions'] = max_code_executions_values
+                generation_params['max_code_executions'] = data_point['total_code_executions']
 
         return await self.llm.generate_async(**generation_params)
 
diff --git a/nemo_skills/inference/model/code_execution.py b/nemo_skills/inference/model/code_execution.py
index 6be61818e3..499c32e8cc 100644
--- a/nemo_skills/inference/model/code_execution.py
+++ b/nemo_skills/inference/model/code_execution.py
@@ -13,20 +13,16 @@
 # limitations under the License.
 
 
-import asyncio
 import copy
 import logging
 import time
-from concurrent.futures import ThreadPoolExecutor
 from dataclasses import field
 
 from nemo_skills.code_execution import extract_code_to_execute, format_code_output
 from nemo_skills.code_execution.sandbox import Sandbox
-from nemo_skills.inference.model.utils import trim_after_stop_phrases
 from nemo_skills.utils import get_logger_name, nested_dataclass
 
 from .base import BaseModel
-from .utils import trim_after_stop_phrases
 
 LOG = logging.getLogger(get_logger_name(__file__))
 

From 69049ba781bf59e6acc4cb14e55d55d407e10046 Mon Sep 17 00:00:00 2001
From: Sanyam Kapoor <3909933+activatedgeek@users.noreply.github.com>
Date: Wed, 13 Aug 2025 15:54:28 -0400
Subject: [PATCH 09/20] Allow generation type or custom module in `eval`
 pipeline (#666)

Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com>
Signed-off-by: Igor Gitman <igor.a.gitman@gmail.com>
Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 nemo_skills/inference/__init__.py  |  4 ++++
 nemo_skills/inference/factory.py   | 14 ++++++++++++++
 nemo_skills/inference/generate.py  |  2 +-
 nemo_skills/pipeline/eval.py       | 10 ++++++++++
 nemo_skills/pipeline/generate.py   | 17 +----------------
 nemo_skills/pipeline/utils/eval.py | 14 +++++++++++---
 6 files changed, 41 insertions(+), 20 deletions(-)
 create mode 100644 nemo_skills/inference/factory.py

diff --git a/nemo_skills/inference/__init__.py b/nemo_skills/inference/__init__.py
index d9155f923f..1cac6632c0 100644
--- a/nemo_skills/inference/__init__.py
+++ b/nemo_skills/inference/__init__.py
@@ -11,3 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+
+from .factory import GenerationType, GENERATION_MODULE_MAP
+
+__all__ = ["GenerationType", "GENERATION_MODULE_MAP"]
diff --git a/nemo_skills/inference/factory.py b/nemo_skills/inference/factory.py
new file mode 100644
index 0000000000..df9106eabb
--- /dev/null
+++ b/nemo_skills/inference/factory.py
@@ -0,0 +1,14 @@
+from enum import Enum
+
+
+class GenerationType(str, Enum):
+    generate = "generate"
+    math_judge = "math_judge"
+    check_contamination = "check_contamination"
+
+
+GENERATION_MODULE_MAP = {
+    GenerationType.generate: "nemo_skills.inference.generate",
+    GenerationType.math_judge: "nemo_skills.inference.llm_math_judge",
+    GenerationType.check_contamination: "nemo_skills.inference.check_contamination",
+}
diff --git a/nemo_skills/inference/generate.py b/nemo_skills/inference/generate.py
index 5b85663812..e6134f16fb 100644
--- a/nemo_skills/inference/generate.py
+++ b/nemo_skills/inference/generate.py
@@ -24,7 +24,7 @@
 from typing import Any
 
 import hydra
-from omegaconf import ListConfig, OmegaConf, open_dict
+from omegaconf import ListConfig, OmegaConf
 from tqdm import tqdm
 
 from nemo_skills.code_execution.sandbox import get_sandbox, sandbox_params
diff --git a/nemo_skills/pipeline/eval.py b/nemo_skills/pipeline/eval.py
index c66af0d0e9..2ced97e02b 100644
--- a/nemo_skills/pipeline/eval.py
+++ b/nemo_skills/pipeline/eval.py
@@ -22,6 +22,7 @@
 
 import nemo_skills.pipeline.utils as pipeline_utils
 from nemo_skills.dataset.utils import ExtraDatasetType
+from nemo_skills.inference import GenerationType
 from nemo_skills.pipeline.app import app, typer_unpacker
 from nemo_skills.pipeline.generate import generate as _generate
 from nemo_skills.pipeline.utils.eval import prepare_eval_commands
@@ -54,6 +55,13 @@ def eval(
         "If you want to use multiple benchmarks, separate them with comma. E.g. gsm8k:4,human-eval",
     ),
     expname: str = typer.Option("eval", help="Name of the experiment"),
+    generation_type: GenerationType | None = typer.Option(None, help="Type of generation to perform"),
+    generation_module: str = typer.Option(
+        None,
+        help="Path to the generation module to use. "
+        "If not specified, will use the registered generation module for the "
+        "generation type (which is required in this case).",
+    ),
     model: str = typer.Option(None, help="Path to the model to be evaluated"),
     server_address: str = typer.Option(None, help="Address of the server hosting the model"),
     server_type: pipeline_utils.SupportedServers = typer.Option(..., help="Type of server to use"),
@@ -261,6 +269,8 @@ def eval(
         with_sandbox,
         wandb_parameters,
         extra_eval_args,
+        generation_type=generation_type,
+        generation_module=generation_module,
     )
     get_random_port = pipeline_utils.should_get_random_port(server_gpus, exclusive, server_type)
     should_package_extra_datasets = extra_datasets and extra_datasets_type == ExtraDatasetType.local
diff --git a/nemo_skills/pipeline/generate.py b/nemo_skills/pipeline/generate.py
index 0ec64945be..dfa94a8a3d 100644
--- a/nemo_skills/pipeline/generate.py
+++ b/nemo_skills/pipeline/generate.py
@@ -13,7 +13,6 @@
 # limitations under the License.
 import importlib
 import logging
-from enum import Enum
 from typing import List
 
 import typer
@@ -21,26 +20,12 @@
 import nemo_skills.pipeline.utils as pipeline_utils
 from nemo_skills.pipeline.app import app, typer_unpacker
 from nemo_skills.utils import compute_chunk_ids, get_logger_name, setup_logging, str_ids_to_list
+from nemo_skills.inference import GenerationType, GENERATION_MODULE_MAP
 
 LOG = logging.getLogger(get_logger_name(__file__))
 
 # TODO: add num_jobs here for consistency with eval?
 
-
-class GenerationType(str, Enum):
-    generate = "generate"
-    reward = "reward"
-    math_judge = "math_judge"
-    check_contamination = "check_contamination"
-
-
-GENERATION_MODULE_MAP = {
-    GenerationType.generate: "nemo_skills.inference.generate",
-    GenerationType.math_judge: "nemo_skills.inference.llm_math_judge",
-    GenerationType.check_contamination: "nemo_skills.inference.check_contamination",
-}
-
-
 @app.command(context_settings={"allow_extra_args": True, "ignore_unknown_options": True})
 @typer_unpacker
 def generate(
diff --git a/nemo_skills/pipeline/utils/eval.py b/nemo_skills/pipeline/utils/eval.py
index b2d8a30d11..452ae2d16d 100644
--- a/nemo_skills/pipeline/utils/eval.py
+++ b/nemo_skills/pipeline/utils/eval.py
@@ -15,13 +15,13 @@
 import importlib
 import logging
 import os
-from collections import defaultdict
 from copy import deepcopy
 from dataclasses import dataclass, field
 from pathlib import Path
 
 import nemo_skills.pipeline.utils as pipeline_utils
 from nemo_skills.dataset.utils import get_dataset_module
+from nemo_skills.inference import GENERATION_MODULE_MAP
 from nemo_skills.inference.generate import GenerationTask
 from nemo_skills.utils import compute_chunk_ids, get_logger_name
 
@@ -218,11 +218,19 @@ def prepare_eval_commands(
     with_sandbox,
     wandb_parameters,
     extra_eval_args,
+    generation_type=None,
+    generation_module=None,
 ):
     # TODO: there is a bit too much code duplication here and logic is quite dense, should try to refactor
 
     # TODO: should we allow setting num chunks per benchmark when not using groups? Maybe benchmark:rs_num:num_chunks?
 
+    if generation_type is not None:
+        if generation_module is not None:
+            raise ValueError("Cannot specify both generation_module and generation_type. ")
+        
+        generation_module = GENERATION_MODULE_MAP[generation_type]
+
     benchmarks_or_groups = {
         k: int(v) for k, v in [b.split(":") if ":" in b else (b, -1) for b in benchmarks_or_groups.split(",")]
     }
@@ -338,10 +346,10 @@ def prepare_eval_commands(
             for chunk_id in benchmark_chunk_ids:
                 job_benchmarks.add(benchmark)
 
-                generation_task = importlib.import_module(benchmark_args.generation_module)
+                generation_task = importlib.import_module(generation_module or benchmark_args.generation_module)
                 if not hasattr(generation_task, 'GENERATION_TASK_CLASS'):
                     raise ValueError(
-                        f"Module {benchmark_args.generation_module} does not have a GENERATION_TASK_CLASS attribute. "
+                        f"Module {generation_module or benchmark_args.generation_module} does not have a GENERATION_TASK_CLASS attribute. "
                         "Please provide a valid generation module."
                     )
                 generation_task = generation_task.GENERATION_TASK_CLASS

From 64526b70d1eaab1ca4a690b418df93e33e240478 Mon Sep 17 00:00:00 2001
From: Wei Du <wedu@nvidia.com>
Date: Wed, 13 Aug 2025 15:49:23 -0500
Subject: [PATCH 10/20] update grpo with megatron backend (#653)

Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 nemo_skills/pipeline/nemo_rl/grpo.py          | 49 ++++++++++-
 .../training/nemo_rl/configs/grpo.yaml        | 82 ++++++++++++++++++-
 nemo_skills/training/nemo_rl/configs/sft.yaml | 14 ++--
 tests/gpu-tests/test_train.py                 |  6 +-
 4 files changed, 136 insertions(+), 15 deletions(-)

diff --git a/nemo_skills/pipeline/nemo_rl/grpo.py b/nemo_skills/pipeline/nemo_rl/grpo.py
index e8928a3585..70a8c04da6 100644
--- a/nemo_skills/pipeline/nemo_rl/grpo.py
+++ b/nemo_skills/pipeline/nemo_rl/grpo.py
@@ -14,6 +14,7 @@
 
 import logging
 from dataclasses import dataclass
+from enum import Enum
 from typing import List
 
 import typer
@@ -24,6 +25,7 @@
     add_task,
     check_mounts,
     get_cluster_config,
+    get_env_variables,
     get_exp,
     get_mounted_path,
     get_timeout,
@@ -34,6 +36,10 @@
 
 LOG = logging.getLogger(get_logger_name(__file__))
 
+# Define supported backend options using Enum
+class SupportedBackends(str, Enum):
+    fsdp = "fsdp"
+    megatron = "megatron"
 
 @dataclass
 class NemoRLTask:
@@ -49,6 +55,8 @@ class NemoRLTask:
     wandb_group: str
     timeout: str
     log_dir: str
+    env_variables: dict
+    backend: str
     extra_arguments: str = ""
 
     def format_train_args(self):
@@ -56,9 +64,16 @@ def format_train_args(self):
             f"++policy.model_name={self.model} "
             f"++cluster.gpus_per_node={self.num_gpus} "
             f"++cluster.num_nodes={self.num_nodes} "
+            f"++checkpointing.checkpoint_must_save_by={self.timeout} "
             f"++logger.log_dir={self.log_dir} "
             f"++checkpointing.checkpoint_dir={self.output_dir}/checkpoints "
         )
+        if self.backend == "megatron":
+            cmd += " ++policy.dtensor_cfg.enabled=false ++policy.megatron_cfg.enabled=true "
+            cmd += " ++policy.optimizer=None ++policy.dynamic_batching.enabled=false "
+        else:
+            cmd += " ++policy.dtensor_cfg.enabled=true ++policy.megatron_cfg.enabled=false "
+
         return cmd
 
     def format_data_args(self):
@@ -108,6 +123,8 @@ def get_training_cmd(
     wandb_group,
     extra_arguments,
     log_dir,
+    env_variables,
+    backend,
 ):
     timeout = get_timeout(cluster_config, partition)
 
@@ -125,22 +142,32 @@ def get_training_cmd(
         timeout=timeout,
         extra_arguments=extra_arguments,
         log_dir=log_dir,
+        env_variables=env_variables,
+        backend=backend,
     )
 
     return task.get_cmd()
 
 
-def get_checkpoint_convert_cmd(output_dir, final_hf_path, step):
+def get_checkpoint_convert_cmd(output_dir, final_hf_path, step, backend):
     cmd = (
         f"export PYTHONPATH=$PYTHONPATH:/nemo_run/code && "
         f"export UV_PROJECT=/opt/NeMo-RL && "
         f"cd /nemo_run/code && "
-        f"uv run --active python -m nemo_skills.training.nemo_rl.convert_dcp_to_hf "
-        f"    --training-folder={output_dir} "
-        f"    --hf-ckpt-path={final_hf_path} "
     )
+    if backend == "fsdp":
+        cmd += "uv run --active python -m nemo_skills.training.nemo_rl.convert_dcp_to_hf "
+    elif backend == "megatron":
+        cmd += "uv run --extra mcore python -m nemo_skills.training.nemo_rl.convert_megatron_to_hf "
+    else:
+        raise ValueError("Invalid backend: must be 'fsdp' or 'megatron'")
+
+    cmd += f"   --training-folder={output_dir} "
+    cmd += f"   --hf-ckpt-path={final_hf_path} "
+
     if step is not None:
         cmd += f"  --step {step} "
+
     return cmd
 
 
@@ -174,6 +201,9 @@ def grpo_nemo_rl(
         None, help="Can specify if need interactive jobs or a specific non-default partition"
     ),
     time_min: str = typer.Option(None, help="If specified, will use as a time-min slurm parameter"),
+    backend: SupportedBackends = typer.Option(
+        ..., "--backend", help="Choose backend. Supported options: fsdp, megatron"  # Required
+    ),
     run_after: List[str] = typer.Option(
         None, help="Can specify a list of expnames that need to be completed before this one starts"
     ),
@@ -237,6 +267,14 @@ def grpo_nemo_rl(
         check_mounted_paths=check_mounted_paths,
     )
 
+    env_variables = get_env_variables(cluster_config)
+    if backend == "megatron":
+        if "HF_HOME" not in env_variables:
+            raise typer.BadParameter(
+                "Missing required environment variable 'HF_HOME' for 'megatron' backend.\n"
+                "You can set it in your cluster config like this:\n"
+                '  env_vars: ["HF_HOME=/your/path/to/hf_home"]'
+            )
     if num_training_jobs > 0:
         if training_data is None:
             raise ValueError("training_data is required when num_training_jobs > 0")
@@ -262,6 +300,8 @@ def grpo_nemo_rl(
         wandb_group=wandb_group,
         extra_arguments=extra_arguments,
         log_dir=f"{log_dir}/training-logs",
+        env_variables=env_variables,
+        backend=backend,
     )
 
     server_config = None
@@ -297,6 +337,7 @@ def grpo_nemo_rl(
                 output_dir=output_dir,
                 final_hf_path=final_hf_path or f"{output_dir}/final_hf_model",
                 step=conversion_step,
+                backend=backend,
             ),
             task_name=f"{expname}-convert-final-ckpt",
             log_dir=f"{log_dir}/convert-final-ckpt",
diff --git a/nemo_skills/training/nemo_rl/configs/grpo.yaml b/nemo_skills/training/nemo_rl/configs/grpo.yaml
index dacb91f2a0..b0d34652e5 100644
--- a/nemo_skills/training/nemo_rl/configs/grpo.yaml
+++ b/nemo_skills/training/nemo_rl/configs/grpo.yaml
@@ -46,14 +46,17 @@ policy:
   fsdp_offload_enabled: false
   activation_checkpointing_enabled: false
   refit_buffer_size_gb: 4 # used for refitting inference engine, the unit is GB
+  tensor_model_parallel_size: 1
+  pipeline_model_parallel_size: 1
+  context_parallel_size: 1
 
   dtensor_cfg:
     enabled: true
     cpu_offload: False
     sequence_parallel: false
     activation_checkpointing: false
-    tensor_parallel_size: 1
-    context_parallel_size: 1
+    tensor_parallel_size: ${policy.tensor_model_parallel_size}
+    context_parallel_size: ${policy.context_parallel_size}
     custom_parallel_plan: null
 
   # dynamic_batching improves performance by ensuring logprob and training microbatches
@@ -61,11 +64,20 @@ policy:
   # responses are sorted by sequence length and bucketed into microbatches with a total
   # amount of tokens is approximately close to 'train_mb_tokens' and 'logprob_mb_tokens' for the
   # training and logprob stages respectively.
+  # We disable dynamic batching for Megatron as it is incompatible with Pipeline parallelism.
+  # Instead, we use sequence packing.
   dynamic_batching:
-    enabled: false
+    enabled: False
+    train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
+    logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
+    sequence_length_round: 64
 
   sequence_packing:
-    enabled: False
+    enabled: True
+    train_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.train_micro_batch_size}}
+    logprob_mb_tokens: ${mul:${policy.max_total_sequence_length}, ${policy.logprob_batch_size}}
+    algorithm: "modified_first_fit_decreasing"
+    sequence_length_round: 64
 
   # makes the training sequence length divisible by the tensor parallel size
   # this is useful for sequence parallel training
@@ -84,6 +96,68 @@ policy:
       foreach: False
       fused: False
 
+
+  megatron_cfg:
+    enabled: true
+    empty_unused_memory_level: 0
+    activation_checkpointing: false
+    converter_type: "Qwen2ForCausalLM"
+    tensor_model_parallel_size: ${policy.tensor_model_parallel_size}
+    expert_tensor_parallel_size: 1
+    expert_model_parallel_size: 1
+    pipeline_model_parallel_size: ${policy.pipeline_model_parallel_size}
+    num_layers_in_first_pipeline_stage: null
+    num_layers_in_last_pipeline_stage: null
+    context_parallel_size: ${policy.context_parallel_size}
+    pipeline_dtype: ${policy.precision}
+    sequence_parallel: false
+    freeze_moe_router: true
+    moe_router_dtype: "fp64"
+    moe_router_load_balancing_type: "none" # "seq_aux_loss" causes logprob error divergence for grpo
+    moe_router_bias_update_rate: 0.0 # by default, disable bias updates for grpo
+    #gives ~20% training perf speedup with sequence packing 
+    apply_rope_fusion: True
+    
+    optimizer:
+      optimizer: "adam"
+      lr: 1.0e-6
+      min_lr: 1.0e-6
+      weight_decay: 0.01
+      bf16: true
+      fp16: false
+      params_dtype: "float32"
+
+      #adam
+      adam_beta1: 0.9
+      adam_beta2: 0.999
+      adam_eps: 1e-8
+
+      #sgd
+      sgd_momentum: 0.9
+
+      #distributed optimizer
+      use_distributed_optimizer: true
+      use_precision_aware_optimizer: true
+
+      clip_grad: ${policy.max_grad_norm}
+
+    scheduler:
+      start_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay}
+      end_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay}
+      weight_decay_incr_style: "constant"
+      lr_decay_style: "constant"
+      lr_decay_iters: null
+      lr_warmup_iters: 0
+      lr_warmup_init: 1.0e-6
+
+    distributed_data_parallel_config:
+      grad_reduce_in_fp32: false
+      overlap_grad_reduce: true
+      overlap_param_gather: true
+      average_in_collective: true
+      use_custom_fsdp: false
+      data_parallel_sharding_strategy: "optim_grads_params"
+
   scheduler:
     - name: "torch.optim.lr_scheduler.LinearLR"
       kwargs:
diff --git a/nemo_skills/training/nemo_rl/configs/sft.yaml b/nemo_skills/training/nemo_rl/configs/sft.yaml
index e15c06d2af..f279c9c2c6 100644
--- a/nemo_skills/training/nemo_rl/configs/sft.yaml
+++ b/nemo_skills/training/nemo_rl/configs/sft.yaml
@@ -34,14 +34,18 @@ policy:
   precision: "bfloat16"
   fsdp_offload_enabled: false
   activation_checkpointing_enabled: false
+  tensor_model_parallel_size: 1
+  pipeline_model_parallel_size: 1
+  context_parallel_size: 1
+
 
   dtensor_cfg:
     enabled: true
     cpu_offload: False
     sequence_parallel: false
     activation_checkpointing: false
-    tensor_parallel_size: 1
-    context_parallel_size: 1
+    tensor_parallel_size: ${policy.tensor_model_parallel_size}
+    context_parallel_size: ${policy.context_parallel_size}
     custom_parallel_plan: null
 
 
@@ -49,11 +53,11 @@ policy:
     enabled: false
     empty_unused_memory_level: 1
     activation_checkpointing: false
-    tensor_model_parallel_size: 1
+    tensor_model_parallel_size: ${policy.tensor_model_parallel_size}
     expert_tensor_parallel_size: 1
     expert_model_parallel_size: 1
-    pipeline_model_parallel_size: 1
-    context_parallel_size: 1
+    pipeline_model_parallel_size: ${policy.pipeline_model_parallel_size}
+    context_parallel_size: ${policy.context_parallel_size}
     pipeline_dtype: ${policy.precision}
     num_layers_in_first_pipeline_stage: null
     num_layers_in_last_pipeline_stage: null
diff --git a/tests/gpu-tests/test_train.py b/tests/gpu-tests/test_train.py
index b551c77680..bfb1446a4c 100644
--- a/tests/gpu-tests/test_train.py
+++ b/tests/gpu-tests/test_train.py
@@ -83,7 +83,8 @@ def test_sft_nemo_rl(backend):
 
 
 @pytest.mark.gpu
-def test_grpo_nemo_rl():
+@pytest.mark.parametrize("backend", ["fsdp", "megatron"])
+def test_grpo_nemo_rl(backend):
     model_path = os.getenv('NEMO_SKILLS_TEST_HF_MODEL')
     if not model_path:
         pytest.skip("Define NEMO_SKILLS_TEST_HF_MODEL to run this test")
@@ -92,7 +93,7 @@ def test_grpo_nemo_rl():
         pytest.skip("Define NEMO_SKILLS_TEST_MODEL_TYPE to run this test")
     prompt_template = 'llama3-instruct' if model_type == 'llama' else 'qwen-instruct'
 
-    output_dir = f"/tmp/nemo-skills-tests/{model_type}/test-grpo-nemo-rl"
+    output_dir = f"/tmp/nemo-skills-tests/{model_type}/test-grpo-nemo-rl/{backend}"
 
     # need to clean up current cluster configuration as we mount /tmp and it causes problems
     docker_rm(['/tmp/ray/ray_current_cluster', output_dir])
@@ -119,6 +120,7 @@ def test_grpo_nemo_rl():
         num_gpus=1,
         num_training_jobs=1,
         training_data="/nemo_run/code/tests/data/small-grpo-data.test",
+        backend=backend,
         disable_wandb=True,
     )
 

From 7f373d89f9dcf3c486b7805f9c555373a6cfca3f Mon Sep 17 00:00:00 2001
From: Sanyam Kapoor <3909933+activatedgeek@users.noreply.github.com>
Date: Wed, 13 Aug 2025 18:15:12 -0400
Subject: [PATCH 11/20] bugfix: missing generation module arg in eval pipeline
 cmd script (#668)

Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 nemo_skills/pipeline/utils/eval.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/nemo_skills/pipeline/utils/eval.py b/nemo_skills/pipeline/utils/eval.py
index 452ae2d16d..48bd223915 100644
--- a/nemo_skills/pipeline/utils/eval.py
+++ b/nemo_skills/pipeline/utils/eval.py
@@ -374,7 +374,7 @@ def prepare_eval_commands(
                     eval_args=f"{benchmark_args.eval_args} {extra_eval_args}",
                     chunk_id=chunk_id,
                     num_chunks=benchmark_args.num_chunks,
-                    script=benchmark_args.generation_module,
+                    script=generation_module or benchmark_args.generation_module,
                     # only logging for the first seed
                     wandb_parameters=wandb_parameters if seed_idx == 0 else None,
                 )

From 463760236856f52e8ee2407f48636397293a8454 Mon Sep 17 00:00:00 2001
From: Wei Du <wedu@nvidia.com>
Date: Wed, 13 Aug 2025 19:14:11 -0500
Subject: [PATCH 12/20] add support for nsys profile (#667)

Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 nemo_skills/pipeline/nemo_rl/grpo.py   | 63 ++++++++++++--------
 nemo_skills/pipeline/nemo_rl/sft.py    | 80 +++++++++++++++-----------
 nemo_skills/pipeline/utils/__init__.py |  1 +
 nemo_skills/pipeline/utils/exp.py      | 12 ++++
 4 files changed, 100 insertions(+), 56 deletions(-)

diff --git a/nemo_skills/pipeline/nemo_rl/grpo.py b/nemo_skills/pipeline/nemo_rl/grpo.py
index 70a8c04da6..7845db6d26 100644
--- a/nemo_skills/pipeline/nemo_rl/grpo.py
+++ b/nemo_skills/pipeline/nemo_rl/grpo.py
@@ -31,6 +31,8 @@
     get_timeout,
     resolve_mount_paths,
     run_exp,
+    temporary_env_update,
+    get_nsight_cmd,
 )
 from nemo_skills.utils import get_logger_name, setup_logging
 
@@ -57,6 +59,7 @@ class NemoRLTask:
     log_dir: str
     env_variables: dict
     backend: str
+    profile_step_range: str
     extra_arguments: str = ""
 
     def format_train_args(self):
@@ -94,10 +97,11 @@ def format_wandb_args(self):
 
     def get_cmd(self):
         self.logging_params = self.format_wandb_args()
-
+        nsight_cmd = get_nsight_cmd(self.profile_step_range)
         cmd = (
             f"export PYTHONPATH=$PYTHONPATH:/nemo_run/code:/opt/NeMo-RL && "
             f"export UV_PROJECT=/opt/NeMo-RL && "
+            f"{nsight_cmd}"
             f"echo 'Starting training' && "
             f"uv run --active python /nemo_run/code/nemo_skills/training/nemo_rl/start_grpo.py "
             f"  {self.format_train_args()} "
@@ -125,6 +129,7 @@ def get_training_cmd(
     log_dir,
     env_variables,
     backend,
+    profile_step_range,
 ):
     timeout = get_timeout(cluster_config, partition)
 
@@ -144,6 +149,7 @@ def get_training_cmd(
         log_dir=log_dir,
         env_variables=env_variables,
         backend=backend,
+        profile_step_range=profile_step_range,
     )
 
     return task.get_cmd()
@@ -197,6 +203,12 @@ def grpo_nemo_rl(
     wandb_project: str = typer.Option("nemo-skills", help="Weights & Biases project name"),
     wandb_group: str = typer.Option(None, help="Weights & Biases group name."),
     disable_wandb: bool = typer.Option(False, help="Disable wandb logging"),
+    profile_step_range: str = typer.Option(
+        None, 
+        help="Controls which training steps the nsys profiler captures. "
+        "Format: START:STOP (1-indexed, STOP exclusive, same as slice syntax arr[start:stop]). "
+        "Example: '3:5' profiles steps 3 and 4 only. NOTE: START must be ≥ 1, so '0:10' is invalid."
+    ),
     partition: str = typer.Option(
         None, help="Can specify if need interactive jobs or a specific non-default partition"
     ),
@@ -302,34 +314,37 @@ def grpo_nemo_rl(
         log_dir=f"{log_dir}/training-logs",
         env_variables=env_variables,
         backend=backend,
+        profile_step_range=profile_step_range,
     )
 
     server_config = None
+    env_update = {"RAY_LOG_SYNC_FREQUENCY": 20} if profile_step_range else {}
     with get_exp(expname, cluster_config, _reuse_exp) as exp:
         prev_task = _task_dependencies
-        for job_id in range(num_training_jobs):
-            prev_task = add_task(
-                exp,
-                cmd=train_cmd,
-                task_name=f'{expname}-grpo-{job_id}',
-                log_dir=f"{log_dir}/training-logs",
-                container=cluster_config["containers"]["nemo-rl"],
-                num_gpus=num_gpus,
-                num_nodes=num_nodes,
-                cluster_config=cluster_config,
-                server_config=server_config,
-                partition=partition,
-                time_min=time_min,
-                run_after=run_after,
-                reuse_code=reuse_code,
-                reuse_code_exp=reuse_code_exp,
-                task_dependencies=[prev_task] if prev_task is not None else None,
-                slurm_kwargs={"exclusive": exclusive} if exclusive else None,
-                heterogeneous=True if server_config is not None else False,
-                with_sandbox=with_sandbox,
-                with_ray=True,
-                installation_command=installation_command,
-            )
+        with temporary_env_update(cluster_config, env_update):
+            for job_id in range(num_training_jobs):
+                prev_task = add_task(
+                    exp,
+                    cmd=train_cmd,
+                    task_name=f'{expname}-grpo-{job_id}',
+                    log_dir=f"{log_dir}/training-logs",
+                    container=cluster_config["containers"]["nemo-rl"],
+                    num_gpus=num_gpus,
+                    num_nodes=num_nodes,
+                    cluster_config=cluster_config,
+                    server_config=server_config,
+                    partition=partition,
+                    time_min=time_min,
+                    run_after=run_after,
+                    reuse_code=reuse_code,
+                    reuse_code_exp=reuse_code_exp,
+                    task_dependencies=[prev_task] if prev_task is not None else None,
+                    slurm_kwargs={"exclusive": exclusive} if exclusive else None,
+                    heterogeneous=True if server_config is not None else False,
+                    with_sandbox=with_sandbox,
+                    with_ray=True,
+                    installation_command=installation_command,
+                )
 
         prev_task = add_task(
             exp,
diff --git a/nemo_skills/pipeline/nemo_rl/sft.py b/nemo_skills/pipeline/nemo_rl/sft.py
index 350a1ad633..7809c356e3 100644
--- a/nemo_skills/pipeline/nemo_rl/sft.py
+++ b/nemo_skills/pipeline/nemo_rl/sft.py
@@ -31,6 +31,8 @@
     get_timeout,
     resolve_mount_paths,
     run_exp,
+    temporary_env_update,
+    get_nsight_cmd,
 )
 from nemo_skills.utils import get_logger_name, setup_logging
 
@@ -59,6 +61,7 @@ class NemoRLTask:
     log_dir: str
     env_variables: dict
     backend: str
+    profile_step_range: str
     extra_arguments: str = ""
 
     def format_train_args(self):
@@ -94,20 +97,22 @@ def format_wandb_args(self):
 
     def get_cmd(self):
         self.logging_params = self.format_wandb_args()
+
+        nsight_cmd = get_nsight_cmd(self.profile_step_range)
         cmd = (
-            f"export PYTHONPATH=$PYTHONPATH:/nemo_run/code:/opt/NeMo-RL && "
-            f"export UV_PROJECT=/opt/NeMo-RL && "
-            f"echo 'Starting training' && "
-            f"NRL_FORCE_REBUILD_VENVS=true uv run --active python /nemo_run/code/nemo_skills/training/nemo_rl/start_sft.py "
-            f"  {self.format_train_args()} "
-            f"  {self.format_data_args()} "
-            f"  {self.logging_params} "
-            f"  {self.extra_arguments} "
+            "export PYTHONPATH=$PYTHONPATH:/nemo_run/code:/opt/NeMo-RL && "
+            "export UV_PROJECT=/opt/NeMo-RL && "
+            f"{nsight_cmd}"
+            "echo 'Starting training' && "
+            "NRL_FORCE_REBUILD_VENVS=true uv run --active "
+            "python /nemo_run/code/nemo_skills/training/nemo_rl/start_sft.py "
+            f"{self.format_train_args()} {self.format_data_args()} "
+            f"{self.logging_params} {self.extra_arguments}"
         )
-
         return cmd
 
 
+
 def get_training_cmd(
     cluster_config,
     partition,
@@ -125,6 +130,7 @@ def get_training_cmd(
     log_dir,
     env_variables,
     backend,
+    profile_step_range,
 ):
     timeout = get_timeout(cluster_config, partition)
 
@@ -144,6 +150,7 @@ def get_training_cmd(
         log_dir=log_dir,
         env_variables=env_variables,
         backend=backend,
+        profile_step_range=profile_step_range,
     )
 
     return task.get_cmd()
@@ -197,6 +204,12 @@ def sft_nemo_rl(
     wandb_project: str = typer.Option("nemo-skills", help="Weights & Biases project name"),
     wandb_group: str = typer.Option(None, help="Weights & Biases group name."),
     disable_wandb: bool = typer.Option(False, help="Disable wandb logging"),
+    profile_step_range: str = typer.Option(
+        None, 
+        help="Controls which training steps the nsys profiler captures. "
+        "Format: START:STOP (1-indexed, STOP exclusive, same as slice syntax arr[start:stop]). "
+        "Example: '3:5' profiles steps 3 and 4 only. NOTE: START must be ≥ 1, so '0:10' is invalid."
+    ),
     partition: str = typer.Option(
         None, help="Can specify if need interactive jobs or a specific non-default partition"
     ),
@@ -301,34 +314,37 @@ def sft_nemo_rl(
         log_dir=f"{log_dir}/training-logs",
         env_variables=env_variables,
         backend=backend,
+        profile_step_range=profile_step_range,
     )
 
     server_config = None
+    env_update = {"RAY_LOG_SYNC_FREQUENCY": 20} if profile_step_range else {}
     with get_exp(expname, cluster_config, _reuse_exp) as exp:
         prev_task = _task_dependencies
-        for job_id in range(num_training_jobs):
-            prev_task = add_task(
-                exp,
-                cmd=train_cmd,
-                task_name=f'{expname}-sft-{job_id}',
-                log_dir=f"{log_dir}/training-logs",
-                container=cluster_config["containers"]["nemo-rl"],
-                num_gpus=num_gpus,
-                num_nodes=num_nodes,
-                cluster_config=cluster_config,
-                server_config=server_config,
-                partition=partition,
-                time_min=time_min,
-                run_after=run_after,
-                reuse_code=reuse_code,
-                reuse_code_exp=reuse_code_exp,
-                task_dependencies=[prev_task] if prev_task is not None else None,
-                slurm_kwargs={"exclusive": exclusive} if exclusive else None,
-                heterogeneous=True if server_config is not None else False,
-                with_sandbox=False,
-                with_ray=True,
-                installation_command=installation_command,
-            )
+        with temporary_env_update(cluster_config, env_update):
+            for job_id in range(num_training_jobs):
+                prev_task = add_task(
+                    exp,
+                    cmd=train_cmd,
+                    task_name=f'{expname}-sft-{job_id}',
+                    log_dir=f"{log_dir}/training-logs",
+                    container=cluster_config["containers"]["nemo-rl"],
+                    num_gpus=num_gpus,
+                    num_nodes=num_nodes,
+                    cluster_config=cluster_config,
+                    server_config=server_config,
+                    partition=partition,
+                    time_min=time_min,
+                    run_after=run_after,
+                    reuse_code=reuse_code,
+                    reuse_code_exp=reuse_code_exp,
+                    task_dependencies=[prev_task] if prev_task is not None else None,
+                    slurm_kwargs={"exclusive": exclusive} if exclusive else None,
+                    heterogeneous=True if server_config is not None else False,
+                    with_sandbox=False,
+                    with_ray=True,
+                    installation_command=installation_command,
+                )
 
         prev_task = add_task(
             exp,
diff --git a/nemo_skills/pipeline/utils/__init__.py b/nemo_skills/pipeline/utils/__init__.py
index 531fcbec11..a82e085dd7 100644
--- a/nemo_skills/pipeline/utils/__init__.py
+++ b/nemo_skills/pipeline/utils/__init__.py
@@ -38,6 +38,7 @@
     get_exp_handles,
     get_sandbox_command,
     run_exp,
+    get_nsight_cmd,
 )
 from nemo_skills.pipeline.utils.generation import (
     configure_client,
diff --git a/nemo_skills/pipeline/utils/exp.py b/nemo_skills/pipeline/utils/exp.py
index 2c0172be65..f0b4e35037 100644
--- a/nemo_skills/pipeline/utils/exp.py
+++ b/nemo_skills/pipeline/utils/exp.py
@@ -633,3 +633,15 @@ def get_exp(expname, cluster_config, _reuse_exp=None):
     if cluster_config['executor'] == 'local':
         return run.Experiment(expname, clean_mode=True)
     return run.Experiment(expname, clean_mode=True, log_level="WARN")
+
+
+def get_nsight_cmd(profile_step_range):
+    cmd = ''
+    if profile_step_range is not None:
+        cmd = (
+            f'export LD_LIBRARY_PATH="/usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu" && '
+            f"export NRL_NSYS_PROFILE_STEP_RANGE={profile_step_range} && "
+            'export NRL_NSYS_WORKER_PATTERNS="*policy*,*vllm*" && '
+
+            )
+    return cmd
\ No newline at end of file

From 017d945ebf92de66015aa6df86cd98beff322ea5 Mon Sep 17 00:00:00 2001
From: Shubham Toshniwal <shtoshni@gmail.com>
Date: Thu, 14 Aug 2025 00:07:31 -0400
Subject: [PATCH 13/20] Fixing BFCL (#669)

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 nemo_skills/inference/eval/bfcl.py | 36 ++++++++++++++++--------------
 1 file changed, 19 insertions(+), 17 deletions(-)

diff --git a/nemo_skills/inference/eval/bfcl.py b/nemo_skills/inference/eval/bfcl.py
index 52126e2745..53b845ab89 100644
--- a/nemo_skills/inference/eval/bfcl.py
+++ b/nemo_skills/inference/eval/bfcl.py
@@ -21,7 +21,6 @@
 
 import hydra
 import openai
-
 from omegaconf import OmegaConf
 
 from nemo_skills.dataset.bfcl_v3.utils import convert_to_tool, func_doc_language_specific_pre_processing
@@ -121,11 +120,11 @@ def _get_disallowed_params(self):
 class BFCLGenerationTask(GenerationTask):
     def __init__(self, cfg: BFCLGenerationConfig):
         super().__init__(cfg)
-    
+
     def log_example_prompt(self, data):
         """BFCL is a multi-turn benchmark, so we can't print a single prompt."""
         return
-    
+
     def setup_prompt(self):
         return None
 
@@ -137,7 +136,7 @@ async def _generate_single_assistant_turn(self, inference_state_dict):
         if self.cfg.system_message:
             messages = [{"role": "system", "content": self.cfg.system_message}] + messages
 
-        # Step 1: Construct the prompt 
+        # Step 1: Construct the prompt
         if self.cfg.use_client_parsing:
             fmted_prompt = self.cfg.message_formatter(messages, tools=tools)
             input_dict = {
@@ -149,21 +148,24 @@ async def _generate_single_assistant_turn(self, inference_state_dict):
         else:
             input_dict = {
                 "prompt": messages,
-                "tools": [tools],
+                "tools": tools,
                 "include_response": True,
                 **asdict(self.cfg.inference),
                 **self.extra_generate_params,
             }
 
-
         # Step 2: Query the LLM server
         # Enable soft-fail when the models run out of context
         try:
             output = await self.llm.generate_async(**input_dict)
         # TODO: Currently we're assuming an openai interface which is not true for all servers
         except openai.BadRequestError as e:
-            if "Requested token count exceeds the model's maximum context length" in str(e) or "is longer than the model's context length" in str(e):
-                LOG.warning("BFCL generation failed due to running out of context. ")
+            error_str = str(e)
+            context_error = "is longer than the model's context length" in error_str
+            token_error = "Requested token count exceeds model's maximum context length" in error_str
+
+            if context_error or token_error:
+                LOG.warning(f"BFCL generation failed due to running out of context. {error_str}")
                 return {"message": None, "generation": ""}
             else:
                 raise
@@ -187,10 +189,12 @@ async def _generate_single_assistant_turn(self, inference_state_dict):
                 "tool_calls": parsed_response.get("tool_calls", []),
                 "num_generated_tokens": output["num_generated_tokens"],
             }
-        else:   
-            if "tool_calls" not in output:
-                output["tool_calls"] = []
+        else:
             output["message"] = output["response"].choices[0].message
+            output["tool_calls"] = []
+            if output["message"].tool_calls:
+                output["tool_calls"] = output["message"].tool_calls
+
             return output
 
     async def generate_single_data_point_single_turn(self, data_point):
@@ -198,9 +202,10 @@ async def generate_single_data_point_single_turn(self, data_point):
         state_dict = {"messages": data_point["question"][0], "tools": data_point["tools"]}
 
         model_response = await self._generate_single_assistant_turn(state_dict)
+
         if model_response["message"] is None:
             # Ran out of context
-            return {"generation": "", "num_generated_tokens": 0, "error": "_ran_out_of_context_"}    
+            return {"generation": "", "num_generated_tokens": 0, "error": "_ran_out_of_context_"}
         else:
             proc_model_response = self._process_model_response(model_response)
             return {
@@ -319,14 +324,11 @@ async def generate_single_data_point_multi_turn(self, data_point):
             if force_quit or out_of_context:
                 break
 
-        output_dict = {
-            "generation": all_model_response, 
-            "num_generated_tokens": output_dict["num_generated_tokens"]
-        }
+        output_dict["generation"] = all_model_response
 
         if out_of_context:
             output_dict["error"] = "_ran_out_of_context_"
-        
+
         return output_dict
 
     async def process_single_datapoint(self, data_point, all_data):

From 2cf6a2f2744cfe86690bce8b372ed116254afbc9 Mon Sep 17 00:00:00 2001
From: Shubham Toshniwal <shtoshni@gmail.com>
Date: Thu, 14 Aug 2025 15:00:30 -0400
Subject: [PATCH 14/20] Minor fixes to dataset defaults (#672)

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 nemo_skills/dataset/hle/__init__.py       | 12 +++++++++---
 nemo_skills/dataset/hmmt_feb25/prepare.py |  1 -
 nemo_skills/inference/model/azure.py      |  3 ++-
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/nemo_skills/dataset/hle/__init__.py b/nemo_skills/dataset/hle/__init__.py
index 435975c67f..a76ad58e49 100644
--- a/nemo_skills/dataset/hle/__init__.py
+++ b/nemo_skills/dataset/hle/__init__.py
@@ -20,7 +20,13 @@
 GENERATION_ARGS = ""
 EVAL_SPLIT = "text"
 
-# some answers are not possible to compare symbolically, so have to use a judge model
-# setting openai judge by default, but can be overriden from command line for a locally hosted model
-JUDGE_PIPELINE_ARGS = {"model": "gpt-4.1", "server_type": "openai", "server_address": "https://api.openai.com/v1"}
+# Some answers are not possible to compare symbolically, so have to use a judge model
+# Setting openai judge by default, but can be overriden from command line for a locally hosted model
+# Currently using o3-mini-20250131 which is used by the official leaderboard - https://agi.safe.ai/
+# To approximate the Artificial Analysis Index results, we suggest using gpt-4o - https://artificialanalysis.ai/methodology/intelligence-benchmarking#evaluation-suite-details
+JUDGE_PIPELINE_ARGS = {
+    "model": "o3-mini-20250131",
+    "server_type": "openai",
+    "server_address": "https://api.openai.com/v1",
+}
 JUDGE_ARGS = "++prompt_config=judge/hle ++generation_key=judgement ++add_generation_stats=False"
diff --git a/nemo_skills/dataset/hmmt_feb25/prepare.py b/nemo_skills/dataset/hmmt_feb25/prepare.py
index c5622dbbbf..3e21c8aeb7 100644
--- a/nemo_skills/dataset/hmmt_feb25/prepare.py
+++ b/nemo_skills/dataset/hmmt_feb25/prepare.py
@@ -22,7 +22,6 @@
 def write_data_to_file(output_file, data):
     with open(output_file, "wt", encoding="utf-8") as fout:
         for entry in tqdm(data, desc=f"Writing {output_file.name}"):
-            print(entry)
             entry['expected_answer'] = entry.pop('answer')
             json.dump(entry, fout)
             fout.write("\n")
diff --git a/nemo_skills/inference/model/azure.py b/nemo_skills/inference/model/azure.py
index 2af5643856..bede01e570 100644
--- a/nemo_skills/inference/model/azure.py
+++ b/nemo_skills/inference/model/azure.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 
 import os
+
 from .openai import OpenAIModel
 
 
@@ -23,7 +24,7 @@ def __init__(
         self,
         *args,
         api_key: str | None = None,
-        api_version: str = "2024-02-15-preview",
+        api_version: str = "2024-12-01-preview",
         **kwargs,
     ):
         if api_key is None:

From de2c9a444e8c1b4bc406beb9b1eef34de9769d04 Mon Sep 17 00:00:00 2001
From: Igor Gitman <igitman@nvidia.com>
Date: Thu, 14 Aug 2025 16:12:40 -0700
Subject: [PATCH 15/20] Enable system_message for openai prompt format (#670)

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 nemo_skills/inference/generate.py | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/nemo_skills/inference/generate.py b/nemo_skills/inference/generate.py
index e6134f16fb..c517568157 100644
--- a/nemo_skills/inference/generate.py
+++ b/nemo_skills/inference/generate.py
@@ -178,7 +178,6 @@ def _post_init_validate_params(self):
         if self.prompt_format == "openai":
             assert self.prompt_config is None, "prompt_config is not supported for prompt_format == 'openai'"
             assert self.prompt_template is None, "prompt_template is not supported for prompt_format == 'openai'"
-            assert self.system_message is None, "system_message is not supported for prompt_format == 'openai'"
         else:
             assert self.prompt_config is not None, "prompt_config is required when prompt_format == 'ns'"
         for param, default_value in self._get_disallowed_params():
@@ -305,7 +304,7 @@ def log_example_prompt(self, data):
 
         if self.cfg.prompt_format == "openai":
             # print the prompt in openai format
-            LOG.info("Example prompt in OpenAI format: \nData dictionary: %s", data_point)
+            LOG.info("Example prompt in OpenAI format: %s", self.fill_prompt(data_point, data))
             return
 
         if self.cfg.multi_turn_key is None:
@@ -388,6 +387,11 @@ def fill_prompt(self, data_point, data):
         if self.cfg.prompt_format == "openai":
             if self.cfg.prompt_suffix:
                 data_point["messages"][-1]["content"] += self.cfg.prompt_suffix
+            if self.cfg.system_message:
+                if data_point["messages"][0]["role"] != "system":
+                    data_point["messages"].insert(0, {"role": "system", "content": self.cfg.system_message})
+                else:
+                    data_point["messages"][0]["content"] = self.cfg.system_message
             return data_point["messages"]
 
         total_code_executions_in_prompt = self.cfg.total_code_executions_in_prompt

From 822dddd6c45a19b375bb67175da55d0439559501 Mon Sep 17 00:00:00 2001
From: Shubham Toshniwal <shtoshni@gmail.com>
Date: Fri, 15 Aug 2025 14:00:21 -0400
Subject: [PATCH 16/20] Reasoning off results

Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 .../posts/llama-nemotron-super-v1.5-evals.md  | 222 ++++++++++++++++--
 1 file changed, 201 insertions(+), 21 deletions(-)

diff --git a/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md b/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
index ad3e72ac2c..e6a604e8df 100644
--- a/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
+++ b/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
@@ -1,12 +1,12 @@
 ---
-date: 2025-08-12
-readtime: 20 # TODO: Revisit this number
+date: 2025-08-15
+readtime: 15 
 ---
 
 # Reproducing Llama-Nemotron-Super-49B-V1.5 Evals 
 
 In this tutorial, we will reproduce the evals for the Llama-3.3-Nemotron-Super-49B-v1.5 model using NeMo-Skills.
-For an introduction to the NeMo-Skills framework, we recommed going over [our introductory tutorial](./omr-simple-recipe.md).
+For an introduction to the NeMo-Skills framework, we recommend going over [our introductory tutorial](./omr-simple-recipe.md).
 
 
 We assume you have `/workspace` defined in your [cluster config](../../basics/cluster-configs.md) and are
@@ -128,7 +128,7 @@ ns eval \
     ++inference.tokens_to_generate=65536 \
     ++inference.temperature=0.6 \
     ++inference.top_p=0.95 \
-    ++system_message='' \
+    ++system_message=''
 ```
 
 !!! note
@@ -138,27 +138,26 @@ ns eval \
     If the OpenAI API throws the `Rate limit exceeded` error, please reduce the `max_concurrent_requests` value in the `extra_judge_args` argument and restart the job.
 
 
-### Command for BFCL Eval (Reasoning on)
+#### Command for BFCL Eval (Reasoning on)
 
 Tool-calling benchmarks require tool-call parsing and execution. NeMo-Skills supports both client-side parsing (default) and server-side parsing. For server-side parsing, the vLLM server requires the parsing details as highlighted in the below command:
 ```bash hl_lines="13-17"
 ns eval \
-  --cluster=local \
-  --benchmarks=bfcl_v3 \
-  --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/ \
-  --server_gpus=8 \
-  --server_type=vllm \
-  --num_jobs=1 \
-  --output_dir=/workspace/llama_nemotron_49b_1_5_tool_calling/ \
-  ++inference.tokens_to_generate=65536 \
-  ++inference.temperature=0.6 \
-  ++inference.top_p=0.95 \
-  ++system_message='' \
-  ++use_client_parsing=False \
-  --server_args="--tool-parser-plugin \"/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/llama_nemotron_toolcall_parser_no_streaming.py\" \
-                 --tool-call-parser \"llama_nemotron_json\" \
-                 --enable-auto-tool-choice"
-
+    --cluster=local \
+    --benchmarks=bfcl_v3 \
+    --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/ \
+    --server_gpus=8 \
+    --server_type=vllm \
+    --num_jobs=1 \
+    --output_dir=/workspace/llama_nemotron_49b_1_5_tool_calling/ \
+    ++inference.tokens_to_generate=65536 \
+    ++inference.temperature=0.6 \
+    ++inference.top_p=0.95 \
+    ++system_message='' \
+    ++use_client_parsing=False \
+    --server_args="--tool-parser-plugin \"/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/llama_nemotron_toolcall_parser_no_streaming.py\" \
+                    --tool-call-parser \"llama_nemotron_json\" \
+                    --enable-auto-tool-choice"
 ```
 
 
@@ -196,6 +195,9 @@ majority@15       | 2158        | 12111      | 7782        | 4.31%         | 3.4
 pass@15           | 2158        | 12111      | 7782        | 27.80%        | 10.10%           | 49.91%
 ```
 
+!!!note
+    The `majority` metric for most reasoning benchmarks typically improves over the corresponding `pass@1` numbers. For HLE, the `majority` number is lower than `pass@1` which can be counterintuitive but it has to with our metric calculation logic. For HLE, the final answer is contained in the generated solution but it is not easily extractable by rule-based systems as in the case of math where the model is instructed to put the final answer in \boxed{}. Thus, for certain questions the `predicted_answer` field is null but the LLM-as-a-judge is still able to evaluate the generated solution. The majority metric performs clustering over `predicted_answer` which currently incorrectly removes from consideration some of the correct solutions for which the `predicted_answer` is None.    
+
 
 #### Results for Code Reasoning benchmarks (Reasoning on)
 
@@ -258,3 +260,181 @@ pass@16           | 30          | 23366      | 832         | 93.33%           |
     Currently `summarize_results` doesn't support benchmarks like BFCL v3 which have their specific logic of combining subset scores to arrive at the overall score. This table was created by formatting the `metrics.json` file from `/workspace/llama_nemotron_49b_1_5_tool_calling/bfcl_v3/metrics.json` using an LLM.  
 
 
+
+### Reasoning-off Evals
+
+For the non-reasoning mode evals, we follow the recommended recipe of setting: 
+
+- temperature to 0.0
+- top-p to 1.0
+- system_message to '/no_think'
+- keep the maximum number of generated tokens to 65536 
+
+#### Command for Math, Code, and Science Reasoning Eval (Reasoning off)
+
+The following command evaluates the model on GPQA, MMLU-Pro, Scicode, MATH-500, AIME24, and AIME25 across 16 different runs for all benchmarks. We have highlighted the inference settings recommended above in the following command:
+
+
+```bash hl_lines="10-13"
+ns eval \
+    --cluster=local \
+    --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5 \
+    --server_type=vllm \
+    --output_dir=/workspace/llama_nemotron_49b_1_5_reasoning_off/ \
+    --benchmarks=gpqa:16,mmlu-pro:16,scicode:16,math-500:16,aime24:16,aime25:16 \
+    --server_gpus=8 \
+    --num_jobs=1 \
+    ++inference.tokens_to_generate=65536 \
+    ++inference.temperature=0.0 \
+    ++inference.top_p=1.0 \
+    ++system_message='/no_think'
+```
+
+For LiveCodeBench, the command is:
+
+```bash
+ns eval \
+    --cluster=local \
+    --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5 \
+    --server_type=vllm \
+    --output_dir=/workspace/llama_nemotron_49b_1_5_reasoning_off/ \
+    --benchmarks=livecodebench:16 \
+    --split=test_v5_2410_2502 \
+    --server_gpus=8 \
+    --num_jobs=1 \
+    ++inference.tokens_to_generate=65536 \
+    ++inference.temperature=0.0 \
+    ++inference.top_p=1.0 \
+    ++system_message='/no_think'
+```
+
+#### Command for HLE Eval (Reasoning off)
+
+
+```bash
+ns eval \
+    --cluster=local \
+    --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5 \
+    --server_type=vllm \
+    --output_dir=/workspace/llama_nemotron_49b_1_5_reasoning_off/ \
+    --benchmarks=hle:16 \
+    --server_gpus=8 \
+    --num_jobs=1 \
+    --judge_model="o3-mini-20250131" \
+    --extra_judge_args="++inference.tokens_to_generate=4096 ++max_concurrent_requests=8" \
+    ++inference.tokens_to_generate=65536 \
+    ++inference.temperature=0.0 \
+    ++inference.top_p=1.0 \
+    ++system_message='/no_think'
+```
+
+#### Command for BFCL Eval (Reasoning off)
+
+```bash
+ns eval \
+    --cluster=local \
+    --benchmarks=bfcl_v3 \
+    --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/ \
+    --server_gpus=8 \
+    --server_type=vllm \
+    --num_jobs=1 \
+    --output_dir=/workspace/llama_nemotron_49b_1_5_reasoning_off_tool_calling/ \
+    ++inference.tokens_to_generate=65536 \
+    ++inference.temperature=0.0 \
+    ++inference.top_p=1.0 \
+    ++system_message='/no_think' \
+    ++use_client_parsing=False \
+    --server_args="--tool-parser-plugin \"/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/llama_nemotron_toolcall_parser_no_streaming.py\" \
+                   --tool-call-parser \"llama_nemotron_json\" \
+                   --enable-auto-tool-choice"
+```
+
+
+### Reasoning-off Results
+
+
+We use the `summarize_results` on the reasoning_off results directory as follows:
+
+```bash
+ns summarize_results --cluster=local /workspace/llama_nemotron_49b_1_5_reasoning_off/eval-results/{BENCHMARK}
+```
+
+
+#### Results for Science & General Reasoning benchmarks (Reasoning off)
+
+```
+------------------------------------------ gpqa -----------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
+pass@1[avg-of-16] | 198         | 853        | 1552        | 51.61%           | 0.25%
+majority@16       | 198         | 853        | 1552        | 52.53%           | 0.00%
+pass@16           | 198         | 853        | 1552        | 74.75%           | 0.00%
+
+---------------------------------------- mmlu-pro ---------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
+pass@1[avg-of-16] | 12032       | 625        | 5684        | 69.19%           | 0.34%
+majority@16       | 12032       | 625        | 5684        | 69.94%           | 0.01%
+pass@16           | 12032       | 625        | 5684        | 77.67%           | 0.01%
+
+-------------------------------------------------- hle --------------------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | judge_correct | symbolic_correct | no_answer
+pass@1[avg-of-16] | 2158        | 1349       | 2667        | 3.92%         | 1.30%            | 59.09%
+majority@16       | 2158        | 1349       | 2667        | 1.53%         | 1.44%            | 47.03%
+pass@16           | 2158        | 1349       | 2667        | 12.09%        | 3.29%            | 47.03%
+```
+
+
+#### Results for Code Reasoning benchmarks (Reasoning off)
+
+```
+--------------------------- livecodebench ---------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | accuracy
+pass@1[avg-of-16] | 166         | 609        | 1156        | 29.89%
+pass@16           | 166         | 609        | 1156        | 33.73%
+
+--------------------------------------------------- scicode ----------------------------------------------------
+evaluation_mode   | avg_tokens | gen_seconds | problem_accuracy | subtask_accuracy | num_problems | num_subtasks
+pass@1[avg-of-16] | 3067       | 66547       | 0.00%            | 19.44%           | 65           | 288
+pass@16           | 3067       | 66547       | 0.00%            | 29.51%           | 65           | 288
+```
+
+#### Results for Math Reasoning benchmarks (Reasoning off)
+
+```
+---------------------------------------- math-500 ---------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
+pass@1[avg-of-16] | 500         | 765        | 1185        | 75.55%           | 0.26%
+majority@16       | 500         | 765        | 1185        | 76.00%           | 0.00%
+pass@16           | 500         | 765        | 1185        | 84.00%           | 0.00%
+
+----------------------------------------- aime24 ----------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
+pass@1[avg-of-16] | 30          | 3611       | 1165        | 16.88%           | 3.75%
+majority@16       | 30          | 3611       | 1165        | 16.67%           | 0.00%
+pass@16           | 30          | 3611       | 1165        | 33.33%           | 0.00%
+
+----------------------------------------- aime25 ----------------------------------------
+evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
+pass@1[avg-of-16] | 30          | 1720       | 1149        | 5.42%            | 1.25%
+majority@16       | 30          | 1720       | 1149        | 6.67%            | 0.00%
+pass@16           | 30          | 1720       | 1149        | 10.00%           | 0.00%
+```
+
+#### Results for Tool Calling  (Reasoning off)
+
+
+```
+----------------------- bfcl_v3 ------------------------
+| Category                    | num_entries | accuracy |
+|-----------------------------|-------------|----------|
+| overall_accuracy            | 4441        | 68.52%   |
+| overall_non_live            | 1390        | 87.55%   |
+| non_live_ast                | 1150        | 87.35%   |
+| irrelevance                 | 240         | 88.33%   |
+| overall_live                | 2251        | 81.87%   |
+| live_ast                    | 1351        | 79.79%   |
+| live_irrelevance            | 882         | 85.60%   |
+| live_relevance              | 18          | 55.56%   |
+| overall_multi_turn          | 800         | 36.13%   |
+```
+
+The reasoning-on vs reasoning-off comparison shows inference-time scaling's impact: higher accuracy at the cost of more tokens and longer generation times.
\ No newline at end of file

From ac7d4e176b0b4d06e95adb1a80a8f3bd436880bf Mon Sep 17 00:00:00 2001
From: Shubham Toshniwal <stoshniwal@nvidia.com>
Date: Fri, 15 Aug 2025 11:05:19 -0700
Subject: [PATCH 17/20] Precommit hook

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
---
 docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md b/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
index e6a604e8df..d766d52871 100644
--- a/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
+++ b/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
@@ -3,6 +3,7 @@ date: 2025-08-15
 readtime: 15 
 ---
 
+
 # Reproducing Llama-Nemotron-Super-49B-V1.5 Evals 
 
 In this tutorial, we will reproduce the evals for the Llama-3.3-Nemotron-Super-49B-v1.5 model using NeMo-Skills.
@@ -437,4 +438,4 @@ pass@16           | 30          | 1720       | 1149        | 10.00%           |
 | overall_multi_turn          | 800         | 36.13%   |
 ```
 
-The reasoning-on vs reasoning-off comparison shows inference-time scaling's impact: higher accuracy at the cost of more tokens and longer generation times.
\ No newline at end of file
+The reasoning-on vs reasoning-off comparison shows inference-time scaling's impact: higher accuracy at the cost of more tokens and longer generation times.

From 587279484bfa40e2b4ce60fb809e2144e3ca6810 Mon Sep 17 00:00:00 2001
From: Shubham Toshniwal <shtoshni@gmail.com>
Date: Fri, 15 Aug 2025 16:48:54 -0400
Subject: [PATCH 18/20] Update
 docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md

Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com>
---
 docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md b/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
index d766d52871..e2b21587ec 100644
--- a/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
+++ b/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
@@ -258,7 +258,7 @@ pass@16           | 30          | 23366      | 832         | 93.33%           |
 ```
 
 !!! note
-    Currently `summarize_results` doesn't support benchmarks like BFCL v3 which have their specific logic of combining subset scores to arrive at the overall score. This table was created by formatting the `metrics.json` file from `/workspace/llama_nemotron_49b_1_5_tool_calling/bfcl_v3/metrics.json` using an LLM.  
+    Currently `summarize_results` doesn't support benchmarks like BFCL v3 which have their specific logic of combining subset scores to arrive at the overall score. This table was created by formatting the `metrics.json` file from `/workspace/llama_nemotron_49b_1_5_tool_calling/bfcl_v3/metrics.json`.  
 
 
 

From 2d491a647d92380dcb17e38a0f8e8976b2eb7aaf Mon Sep 17 00:00:00 2001
From: Shubham Toshniwal <shtoshni@gmail.com>
Date: Fri, 15 Aug 2025 18:40:03 -0400
Subject: [PATCH 19/20] Resolving comments

---
 .../posts/llama-nemotron-super-v1.5-evals.md  | 59 +++++++++----------
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md b/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
index e2b21587ec..aaf746bed7 100644
--- a/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
+++ b/docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md
@@ -7,7 +7,7 @@ readtime: 15
 # Reproducing Llama-Nemotron-Super-49B-V1.5 Evals 
 
 In this tutorial, we will reproduce the evals for the Llama-3.3-Nemotron-Super-49B-v1.5 model using NeMo-Skills.
-For an introduction to the NeMo-Skills framework, we recommend going over [our introductory tutorial](./omr-simple-recipe.md).
+For an introduction to the NeMo-Skills framework, we recommend going over [our introductory tutorial](../../basics/index.md).
 
 
 We assume you have `/workspace` defined in your [cluster config](../../basics/cluster-configs.md) and are
@@ -17,12 +17,15 @@ executing all commands from that folder locally. Change all commands accordingly
 
 ## Download the model
 
-Get the model from HF. 
+Get the model from HF.   
 ```bash
 pip install -U "huggingface_hub[cli]"
 huggingface-cli download nvidia/Llama-3_3-Nemotron-Super-49B-v1_5 --local-dir /workspace/Llama-3_3-Nemotron-Super-49B-v1_5
 ```
 
+!!!note
+     In most cases, we can define `HF_HOME` in the cluster config to a mounted directory, and refer to models by their huggingface names such as `nvidia/Llama-3_3-Nemotron-Super-49B-v1_5` in this case. However, in this example, we download the model to an explicit location because we rely on the tool parsing script which is part of the huggingface repo. Alternatively, users can download the model to the `HF_HOME` and separately download the [tool parsing script](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5/blob/main/llama_nemotron_toolcall_parser_no_streaming.py){target="_blank"} to another mounted location.  
+
 ## Prepare evaluation data
 
 We will evaluate the model on the following:
@@ -59,7 +62,7 @@ We detail the evaluation commands and results for both the modes.
 Note that you might not get exactly the same numbers as reported here because of the stochastic nature of LLM generations. 
 
 !!! note 
-    The commands provided here assume you're working with a local machine where benchmarks/subsets are evaluated sequentially which will take a very long time. If running on slurm, you can set `--num_jobs` to a bigger number or just set it to -1 to run each benchmark and their random seeds as an independent job which in case of Llama-Nemotron-Super-49B-V1.5 requires one node per job.  
+    The commands provided here assume you're working with a local machine where benchmarks/subsets are evaluated sequentially which will take a very long time. If running on slurm, by default we will run each benchmark and their random seeds as an independent job.  
 
 
 
@@ -77,15 +80,14 @@ For the reasoning mode evals, we follow the recommended recipe of setting:
 The following command evaluates the model on GPQA, MMLU-Pro, Scicode, MATH-500, AIME24, and AIME25 across 16 different runs for all benchmarks. We have highlighted the inference settings recommended above in the following command:
 
 
-```bash hl_lines="9-13"
+```bash hl_lines="8-12"
 ns eval \
     --cluster=local \
     --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5 \
     --server_type=vllm \
     --output_dir=/workspace/llama_nemotron_49b_1_5/ \
     --benchmarks=gpqa:16,mmlu-pro:16,scicode:16,math-500:16,aime24:16,aime25:16 \
-    --server_gpus=8 \
-    --num_jobs=1 \
+    --server_gpus=2 \
     ++inference.tokens_to_generate=65536 \
     ++inference.temperature=0.6 \
     ++inference.top_p=0.95 \
@@ -102,8 +104,7 @@ ns eval \
     --output_dir=/workspace/llama_nemotron_49b_1_5/ \
     --benchmarks=livecodebench:16 \
     --split=test_v5_2410_2502 \
-    --server_gpus=8 \
-    --num_jobs=1 \
+    --server_gpus=2 \
     ++inference.tokens_to_generate=65536 \
     ++inference.temperature=0.6 \
     ++inference.top_p=0.95 \
@@ -113,17 +114,19 @@ ns eval \
 #### Command for HLE Eval (Reasoning on)
 
 
-For HLE, because symbolic comparison is not sufficient to determine the correctness of the output, we use the recommended `o3-mini-20250131` model as the judge. Note that this model is the default in NeMo-Skills, and we have just added this argument for illustration purposes. To evaluate for the [Artificial Analysis Index (AAI) setting, please use the gpt-4o-20240806 model as the judge](https://artificialanalysis.ai/methodology/intelligence-benchmarking#intelligence-index-evaluation-suite-overview){target="_blank"}.
+For HLE, because symbolic comparison is not sufficient to determine the correctness of the output, we use the recommended `o3-mini-20250131` model as the judge. Note that this model is the default in NeMo-Skills, and we have just added this argument for illustration purposes. To evaluate for the [Artificial Analysis Index (AAI) setting, please use the gpt-4o-20240806 model as the judge](https://artificialanalysis.ai/methodology/intelligence-benchmarking#intelligence-index-evaluation-suite-overview){target="_blank"}. 
+
+Note that using any of the OpenAI hosted models requires `OPENAI_API_KEY`. Alternatively, a self-hosted judge model can also be used for judgement. For example, `--judge_model="/workspace/Llama-3_3-Nemotron-Super-49B-v1_5"`  in tandem with `--judge_server_type="vllm" --judge_server_gpus 2` will use the `Llama-3_3-Nemotron-Super-49B-v1_5` itself as a judge. 
 
-```bash hl_lines="9-10"
+
+```bash hl_lines="8-9"
 ns eval \
     --cluster=local \
     --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5 \
     --server_type=vllm \
     --output_dir=/workspace/llama_nemotron_49b_1_5/ \
     --benchmarks=hle:16 \
-    --server_gpus=8 \
-    --num_jobs=1 \
+    --server_gpus=2 \
     --judge_model="o3-mini-20250131" \
     --extra_judge_args="++inference.tokens_to_generate=4096 ++max_concurrent_requests=8" \
     ++inference.tokens_to_generate=65536 \
@@ -142,14 +145,13 @@ ns eval \
 #### Command for BFCL Eval (Reasoning on)
 
 Tool-calling benchmarks require tool-call parsing and execution. NeMo-Skills supports both client-side parsing (default) and server-side parsing. For server-side parsing, the vLLM server requires the parsing details as highlighted in the below command:
-```bash hl_lines="13-17"
+```bash hl_lines="12-16"
 ns eval \
     --cluster=local \
     --benchmarks=bfcl_v3 \
     --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/ \
-    --server_gpus=8 \
+    --server_gpus=2 \
     --server_type=vllm \
-    --num_jobs=1 \
     --output_dir=/workspace/llama_nemotron_49b_1_5_tool_calling/ \
     ++inference.tokens_to_generate=65536 \
     ++inference.temperature=0.6 \
@@ -165,10 +167,11 @@ ns eval \
 
 ### Reasoning-on Results
 
-We use the `summarize_results` pipeline to calculate the evaluation metrics, for all but BFCL where the metrics are calculated as part of the evaluation job itself. 
-The following results were obtained by running the command:
-
+The eval jobs also launch a dependent job to perform metrics calculation and store the result in a file called `metrics.json`. 
+In our running example, for a benchmark such as aime25, the `metrics.json` would be located at `/workspace/llama_nemotron_49b_1_5/eval-results/aime25/metrics.json`. 
+This metrics calculation is done typically by the `summarize_results` pipeline, except in the case of BFCL where the metrics are calculated by a BFCL specific script because BFCL has a specific way of combining subtask accuracy to obtain the overall accuracy. 
 
+To print the results for these benchmarks (except for BFCL), we could rerun the `summarize_results` script manually as follows: 
 ```bash
 ns summarize_results --cluster=local /workspace/llama_nemotron_49b_1_5/eval-results/{BENCHMARK}
 ```
@@ -191,9 +194,9 @@ pass@16           | 12032       | 4879       | 12516       | 91.32%           |
 
 -------------------------------------------------- hle --------------------------------------------------
 evaluation_mode   | num_entries | avg_tokens | gen_seconds | judge_correct | symbolic_correct | no_answer
-pass@1[avg-of-15] | 2158        | 12111      | 7782        | 7.75%         | 2.40%            | 64.13%
-majority@15       | 2158        | 12111      | 7782        | 4.31%         | 3.43%            | 49.91%
-pass@15           | 2158        | 12111      | 7782        | 27.80%        | 10.10%           | 49.91%
+pass@1[avg-of-16] | 2158        | 12111      | 7782        | 7.75%         | 2.40%            | 64.13%
+majority@16       | 2158        | 12111      | 7782        | 4.31%         | 3.43%            | 49.91%
+pass@16           | 2158        | 12111      | 7782        | 27.80%        | 10.10%           | 49.91%
 ```
 
 !!!note
@@ -276,15 +279,14 @@ For the non-reasoning mode evals, we follow the recommended recipe of setting:
 The following command evaluates the model on GPQA, MMLU-Pro, Scicode, MATH-500, AIME24, and AIME25 across 16 different runs for all benchmarks. We have highlighted the inference settings recommended above in the following command:
 
 
-```bash hl_lines="10-13"
+```bash hl_lines="9-12"
 ns eval \
     --cluster=local \
     --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5 \
     --server_type=vllm \
     --output_dir=/workspace/llama_nemotron_49b_1_5_reasoning_off/ \
     --benchmarks=gpqa:16,mmlu-pro:16,scicode:16,math-500:16,aime24:16,aime25:16 \
-    --server_gpus=8 \
-    --num_jobs=1 \
+    --server_gpus=2 \
     ++inference.tokens_to_generate=65536 \
     ++inference.temperature=0.0 \
     ++inference.top_p=1.0 \
@@ -301,8 +303,7 @@ ns eval \
     --output_dir=/workspace/llama_nemotron_49b_1_5_reasoning_off/ \
     --benchmarks=livecodebench:16 \
     --split=test_v5_2410_2502 \
-    --server_gpus=8 \
-    --num_jobs=1 \
+    --server_gpus=2 \
     ++inference.tokens_to_generate=65536 \
     ++inference.temperature=0.0 \
     ++inference.top_p=1.0 \
@@ -319,8 +320,7 @@ ns eval \
     --server_type=vllm \
     --output_dir=/workspace/llama_nemotron_49b_1_5_reasoning_off/ \
     --benchmarks=hle:16 \
-    --server_gpus=8 \
-    --num_jobs=1 \
+    --server_gpus=2 \
     --judge_model="o3-mini-20250131" \
     --extra_judge_args="++inference.tokens_to_generate=4096 ++max_concurrent_requests=8" \
     ++inference.tokens_to_generate=65536 \
@@ -336,9 +336,8 @@ ns eval \
     --cluster=local \
     --benchmarks=bfcl_v3 \
     --model=/workspace/Llama-3_3-Nemotron-Super-49B-v1_5/ \
-    --server_gpus=8 \
+    --server_gpus=2 \
     --server_type=vllm \
-    --num_jobs=1 \
     --output_dir=/workspace/llama_nemotron_49b_1_5_reasoning_off_tool_calling/ \
     ++inference.tokens_to_generate=65536 \
     ++inference.temperature=0.0 \

From f3b862bf1af18496914d9d75a29a8623f0d074a6 Mon Sep 17 00:00:00 2001
From: Shubham Toshniwal <shtoshni@gmail.com>
Date: Fri, 15 Aug 2025 18:58:01 -0400
Subject: [PATCH 20/20] Added news

Signed-off-by: Shubham Toshniwal <shtoshni@gmail.com>
---
 README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/README.md b/README.md
index 41c1105867..45d3b9c3f5 100644
--- a/README.md
+++ b/README.md
@@ -20,7 +20,7 @@ Here are some of the features we support:
 - [Model training](https://nvidia.github.io/NeMo-Skills/pipelines/training): Train models using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner/), [NeMo-RL](https://github.com/NVIDIA/NeMo-RL/) or [verl](https://github.com/volcengine/verl).
 
 ## News
-
+* [08/15/2025]: Added details for [reproducing evals](docs/tutorials/posts/llama-nemotron-super-v1.5-evals.md) for the [Llama-3_3-Nemotron-Super-49B-v1_5](nvidia/Llama-3_3-Nemotron-Super-49B-v1_5) model by NVIDIA.
 * [07/30/2025]: The datasets used to train OpenReasoning models are released! Math and code are available as part of [Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1) and science is available in
 [OpenScienceReasoning-2](https://huggingface.co/datasets/nvidia/OpenScienceReasoning-2).
 See our [documentation](https://nvidia.github.io/NeMo-Skills/releases/openreasoning/training) for more details.