Add RULERv2 by hsiehjackson · Pull Request #1106 · NVIDIA-NeMo/Skills

hsiehjackson · 2025-12-12T21:06:47Z

Summary by CodeRabbit

New Features

Added RULER2 benchmark with multiple dataset types (NIAH, MMLU, QA) at configurable difficulty levels
New RULER2 evaluator with scoring capabilities for benchmark assessment
Introduced tokenizer abstraction supporting multiple backend providers
Added automated dataset preparation workflows for benchmark setup

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-12T21:08:29Z

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive RULER2 benchmark dataset preparation and evaluation pipeline. The changes add a new dataset module with multiple data preparation scripts (NIAH, MMLU, QA), tokenizer abstractions supporting multiple frameworks (NeMo, HuggingFace, OpenAI, Gemini), an orchestration script that handles dataset generation in parallel, evaluation infrastructure for RULER2 benchmarks, and scoring logic. The pipeline is integrated into the evaluation and metrics systems.

Changes

Cohort / File(s)	Summary
RULER2 Module Initialization `nemo_skills/dataset/ruler2/__init__.py`	Adds `DATASET_GROUP = "long-context"` constant to mark the module as a long-context benchmark group.
RULER2 Tokenizer Utilities `nemo_skills/dataset/ruler2/tokenizer.py`	Introduces tokenizer factory (`select_tokenizer`) and adapter classes for NeMo TikToken, NeMo SentencePiece, HuggingFace, OpenAI, and Gemini tokenizers with unified `text_to_tokens` and `tokens_to_text` interfaces.
RULER2 Data Preparation — NIAH `nemo_skills/dataset/ruler2/prepare_niah.py`	Generates synthetic QA data using randomized numbers, words, or UUIDs; supports essay and needle-noise contexts; uses binary search to optimize haystack size within token limits; outputs JSONL samples.
RULER2 Data Preparation — MMLU `nemo_skills/dataset/ruler2/prepare_mmlu.py`	Prepares multiple datasets (GSM8K, MATH500, MMLU, MBPP) into haystack and needle collections; constructs contextual prompts for retrieve/niaH/solve task types; generates input-output pairs with token-aware sampling via binary search.
RULER2 Data Preparation — QA `nemo_skills/dataset/ruler2/prepare_qa.py`	Generates QA prompt-output pairs from SQuAD v2, HotpotQA, and MusiQue; supports flexible prompt types, few-shot examples, and multiple query types; uses tokenizer-based binary search to determine optimal haystack sizes.
RULER2 Dataset Orchestration `nemo_skills/dataset/ruler2/prepare.py`	Central coordinator that installs dependencies, creates per-task `__init__.py` files via `prepare_task_for_ns`, executes all task preparation functions in parallel via ThreadPoolExecutor, and generates a top-level `__init__.py` with benchmark metadata and BENCHMARKS mapping.
RULER2 Scoring `nemo_skills/dataset/ruler2/ruler2_score.py`	Implements `compute_score` function that aggregates accuracy metrics across predefined tasks by deriving setup name and computing task-level averages.
Existing RULER Module Update `nemo_skills/dataset/ruler/prepare.py`	Adjusts end-of-file newline formatting; no functional changes.
Evaluation Integration `nemo_skills/evaluation/evaluator/__init__.py`	Imports `eval_ruler2` and adds mapping entry `"ruler2": eval_ruler2` to `EVALUATOR_MAP`.
Evaluation Logic `nemo_skills/evaluation/evaluator/ruler.py`	Adds `eval_ruler2` function with internal helpers for prediction parsing, word error rate computation, and three string-matching strategies (all, 2-step, partial); reads JSONL, computes correctness, and writes results.
Metrics Registration `nemo_skills/evaluation/metrics/map_metrics.py`	Adds mapping entry `"ruler2": RulerMetrics` to `METRICS_MAP`.

Sequence Diagram

sequenceDiagram
    participant CLI as Command Line
    participant Orchestrator as prepare_dataset<br/>(ruler2/prepare.py)
    participant PackageMgr as Pip
    participant TaskSetup as prepare_task_for_ns
    participant Executor as ThreadPoolExecutor
    participant TaskFunc as Task Functions<br/>(prepare_*_*)
    participant Subprocess as Subprocess Runner
    participant Tokenizer as Tokenizer
    participant FileSystem as File System
    participant EvalInit as Top-level __init__.py

    CLI->>Orchestrator: prepare_dataset(tasks, setup, ...)
    Orchestrator->>PackageMgr: pip install (wonderwords,<br/>html2text, tenacity)
    PackageMgr-->>Orchestrator: ✓
    
    loop For each task
        Orchestrator->>TaskSetup: prepare_task_for_ns(folder, task)
        TaskSetup->>FileSystem: Write per-task __init__.py<br/>(metrics, eval_args)
        FileSystem-->>TaskSetup: ✓
        TaskSetup-->>Orchestrator: ✓
    end
    
    Orchestrator->>Executor: submit() each task function
    par Task Execution (Parallel)
        TaskFunc->>Subprocess: Run data prep script<br/>with args (tokenizer,<br/>max_seq_length, etc.)
        Subprocess->>Tokenizer: select_tokenizer(type, path)
        Tokenizer-->>Subprocess: tokenizer instance
        Subprocess->>Tokenizer: text_to_tokens() for<br/>length estimation & binary search
        Tokenizer-->>Subprocess: token count
        Subprocess->>FileSystem: Write test.jsonl<br/>(samples)
        FileSystem-->>Subprocess: ✓
        Subprocess-->>TaskFunc: exit code
    end
    
    Executor-->>Orchestrator: results (surface errors)
    Orchestrator->>EvalInit: Write top-level __init__.py<br/>(score_module, BENCHMARKS)
    EvalInit->>FileSystem: Persist metadata
    FileSystem-->>EvalInit: ✓
    EvalInit-->>Orchestrator: ✓
    Orchestrator-->>CLI: Print completion message

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

tokenizer.py: Five distinct tokenizer implementations (NeMo TikToken, NeMo SentencePiece, HuggingFace, OpenAI, Gemini) with different import patterns, error handling for format validation, and retry logic for Gemini API calls.
prepare.py: ThreadPoolExecutor-based parallel task execution with subprocess error propagation, dynamic __init__.py generation via string templates, and multi-step orchestration logic.
prepare_niah.py, prepare_mmlu.py, prepare_qa.py: Complex binary search algorithms for haystack sizing, diverse prompt construction logic across multiple task types and datasets, dataset-specific reader implementations, and token-length aware sample generation with retry mechanisms.
eval_ruler2: Multiple evaluation strategies (word error rate, string matching variants) and metric computation dispatch logic.
Integration points: Tokenizer factory used across data prep modules, errors in subprocess calls must propagate correctly, per-task vs. top-level __init__.py generation order and structure.

Possibly related PRs

PR #825: Modifies evaluation registration and evaluator codepaths similarly; both PRs update nemo_skills/evaluation/evaluator/__init__.py and add evaluator entries to registration maps.
PR #958: Refactors evaluator registration logic in nemo_skills/evaluation/evaluator/ module; directly overlaps with this PR's evaluation infrastructure additions (eval_ruler2, evaluator imports).

Suggested reviewers

shuoyangd
Kipok

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.13% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add RULERv2' accurately summarizes the main change in the pull request, which introduces a complete new RULER v2 implementation with multiple dataset preparation modules, evaluation utilities, and scoring mechanisms.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chsieh/ruler2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 15

🧹 Nitpick comments (4)

nemo_skills/dataset/ruler2/prepare_niah.py (1)

40-65: Avoid import-time side effects (argparse + nltk.download).
Having parser.parse_args() and nltk.download() at module import time makes python -m ... fine, but breaks any library-style import and can hang CI/offline runs. Move downloads + arg parsing into main() (or guard with if __name__ == "__main__":).

Also applies to: 29-35

nemo_skills/dataset/ruler2/prepare_mmlu.py (1)

59-120: Prompt branching looks consistent; consider consolidating string-building for maintainability.
A lot of PROBLEM_PROMPT variants differ by small deltas; a dict/table-driven approach would reduce branching.
nemo_skills/dataset/ruler2/prepare_qa.py (1)
116-130: Remove no-op f prefixes; add strict= to zip() for safer assumptions.
-    data = load_dataset(f"hotpotqa/hotpot_qa", "distractor")["train"]
+    data = load_dataset("hotpotqa/hotpot_qa", "distractor")["train"]
 ...
-    data = load_dataset(f"hotpotqa/hotpot_qa", "distractor")["validation"]
+    data = load_dataset("hotpotqa/hotpot_qa", "distractor")["validation"]
-    haystack = [f"{t}\n{''.join(s)}" for d in data for t, s in zip(d['context']['title'], d['context']['sentences'])]
+    haystack = [f"{t}\n{''.join(s)}" for d in data for t, s in zip(d["context"]["title"], d["context"]["sentences"], strict=False)]
Also applies to: 124-130
nemo_skills/dataset/ruler2/prepare.py (1)

299-325: Consider allowing an explicit output base dir instead of writing under the source tree.
output_folder = Path(__file__).parent / setup is awkward for installed packages (site-packages is often read-only).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d3963d and fe9e068.

📒 Files selected for processing (11)

nemo_skills/dataset/ruler/prepare.py (1 hunks)
nemo_skills/dataset/ruler2/__init__.py (1 hunks)
nemo_skills/dataset/ruler2/prepare.py (1 hunks)
nemo_skills/dataset/ruler2/prepare_mmlu.py (1 hunks)
nemo_skills/dataset/ruler2/prepare_niah.py (1 hunks)
nemo_skills/dataset/ruler2/prepare_qa.py (1 hunks)
nemo_skills/dataset/ruler2/ruler2_score.py (1 hunks)
nemo_skills/dataset/ruler2/tokenizer.py (1 hunks)
nemo_skills/evaluation/evaluator/__init__.py (2 hunks)
nemo_skills/evaluation/evaluator/ruler.py (2 hunks)
nemo_skills/evaluation/metrics/map_metrics.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

nemo_skills/evaluation/evaluator/__init__.py (1)

nemo_skills/evaluation/evaluator/ruler.py (2)

eval_ruler (35-84)

eval_ruler2 (87-176)

nemo_skills/evaluation/metrics/map_metrics.py (1)

nemo_skills/evaluation/metrics/ruler_metrics.py (1)

RulerMetrics (18-29)

nemo_skills/dataset/ruler2/prepare_mmlu.py (1)

nemo_skills/dataset/ruler2/tokenizer.py (6)

select_tokenizer (26-41)

text_to_tokens (52-54)

text_to_tokens (69-71)

text_to_tokens (86-88)

text_to_tokens (103-105)

text_to_tokens (122-124)

nemo_skills/dataset/ruler2/prepare.py (3)

nemo_skills/pipeline/utils/declarative.py (1)

run (346-483)

nemo_skills/dataset/ruler/prepare.py (1)

prepare_task (114-122)

nemo_skills/pipeline/utils/cluster.py (1)

submit (463-466)

🪛 Ruff (0.14.8)

nemo_skills/evaluation/evaluator/ruler.py

108-108: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

124-126: Prefer next(...) over single element slice

Replace with next(...)

(RUF015)

125-125: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

134-136: Prefer next(...) over single element slice

Replace with next(...)

(RUF015)

135-135: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

146-146: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/dataset/ruler2/prepare_mmlu.py

153-153: Loop control variable index not used within loop body

Rename unused index to _index

(B007)

228-228: Avoid specifying long messages outside the exception class

(TRY003)

240-240: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

267-267: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

273-273: Loop control variable i not used within loop body

(B007)

392-392: Do not use bare except

(E722)

nemo_skills/dataset/ruler2/ruler2_score.py

34-34: Prefer next(iter(metrics.keys())) over single element slice

Replace with next(iter(metrics.keys()))

(RUF015)

nemo_skills/dataset/ruler2/prepare_qa.py

117-117: f-string without any placeholders

Remove extraneous f prefix

(F541)

118-118: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

124-124: f-string without any placeholders

Remove extraneous f prefix

(F541)

128-128: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

129-129: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

169-169: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

317-317: Do not use bare except

(E722)

nemo_skills/dataset/ruler2/tokenizer.py

33-33: Avoid specifying long messages outside the exception class

(TRY003)

41-41: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/dataset/ruler2/prepare_niah.py

88-88: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

91-91: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

97-97: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

121-121: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

131-133: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

164-164: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

219-219: Unpacked variable save_dict is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

244-244: Do not use bare except

(E722)

nemo_skills/dataset/ruler2/prepare.py

33-33: subprocess call with shell=True identified, security issue

(S602)

53-53: subprocess call with shell=True identified, security issue

(S602)

72-72: subprocess call with shell=True identified, security issue

(S602)

91-91: subprocess call with shell=True identified, security issue

(S602)

110-110: subprocess call with shell=True identified, security issue

(S602)

130-130: subprocess call with shell=True identified, security issue

(S602)

149-149: subprocess call with shell=True identified, security issue

(S602)

168-168: subprocess call with shell=True identified, security issue

(S602)

188-188: subprocess call with shell=True identified, security issue

(S602)

206-206: subprocess call with shell=True identified, security issue

(S602)

225-225: subprocess call with shell=True identified, security issue

(S602)

243-243: subprocess call with shell=True identified, security issue

(S602)

302-302: subprocess call with shell=True identified, security issue

(S602)

302-302: Starting a process with a partial executable path

(S607)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pre-commit
GitHub Check: unit-tests

🔇 Additional comments (4)

nemo_skills/dataset/ruler/prepare.py (1)

212-212: EOF-only change looks fine.

nemo_skills/dataset/ruler2/__init__.py (1)

1-15: DATASET_GROUP = "long-context" is clear and consistent with the intended grouping.

nemo_skills/evaluation/evaluator/__init__.py (1)

41-53: Evaluator registration for ruler2 is straightforward and consistent with existing patterns.

nemo_skills/evaluation/metrics/map_metrics.py (1)

58-60: "ruler2": RulerMetrics alias is correctly implemented. Both eval_ruler and eval_ruler2 emit prediction["is_correct"] in identical format (float values from match_type_funcs), which RulerMetrics handles properly in its _get_score_dict method.

nemo_skills/dataset/ruler2/prepare_mmlu.py

coderabbitai · 2025-12-12T21:12:47Z

nemo_skills/dataset/ruler2/prepare_niah.py

+# Define Needle/Haystack Format 
+needle = "One of the special magic {type_needle_v} for {key} is: {value}."
+if args.type_haystack == 'needle':
+    haystack = needle
+else:
+    raise NotImplementedError(f'{args.type_haystack} is not implemented.')
+


⚠️ Potential issue | 🔴 Critical

Critical: --type_haystack handling is internally inconsistent and currently broken for essay/noise.

Lines 71-75: only 'needle' is allowed, but argparse help advertises noise, essay, needle.

In generate_input_output() the essay/noise/needle branches exist, but haystack is a string in the only supported mode (haystack = needle), so essay mode would iterate characters (" ".join(haystack[:num_haystack])).
Either (a) implement essay/noise properly (with a list of sentences/docs), or (b) remove those modes from help + delete dead branches and enforce choices=["needle"].

-parser.add_argument("--type_haystack", type=str, default='essay', help='[Options] noise, essay, needle.') +parser.add_argument("--type_haystack", type=str, default="needle", choices=["needle"]) -needle = "One of the special magic {type_needle_v} for {key} is: {value}." -if args.type_haystack == 'needle': - haystack = needle -else: - raise NotImplementedError(f'{args.type_haystack} is not implemented.') +needle_sentence = "One of the special magic {type_needle_v} for {key} is: {value}."

- needles.append(needle.format( + needles.append(needle_sentence.format( type_needle_v=args.type_needle_v, key=keys[-1], value=value[-1], ))

- if args.type_haystack == 'essay': - ... - else: - if args.type_haystack == 'noise': - ... - elif args.type_haystack == 'needle': + if args.type_haystack == "needle": ... - context = "\n".join(sentences) + context = "\n".join(sentences) + else: + raise NotImplementedError(f"{args.type_haystack} is not implemented.")

Also applies to: 123-167

🤖 Prompt for AI Agents

In nemo_skills/dataset/ruler2/prepare_niah.py around lines 69 to 75 (and also review lines 123-167), the --type_haystack handling is inconsistent: only 'needle' is currently supported but argparse documents 'noise, essay, needle', and when other modes are used haystack is a string so downstream code will iterate characters. Fix by either (A) implementing proper haystack as a list for 'essay' and 'noise' modes (produce a list of full sentences/documents and ensure generate_input_output() uses list semantics, e.g., build haystack = [sent1, sent2, ...] and adjust joins/sampling accordingly), or (B) remove 'essay' and 'noise' from argparse help and enforce choices=['needle'] and delete dead branches in generate_input_output(); pick one approach and make corresponding unit/test updates and update help text/comments.

nemo_skills/dataset/ruler2/prepare_niah.py

coderabbitai · 2025-12-12T21:12:47Z

nemo_skills/dataset/ruler2/ruler2_score.py

+def compute_score(metrics: dict):
+    # just an average of all metrics. Here we assume that all tasks are present.
+    # if that's not the case, users shouldn't run as a group
+    tasks = [
+        "mk_niah_basic",
+        "mk_niah_easy",
+        "mk_niah_medium",
+        "mk_niah_hard",
+        "mv_niah_basic",
+        "mv_niah_easy",
+        "mv_niah_medium",
+        "mv_niah_hard",
+        "qa_basic",
+        "qa_easy",
+        "qa_medium",
+        "qa_hard",
+
+    ]
+    setup = list(metrics.keys())[0].rsplit(".", 1)[0]
+    metrics[setup] = {}
+
+    for aggregation in metrics[f"{setup}.mk_niah_basic"]:
+        metrics[setup][aggregation] = {
+            "accuracy": sum(metrics[f"{setup}.{task}"][aggregation].get("accuracy", (metrics[f"{setup}.{task}"][aggregation].get("symbolic_correct", 0))) for task in tasks) / len(tasks)
+        }
+
+    return metrics


⚠️ Potential issue | 🟠 Major

Make compute_score() resilient to key order, collisions, and partial runs. Right now it (1) derives setup from the “first” key, (2) writes into metrics[setup] (can collide with an existing key), and (3) will throw KeyError if any task/aggregation is missing.

Suggested patch (keeps current behavior but fails fast with clearer errors + removes ordering footgun):

def compute_score(metrics: dict): @@ - setup = list(metrics.keys())[0].rsplit(".", 1)[0] - metrics[setup] = {} - - for aggregation in metrics[f"{setup}.mk_niah_basic"]: - metrics[setup][aggregation] = { - "accuracy": sum(metrics[f"{setup}.{task}"][aggregation].get("accuracy", (metrics[f"{setup}.{task}"][aggregation].get("symbolic_correct", 0))) for task in tasks) / len(tasks) - } + first_key = next(iter(metrics)) + setup = first_key.rsplit(".", 1)[0] + + agg_key = f"{setup}.mk_niah_basic" + if agg_key not in metrics: + raise KeyError(f"Missing aggregation source key: {agg_key}") + + out_key = setup + if out_key in metrics and not isinstance(metrics[out_key], dict): + raise TypeError(f"Refusing to overwrite non-dict metrics[{out_key!r}]") + metrics.setdefault(out_key, {}) + + for aggregation in metrics[agg_key]: + total = 0.0 + for task in tasks: + task_key = f"{setup}.{task}" + if task_key not in metrics or aggregation not in metrics[task_key]: + raise KeyError(f"Missing metrics for {task_key}.{aggregation}") + cell = metrics[task_key][aggregation] + total += cell.get("accuracy", cell.get("symbolic_correct", 0)) + metrics[out_key][aggregation] = {"accuracy": total / len(tasks)} return metrics

🧰 Tools

🪛 Ruff (0.14.8)

34-34: Prefer next(iter(metrics.keys())) over single element slice

Replace with next(iter(metrics.keys()))

(RUF015)

🤖 Prompt for AI Agents

In nemo_skills/dataset/ruler2/ruler2_score.py around lines 16–42, the function derives setup from the first dict key, blindly writes into metrics[setup] (risking collision), and assumes every task/aggregation exists which causes KeyError; fix by (1) determine the unique setup by scanning all metric keys and validating exactly one setup is present (split on last '.'), (2) refuse to overwrite an existing metrics key by raising a clear ValueError if metrics already contains the chosen summary key, (3) enumerate available aggregations for the setup by inspecting one task entry or collecting aggregation names across tasks, and (4) when computing per-aggregation averages either (a) fail fast with clear errors if expected tasks/aggregations are missing or (b) compute averages only over tasks present and document behavior; ensure error messages include the missing task/aggregation and the derived setup so callers can debug.

nemo_skills/dataset/ruler2/tokenizer.py

coderabbitai · 2025-12-12T21:12:48Z

nemo_skills/dataset/ruler2/tokenizer.py

+class HFTokenizer:
+    """
+    Tokenizer from HF models
+    """
+    def __init__(self, model_path) -> None:
+        from transformers import AutoTokenizer
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+
+    def text_to_tokens(self, text: str) -> List[str]:
+        tokens = self.tokenizer.tokenize(text)
+        return tokens
+
+    def tokens_to_text(self, tokens: List[int]) -> str:
+        text = self.tokenizer.convert_tokens_to_string(tokens)
+        return text
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

HuggingFace Transformers trust_remote_code=True security RCE implications recommended defaults

💡 Result:

Short answer

trust_remote_code=True lets Transformers execute arbitrary Python code from a model repo (modeling/config/loading code). That can lead to arbitrary code execution / RCE if the remote repo is malicious or compromised — it is effectively running untrusted code in your process. [1][2][3].

Recommended default: keep trust_remote_code=False unless you explicitly trust the model repository, and follow strict mitigations when you must enable it (pin revision, vet files, run in isolated environment, least privilege). [1]

Why it’s risky (brief)

Enabling it loads and runs Python modules from the model repo (not just weights), so malicious code in those files runs with your process privileges. This class of risk has led to multiple Transformers/supply‑chain vulnerabilities and CVEs (deserialization and config/model‑code RCEs). [2][3][4]

Practical recommendations (minimal, high‑value)

Default to trust_remote_code=False. Only set True for repos you’ve audited or that are from a fully trusted party. [1]

Pin a commit/revision (don’t allow mutable “latest” branches) when you must trust remote code. [1]

Inspect the repo’s Python files before enabling (modeling_*.py, config, init, setup). Look for imports, subprocess/os usage, deserialization (pickle), or dynamic eval/exec. [1][2]

Run any code that requires trust_remote_code=True in an isolated sandbox: container, VM, or dedicated build host with network/filesystem restrictions and unprivileged user. Prefer ephemeral environments.

Minimize privileges: avoid running as root; restrict network and filesystem access; use seccomp/AppArmor/SELinux or container runtime restrictions when possible.

Prefer safe weight formats (safetensors) and avoid loading untrusted pickled checkpoints. Use use_safetensors where supported. [1][2]

Keep transformers and related packages up to date; subscribe to security advisories and patch quickly. [2][3][4]

For teams: enforce review policy for model imports, provide a curated internal model registry, and consider automated static checks on remote model code before allowing trust.

References

Transformers security notes / documentation (trust_remote_code guidance). [1]

CVE examples: deserialization / RCE issues in Transformers (CVE-2024-3568, CVE-2024-11392). [2][3]

Supply‑chain / sandbox escape example (smolagents sandbox escape). [4]

If you want, I can: 1) produce a short checklist/script to vet a model repo’s Python files, or 2) give an example Docker seccomp profile and minimal container command to run a trusted remote model safely.

🏁 Script executed:

# First, understand the file structure and context git ls-files | grep -E "(tokenizer|ruler2)" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 348

🏁 Script executed:

# Check how HFTokenizer is instantiated - is model_path user-controlled? rg "HFTokenizer" -A 3 -B 3

Repository: NVIDIA-NeMo/Skills

Length of output: 1023

🏁 Script executed:

# Look for other tokenizer implementations for comparison fd "tokenizer" -t f | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 101

🏁 Script executed:

# Check the factory function and how tokenizer_type is determined rg "get_tokenizer|tokenizer_type" -A 5 -B 2 | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 4251

🏁 Script executed:

# Look at the prepare scripts to see what model_path values are used rg "model_path|tokenizer" nemo_skills/dataset/ruler2/prepare.py -A 2 -B 2 | head -100

Repository: NVIDIA-NeMo/Skills

Length of output: 3675

🏁 Script executed:

# Check if there's any CLI argument or environment variable that controls trust_remote_code rg "trust_remote_code|argparse|cli"

Repository: NVIDIA-NeMo/Skills

Length of output: 50375

trust_remote_code=True creates an avoidable RCE risk; change the default to False.

HuggingFace's own security guidance recommends keeping trust_remote_code=False as the default and enabling it only for explicitly trusted model repositories. Enabling remote code execution allows arbitrary Python code from the model repository to run during tokenizer initialization—a known attack vector with multiple CVEs.

Since model_path is user-controlled (passed via CLI argument), users can inadvertently point to malicious or compromised model repositories without explicit intent to do so.

- self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) + self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=False)

🤖 Prompt for AI Agents

In nemo_skills/dataset/ruler2/tokenizer.py around lines 78 to 93, the tokenizer is initialized with trust_remote_code=True which introduces an unnecessary RCE risk when model_path is user-controlled; change the default to trust_remote_code=False and only enable True when an explicit, documented flag or parameter (e.g., trust_remote_code or trust_remote_code_env) is passed by the caller, update the __init__ signature to accept that flag (default False), and ensure any code paths calling HFTokenizer are updated to pass True explicitly when the model source is trusted.

coderabbitai · 2025-12-12T21:12:48Z

nemo_skills/dataset/ruler2/tokenizer.py

+class GeminiTokenizer:
+    """
+    Tokenizer from gemini
+    """
+    def __init__(self, model_path="gemini-1.5-pro-latest") -> None:
+        import google.generativeai as genai
+        genai.configure(api_key=os.environ["GEMINI_API_KEY"])
+        self.model = genai.GenerativeModel(model_path)
+
+    @retry(wait=wait_fixed(60) + wait_random(0, 10), stop=stop_after_attempt(3))
+    def text_to_tokens(self, text: str) -> List[int]:
+        tokens = list(range(self.model.count_tokens(text).total_tokens))
+        return tokens
+
+    def tokens_to_text(self, tokens: List[int]) -> str:
+        pass


⚠️ Potential issue | 🟠 Major

Make GeminiTokenizer fail fast + avoid allocating huge lists; don’t leave pass.

KeyError on missing GEMINI_API_KEY is opaque.

list(range(total_tokens)) is a large allocation for long contexts; range(total_tokens) still supports len().

tokens_to_text() should raise NotImplementedError.

- genai.configure(api_key=os.environ["GEMINI_API_KEY"]) + api_key = os.environ.get("GEMINI_API_KEY") + if not api_key: + raise RuntimeError("GEMINI_API_KEY is not set") + genai.configure(api_key=api_key) ... - tokens = list(range(self.model.count_tokens(text).total_tokens)) - return tokens + return range(self.model.count_tokens(text).total_tokens) ... def tokens_to_text(self, tokens: List[int]) -> str: - pass + raise NotImplementedError("GeminiTokenizer.tokens_to_text is not supported")

🤖 Prompt for AI Agents

In nemo_skills/dataset/ruler2/tokenizer.py around lines 112-127, the initializer and token methods need hardening and a real implementation for tokens_to_text: check for GEMINI_API_KEY explicitly and raise a clear RuntimeError (or ValueError) if missing instead of allowing a KeyError, avoid creating huge lists by returning a range(total_tokens) (or another lazy/iterable) from text_to_tokens rather than list(range(...)), and replace the tokens_to_text pass with raising NotImplementedError to signal it's intentionally unimplemented.

nemo_skills/evaluation/evaluator/ruler.py

Kipok

I haven't reviewed the underlying logic as I don't know the details of ruler2. @fayejf could you have a look? Or @hsiehjackson if you think it's ok to merge without Fei's review, we can do that as well, let me know.

Please add the benchmark to docs though and also please add an example command with some reference results for any model

Kipok · 2026-01-08T23:29:07Z

nemo_skills/dataset/ruler2/tokenizer.py

+
+
+def select_tokenizer(tokenizer_type, tokenizer_path):
+    if tokenizer_type == 'nemo':


do we really need this? I think all recent nemotrons use normal hf tokenizers

nemo_skills/evaluation/evaluator/ruler.py

greptile-apps · 2026-01-13T06:11:16Z

Greptile Summary

This PR adds the RULERv2 benchmark for evaluating long-context language models. It introduces a comprehensive framework with three dataset types (NIAH, MMLU, QA) across four difficulty levels each (basic, easy, medium, hard), totaling 12 benchmark tasks. The implementation includes:

Tokenizer abstraction supporting HuggingFace, OpenAI, and Gemini backends
Automated dataset preparation with parallel execution
Evaluation logic using WER-based scoring and string matching
Score aggregation across all tasks
Integration with existing NeMo-Skills infrastructure

Key Changes:

New ruler2 dataset module with preparation scripts for NIAH, MMLU, and QA benchmarks
Tokenizer abstraction layer for flexible backend support
RULER2 evaluator with configurable matching strategies (all, part, 2steps)
Documentation with usage examples and baseline scores
Added editdistance dependency for WER calculation

Issues Addressed in Previous Review Threads:
Most critical issues from previous review rounds have been noted, including infinite loop vulnerabilities, command injection risks, type inconsistencies in tokenizers, and missing dependencies. These are tracked separately.

Confidence Score: 4/5

This PR is mostly safe to merge with minor issues remaining
The implementation is well-structured with clear separation of concerns and good test coverage. The core benchmark logic appears sound. However, there are several implementation issues noted in previous review threads that should be addressed, including type inconsistencies in the tokenizer abstraction, incomplete GeminiTokenizer implementation, and potential edge cases in error handling. The PR adds significant new functionality without breaking existing code.
Pay attention to nemo_skills/dataset/ruler2/tokenizer.py for type consistency and GeminiTokenizer implementation

Important Files Changed

Filename	Overview
nemo_skills/dataset/ruler2/prepare.py	Main dataset preparation orchestration script with parallel task execution. Contains subprocess calls with proper string conversion.
nemo_skills/dataset/ruler2/tokenizer.py	Tokenizer abstraction with type inconsistencies and incomplete GeminiTokenizer implementation.
nemo_skills/dataset/ruler2/prepare_niah.py	NIAH benchmark dataset generator for needle-in-haystack tasks.
nemo_skills/dataset/ruler2/prepare_mmlu.py	MMLU-based long-context benchmark preparation with multiple task types.
nemo_skills/dataset/ruler2/prepare_qa.py	QA dataset preparation using SQuAD and HotpotQA for retrieval and solving tasks.
nemo_skills/evaluation/evaluator/ruler.py	RULER evaluation logic with WER-based scoring and string matching.

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant Tokenizer as tokenizer.py
    participant PrepareNIAH as prepare_niah.py
    participant PrepareMMLU as prepare_mmlu.py
    participant PrepareQA as prepare_qa.py
    participant Evaluator as ruler.py
    participant Scorer as ruler2_score.py

    User->>PrepareScript: ns prepare_data ruler2 --setup --tokenizer_path --max_seq_length
    PrepareScript->>Tokenizer: select_tokenizer(type, path)
    Tokenizer-->>PrepareScript: tokenizer instance
    
    par Parallel Dataset Generation
        PrepareScript->>PrepareNIAH: subprocess call for mk_niah_basic/mv_niah_basic
        PrepareNIAH->>Tokenizer: text_to_tokens(text)
        Tokenizer-->>PrepareNIAH: token list
        PrepareNIAH-->>PrepareScript: NIAH dataset generated
    and
        PrepareScript->>PrepareMMLU: subprocess call for mk_niah_easy/medium/hard
        PrepareMMLU->>Tokenizer: text_to_tokens(text)
        Tokenizer-->>PrepareMMLU: token list
        PrepareMMLU-->>PrepareScript: MMLU dataset generated
    and
        PrepareScript->>PrepareQA: subprocess call for qa_basic/easy/medium/hard
        PrepareQA->>Tokenizer: text_to_tokens(text)
        Tokenizer-->>PrepareQA: token list
        PrepareQA-->>PrepareScript: QA dataset generated
    end
    
    PrepareScript-->>User: Dataset preparation completed

    User->>Evaluator: ns eval --benchmarks=ruler2.setup --model
    loop For each sample
        Evaluator->>Evaluator: default_parse(prediction)
        Evaluator->>Evaluator: string_match_*/wer scoring
        Evaluator->>Evaluator: compute is_correct
    end
    Evaluator-->>User: Evaluation results per task
    
    User->>Scorer: compute_score(metrics)
    Scorer->>Scorer: aggregate all 12 task scores
    Scorer-->>User: Overall RULER2 score

greptile-apps

_{12 files reviewed, 12 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/dataset/ruler2/tokenizer.py

nemo_skills/dataset/ruler2/prepare_niah.py

nemo_skills/dataset/ruler2/prepare_mmlu.py

nemo_skills/dataset/ruler2/prepare_qa.py

nemo_skills/dataset/ruler2/prepare.py

nemo_skills/dataset/ruler2/tokenizer.py

greptile-apps · 2026-01-13T06:11:38Z

Additional Comments (1)

nemo_skills/evaluation/metrics/map_metrics.py
the copyright notice is truncated - it ends with "See the License for the specific lang" instead of "See the License for the specific language governing permissions and"

# See the License for the specific language governing permissions and

greptile-apps

_{9 files reviewed, 9 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T07:05:04Z

nemo_skills/dataset/ruler2/prepare.py

+    output_folder = Path(__file__).parent / setup
+
+    # 1. installing necessary packages
+    subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True)


the use of shell=True with subprocess.run is dangerous because it can lead to command injection vulnerabilities. The package names in the command are hardcoded here, but shell=True should generally be avoided. Use a list of arguments instead:

Suggested change

subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True)

subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)

greptile-apps · 2026-01-13T07:05:05Z

nemo_skills/dataset/ruler2/prepare.py

+    subprocess.run(
+        f"python -m nemo_skills.dataset.ruler2.prepare_niah "
+        f"--output_folder {output_folder} "
+        f"--tokenizer_type {tokenizer_type} "
+        f"--tokenizer_path {tokenizer_path} "
+        f"--max_seq_length {length} "
+        f"--num_samples {dataset_size} "
+        f"--random_seed 42 "
+        f"--num_needle_k 1 "
+        f"--num_needle_v 1 "
+        f"--num_needle_q 1 "
+        f"--type_haystack needle "
+        f"--type_needle_k words "
+        f"--type_needle_v numbers "
+        f"--num_digits_v 10",
+        shell=True,
+        check=True,
+    )


command injection vulnerability: all the subprocess.run calls in this file use shell=True with f-strings that include user-provided arguments like output_folder, tokenizer_path, etc. If these contain shell metacharacters (e.g., semicolons, backticks), arbitrary commands could be executed.

The proper fix is to use shell=False and pass arguments as a list. For example:

Suggested change

subprocess.run(

f"python -m nemo_skills.dataset.ruler2.prepare_niah "

f"--output_folder {output_folder} "

f"--tokenizer_type {tokenizer_type} "

f"--tokenizer_path {tokenizer_path} "

f"--max_seq_length {length} "

f"--num_samples {dataset_size} "

f"--random_seed 42 "

f"--num_needle_k 1 "

f"--num_needle_v 1 "

f"--num_needle_q 1 "

f"--type_haystack needle "

f"--type_needle_k words "

f"--type_needle_v numbers "

f"--num_digits_v 10",

shell=True,

check=True,

)

def prepare_mk_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):

subprocess.run(

[

"python", "-m", "nemo_skills.dataset.ruler2.prepare_niah",

"--output_folder", output_folder,

"--tokenizer_type", tokenizer_type,

"--tokenizer_path", tokenizer_path,

"--max_seq_length", str(length),

"--num_samples", str(dataset_size),

"--random_seed", "42",

"--num_needle_k", "1",

"--num_needle_v", "1",

"--num_needle_q", "1",

"--type_haystack", "needle",

"--type_needle_k", "words",

"--type_needle_v", "numbers",

"--num_digits_v", "10",

],

check=True,

)

This pattern should be applied to ALL subprocess.run calls in this file (all 12 prepare functions).

greptile-apps · 2026-01-13T07:05:06Z

nemo_skills/dataset/ruler2/tokenizer.py

@@ -0,0 +1,86 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.


copyright year is 2024 but should be 2025 to match the other new files in this PR

Suggested change

# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

greptile-apps · 2026-01-13T07:05:07Z

nemo_skills/dataset/ruler2/tokenizer.py

+    def tokens_to_text(self, tokens: List[int]) -> str:
+        pass


the tokens_to_text method in GeminiTokenizer is not implemented - it only contains pass, which means it returns None. If this method is called, it will cause errors downstream. Either implement this method properly or raise NotImplementedError:

Suggested change

def tokens_to_text(self, tokens: List[int]) -> str:

pass

def tokens_to_text(self, tokens: List[int]) -> str:

raise NotImplementedError("GeminiTokenizer does not support tokens_to_text conversion")

greptile-apps · 2026-01-13T07:05:08Z

nemo_skills/dataset/ruler2/tokenizer.py

+    """
+    def __init__(self, model_path) -> None:
+        from transformers import AutoTokenizer
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)


using trust_remote_code=True allows arbitrary code execution from the model files, which is a security risk. Only use this if you fully trust the model source and have verified its contents. Consider adding a warning or making this configurable:

Suggested change

self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# WARNING: trust_remote_code=True allows arbitrary code execution from model files

# Only use with trusted model sources

self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

greptile-apps · 2026-01-13T07:05:09Z

nemo_skills/dataset/ruler2/prepare_niah.py

+            ))
+        values.append(value)
+
+    random.Random(args.random_seed).shuffle(needles)


creating a new Random instance with a fixed seed (random.Random(args.random_seed)) on every call to generate_input_output will cause the needles to be shuffled in the exact same order for every sample. This defeats the purpose of randomization and makes all samples have the same needle ordering pattern.

The fix is to use the global random that was already seeded at the module level:

Suggested change

random.Random(args.random_seed).shuffle(needles)

random.shuffle(needles)

This ensures each call gets a different shuffle while still maintaining reproducibility through the global seed.

greptile-apps · 2026-01-13T07:05:10Z

nemo_skills/dataset/ruler2/prepare_mmlu.py

+        })
+        questions.add(d['problem'])
+
+    from math_verify import parse


the import of math_verify.parse is inside a conditional block that only executes when args.dataset == "math500". This means:

If the import fails, you won't know until runtime when processing math500 data

The dependency on math_verify might not be obvious to users

Consider moving this import to the top of the file with a try/except to give a clear error message:

Suggested change

from math_verify import parse

try:

from math_verify import parse

except ImportError:

parse = None # Will be checked when needed

Then add a check where it's used:

if parse is None: raise ImportError("math_verify is required for math500 dataset. Install with: pip install math-verify")

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-13T07:05:11Z

nemo_skills/evaluation/evaluator/ruler.py

+        if words != 0:
+            wer = 1.0 * scores / words
+        else:
+            wer = float('inf')
+        return wer


when the reference text is empty (words == 0), the WER function returns float('inf'). This causes issues in the scoring functions (lines 125, 135, 146) where 1 - wer(...) would become -inf, and max(1.0, -inf) would always be 1.0, potentially giving a perfect score for empty references.

Consider handling this edge case explicitly:

Suggested change

if words != 0:

wer = 1.0 * scores / words

else:

wer = float('inf')

return wer

if words != 0:

wer = 1.0 * scores / words

else:

# Empty reference - return maximum error rate instead of inf

wer = 1.0 if scores > 0 else 0.0

return wer

greptile-apps · 2026-01-13T07:05:12Z

nemo_skills/dataset/ruler2/tokenizer.py

+    """
+    def __init__(self, model_path="gemini-1.5-pro-latest") -> None:
+        import google.generativeai as genai
+        genai.configure(api_key=os.environ["GEMINI_API_KEY"])


accessing environment variables with bracket notation will raise a KeyError if the variable is not set, causing the program to crash with an unhelpful error message. Use .get() with proper error handling to provide a clear message to users about what's missing.

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

greptile-apps

_{5 files reviewed, 5 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T07:26:06Z

nemo_skills/dataset/ruler2/prepare.py

+        ]
+        check=True,
+    )


missing comma between the list closing bracket and check=True causes a SyntaxError. The closing bracket ] on line 49 should be followed by a comma before check=True on line 50.

Suggested change

]

check=True,

)

]

, check=True,

)

This same syntax error appears in all 12 prepare functions (lines 50, 70, 90, 110, 131, 151, 171, 191, 211, 230, 250, 269). The code will fail to run until all instances are fixed.

greptile-apps · 2026-01-13T07:26:07Z

nemo_skills/dataset/ruler2/tokenizer.py

+    def text_to_tokens(self, text: str) -> List[str]:
+        tokens = self.tokenizer.tokenize(text)
+        return tokens
+
+    def tokens_to_text(self, tokens: List[int]) -> str:
+        text = self.tokenizer.convert_tokens_to_string(tokens)
+        return text


type annotation mismatch in HFTokenizer: text_to_tokens returns List[str] (line 45) but tokens_to_text expects List[int] as input (line 49). This is inconsistent with the OpenAITokenizer which correctly uses List[int] for both methods.

The HFTokenizer should be consistent - either both methods should use List[str] (if working with string tokens) or both should use List[int] (if working with token IDs). Based on the implementation using tokenizer.tokenize() which returns string tokens, the fix should be:

Suggested change

def text_to_tokens(self, text: str) -> List[str]:

tokens = self.tokenizer.tokenize(text)

return tokens

def tokens_to_text(self, tokens: List[int]) -> str:

text = self.tokenizer.convert_tokens_to_string(tokens)

return text

def tokens_to_text(self, tokens: List[str]) -> str:

text = self.tokenizer.convert_tokens_to_string(tokens)

return text

greptile-apps · 2026-01-13T07:26:08Z

nemo_skills/dataset/ruler2/tokenizer.py

+    def tokens_to_text(self, tokens: List[int]) -> str:
+        pass


the tokens_to_text method is not implemented and will return None. This will cause errors if this method is ever called. Since GeminiTokenizer's text_to_tokens returns fake sequential numbers rather than actual tokens, there's no way to reverse this operation. Either implement proper token decoding or raise NotImplementedError to make it explicit:

Suggested change

def tokens_to_text(self, tokens: List[int]) -> str:

pass

def tokens_to_text(self, tokens: List[int]) -> str:

raise NotImplementedError("GeminiTokenizer does not support converting tokens back to text")

greptile-apps · 2026-01-13T07:26:09Z

nemo_skills/dataset/ruler2/tokenizer.py

+    def text_to_tokens(self, text: str) -> List[int]:
+        tokens = list(range(self.model.count_tokens(text).total_tokens))
+        return tokens


GeminiTokenizer returns fake tokens (sequential numbers from 0 to token_count) instead of actual token IDs. The line list(range(self.model.count_tokens(text).total_tokens)) creates [0, 1, 2, ..., n-1] which are not real tokens. This breaks the tokenizer abstraction since these fake tokens cannot be decoded back to text and don't represent the actual tokenization.

If the Gemini API doesn't provide access to actual tokens, this should be documented clearly or the tokenizer should raise an error for operations that require real tokens.

greptile-apps · 2026-01-13T07:26:10Z

nemo_skills/evaluation/evaluator/ruler.py

+        if words != 0:
+            wer = 1.0 * scores / words
+        else:
+            wer = float('inf')
+        return wer
+
+    def string_match_all_single(preds, refs):
+        """the metric function with input (predictions: [str], references: [[str]]) to compute score."""
+        preds = post_process_preds(preds)
+        preds = [preds]
+        refs = [refs]
+        score = [
+            sum([max(1.0 if r.lower() in pred.lower() else 0.0, 1 - wer([pred.lower()], [r.lower()])) for r in ref]) / len(ref) for pred, ref in zip(preds, refs)
+        ][0]
+        return score


when WER returns float('inf') (for empty references on line 116), the score calculation 1 - wer([pred.lower()], [r.lower()]) on line 125 becomes 1 - inf = -inf. The max(1.0 if ..., -inf) will always choose the first value (0.0 or 1.0), but this behavior may not be intended.

Consider handling the infinity case explicitly:

Suggested change

if words != 0:

wer = 1.0 * scores / words

else:

wer = float('inf')

return wer

def string_match_all_single(preds, refs):

"""the metric function with input (predictions: [str], references: [[str]]) to compute score."""

preds = post_process_preds(preds)

preds = [preds]

refs = [refs]

score = [

sum([max(1.0 if r.lower() in pred.lower() else 0.0, 1 - wer([pred.lower()], [r.lower()])) for r in ref]) / len(ref) for pred, ref in zip(preds, refs)

][0]

return score

def string_match_all_single(preds, refs):

"""the metric function with input (predictions: [str], references: [[str]]) to compute score."""

preds = post_process_preds(preds)

preds = [preds]

refs = [refs]

score = [

sum([max(1.0 if r.lower() in pred.lower() else 0.0, 1 - min(wer([pred.lower()], [r.lower()]), 1.0)) for r in ref]) / len(ref) for pred, ref in zip(preds, refs)

][0]

return score

This clamps WER to [0, 1] before computing 1 - wer, preventing negative infinity. The same issue exists in string_match_2steps_single (line 135) and string_match_part_single (line 146).

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

greptile-apps

_{22 files reviewed, 22 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T07:44:47Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length", length,
+            "--num_samples", dataset_size,


the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime

Suggested change

"--max_seq_length", length,

"--num_samples", dataset_size,

"--max_seq_length", str(length),

"--num_samples", str(dataset_size),

greptile-apps · 2026-01-13T07:44:48Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length", length,
+            "--num_samples", dataset_size,


the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime

Suggested change

"--max_seq_length", length,

"--num_samples", dataset_size,

"--max_seq_length", str(length),

"--num_samples", str(dataset_size),

greptile-apps · 2026-01-13T07:44:49Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length", length,
+            "--num_samples", dataset_size,


the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime

Suggested change

"--max_seq_length", length,

"--num_samples", dataset_size,

"--max_seq_length", str(length),

"--num_samples", str(dataset_size),

greptile-apps · 2026-01-13T07:44:50Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length", length,
+            "--num_samples", dataset_size,


the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime

Suggested change

"--max_seq_length", length,

"--num_samples", dataset_size,

"--max_seq_length", str(length),

"--num_samples", str(dataset_size),

greptile-apps · 2026-01-13T07:44:51Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length", length,
+            "--num_samples", dataset_size,


the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime

Suggested change

"--max_seq_length", length,

"--num_samples", dataset_size,

"--max_seq_length", str(length),

"--num_samples", str(dataset_size),

greptile-apps · 2026-01-13T07:45:04Z

nemo_skills/dataset/ruler2/prepare_mmlu.py

+            except AssertionError:
+                if used_qs > incremental:
+                    used_qs -= incremental
+                else:
+                    raise


the exception handler only catches AssertionError but generate_input_output can raise other exceptions like ValueError (from line 262 when fewshot > curr_context) - these uncaught exceptions will crash the sample generation loop

greptile-apps · 2026-01-13T07:45:05Z

nemo_skills/dataset/ruler2/prepare_mmlu.py

+
+
+def main():
+    output_file = Path(args.output_folder) / "test.jsonl"


the output directory is not created before writing - if args.output_folder doesn't exist, this will raise FileNotFoundError when opening the file for writing

Suggested change

output_file = Path(args.output_folder) / "test.jsonl"

output_file = Path(args.output_folder) / "test.jsonl"

output_file.parent.mkdir(parents=True, exist_ok=True)

greptile-apps · 2026-01-13T07:45:06Z

nemo_skills/dataset/ruler2/prepare_niah.py

+
+
+def main():
+    output_file = Path(args.output_folder) / "test.jsonl"


the output directory is not created before writing - if args.output_folder doesn't exist, this will raise FileNotFoundError when opening the file for writing

Suggested change

output_file = Path(args.output_folder) / "test.jsonl"

output_file = Path(args.output_folder) / "test.jsonl"

output_file.parent.mkdir(parents=True, exist_ok=True)

greptile-apps · 2026-01-13T07:45:07Z

nemo_skills/dataset/ruler2/prepare_qa.py

+
+
+def main():
+    output_file = Path(args.output_folder) / "test.jsonl"


the output directory is not created before writing - if args.output_folder doesn't exist, this will raise FileNotFoundError when opening the file for writing

Suggested change

output_file = Path(args.output_folder) / "test.jsonl"

output_file = Path(args.output_folder) / "test.jsonl"

output_file.parent.mkdir(parents=True, exist_ok=True)

greptile-apps · 2026-01-13T07:45:08Z

nemo_skills/dataset/ruler2/prepare_qa.py

+    curr_needle["distractor"] = [{**c, "random_index": generate_random_number()} for c in curr_needle["distractor"]]
+
+    if args.fewshot > 0:
+        fewshot_examples = random.sample([i for i in range(len(needle)) if i != index], args.fewshot)


the random.sample call will raise ValueError if args.fewshot > len(needle) - 1 (since we exclude the current index) - this should be validated or handled

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

greptile-apps

_{12 files reviewed, 12 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T07:56:51Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length",
+            length,


the integer arguments length and dataset_size are passed directly to subprocess.run, which requires all arguments to be strings. This will cause a TypeError at runtime

Suggested change

"--max_seq_length",

length,

str(length),

"--num_samples",

str(dataset_size),

greptile-apps · 2026-01-13T07:56:52Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length",
+            length,


the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change

"--max_seq_length",

length,

str(length),

"--num_samples",

str(dataset_size),

greptile-apps · 2026-01-13T07:56:53Z

nemo_skills/dataset/ruler2/prepare.py

+            tokenizer_path,
+            "--max_seq_length",


the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change

tokenizer_path,

"--max_seq_length",

str(length),

"--num_samples",

str(dataset_size),

greptile-apps · 2026-01-13T07:56:54Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length",
+            length,


the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change

"--max_seq_length",

length,

str(length),

"--num_samples",

str(dataset_size),

greptile-apps · 2026-01-13T07:56:55Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length",
+            length,


the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change

"--max_seq_length",

length,

str(length),

"--num_samples",

str(dataset_size),

greptile-apps · 2026-01-13T07:56:58Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length",
+            length,


the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change

"--max_seq_length",

length,

str(length),

"--num_samples",

str(dataset_size),

greptile-apps · 2026-01-13T07:56:59Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length",
+            length,


the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change

"--max_seq_length",

length,

str(length),

"--num_samples",

str(dataset_size),

greptile-apps · 2026-01-13T07:57:00Z

nemo_skills/dataset/ruler2/prepare.py

+            length,
+            "--num_samples",


the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change

length,

"--num_samples",

str(length),

"--num_samples",

str(dataset_size),

greptile-apps · 2026-01-13T07:57:01Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length",
+            length,


the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change

"--max_seq_length",

length,

str(length),

"--num_samples",

str(dataset_size),

greptile-apps · 2026-01-13T07:57:02Z

nemo_skills/dataset/ruler2/prepare.py

+            "--max_seq_length",
+            length,


the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change

"--max_seq_length",

length,

str(length),

"--num_samples",

str(dataset_size),

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

greptile-apps

_{9 files reviewed, 9 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-13T08:07:24Z

nemo_skills/dataset/ruler2/prepare_niah.py

+        ]
+        random.shuffle(sentences)
+
+    indexes = sorted(random.sample(range(num_haystack), len(needles)), reverse=True)


This line will raise a ValueError when len(needles) > num_haystack because random.sample() cannot sample more elements than the population size.

This can occur when args.num_needle_k * args.num_needle_v is large but num_haystack is small (e.g., during binary search or when max_seq_length is very restrictive).

Suggested change

indexes = sorted(random.sample(range(num_haystack), len(needles)), reverse=True)

indexes = sorted(random.sample(range(num_haystack), min(len(needles), num_haystack)), reverse=True)

greptile-apps · 2026-01-13T08:07:25Z

nemo_skills/dataset/ruler2/ruler2_score.py

+    setup = list(metrics.keys())[0].rsplit(".", 1)[0]
+    metrics[setup] = {}
+
+    for aggregation in metrics[f"{setup}.mk_niah_basic"]:


This code assumes metrics is non-empty and that specific keys exist without validation. If metrics is empty, line 33 raises IndexError. If expected tasks are missing, line 36 raises KeyError.

Suggested change

setup = list(metrics.keys())[0].rsplit(".", 1)[0]

metrics[setup] = {}

for aggregation in metrics[f"{setup}.mk_niah_basic"]:

if not metrics:

raise ValueError("metrics dictionary cannot be empty")

setup = list(metrics.keys())[0].rsplit(".", 1)[0]

# Validate that mk_niah_basic exists

if f"{setup}.mk_niah_basic" not in metrics:

raise KeyError(f"Expected key '{setup}.mk_niah_basic' not found in metrics")

metrics[setup] = {}

greptile-apps · 2026-01-13T08:07:26Z

nemo_skills/dataset/ruler2/prepare.py

+    output_folder = Path(__file__).parent / setup
+
+    # 1. installing necessary packages
+    subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)


The dependencies wonderwords, html2text, and tenacity are installed at runtime via pip, but they are not listed in requirements/main.txt. This creates an inconsistency where:

The dependencies won't be installed during normal package installation

Runtime pip installation may fail in restricted environments

It makes dependency management unclear

These dependencies should be added to requirements/main.txt. Note that tenacity is used in tokenizer.py but not declared anywhere in requirements.

greptile-apps · 2026-01-13T08:07:27Z

nemo_skills/dataset/ruler2/prepare_mmlu.py

+        q["random_index"] = random_numbers[i]
+
+    random.shuffle(curr_context)
+    examples = random.sample(curr_context, args.fewshot)


This line will raise a ValueError if args.fewshot > len(curr_context). While unlikely in typical usage, it could occur with specific parameter combinations or edge cases during binary search when num_qs is very small.

Suggested change

examples = random.sample(curr_context, args.fewshot)

examples = random.sample(curr_context, min(args.fewshot, len(curr_context)))

greptile-apps · 2026-01-13T08:07:28Z

nemo_skills/dataset/ruler2/prepare_qa.py

+    curr_needle["distractor"] = [{**c, "random_index": generate_random_number()} for c in curr_needle["distractor"]]
+
+    if args.fewshot > 0:
+        fewshot_examples = random.sample([i for i in range(len(needle)) if i != index], args.fewshot)


This line will raise a ValueError if args.fewshot >= len(needle). The code samples from indices excluding the current index, so if the dataset only has a few samples and fewshot is configured too high, this will fail.

Suggested change

fewshot_examples = random.sample([i for i in range(len(needle)) if i != index], args.fewshot)

available_samples = [i for i in range(len(needle)) if i != index]

if args.fewshot > len(available_samples):

raise ValueError(f"fewshot ({args.fewshot}) cannot exceed available samples ({len(available_samples)})")

fewshot_examples = random.sample(available_samples, args.fewshot)

greptile-apps · 2026-01-13T08:07:29Z

nemo_skills/dataset/ruler2/tokenizer.py

+    def __init__(self, model_path="gemini-1.5-pro-latest") -> None:
+        import google.generativeai as genai
+
+        genai.configure(api_key=os.environ["GEMINI_API_KEY"])


Accessing os.environ["GEMINI_API_KEY"] will raise a KeyError if the environment variable is not set. This should be handled gracefully with a clear error message.

Consider checking if the environment variable exists first and raising a descriptive error if it's missing.

greptile-apps · 2026-01-13T08:07:30Z

nemo_skills/dataset/ruler2/prepare_mmlu.py

+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License


The copyright notice is missing a period at the end. It should end with "limitations under the License."

Suggested change

# limitations under the License

# limitations under the License.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-13T08:07:31Z

nemo_skills/dataset/ruler2/prepare_niah.py

+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License


The copyright notice is missing a period at the end. It should end with "limitations under the License."

Suggested change

# limitations under the License

# limitations under the License.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-13T08:07:32Z

nemo_skills/dataset/ruler2/prepare_qa.py

+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License


The copyright notice is missing a period at the end. It should end with "limitations under the License."

Suggested change

# limitations under the License

# limitations under the License.

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Kipok

Please add this to the documentation (ideally with an example command / expected results for any model).

If it takes a long time to prepare data (or data is very large) please also add this benchmark to https://github.com/NVIDIA-NeMo/Skills/blob/main/tests/gpu-tests/test_eval.py#L30, so that we don't automatically run it in ci

nemo_skills/dataset/ruler2/prepare.py

nemo_skills/dataset/ruler2/ruler2_score.py

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

greptile-apps

_{13 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-14T04:55:58Z

nemo_skills/dataset/ruler2/prepare.py

+    output_folder = Path(__file__).parent / setup
+
+    # 1. installing necessary packages
+    # subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)


logic: the dependencies wonderwords, html2text, and tenacity are commented out but actually required by the code (prepare_niah.py imports wonderwords, tokenizer.py imports tenacity). These should either be uncommented or added to requirements/main.txt

Suggested change

# subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)

subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)

greptile-apps · 2026-01-14T04:55:58Z

nemo_skills/dataset/ruler2/tokenizer.py

+    def tokens_to_text(self, tokens: List[int]) -> str:
+        pass


logic: tokens_to_text is not implemented and returns None. Since GeminiTokenizer's text_to_tokens returns fake sequential integers rather than actual tokens, decoding is impossible. If this functionality isn't needed for Gemini, raise NotImplementedError to make the limitation explicit

Suggested change

def tokens_to_text(self, tokens: List[int]) -> str:

pass

def tokens_to_text(self, tokens: List[int]) -> str:

raise NotImplementedError("GeminiTokenizer does not support decoding tokens back to text")

Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>

greptile-apps

Additional Comments (1)

requirements/main.txt, line 54 (link)

logic: missing required dependencies for RULER2: wonderwords, html2text, and tenacity are imported in prepare_niah.py and tokenizer.py but not listed here

_{13 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

…into chsieh/ruler2

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com> Co-authored-by: Wasi Ahmad <wasiahmad@ucla.edu> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <3909933+activatedgeek@users.noreply.github.com> Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Co-authored-by: Jocelyn <jocelynh@nvidia.com> Co-authored-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com> Co-authored-by: Wasi Ahmad <wasiahmad@ucla.edu> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <3909933+activatedgeek@users.noreply.github.com> Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Co-authored-by: Jocelyn <jocelynh@nvidia.com> Co-authored-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>

hsiehjackson requested a review from fayejf December 12, 2025 21:07

coderabbitai bot reviewed Dec 12, 2025

View reviewed changes

gwarmstrong assigned fayejf Dec 15, 2025

Kipok reviewed Jan 8, 2026

View reviewed changes

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

hsiehjackson and others added 21 commits January 13, 2026 15:22

add initial ruler2

e61e16c

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix syntax error

e970ab2

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix

f62aa6b

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix

0d290fd

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix

b9eb3d4

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix

e7211ab

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix

16e7e21

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix

f567774

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix

c30d3f3

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

add ruler2 metrics

810f5f6

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix

4877ea8

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix

0de020f

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

add wer

aeb325e

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Evaluation on LiveBench-Coding (#821)

39e08c7

Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

add initial ruler2

fe6d707

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix wer

7d2cc35

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix conflict

e63de3c

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

fix conflict

d4f31ee

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

resolve conflict

5189cca

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

resolve conflict

770d781

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

resolve conflict

d73fcd0

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

hsiehjackson and others added 2 commits January 13, 2026 15:37

fix

9490505

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Merge branch 'main' into chsieh/ruler2

c41fb80

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

fix

53a7d14

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

fix

733844d

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

greptile-apps bot reviewed Jan 13, 2026

View reviewed changes

Kipok reviewed Jan 13, 2026

View reviewed changes

nemo_skills/dataset/ruler2/prepare.py Show resolved Hide resolved

nemo_skills/dataset/ruler2/ruler2_score.py Outdated Show resolved Hide resolved

hsiehjackson and others added 2 commits January 14, 2026 12:48

resolve comment

2581064

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

Merge branch 'main' into chsieh/ruler2

adf0e08

greptile-apps bot reviewed Jan 14, 2026

View reviewed changes

hsiehjackson requested a review from Kipok January 14, 2026 05:03

Kipok and others added 2 commits January 14, 2026 15:21

Merge branch 'main' into chsieh/ruler2

979d4f4

Merge branch 'main' into chsieh/ruler2

081ca9a

Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>

greptile-apps bot reviewed Jan 16, 2026

View reviewed changes

Kipok added 3 commits January 16, 2026 14:55

Merge branch 'chsieh/ruler2' of https://github.com/NVIDIA/NeMo-Skills …

36d309b

…into chsieh/ruler2

Update docs

91cdf7d

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add link

d9eed2a

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok approved these changes Jan 16, 2026

View reviewed changes

Kipok added the run GPU tests label Jan 16, 2026

Merge branch 'main' into chsieh/ruler2

39c2eda

Kipok added run GPU tests and removed run GPU tests labels Jan 16, 2026

Kipok merged commit d77caab into main Jan 18, 2026
6 checks passed

Kipok deleted the chsieh/ruler2 branch January 18, 2026 17:45

This was referenced Feb 11, 2026

Gnalbandyan/add physics #1214

Merged

Adding an option to store benchmarks in external repo #1240

Merged



		def select_tokenizer(tokenizer_type, tokenizer_path):
		if tokenizer_type == 'nemo':

	subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True)
	subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)

		@@ -0,0 +1,86 @@
		# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.

	# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
	# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

	random.Random(args.random_seed).shuffle(needles)
	random.shuffle(needles)



		def main():
		output_file = Path(args.output_folder) / "test.jsonl"

	output_file = Path(args.output_folder) / "test.jsonl"
	output_file = Path(args.output_folder) / "test.jsonl"
	output_file.parent.mkdir(parents=True, exist_ok=True)

	indexes = sorted(random.sample(range(num_haystack), len(needles)), reverse=True)
	indexes = sorted(random.sample(range(num_haystack), min(len(needles), num_haystack)), reverse=True)

	examples = random.sample(curr_context, args.fewshot)
	examples = random.sample(curr_context, min(args.fewshot, len(curr_context)))

-        fewshot_examples = random.sample([i for i in range(len(needle)) if i != index], args.fewshot)
+        available_samples = [i for i in range(len(needle)) if i != index]
+        if args.fewshot > len(available_samples):
+            raise ValueError(f"fewshot ({args.fewshot}) cannot exceed available samples ({len(available_samples)})")
+        fewshot_examples = random.sample(available_samples, args.fewshot)

	# limitations under the License
	# limitations under the License.

	# subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)
	subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)

Conversation

hsiehjackson commented Dec 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

New Features

Uh oh!

coderabbitai bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Jan 13, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 13, 2026

Choose a reason for hiding this comment

hsiehjackson commented Dec 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 12, 2025 •

edited

Loading

greptile-apps bot commented Jan 13, 2026 •

edited

Loading