Skip to content

Add RULERv2#1106

Merged
Kipok merged 94 commits intomainfrom
chsieh/ruler2
Jan 18, 2026
Merged

Add RULERv2#1106
Kipok merged 94 commits intomainfrom
chsieh/ruler2

Conversation

@hsiehjackson
Copy link
Collaborator

@hsiehjackson hsiehjackson commented Dec 12, 2025

Summary by CodeRabbit

New Features

  • Added RULER2 benchmark with multiple dataset types (NIAH, MMLU, QA) at configurable difficulty levels
  • New RULER2 evaluator with scoring capabilities for benchmark assessment
  • Introduced tokenizer abstraction supporting multiple backend providers
  • Added automated dataset preparation workflows for benchmark setup

✏️ Tip: You can customize this high-level summary in your review settings.

@hsiehjackson hsiehjackson requested a review from fayejf December 12, 2025 21:07
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 12, 2025

📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive RULER2 benchmark dataset preparation and evaluation pipeline. The changes add a new dataset module with multiple data preparation scripts (NIAH, MMLU, QA), tokenizer abstractions supporting multiple frameworks (NeMo, HuggingFace, OpenAI, Gemini), an orchestration script that handles dataset generation in parallel, evaluation infrastructure for RULER2 benchmarks, and scoring logic. The pipeline is integrated into the evaluation and metrics systems.

Changes

Cohort / File(s) Summary
RULER2 Module Initialization
nemo_skills/dataset/ruler2/__init__.py
Adds DATASET_GROUP = "long-context" constant to mark the module as a long-context benchmark group.
RULER2 Tokenizer Utilities
nemo_skills/dataset/ruler2/tokenizer.py
Introduces tokenizer factory (select_tokenizer) and adapter classes for NeMo TikToken, NeMo SentencePiece, HuggingFace, OpenAI, and Gemini tokenizers with unified text_to_tokens and tokens_to_text interfaces.
RULER2 Data Preparation — NIAH
nemo_skills/dataset/ruler2/prepare_niah.py
Generates synthetic QA data using randomized numbers, words, or UUIDs; supports essay and needle-noise contexts; uses binary search to optimize haystack size within token limits; outputs JSONL samples.
RULER2 Data Preparation — MMLU
nemo_skills/dataset/ruler2/prepare_mmlu.py
Prepares multiple datasets (GSM8K, MATH500, MMLU, MBPP) into haystack and needle collections; constructs contextual prompts for retrieve/niaH/solve task types; generates input-output pairs with token-aware sampling via binary search.
RULER2 Data Preparation — QA
nemo_skills/dataset/ruler2/prepare_qa.py
Generates QA prompt-output pairs from SQuAD v2, HotpotQA, and MusiQue; supports flexible prompt types, few-shot examples, and multiple query types; uses tokenizer-based binary search to determine optimal haystack sizes.
RULER2 Dataset Orchestration
nemo_skills/dataset/ruler2/prepare.py
Central coordinator that installs dependencies, creates per-task __init__.py files via prepare_task_for_ns, executes all task preparation functions in parallel via ThreadPoolExecutor, and generates a top-level __init__.py with benchmark metadata and BENCHMARKS mapping.
RULER2 Scoring
nemo_skills/dataset/ruler2/ruler2_score.py
Implements compute_score function that aggregates accuracy metrics across predefined tasks by deriving setup name and computing task-level averages.
Existing RULER Module Update
nemo_skills/dataset/ruler/prepare.py
Adjusts end-of-file newline formatting; no functional changes.
Evaluation Integration
nemo_skills/evaluation/evaluator/__init__.py
Imports eval_ruler2 and adds mapping entry "ruler2": eval_ruler2 to EVALUATOR_MAP.
Evaluation Logic
nemo_skills/evaluation/evaluator/ruler.py
Adds eval_ruler2 function with internal helpers for prediction parsing, word error rate computation, and three string-matching strategies (all, 2-step, partial); reads JSONL, computes correctness, and writes results.
Metrics Registration
nemo_skills/evaluation/metrics/map_metrics.py
Adds mapping entry "ruler2": RulerMetrics to METRICS_MAP.

Sequence Diagram

sequenceDiagram
    participant CLI as Command Line
    participant Orchestrator as prepare_dataset<br/>(ruler2/prepare.py)
    participant PackageMgr as Pip
    participant TaskSetup as prepare_task_for_ns
    participant Executor as ThreadPoolExecutor
    participant TaskFunc as Task Functions<br/>(prepare_*_*)
    participant Subprocess as Subprocess Runner
    participant Tokenizer as Tokenizer
    participant FileSystem as File System
    participant EvalInit as Top-level __init__.py

    CLI->>Orchestrator: prepare_dataset(tasks, setup, ...)
    Orchestrator->>PackageMgr: pip install (wonderwords,<br/>html2text, tenacity)
    PackageMgr-->>Orchestrator: ✓
    
    loop For each task
        Orchestrator->>TaskSetup: prepare_task_for_ns(folder, task)
        TaskSetup->>FileSystem: Write per-task __init__.py<br/>(metrics, eval_args)
        FileSystem-->>TaskSetup: ✓
        TaskSetup-->>Orchestrator: ✓
    end
    
    Orchestrator->>Executor: submit() each task function
    par Task Execution (Parallel)
        TaskFunc->>Subprocess: Run data prep script<br/>with args (tokenizer,<br/>max_seq_length, etc.)
        Subprocess->>Tokenizer: select_tokenizer(type, path)
        Tokenizer-->>Subprocess: tokenizer instance
        Subprocess->>Tokenizer: text_to_tokens() for<br/>length estimation & binary search
        Tokenizer-->>Subprocess: token count
        Subprocess->>FileSystem: Write test.jsonl<br/>(samples)
        FileSystem-->>Subprocess: ✓
        Subprocess-->>TaskFunc: exit code
    end
    
    Executor-->>Orchestrator: results (surface errors)
    Orchestrator->>EvalInit: Write top-level __init__.py<br/>(score_module, BENCHMARKS)
    EvalInit->>FileSystem: Persist metadata
    FileSystem-->>EvalInit: ✓
    EvalInit-->>Orchestrator: ✓
    Orchestrator-->>CLI: Print completion message
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • tokenizer.py: Five distinct tokenizer implementations (NeMo TikToken, NeMo SentencePiece, HuggingFace, OpenAI, Gemini) with different import patterns, error handling for format validation, and retry logic for Gemini API calls.
  • prepare.py: ThreadPoolExecutor-based parallel task execution with subprocess error propagation, dynamic __init__.py generation via string templates, and multi-step orchestration logic.
  • prepare_niah.py, prepare_mmlu.py, prepare_qa.py: Complex binary search algorithms for haystack sizing, diverse prompt construction logic across multiple task types and datasets, dataset-specific reader implementations, and token-length aware sample generation with retry mechanisms.
  • eval_ruler2: Multiple evaluation strategies (word error rate, string matching variants) and metric computation dispatch logic.
  • Integration points: Tokenizer factory used across data prep modules, errors in subprocess calls must propagate correctly, per-task vs. top-level __init__.py generation order and structure.

Possibly related PRs

  • PR #825: Modifies evaluation registration and evaluator codepaths similarly; both PRs update nemo_skills/evaluation/evaluator/__init__.py and add evaluator entries to registration maps.
  • PR #958: Refactors evaluator registration logic in nemo_skills/evaluation/evaluator/ module; directly overlaps with this PR's evaluation infrastructure additions (eval_ruler2, evaluator imports).

Suggested reviewers

  • shuoyangd
  • Kipok

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.13% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add RULERv2' accurately summarizes the main change in the pull request, which introduces a complete new RULER v2 implementation with multiple dataset preparation modules, evaluation utilities, and scoring mechanisms.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chsieh/ruler2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (4)
nemo_skills/dataset/ruler2/prepare_niah.py (1)

40-65: Avoid import-time side effects (argparse + nltk.download).
Having parser.parse_args() and nltk.download() at module import time makes python -m ... fine, but breaks any library-style import and can hang CI/offline runs. Move downloads + arg parsing into main() (or guard with if __name__ == "__main__":).

Also applies to: 29-35

nemo_skills/dataset/ruler2/prepare_mmlu.py (1)

59-120: Prompt branching looks consistent; consider consolidating string-building for maintainability.
A lot of PROBLEM_PROMPT variants differ by small deltas; a dict/table-driven approach would reduce branching.

nemo_skills/dataset/ruler2/prepare_qa.py (1)

116-130: Remove no-op f prefixes; add strict= to zip() for safer assumptions.

-    data = load_dataset(f"hotpotqa/hotpot_qa", "distractor")["train"]
+    data = load_dataset("hotpotqa/hotpot_qa", "distractor")["train"]
 ...
-    data = load_dataset(f"hotpotqa/hotpot_qa", "distractor")["validation"]
+    data = load_dataset("hotpotqa/hotpot_qa", "distractor")["validation"]
-    haystack = [f"{t}\n{''.join(s)}" for d in data for t, s in zip(d['context']['title'], d['context']['sentences'])]
+    haystack = [f"{t}\n{''.join(s)}" for d in data for t, s in zip(d["context"]["title"], d["context"]["sentences"], strict=False)]

Also applies to: 124-130

nemo_skills/dataset/ruler2/prepare.py (1)

299-325: Consider allowing an explicit output base dir instead of writing under the source tree.
output_folder = Path(__file__).parent / setup is awkward for installed packages (site-packages is often read-only).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d3963d and fe9e068.

📒 Files selected for processing (11)
  • nemo_skills/dataset/ruler/prepare.py (1 hunks)
  • nemo_skills/dataset/ruler2/__init__.py (1 hunks)
  • nemo_skills/dataset/ruler2/prepare.py (1 hunks)
  • nemo_skills/dataset/ruler2/prepare_mmlu.py (1 hunks)
  • nemo_skills/dataset/ruler2/prepare_niah.py (1 hunks)
  • nemo_skills/dataset/ruler2/prepare_qa.py (1 hunks)
  • nemo_skills/dataset/ruler2/ruler2_score.py (1 hunks)
  • nemo_skills/dataset/ruler2/tokenizer.py (1 hunks)
  • nemo_skills/evaluation/evaluator/__init__.py (2 hunks)
  • nemo_skills/evaluation/evaluator/ruler.py (2 hunks)
  • nemo_skills/evaluation/metrics/map_metrics.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/ruler.py (2)
  • eval_ruler (35-84)
  • eval_ruler2 (87-176)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/ruler_metrics.py (1)
  • RulerMetrics (18-29)
nemo_skills/dataset/ruler2/prepare_mmlu.py (1)
nemo_skills/dataset/ruler2/tokenizer.py (6)
  • select_tokenizer (26-41)
  • text_to_tokens (52-54)
  • text_to_tokens (69-71)
  • text_to_tokens (86-88)
  • text_to_tokens (103-105)
  • text_to_tokens (122-124)
nemo_skills/dataset/ruler2/prepare.py (3)
nemo_skills/pipeline/utils/declarative.py (1)
  • run (346-483)
nemo_skills/dataset/ruler/prepare.py (1)
  • prepare_task (114-122)
nemo_skills/pipeline/utils/cluster.py (1)
  • submit (463-466)
🪛 Ruff (0.14.8)
nemo_skills/evaluation/evaluator/ruler.py

108-108: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


124-126: Prefer next(...) over single element slice

Replace with next(...)

(RUF015)


125-125: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


134-136: Prefer next(...) over single element slice

Replace with next(...)

(RUF015)


135-135: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


146-146: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/dataset/ruler2/prepare_mmlu.py

153-153: Loop control variable index not used within loop body

Rename unused index to _index

(B007)


228-228: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


267-267: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


273-273: Loop control variable i not used within loop body

(B007)


392-392: Do not use bare except

(E722)

nemo_skills/dataset/ruler2/ruler2_score.py

34-34: Prefer next(iter(metrics.keys())) over single element slice

Replace with next(iter(metrics.keys()))

(RUF015)

nemo_skills/dataset/ruler2/prepare_qa.py

117-117: f-string without any placeholders

Remove extraneous f prefix

(F541)


118-118: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


124-124: f-string without any placeholders

Remove extraneous f prefix

(F541)


128-128: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


129-129: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


169-169: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


317-317: Do not use bare except

(E722)

nemo_skills/dataset/ruler2/tokenizer.py

33-33: Avoid specifying long messages outside the exception class

(TRY003)


41-41: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/dataset/ruler2/prepare_niah.py

88-88: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


91-91: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


97-97: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


121-121: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


131-133: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)


164-164: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


219-219: Unpacked variable save_dict is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


244-244: Do not use bare except

(E722)

nemo_skills/dataset/ruler2/prepare.py

33-33: subprocess call with shell=True identified, security issue

(S602)


53-53: subprocess call with shell=True identified, security issue

(S602)


72-72: subprocess call with shell=True identified, security issue

(S602)


91-91: subprocess call with shell=True identified, security issue

(S602)


110-110: subprocess call with shell=True identified, security issue

(S602)


130-130: subprocess call with shell=True identified, security issue

(S602)


149-149: subprocess call with shell=True identified, security issue

(S602)


168-168: subprocess call with shell=True identified, security issue

(S602)


188-188: subprocess call with shell=True identified, security issue

(S602)


206-206: subprocess call with shell=True identified, security issue

(S602)


225-225: subprocess call with shell=True identified, security issue

(S602)


243-243: subprocess call with shell=True identified, security issue

(S602)


302-302: subprocess call with shell=True identified, security issue

(S602)


302-302: Starting a process with a partial executable path

(S607)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (4)
nemo_skills/dataset/ruler/prepare.py (1)

212-212: EOF-only change looks fine.

nemo_skills/dataset/ruler2/__init__.py (1)

1-15: DATASET_GROUP = "long-context" is clear and consistent with the intended grouping.

nemo_skills/evaluation/evaluator/__init__.py (1)

41-53: Evaluator registration for ruler2 is straightforward and consistent with existing patterns.

nemo_skills/evaluation/metrics/map_metrics.py (1)

58-60: "ruler2": RulerMetrics alias is correctly implemented. Both eval_ruler and eval_ruler2 emit prediction["is_correct"] in identical format (float values from match_type_funcs), which RulerMetrics handles properly in its _get_score_dict method.

Comment on lines +69 to +75
# Define Needle/Haystack Format
needle = "One of the special magic {type_needle_v} for {key} is: {value}."
if args.type_haystack == 'needle':
haystack = needle
else:
raise NotImplementedError(f'{args.type_haystack} is not implemented.')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: --type_haystack handling is internally inconsistent and currently broken for essay/noise.

  • Lines 71-75: only 'needle' is allowed, but argparse help advertises noise, essay, needle.
  • In generate_input_output() the essay/noise/needle branches exist, but haystack is a string in the only supported mode (haystack = needle), so essay mode would iterate characters (" ".join(haystack[:num_haystack])).
    Either (a) implement essay/noise properly (with a list of sentences/docs), or (b) remove those modes from help + delete dead branches and enforce choices=["needle"].
-parser.add_argument("--type_haystack", type=str, default='essay', help='[Options] noise, essay, needle.')
+parser.add_argument("--type_haystack", type=str, default="needle", choices=["needle"])

-needle = "One of the special magic {type_needle_v} for {key} is: {value}."
-if args.type_haystack == 'needle':
-    haystack = needle
-else:
-    raise NotImplementedError(f'{args.type_haystack} is not implemented.')
+needle_sentence = "One of the special magic {type_needle_v} for {key} is: {value}."
-            needles.append(needle.format(
+            needles.append(needle_sentence.format(
                 type_needle_v=args.type_needle_v,
                 key=keys[-1], 
                 value=value[-1],
             ))
-    if args.type_haystack == 'essay':
-        ...
-    else:
-        if args.type_haystack == 'noise':
-            ...
-        elif args.type_haystack == 'needle':
+    if args.type_haystack == "needle":
         ...
-        context = "\n".join(sentences)
+        context = "\n".join(sentences)
+    else:
+        raise NotImplementedError(f"{args.type_haystack} is not implemented.")

Also applies to: 123-167

🤖 Prompt for AI Agents
In nemo_skills/dataset/ruler2/prepare_niah.py around lines 69 to 75 (and also
review lines 123-167), the --type_haystack handling is inconsistent: only
'needle' is currently supported but argparse documents 'noise, essay, needle',
and when other modes are used haystack is a string so downstream code will
iterate characters. Fix by either (A) implementing proper haystack as a list for
'essay' and 'noise' modes (produce a list of full sentences/documents and ensure
generate_input_output() uses list semantics, e.g., build haystack = [sent1,
sent2, ...] and adjust joins/sampling accordingly), or (B) remove 'essay' and
'noise' from argparse help and enforce choices=['needle'] and delete dead
branches in generate_input_output(); pick one approach and make corresponding
unit/test updates and update help text/comments.

Comment on lines +16 to +42
def compute_score(metrics: dict):
# just an average of all metrics. Here we assume that all tasks are present.
# if that's not the case, users shouldn't run as a group
tasks = [
"mk_niah_basic",
"mk_niah_easy",
"mk_niah_medium",
"mk_niah_hard",
"mv_niah_basic",
"mv_niah_easy",
"mv_niah_medium",
"mv_niah_hard",
"qa_basic",
"qa_easy",
"qa_medium",
"qa_hard",

]
setup = list(metrics.keys())[0].rsplit(".", 1)[0]
metrics[setup] = {}

for aggregation in metrics[f"{setup}.mk_niah_basic"]:
metrics[setup][aggregation] = {
"accuracy": sum(metrics[f"{setup}.{task}"][aggregation].get("accuracy", (metrics[f"{setup}.{task}"][aggregation].get("symbolic_correct", 0))) for task in tasks) / len(tasks)
}

return metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make compute_score() resilient to key order, collisions, and partial runs. Right now it (1) derives setup from the “first” key, (2) writes into metrics[setup] (can collide with an existing key), and (3) will throw KeyError if any task/aggregation is missing.

Suggested patch (keeps current behavior but fails fast with clearer errors + removes ordering footgun):

 def compute_score(metrics: dict):
@@
-    setup = list(metrics.keys())[0].rsplit(".", 1)[0]
-    metrics[setup] = {}
-
-    for aggregation in metrics[f"{setup}.mk_niah_basic"]:
-        metrics[setup][aggregation] = {
-            "accuracy": sum(metrics[f"{setup}.{task}"][aggregation].get("accuracy", (metrics[f"{setup}.{task}"][aggregation].get("symbolic_correct", 0))) for task in tasks) / len(tasks)
-        }
+    first_key = next(iter(metrics))
+    setup = first_key.rsplit(".", 1)[0]
+
+    agg_key = f"{setup}.mk_niah_basic"
+    if agg_key not in metrics:
+        raise KeyError(f"Missing aggregation source key: {agg_key}")
+
+    out_key = setup
+    if out_key in metrics and not isinstance(metrics[out_key], dict):
+        raise TypeError(f"Refusing to overwrite non-dict metrics[{out_key!r}]")
+    metrics.setdefault(out_key, {})
+
+    for aggregation in metrics[agg_key]:
+        total = 0.0
+        for task in tasks:
+            task_key = f"{setup}.{task}"
+            if task_key not in metrics or aggregation not in metrics[task_key]:
+                raise KeyError(f"Missing metrics for {task_key}.{aggregation}")
+            cell = metrics[task_key][aggregation]
+            total += cell.get("accuracy", cell.get("symbolic_correct", 0))
+        metrics[out_key][aggregation] = {"accuracy": total / len(tasks)}
 
     return metrics
🧰 Tools
🪛 Ruff (0.14.8)

34-34: Prefer next(iter(metrics.keys())) over single element slice

Replace with next(iter(metrics.keys()))

(RUF015)

🤖 Prompt for AI Agents
In nemo_skills/dataset/ruler2/ruler2_score.py around lines 16–42, the function
derives setup from the first dict key, blindly writes into metrics[setup]
(risking collision), and assumes every task/aggregation exists which causes
KeyError; fix by (1) determine the unique setup by scanning all metric keys and
validating exactly one setup is present (split on last '.'), (2) refuse to
overwrite an existing metrics key by raising a clear ValueError if metrics
already contains the chosen summary key, (3) enumerate available aggregations
for the setup by inspecting one task entry or collecting aggregation names
across tasks, and (4) when computing per-aggregation averages either (a) fail
fast with clear errors if expected tasks/aggregations are missing or (b) compute
averages only over tasks present and document behavior; ensure error messages
include the missing task/aggregation and the derived setup so callers can debug.

Comment on lines +78 to +93
class HFTokenizer:
"""
Tokenizer from HF models
"""
def __init__(self, model_path) -> None:
from transformers import AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

def text_to_tokens(self, text: str) -> List[str]:
tokens = self.tokenizer.tokenize(text)
return tokens

def tokens_to_text(self, tokens: List[int]) -> str:
text = self.tokenizer.convert_tokens_to_string(tokens)
return text

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

HuggingFace Transformers trust_remote_code=True security RCE implications recommended defaults

💡 Result:

Short answer

  • trust_remote_code=True lets Transformers execute arbitrary Python code from a model repo (modeling/config/loading code). That can lead to arbitrary code execution / RCE if the remote repo is malicious or compromised — it is effectively running untrusted code in your process. [1][2][3].
  • Recommended default: keep trust_remote_code=False unless you explicitly trust the model repository, and follow strict mitigations when you must enable it (pin revision, vet files, run in isolated environment, least privilege). [1]

Why it’s risky (brief)

  • Enabling it loads and runs Python modules from the model repo (not just weights), so malicious code in those files runs with your process privileges. This class of risk has led to multiple Transformers/supply‑chain vulnerabilities and CVEs (deserialization and config/model‑code RCEs). [2][3][4]

Practical recommendations (minimal, high‑value)

  1. Default to trust_remote_code=False. Only set True for repos you’ve audited or that are from a fully trusted party. [1]
  2. Pin a commit/revision (don’t allow mutable “latest” branches) when you must trust remote code. [1]
  3. Inspect the repo’s Python files before enabling (modeling_*.py, config, init, setup). Look for imports, subprocess/os usage, deserialization (pickle), or dynamic eval/exec. [1][2]
  4. Run any code that requires trust_remote_code=True in an isolated sandbox: container, VM, or dedicated build host with network/filesystem restrictions and unprivileged user. Prefer ephemeral environments.
  5. Minimize privileges: avoid running as root; restrict network and filesystem access; use seccomp/AppArmor/SELinux or container runtime restrictions when possible.
  6. Prefer safe weight formats (safetensors) and avoid loading untrusted pickled checkpoints. Use use_safetensors where supported. [1][2]
  7. Keep transformers and related packages up to date; subscribe to security advisories and patch quickly. [2][3][4]
  8. For teams: enforce review policy for model imports, provide a curated internal model registry, and consider automated static checks on remote model code before allowing trust.

References

  • Transformers security notes / documentation (trust_remote_code guidance). [1]
  • CVE examples: deserialization / RCE issues in Transformers (CVE-2024-3568, CVE-2024-11392). [2][3]
  • Supply‑chain / sandbox escape example (smolagents sandbox escape). [4]

If you want, I can: 1) produce a short checklist/script to vet a model repo’s Python files, or 2) give an example Docker seccomp profile and minimal container command to run a trusted remote model safely.


🏁 Script executed:

# First, understand the file structure and context
git ls-files | grep -E "(tokenizer|ruler2)" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 348


🏁 Script executed:

# Check how HFTokenizer is instantiated - is model_path user-controlled?
rg "HFTokenizer" -A 3 -B 3

Repository: NVIDIA-NeMo/Skills

Length of output: 1023


🏁 Script executed:

# Look for other tokenizer implementations for comparison
fd "tokenizer" -t f | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 101


🏁 Script executed:

# Check the factory function and how tokenizer_type is determined
rg "get_tokenizer|tokenizer_type" -A 5 -B 2 | head -50

Repository: NVIDIA-NeMo/Skills

Length of output: 4251


🏁 Script executed:

# Look at the prepare scripts to see what model_path values are used
rg "model_path|tokenizer" nemo_skills/dataset/ruler2/prepare.py -A 2 -B 2 | head -100

Repository: NVIDIA-NeMo/Skills

Length of output: 3675


🏁 Script executed:

# Check if there's any CLI argument or environment variable that controls trust_remote_code
rg "trust_remote_code|argparse|cli"

Repository: NVIDIA-NeMo/Skills

Length of output: 50375


trust_remote_code=True creates an avoidable RCE risk; change the default to False.

HuggingFace's own security guidance recommends keeping trust_remote_code=False as the default and enabling it only for explicitly trusted model repositories. Enabling remote code execution allows arbitrary Python code from the model repository to run during tokenizer initialization—a known attack vector with multiple CVEs.

Since model_path is user-controlled (passed via CLI argument), users can inadvertently point to malicious or compromised model repositories without explicit intent to do so.

-        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=False)
🤖 Prompt for AI Agents
In nemo_skills/dataset/ruler2/tokenizer.py around lines 78 to 93, the tokenizer
is initialized with trust_remote_code=True which introduces an unnecessary RCE
risk when model_path is user-controlled; change the default to
trust_remote_code=False and only enable True when an explicit, documented flag
or parameter (e.g., trust_remote_code or trust_remote_code_env) is passed by the
caller, update the __init__ signature to accept that flag (default False), and
ensure any code paths calling HFTokenizer are updated to pass True explicitly
when the model source is trusted.

Comment on lines +112 to +127
class GeminiTokenizer:
"""
Tokenizer from gemini
"""
def __init__(self, model_path="gemini-1.5-pro-latest") -> None:
import google.generativeai as genai
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
self.model = genai.GenerativeModel(model_path)

@retry(wait=wait_fixed(60) + wait_random(0, 10), stop=stop_after_attempt(3))
def text_to_tokens(self, text: str) -> List[int]:
tokens = list(range(self.model.count_tokens(text).total_tokens))
return tokens

def tokens_to_text(self, tokens: List[int]) -> str:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make GeminiTokenizer fail fast + avoid allocating huge lists; don’t leave pass.

  • KeyError on missing GEMINI_API_KEY is opaque.
  • list(range(total_tokens)) is a large allocation for long contexts; range(total_tokens) still supports len().
  • tokens_to_text() should raise NotImplementedError.
-        genai.configure(api_key=os.environ["GEMINI_API_KEY"])
+        api_key = os.environ.get("GEMINI_API_KEY")
+        if not api_key:
+            raise RuntimeError("GEMINI_API_KEY is not set")
+        genai.configure(api_key=api_key)
...
-        tokens = list(range(self.model.count_tokens(text).total_tokens))
-        return tokens
+        return range(self.model.count_tokens(text).total_tokens)
...
     def tokens_to_text(self, tokens: List[int]) -> str:
-        pass
+        raise NotImplementedError("GeminiTokenizer.tokens_to_text is not supported")
🤖 Prompt for AI Agents
In nemo_skills/dataset/ruler2/tokenizer.py around lines 112-127, the initializer
and token methods need hardening and a real implementation for tokens_to_text:
check for GEMINI_API_KEY explicitly and raise a clear RuntimeError (or
ValueError) if missing instead of allowing a KeyError, avoid creating huge lists
by returning a range(total_tokens) (or another lazy/iterable) from
text_to_tokens rather than list(range(...)), and replace the tokens_to_text pass
with raising NotImplementedError to signal it's intentionally unimplemented.

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed the underlying logic as I don't know the details of ruler2. @fayejf could you have a look? Or @hsiehjackson if you think it's ok to merge without Fei's review, we can do that as well, let me know.

Please add the benchmark to docs though and also please add an example command with some reference results for any model



def select_tokenizer(tokenizer_type, tokenizer_path):
if tokenizer_type == 'nemo':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this? I think all recent nemotrons use normal hf tokenizers

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Greptile Summary

This PR adds the RULERv2 benchmark for evaluating long-context language models. It introduces a comprehensive framework with three dataset types (NIAH, MMLU, QA) across four difficulty levels each (basic, easy, medium, hard), totaling 12 benchmark tasks. The implementation includes:

  • Tokenizer abstraction supporting HuggingFace, OpenAI, and Gemini backends
  • Automated dataset preparation with parallel execution
  • Evaluation logic using WER-based scoring and string matching
  • Score aggregation across all tasks
  • Integration with existing NeMo-Skills infrastructure

Key Changes:

  • New ruler2 dataset module with preparation scripts for NIAH, MMLU, and QA benchmarks
  • Tokenizer abstraction layer for flexible backend support
  • RULER2 evaluator with configurable matching strategies (all, part, 2steps)
  • Documentation with usage examples and baseline scores
  • Added editdistance dependency for WER calculation

Issues Addressed in Previous Review Threads:
Most critical issues from previous review rounds have been noted, including infinite loop vulnerabilities, command injection risks, type inconsistencies in tokenizers, and missing dependencies. These are tracked separately.

Confidence Score: 4/5

  • This PR is mostly safe to merge with minor issues remaining
  • The implementation is well-structured with clear separation of concerns and good test coverage. The core benchmark logic appears sound. However, there are several implementation issues noted in previous review threads that should be addressed, including type inconsistencies in the tokenizer abstraction, incomplete GeminiTokenizer implementation, and potential edge cases in error handling. The PR adds significant new functionality without breaking existing code.
  • Pay attention to nemo_skills/dataset/ruler2/tokenizer.py for type consistency and GeminiTokenizer implementation

Important Files Changed

Filename Overview
nemo_skills/dataset/ruler2/prepare.py Main dataset preparation orchestration script with parallel task execution. Contains subprocess calls with proper string conversion.
nemo_skills/dataset/ruler2/tokenizer.py Tokenizer abstraction with type inconsistencies and incomplete GeminiTokenizer implementation.
nemo_skills/dataset/ruler2/prepare_niah.py NIAH benchmark dataset generator for needle-in-haystack tasks.
nemo_skills/dataset/ruler2/prepare_mmlu.py MMLU-based long-context benchmark preparation with multiple task types.
nemo_skills/dataset/ruler2/prepare_qa.py QA dataset preparation using SQuAD and HotpotQA for retrieval and solving tasks.
nemo_skills/evaluation/evaluator/ruler.py RULER evaluation logic with WER-based scoring and string matching.

Sequence Diagram

sequenceDiagram
    participant User
    participant PrepareScript as prepare.py
    participant Tokenizer as tokenizer.py
    participant PrepareNIAH as prepare_niah.py
    participant PrepareMMLU as prepare_mmlu.py
    participant PrepareQA as prepare_qa.py
    participant Evaluator as ruler.py
    participant Scorer as ruler2_score.py

    User->>PrepareScript: ns prepare_data ruler2 --setup --tokenizer_path --max_seq_length
    PrepareScript->>Tokenizer: select_tokenizer(type, path)
    Tokenizer-->>PrepareScript: tokenizer instance
    
    par Parallel Dataset Generation
        PrepareScript->>PrepareNIAH: subprocess call for mk_niah_basic/mv_niah_basic
        PrepareNIAH->>Tokenizer: text_to_tokens(text)
        Tokenizer-->>PrepareNIAH: token list
        PrepareNIAH-->>PrepareScript: NIAH dataset generated
    and
        PrepareScript->>PrepareMMLU: subprocess call for mk_niah_easy/medium/hard
        PrepareMMLU->>Tokenizer: text_to_tokens(text)
        Tokenizer-->>PrepareMMLU: token list
        PrepareMMLU-->>PrepareScript: MMLU dataset generated
    and
        PrepareScript->>PrepareQA: subprocess call for qa_basic/easy/medium/hard
        PrepareQA->>Tokenizer: text_to_tokens(text)
        Tokenizer-->>PrepareQA: token list
        PrepareQA-->>PrepareScript: QA dataset generated
    end
    
    PrepareScript-->>User: Dataset preparation completed

    User->>Evaluator: ns eval --benchmarks=ruler2.setup --model
    loop For each sample
        Evaluator->>Evaluator: default_parse(prediction)
        Evaluator->>Evaluator: string_match_*/wer scoring
        Evaluator->>Evaluator: compute is_correct
    end
    Evaluator-->>User: Evaluation results per task
    
    User->>Scorer: compute_score(metrics)
    Scorer->>Scorer: aggregate all 12 task scores
    Scorer-->>User: Overall RULER2 score
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, 12 comments

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 13, 2026

Additional Comments (1)

nemo_skills/evaluation/metrics/map_metrics.py
the copyright notice is truncated - it ends with "See the License for the specific lang" instead of "See the License for the specific language governing permissions and"

# See the License for the specific language governing permissions and

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 9 comments

Edit Code Review Agent Settings | Greptile

output_folder = Path(__file__).parent / setup

# 1. installing necessary packages
subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use of shell=True with subprocess.run is dangerous because it can lead to command injection vulnerabilities. The package names in the command are hardcoded here, but shell=True should generally be avoided. Use a list of arguments instead:

Suggested change
subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True)
subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)

Comment on lines +33 to +50
subprocess.run(
f"python -m nemo_skills.dataset.ruler2.prepare_niah "
f"--output_folder {output_folder} "
f"--tokenizer_type {tokenizer_type} "
f"--tokenizer_path {tokenizer_path} "
f"--max_seq_length {length} "
f"--num_samples {dataset_size} "
f"--random_seed 42 "
f"--num_needle_k 1 "
f"--num_needle_v 1 "
f"--num_needle_q 1 "
f"--type_haystack needle "
f"--type_needle_k words "
f"--type_needle_v numbers "
f"--num_digits_v 10",
shell=True,
check=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

command injection vulnerability: all the subprocess.run calls in this file use shell=True with f-strings that include user-provided arguments like output_folder, tokenizer_path, etc. If these contain shell metacharacters (e.g., semicolons, backticks), arbitrary commands could be executed.

The proper fix is to use shell=False and pass arguments as a list. For example:

Suggested change
subprocess.run(
f"python -m nemo_skills.dataset.ruler2.prepare_niah "
f"--output_folder {output_folder} "
f"--tokenizer_type {tokenizer_type} "
f"--tokenizer_path {tokenizer_path} "
f"--max_seq_length {length} "
f"--num_samples {dataset_size} "
f"--random_seed 42 "
f"--num_needle_k 1 "
f"--num_needle_v 1 "
f"--num_needle_q 1 "
f"--type_haystack needle "
f"--type_needle_k words "
f"--type_needle_v numbers "
f"--num_digits_v 10",
shell=True,
check=True,
)
def prepare_mk_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size):
subprocess.run(
[
"python", "-m", "nemo_skills.dataset.ruler2.prepare_niah",
"--output_folder", output_folder,
"--tokenizer_type", tokenizer_type,
"--tokenizer_path", tokenizer_path,
"--max_seq_length", str(length),
"--num_samples", str(dataset_size),
"--random_seed", "42",
"--num_needle_k", "1",
"--num_needle_v", "1",
"--num_needle_q", "1",
"--type_haystack", "needle",
"--type_needle_k", "words",
"--type_needle_v", "numbers",
"--num_digits_v", "10",
],
check=True,
)

This pattern should be applied to ALL subprocess.run calls in this file (all 12 prepare functions).

@@ -0,0 +1,86 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyright year is 2024 but should be 2025 to match the other new files in this PR

Suggested change
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

Comment on lines +85 to +86
def tokens_to_text(self, tokens: List[int]) -> str:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tokens_to_text method in GeminiTokenizer is not implemented - it only contains pass, which means it returns None. If this method is called, it will cause errors downstream. Either implement this method properly or raise NotImplementedError:

Suggested change
def tokens_to_text(self, tokens: List[int]) -> str:
pass
def tokens_to_text(self, tokens: List[int]) -> str:
raise NotImplementedError("GeminiTokenizer does not support tokens_to_text conversion")

"""
def __init__(self, model_path) -> None:
from transformers import AutoTokenizer
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using trust_remote_code=True allows arbitrary code execution from the model files, which is a security risk. Only use this if you fully trust the model source and have verified its contents. Consider adding a warning or making this configurable:

Suggested change
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# WARNING: trust_remote_code=True allows arbitrary code execution from model files
# Only use with trusted model sources
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

))
values.append(value)

random.Random(args.random_seed).shuffle(needles)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating a new Random instance with a fixed seed (random.Random(args.random_seed)) on every call to generate_input_output will cause the needles to be shuffled in the exact same order for every sample. This defeats the purpose of randomization and makes all samples have the same needle ordering pattern.

The fix is to use the global random that was already seeded at the module level:

Suggested change
random.Random(args.random_seed).shuffle(needles)
random.shuffle(needles)

This ensures each call gets a different shuffle while still maintaining reproducibility through the global seed.

})
questions.add(d['problem'])

from math_verify import parse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the import of math_verify.parse is inside a conditional block that only executes when args.dataset == "math500". This means:

  1. If the import fails, you won't know until runtime when processing math500 data
  2. The dependency on math_verify might not be obvious to users

Consider moving this import to the top of the file with a try/except to give a clear error message:

Suggested change
from math_verify import parse
try:
from math_verify import parse
except ImportError:
parse = None # Will be checked when needed

Then add a check where it's used:

if parse is None:
    raise ImportError("math_verify is required for math500 dataset. Install with: pip install math-verify")

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +113 to +117
if words != 0:
wer = 1.0 * scores / words
else:
wer = float('inf')
return wer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the reference text is empty (words == 0), the WER function returns float('inf'). This causes issues in the scoring functions (lines 125, 135, 146) where 1 - wer(...) would become -inf, and max(1.0, -inf) would always be 1.0, potentially giving a perfect score for empty references.

Consider handling this edge case explicitly:

Suggested change
if words != 0:
wer = 1.0 * scores / words
else:
wer = float('inf')
return wer
if words != 0:
wer = 1.0 * scores / words
else:
# Empty reference - return maximum error rate instead of inf
wer = 1.0 if scores > 0 else 0.0
return wer

"""
def __init__(self, model_path="gemini-1.5-pro-latest") -> None:
import google.generativeai as genai
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accessing environment variables with bracket notation will raise a KeyError if the variable is not set, causing the program to crash with an unhelpful error message. Use .get() with proper error handling to provide a clear message to users about what's missing.

hsiehjackson and others added 21 commits January 13, 2026 15:22
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +49 to +51
]
check=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing comma between the list closing bracket and check=True causes a SyntaxError. The closing bracket ] on line 49 should be followed by a comma before check=True on line 50.

Suggested change
]
check=True,
)
]
, check=True,
)

This same syntax error appears in all 12 prepare functions (lines 50, 70, 90, 110, 131, 151, 171, 191, 211, 230, 250, 269). The code will fail to run until all instances are fixed.

Comment on lines +45 to +51
def text_to_tokens(self, text: str) -> List[str]:
tokens = self.tokenizer.tokenize(text)
return tokens

def tokens_to_text(self, tokens: List[int]) -> str:
text = self.tokenizer.convert_tokens_to_string(tokens)
return text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type annotation mismatch in HFTokenizer: text_to_tokens returns List[str] (line 45) but tokens_to_text expects List[int] as input (line 49). This is inconsistent with the OpenAITokenizer which correctly uses List[int] for both methods.

The HFTokenizer should be consistent - either both methods should use List[str] (if working with string tokens) or both should use List[int] (if working with token IDs). Based on the implementation using tokenizer.tokenize() which returns string tokens, the fix should be:

Suggested change
def text_to_tokens(self, text: str) -> List[str]:
tokens = self.tokenizer.tokenize(text)
return tokens
def tokens_to_text(self, tokens: List[int]) -> str:
text = self.tokenizer.convert_tokens_to_string(tokens)
return text
def tokens_to_text(self, tokens: List[str]) -> str:
text = self.tokenizer.convert_tokens_to_string(tokens)
return text

Comment on lines +85 to +86
def tokens_to_text(self, tokens: List[int]) -> str:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tokens_to_text method is not implemented and will return None. This will cause errors if this method is ever called. Since GeminiTokenizer's text_to_tokens returns fake sequential numbers rather than actual tokens, there's no way to reverse this operation. Either implement proper token decoding or raise NotImplementedError to make it explicit:

Suggested change
def tokens_to_text(self, tokens: List[int]) -> str:
pass
def tokens_to_text(self, tokens: List[int]) -> str:
raise NotImplementedError("GeminiTokenizer does not support converting tokens back to text")

Comment on lines +81 to +83
def text_to_tokens(self, text: str) -> List[int]:
tokens = list(range(self.model.count_tokens(text).total_tokens))
return tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeminiTokenizer returns fake tokens (sequential numbers from 0 to token_count) instead of actual token IDs. The line list(range(self.model.count_tokens(text).total_tokens)) creates [0, 1, 2, ..., n-1] which are not real tokens. This breaks the tokenizer abstraction since these fake tokens cannot be decoded back to text and don't represent the actual tokenization.

If the Gemini API doesn't provide access to actual tokens, this should be documented clearly or the tokenizer should raise an error for operations that require real tokens.

Comment on lines +113 to +127
if words != 0:
wer = 1.0 * scores / words
else:
wer = float('inf')
return wer

def string_match_all_single(preds, refs):
"""the metric function with input (predictions: [str], references: [[str]]) to compute score."""
preds = post_process_preds(preds)
preds = [preds]
refs = [refs]
score = [
sum([max(1.0 if r.lower() in pred.lower() else 0.0, 1 - wer([pred.lower()], [r.lower()])) for r in ref]) / len(ref) for pred, ref in zip(preds, refs)
][0]
return score
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when WER returns float('inf') (for empty references on line 116), the score calculation 1 - wer([pred.lower()], [r.lower()]) on line 125 becomes 1 - inf = -inf. The max(1.0 if ..., -inf) will always choose the first value (0.0 or 1.0), but this behavior may not be intended.

Consider handling the infinity case explicitly:

Suggested change
if words != 0:
wer = 1.0 * scores / words
else:
wer = float('inf')
return wer
def string_match_all_single(preds, refs):
"""the metric function with input (predictions: [str], references: [[str]]) to compute score."""
preds = post_process_preds(preds)
preds = [preds]
refs = [refs]
score = [
sum([max(1.0 if r.lower() in pred.lower() else 0.0, 1 - wer([pred.lower()], [r.lower()])) for r in ref]) / len(ref) for pred, ref in zip(preds, refs)
][0]
return score
def string_match_all_single(preds, refs):
"""the metric function with input (predictions: [str], references: [[str]]) to compute score."""
preds = post_process_preds(preds)
preds = [preds]
refs = [refs]
score = [
sum([max(1.0 if r.lower() in pred.lower() else 0.0, 1 - min(wer([pred.lower()], [r.lower()]), 1.0)) for r in ref]) / len(ref) for pred, ref in zip(preds, refs)
][0]
return score

This clamps WER to [0, 1] before computing 1 - wer, preventing negative infinity. The same issue exists in string_match_2steps_single (line 135) and string_match_part_single (line 146).

hsiehjackson and others added 2 commits January 13, 2026 15:37
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

22 files reviewed, 22 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +39 to +40
"--max_seq_length", length,
"--num_samples", dataset_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime

Suggested change
"--max_seq_length", length,
"--num_samples", dataset_size,
"--max_seq_length", str(length),
"--num_samples", str(dataset_size),

Comment on lines +60 to +61
"--max_seq_length", length,
"--num_samples", dataset_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime

Suggested change
"--max_seq_length", length,
"--num_samples", dataset_size,
"--max_seq_length", str(length),
"--num_samples", str(dataset_size),

Comment on lines +80 to +81
"--max_seq_length", length,
"--num_samples", dataset_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime

Suggested change
"--max_seq_length", length,
"--num_samples", dataset_size,
"--max_seq_length", str(length),
"--num_samples", str(dataset_size),

Comment on lines +100 to +101
"--max_seq_length", length,
"--num_samples", dataset_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime

Suggested change
"--max_seq_length", length,
"--num_samples", dataset_size,
"--max_seq_length", str(length),
"--num_samples", str(dataset_size),

Comment on lines +120 to +121
"--max_seq_length", length,
"--num_samples", dataset_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime

Suggested change
"--max_seq_length", length,
"--num_samples", dataset_size,
"--max_seq_length", str(length),
"--num_samples", str(dataset_size),

Comment on lines +392 to +396
+ except AssertionError:
+ if used_qs > incremental:
+ used_qs -= incremental
+ else:
+ raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the exception handler only catches AssertionError but generate_input_output can raise other exceptions like ValueError (from line 262 when fewshot > curr_context) - these uncaught exceptions will crash the sample generation loop



def main():
output_file = Path(args.output_folder) / "test.jsonl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output directory is not created before writing - if args.output_folder doesn't exist, this will raise FileNotFoundError when opening the file for writing

Suggested change
output_file = Path(args.output_folder) / "test.jsonl"
output_file = Path(args.output_folder) / "test.jsonl"
output_file.parent.mkdir(parents=True, exist_ok=True)



def main():
output_file = Path(args.output_folder) / "test.jsonl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output directory is not created before writing - if args.output_folder doesn't exist, this will raise FileNotFoundError when opening the file for writing

Suggested change
output_file = Path(args.output_folder) / "test.jsonl"
output_file = Path(args.output_folder) / "test.jsonl"
output_file.parent.mkdir(parents=True, exist_ok=True)



def main():
output_file = Path(args.output_folder) / "test.jsonl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output directory is not created before writing - if args.output_folder doesn't exist, this will raise FileNotFoundError when opening the file for writing

Suggested change
output_file = Path(args.output_folder) / "test.jsonl"
output_file = Path(args.output_folder) / "test.jsonl"
output_file.parent.mkdir(parents=True, exist_ok=True)

curr_needle["distractor"] = [{**c, "random_index": generate_random_number()} for c in curr_needle["distractor"]]

if args.fewshot > 0:
fewshot_examples = random.sample([i for i in range(len(needle)) if i != index], args.fewshot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the random.sample call will raise ValueError if args.fewshot > len(needle) - 1 (since we exclude the current index) - this should be validated or handled

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, 12 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +43 to +44
"--max_seq_length",
length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integer arguments length and dataset_size are passed directly to subprocess.run, which requires all arguments to be strings. This will cause a TypeError at runtime

Suggested change
"--max_seq_length",
length,
str(length),
"--num_samples",
str(dataset_size),

Comment on lines +80 to +81
"--max_seq_length",
length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change
"--max_seq_length",
length,
str(length),
"--num_samples",
str(dataset_size),

Comment on lines +114 to +115
tokenizer_path,
"--max_seq_length",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change
tokenizer_path,
"--max_seq_length",
str(length),
"--num_samples",
str(dataset_size),

Comment on lines +150 to +151
"--max_seq_length",
length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change
"--max_seq_length",
length,
str(length),
"--num_samples",
str(dataset_size),

Comment on lines +185 to +186
"--max_seq_length",
length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change
"--max_seq_length",
length,
str(length),
"--num_samples",
str(dataset_size),

Comment on lines +292 to +293
"--max_seq_length",
length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change
"--max_seq_length",
length,
str(length),
"--num_samples",
str(dataset_size),

Comment on lines +327 to +328
"--max_seq_length",
length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change
"--max_seq_length",
length,
str(length),
"--num_samples",
str(dataset_size),

Comment on lines +361 to +362
length,
"--num_samples",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change
length,
"--num_samples",
str(length),
"--num_samples",
str(dataset_size),

Comment on lines +393 to +394
"--max_seq_length",
length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change
"--max_seq_length",
length,
str(length),
"--num_samples",
str(dataset_size),

Comment on lines +426 to +427
"--max_seq_length",
length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the integer arguments length and dataset_size must be converted to strings for subprocess.run

Suggested change
"--max_seq_length",
length,
str(length),
"--num_samples",
str(dataset_size),

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9 files reviewed, 9 comments

Edit Code Review Agent Settings | Greptile

]
random.shuffle(sentences)

indexes = sorted(random.sample(range(num_haystack), len(needles)), reverse=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will raise a ValueError when len(needles) > num_haystack because random.sample() cannot sample more elements than the population size.

This can occur when args.num_needle_k * args.num_needle_v is large but num_haystack is small (e.g., during binary search or when max_seq_length is very restrictive).

Suggested change
indexes = sorted(random.sample(range(num_haystack), len(needles)), reverse=True)
indexes = sorted(random.sample(range(num_haystack), min(len(needles), num_haystack)), reverse=True)

Comment on lines +33 to +36
setup = list(metrics.keys())[0].rsplit(".", 1)[0]
metrics[setup] = {}

for aggregation in metrics[f"{setup}.mk_niah_basic"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code assumes metrics is non-empty and that specific keys exist without validation. If metrics is empty, line 33 raises IndexError. If expected tasks are missing, line 36 raises KeyError.

Suggested change
setup = list(metrics.keys())[0].rsplit(".", 1)[0]
metrics[setup] = {}
for aggregation in metrics[f"{setup}.mk_niah_basic"]:
if not metrics:
raise ValueError("metrics dictionary cannot be empty")
setup = list(metrics.keys())[0].rsplit(".", 1)[0]
# Validate that mk_niah_basic exists
if f"{setup}.mk_niah_basic" not in metrics:
raise KeyError(f"Expected key '{setup}.mk_niah_basic' not found in metrics")
metrics[setup] = {}

output_folder = Path(__file__).parent / setup

# 1. installing necessary packages
subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependencies wonderwords, html2text, and tenacity are installed at runtime via pip, but they are not listed in requirements/main.txt. This creates an inconsistency where:

  1. The dependencies won't be installed during normal package installation
  2. Runtime pip installation may fail in restricted environments
  3. It makes dependency management unclear

These dependencies should be added to requirements/main.txt. Note that tenacity is used in tokenizer.py but not declared anywhere in requirements.

q["random_index"] = random_numbers[i]

random.shuffle(curr_context)
examples = random.sample(curr_context, args.fewshot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will raise a ValueError if args.fewshot > len(curr_context). While unlikely in typical usage, it could occur with specific parameter combinations or edge cases during binary search when num_qs is very small.

Suggested change
examples = random.sample(curr_context, args.fewshot)
examples = random.sample(curr_context, min(args.fewshot, len(curr_context)))

curr_needle["distractor"] = [{**c, "random_index": generate_random_number()} for c in curr_needle["distractor"]]

if args.fewshot > 0:
fewshot_examples = random.sample([i for i in range(len(needle)) if i != index], args.fewshot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will raise a ValueError if args.fewshot >= len(needle). The code samples from indices excluding the current index, so if the dataset only has a few samples and fewshot is configured too high, this will fail.

Suggested change
fewshot_examples = random.sample([i for i in range(len(needle)) if i != index], args.fewshot)
available_samples = [i for i in range(len(needle)) if i != index]
if args.fewshot > len(available_samples):
raise ValueError(f"fewshot ({args.fewshot}) cannot exceed available samples ({len(available_samples)})")
fewshot_examples = random.sample(available_samples, args.fewshot)

def __init__(self, model_path="gemini-1.5-pro-latest") -> None:
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing os.environ["GEMINI_API_KEY"] will raise a KeyError if the environment variable is not set. This should be handled gracefully with a clear error message.

Consider checking if the environment variable exists first and raising a descriptive error if it's missing.

# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright notice is missing a period at the end. It should end with "limitations under the License."

Suggested change
# limitations under the License
# limitations under the License.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright notice is missing a period at the end. It should end with "limitations under the License."

Suggested change
# limitations under the License
# limitations under the License.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copyright notice is missing a period at the end. It should end with "limitations under the License."

Suggested change
# limitations under the License
# limitations under the License.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this to the documentation (ideally with an example command / expected results for any model).

If it takes a long time to prepare data (or data is very large) please also add this benchmark to https://github.com/NVIDIA-NeMo/Skills/blob/main/tests/gpu-tests/test_eval.py#L30, so that we don't automatically run it in ci

hsiehjackson and others added 2 commits January 14, 2026 12:48
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

output_folder = Path(__file__).parent / setup

# 1. installing necessary packages
# subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: the dependencies wonderwords, html2text, and tenacity are commented out but actually required by the code (prepare_niah.py imports wonderwords, tokenizer.py imports tenacity). These should either be uncommented or added to requirements/main.txt

Suggested change
# subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)
subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True)

Comment on lines +92 to +93
def tokens_to_text(self, tokens: List[int]) -> str:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: tokens_to_text is not implemented and returns None. Since GeminiTokenizer's text_to_tokens returns fake sequential integers rather than actual tokens, decoding is impossible. If this functionality isn't needed for Gemini, raise NotImplementedError to make the limitation explicit

Suggested change
def tokens_to_text(self, tokens: List[int]) -> str:
pass
def tokens_to_text(self, tokens: List[int]) -> str:
raise NotImplementedError("GeminiTokenizer does not support decoding tokens back to text")

@hsiehjackson hsiehjackson requested a review from Kipok January 14, 2026 05:03
Kipok and others added 2 commits January 14, 2026 15:21
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. requirements/main.txt, line 54 (link)

    logic: missing required dependencies for RULER2: wonderwords, html2text, and tenacity are imported in prepare_niah.py and tokenizer.py but not listed here

13 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Kipok added 3 commits January 16, 2026 14:55
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@Kipok Kipok merged commit d77caab into main Jan 18, 2026
6 checks passed
@Kipok Kipok deleted the chsieh/ruler2 branch January 18, 2026 17:45
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: Dan Lord <blahblahasdf@gmail.com>
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Co-authored-by: Wasi Ahmad <wasiahmad@ucla.edu>
Co-authored-by: Ivan <imoshkov@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
Co-authored-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
Co-authored-by: Wei Du <wedu@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Sanyam Kapoor <3909933+activatedgeek@users.noreply.github.com>
Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com>
Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>
Co-authored-by: Dan Lord <blahblahasdf@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
Signed-off-by: i-vainn <imoshkov@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: Stephen Ge <stepheng@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
Signed-off-by: Wei Du <wedu@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com>
Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>
Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: Dan Lord <blahblahasdf@gmail.com>
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Co-authored-by: Wasi Ahmad <wasiahmad@ucla.edu>
Co-authored-by: Ivan <imoshkov@nvidia.com>
Co-authored-by: George Armstrong <georgea@nvidia.com>
Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
Co-authored-by: Stephen Ge <stepheng@nvidia.com>
Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
Co-authored-by: Wei Du <wedu@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Co-authored-by: Sanyam Kapoor <3909933+activatedgeek@users.noreply.github.com>
Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com>
Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>
Co-authored-by: Dan Lord <blahblahasdf@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.