Conversation
📝 WalkthroughWalkthroughThis pull request introduces a comprehensive RULER2 benchmark dataset preparation and evaluation pipeline. The changes add a new dataset module with multiple data preparation scripts (NIAH, MMLU, QA), tokenizer abstractions supporting multiple frameworks (NeMo, HuggingFace, OpenAI, Gemini), an orchestration script that handles dataset generation in parallel, evaluation infrastructure for RULER2 benchmarks, and scoring logic. The pipeline is integrated into the evaluation and metrics systems. Changes
Sequence DiagramsequenceDiagram
participant CLI as Command Line
participant Orchestrator as prepare_dataset<br/>(ruler2/prepare.py)
participant PackageMgr as Pip
participant TaskSetup as prepare_task_for_ns
participant Executor as ThreadPoolExecutor
participant TaskFunc as Task Functions<br/>(prepare_*_*)
participant Subprocess as Subprocess Runner
participant Tokenizer as Tokenizer
participant FileSystem as File System
participant EvalInit as Top-level __init__.py
CLI->>Orchestrator: prepare_dataset(tasks, setup, ...)
Orchestrator->>PackageMgr: pip install (wonderwords,<br/>html2text, tenacity)
PackageMgr-->>Orchestrator: ✓
loop For each task
Orchestrator->>TaskSetup: prepare_task_for_ns(folder, task)
TaskSetup->>FileSystem: Write per-task __init__.py<br/>(metrics, eval_args)
FileSystem-->>TaskSetup: ✓
TaskSetup-->>Orchestrator: ✓
end
Orchestrator->>Executor: submit() each task function
par Task Execution (Parallel)
TaskFunc->>Subprocess: Run data prep script<br/>with args (tokenizer,<br/>max_seq_length, etc.)
Subprocess->>Tokenizer: select_tokenizer(type, path)
Tokenizer-->>Subprocess: tokenizer instance
Subprocess->>Tokenizer: text_to_tokens() for<br/>length estimation & binary search
Tokenizer-->>Subprocess: token count
Subprocess->>FileSystem: Write test.jsonl<br/>(samples)
FileSystem-->>Subprocess: ✓
Subprocess-->>TaskFunc: exit code
end
Executor-->>Orchestrator: results (surface errors)
Orchestrator->>EvalInit: Write top-level __init__.py<br/>(score_module, BENCHMARKS)
EvalInit->>FileSystem: Persist metadata
FileSystem-->>EvalInit: ✓
EvalInit-->>Orchestrator: ✓
Orchestrator-->>CLI: Print completion message
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 15
🧹 Nitpick comments (4)
nemo_skills/dataset/ruler2/prepare_niah.py (1)
40-65: Avoid import-time side effects (argparse + nltk.download).
Havingparser.parse_args()andnltk.download()at module import time makespython -m ...fine, but breaks any library-style import and can hang CI/offline runs. Move downloads + arg parsing intomain()(or guard withif __name__ == "__main__":).Also applies to: 29-35
nemo_skills/dataset/ruler2/prepare_mmlu.py (1)
59-120: Prompt branching looks consistent; consider consolidating string-building for maintainability.
A lot ofPROBLEM_PROMPTvariants differ by small deltas; a dict/table-driven approach would reduce branching.nemo_skills/dataset/ruler2/prepare_qa.py (1)
116-130: Remove no-opfprefixes; addstrict=tozip()for safer assumptions.- data = load_dataset(f"hotpotqa/hotpot_qa", "distractor")["train"] + data = load_dataset("hotpotqa/hotpot_qa", "distractor")["train"] ... - data = load_dataset(f"hotpotqa/hotpot_qa", "distractor")["validation"] + data = load_dataset("hotpotqa/hotpot_qa", "distractor")["validation"]- haystack = [f"{t}\n{''.join(s)}" for d in data for t, s in zip(d['context']['title'], d['context']['sentences'])] + haystack = [f"{t}\n{''.join(s)}" for d in data for t, s in zip(d["context"]["title"], d["context"]["sentences"], strict=False)]Also applies to: 124-130
nemo_skills/dataset/ruler2/prepare.py (1)
299-325: Consider allowing an explicit output base dir instead of writing under the source tree.
output_folder = Path(__file__).parent / setupis awkward for installed packages (site-packages is often read-only).
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
nemo_skills/dataset/ruler/prepare.py(1 hunks)nemo_skills/dataset/ruler2/__init__.py(1 hunks)nemo_skills/dataset/ruler2/prepare.py(1 hunks)nemo_skills/dataset/ruler2/prepare_mmlu.py(1 hunks)nemo_skills/dataset/ruler2/prepare_niah.py(1 hunks)nemo_skills/dataset/ruler2/prepare_qa.py(1 hunks)nemo_skills/dataset/ruler2/ruler2_score.py(1 hunks)nemo_skills/dataset/ruler2/tokenizer.py(1 hunks)nemo_skills/evaluation/evaluator/__init__.py(2 hunks)nemo_skills/evaluation/evaluator/ruler.py(2 hunks)nemo_skills/evaluation/metrics/map_metrics.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
nemo_skills/evaluation/evaluator/__init__.py (1)
nemo_skills/evaluation/evaluator/ruler.py (2)
eval_ruler(35-84)eval_ruler2(87-176)
nemo_skills/evaluation/metrics/map_metrics.py (1)
nemo_skills/evaluation/metrics/ruler_metrics.py (1)
RulerMetrics(18-29)
nemo_skills/dataset/ruler2/prepare_mmlu.py (1)
nemo_skills/dataset/ruler2/tokenizer.py (6)
select_tokenizer(26-41)text_to_tokens(52-54)text_to_tokens(69-71)text_to_tokens(86-88)text_to_tokens(103-105)text_to_tokens(122-124)
nemo_skills/dataset/ruler2/prepare.py (3)
nemo_skills/pipeline/utils/declarative.py (1)
run(346-483)nemo_skills/dataset/ruler/prepare.py (1)
prepare_task(114-122)nemo_skills/pipeline/utils/cluster.py (1)
submit(463-466)
🪛 Ruff (0.14.8)
nemo_skills/evaluation/evaluator/ruler.py
108-108: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
124-126: Prefer next(...) over single element slice
Replace with next(...)
(RUF015)
125-125: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
134-136: Prefer next(...) over single element slice
Replace with next(...)
(RUF015)
135-135: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
146-146: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
nemo_skills/dataset/ruler2/prepare_mmlu.py
153-153: Loop control variable index not used within loop body
Rename unused index to _index
(B007)
228-228: Avoid specifying long messages outside the exception class
(TRY003)
240-240: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
267-267: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
273-273: Loop control variable i not used within loop body
(B007)
392-392: Do not use bare except
(E722)
nemo_skills/dataset/ruler2/ruler2_score.py
34-34: Prefer next(iter(metrics.keys())) over single element slice
Replace with next(iter(metrics.keys()))
(RUF015)
nemo_skills/dataset/ruler2/prepare_qa.py
117-117: f-string without any placeholders
Remove extraneous f prefix
(F541)
118-118: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
124-124: f-string without any placeholders
Remove extraneous f prefix
(F541)
128-128: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
129-129: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
169-169: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
317-317: Do not use bare except
(E722)
nemo_skills/dataset/ruler2/tokenizer.py
33-33: Avoid specifying long messages outside the exception class
(TRY003)
41-41: Avoid specifying long messages outside the exception class
(TRY003)
nemo_skills/dataset/ruler2/prepare_niah.py
88-88: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
91-91: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
97-97: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
121-121: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
131-133: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
164-164: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
219-219: Unpacked variable save_dict is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
244-244: Do not use bare except
(E722)
nemo_skills/dataset/ruler2/prepare.py
33-33: subprocess call with shell=True identified, security issue
(S602)
53-53: subprocess call with shell=True identified, security issue
(S602)
72-72: subprocess call with shell=True identified, security issue
(S602)
91-91: subprocess call with shell=True identified, security issue
(S602)
110-110: subprocess call with shell=True identified, security issue
(S602)
130-130: subprocess call with shell=True identified, security issue
(S602)
149-149: subprocess call with shell=True identified, security issue
(S602)
168-168: subprocess call with shell=True identified, security issue
(S602)
188-188: subprocess call with shell=True identified, security issue
(S602)
206-206: subprocess call with shell=True identified, security issue
(S602)
225-225: subprocess call with shell=True identified, security issue
(S602)
243-243: subprocess call with shell=True identified, security issue
(S602)
302-302: subprocess call with shell=True identified, security issue
(S602)
302-302: Starting a process with a partial executable path
(S607)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pre-commit
- GitHub Check: unit-tests
🔇 Additional comments (4)
nemo_skills/dataset/ruler/prepare.py (1)
212-212: EOF-only change looks fine.nemo_skills/dataset/ruler2/__init__.py (1)
1-15:DATASET_GROUP = "long-context"is clear and consistent with the intended grouping.nemo_skills/evaluation/evaluator/__init__.py (1)
41-53: Evaluator registration forruler2is straightforward and consistent with existing patterns.nemo_skills/evaluation/metrics/map_metrics.py (1)
58-60:"ruler2": RulerMetricsalias is correctly implemented. Botheval_rulerandeval_ruler2emitprediction["is_correct"]in identical format (float values from match_type_funcs), whichRulerMetricshandles properly in its_get_score_dictmethod.
| # Define Needle/Haystack Format | ||
| needle = "One of the special magic {type_needle_v} for {key} is: {value}." | ||
| if args.type_haystack == 'needle': | ||
| haystack = needle | ||
| else: | ||
| raise NotImplementedError(f'{args.type_haystack} is not implemented.') | ||
|
|
There was a problem hiding this comment.
Critical: --type_haystack handling is internally inconsistent and currently broken for essay/noise.
- Lines 71-75: only
'needle'is allowed, but argparse help advertisesnoise, essay, needle. - In
generate_input_output()theessay/noise/needlebranches exist, buthaystackis a string in the only supported mode (haystack = needle), soessaymode would iterate characters (" ".join(haystack[:num_haystack])).
Either (a) implementessay/noiseproperly (with a list of sentences/docs), or (b) remove those modes from help + delete dead branches and enforcechoices=["needle"].
-parser.add_argument("--type_haystack", type=str, default='essay', help='[Options] noise, essay, needle.')
+parser.add_argument("--type_haystack", type=str, default="needle", choices=["needle"])
-needle = "One of the special magic {type_needle_v} for {key} is: {value}."
-if args.type_haystack == 'needle':
- haystack = needle
-else:
- raise NotImplementedError(f'{args.type_haystack} is not implemented.')
+needle_sentence = "One of the special magic {type_needle_v} for {key} is: {value}."- needles.append(needle.format(
+ needles.append(needle_sentence.format(
type_needle_v=args.type_needle_v,
key=keys[-1],
value=value[-1],
))- if args.type_haystack == 'essay':
- ...
- else:
- if args.type_haystack == 'noise':
- ...
- elif args.type_haystack == 'needle':
+ if args.type_haystack == "needle":
...
- context = "\n".join(sentences)
+ context = "\n".join(sentences)
+ else:
+ raise NotImplementedError(f"{args.type_haystack} is not implemented.")Also applies to: 123-167
🤖 Prompt for AI Agents
In nemo_skills/dataset/ruler2/prepare_niah.py around lines 69 to 75 (and also
review lines 123-167), the --type_haystack handling is inconsistent: only
'needle' is currently supported but argparse documents 'noise, essay, needle',
and when other modes are used haystack is a string so downstream code will
iterate characters. Fix by either (A) implementing proper haystack as a list for
'essay' and 'noise' modes (produce a list of full sentences/documents and ensure
generate_input_output() uses list semantics, e.g., build haystack = [sent1,
sent2, ...] and adjust joins/sampling accordingly), or (B) remove 'essay' and
'noise' from argparse help and enforce choices=['needle'] and delete dead
branches in generate_input_output(); pick one approach and make corresponding
unit/test updates and update help text/comments.
| def compute_score(metrics: dict): | ||
| # just an average of all metrics. Here we assume that all tasks are present. | ||
| # if that's not the case, users shouldn't run as a group | ||
| tasks = [ | ||
| "mk_niah_basic", | ||
| "mk_niah_easy", | ||
| "mk_niah_medium", | ||
| "mk_niah_hard", | ||
| "mv_niah_basic", | ||
| "mv_niah_easy", | ||
| "mv_niah_medium", | ||
| "mv_niah_hard", | ||
| "qa_basic", | ||
| "qa_easy", | ||
| "qa_medium", | ||
| "qa_hard", | ||
|
|
||
| ] | ||
| setup = list(metrics.keys())[0].rsplit(".", 1)[0] | ||
| metrics[setup] = {} | ||
|
|
||
| for aggregation in metrics[f"{setup}.mk_niah_basic"]: | ||
| metrics[setup][aggregation] = { | ||
| "accuracy": sum(metrics[f"{setup}.{task}"][aggregation].get("accuracy", (metrics[f"{setup}.{task}"][aggregation].get("symbolic_correct", 0))) for task in tasks) / len(tasks) | ||
| } | ||
|
|
||
| return metrics |
There was a problem hiding this comment.
Make compute_score() resilient to key order, collisions, and partial runs. Right now it (1) derives setup from the “first” key, (2) writes into metrics[setup] (can collide with an existing key), and (3) will throw KeyError if any task/aggregation is missing.
Suggested patch (keeps current behavior but fails fast with clearer errors + removes ordering footgun):
def compute_score(metrics: dict):
@@
- setup = list(metrics.keys())[0].rsplit(".", 1)[0]
- metrics[setup] = {}
-
- for aggregation in metrics[f"{setup}.mk_niah_basic"]:
- metrics[setup][aggregation] = {
- "accuracy": sum(metrics[f"{setup}.{task}"][aggregation].get("accuracy", (metrics[f"{setup}.{task}"][aggregation].get("symbolic_correct", 0))) for task in tasks) / len(tasks)
- }
+ first_key = next(iter(metrics))
+ setup = first_key.rsplit(".", 1)[0]
+
+ agg_key = f"{setup}.mk_niah_basic"
+ if agg_key not in metrics:
+ raise KeyError(f"Missing aggregation source key: {agg_key}")
+
+ out_key = setup
+ if out_key in metrics and not isinstance(metrics[out_key], dict):
+ raise TypeError(f"Refusing to overwrite non-dict metrics[{out_key!r}]")
+ metrics.setdefault(out_key, {})
+
+ for aggregation in metrics[agg_key]:
+ total = 0.0
+ for task in tasks:
+ task_key = f"{setup}.{task}"
+ if task_key not in metrics or aggregation not in metrics[task_key]:
+ raise KeyError(f"Missing metrics for {task_key}.{aggregation}")
+ cell = metrics[task_key][aggregation]
+ total += cell.get("accuracy", cell.get("symbolic_correct", 0))
+ metrics[out_key][aggregation] = {"accuracy": total / len(tasks)}
return metrics🧰 Tools
🪛 Ruff (0.14.8)
34-34: Prefer next(iter(metrics.keys())) over single element slice
Replace with next(iter(metrics.keys()))
(RUF015)
🤖 Prompt for AI Agents
In nemo_skills/dataset/ruler2/ruler2_score.py around lines 16–42, the function
derives setup from the first dict key, blindly writes into metrics[setup]
(risking collision), and assumes every task/aggregation exists which causes
KeyError; fix by (1) determine the unique setup by scanning all metric keys and
validating exactly one setup is present (split on last '.'), (2) refuse to
overwrite an existing metrics key by raising a clear ValueError if metrics
already contains the chosen summary key, (3) enumerate available aggregations
for the setup by inspecting one task entry or collecting aggregation names
across tasks, and (4) when computing per-aggregation averages either (a) fail
fast with clear errors if expected tasks/aggregations are missing or (b) compute
averages only over tasks present and document behavior; ensure error messages
include the missing task/aggregation and the derived setup so callers can debug.
| class HFTokenizer: | ||
| """ | ||
| Tokenizer from HF models | ||
| """ | ||
| def __init__(self, model_path) -> None: | ||
| from transformers import AutoTokenizer | ||
| self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) | ||
|
|
||
| def text_to_tokens(self, text: str) -> List[str]: | ||
| tokens = self.tokenizer.tokenize(text) | ||
| return tokens | ||
|
|
||
| def tokens_to_text(self, tokens: List[int]) -> str: | ||
| text = self.tokenizer.convert_tokens_to_string(tokens) | ||
| return text | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
HuggingFace Transformers trust_remote_code=True security RCE implications recommended defaults
💡 Result:
Short answer
- trust_remote_code=True lets Transformers execute arbitrary Python code from a model repo (modeling/config/loading code). That can lead to arbitrary code execution / RCE if the remote repo is malicious or compromised — it is effectively running untrusted code in your process. [1][2][3].
- Recommended default: keep trust_remote_code=False unless you explicitly trust the model repository, and follow strict mitigations when you must enable it (pin revision, vet files, run in isolated environment, least privilege). [1]
Why it’s risky (brief)
- Enabling it loads and runs Python modules from the model repo (not just weights), so malicious code in those files runs with your process privileges. This class of risk has led to multiple Transformers/supply‑chain vulnerabilities and CVEs (deserialization and config/model‑code RCEs). [2][3][4]
Practical recommendations (minimal, high‑value)
- Default to trust_remote_code=False. Only set True for repos you’ve audited or that are from a fully trusted party. [1]
- Pin a commit/revision (don’t allow mutable “latest” branches) when you must trust remote code. [1]
- Inspect the repo’s Python files before enabling (modeling_*.py, config, init, setup). Look for imports, subprocess/os usage, deserialization (pickle), or dynamic eval/exec. [1][2]
- Run any code that requires trust_remote_code=True in an isolated sandbox: container, VM, or dedicated build host with network/filesystem restrictions and unprivileged user. Prefer ephemeral environments.
- Minimize privileges: avoid running as root; restrict network and filesystem access; use seccomp/AppArmor/SELinux or container runtime restrictions when possible.
- Prefer safe weight formats (safetensors) and avoid loading untrusted pickled checkpoints. Use use_safetensors where supported. [1][2]
- Keep transformers and related packages up to date; subscribe to security advisories and patch quickly. [2][3][4]
- For teams: enforce review policy for model imports, provide a curated internal model registry, and consider automated static checks on remote model code before allowing trust.
References
- Transformers security notes / documentation (trust_remote_code guidance). [1]
- CVE examples: deserialization / RCE issues in Transformers (CVE-2024-3568, CVE-2024-11392). [2][3]
- Supply‑chain / sandbox escape example (smolagents sandbox escape). [4]
If you want, I can: 1) produce a short checklist/script to vet a model repo’s Python files, or 2) give an example Docker seccomp profile and minimal container command to run a trusted remote model safely.
🏁 Script executed:
# First, understand the file structure and context
git ls-files | grep -E "(tokenizer|ruler2)" | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 348
🏁 Script executed:
# Check how HFTokenizer is instantiated - is model_path user-controlled?
rg "HFTokenizer" -A 3 -B 3Repository: NVIDIA-NeMo/Skills
Length of output: 1023
🏁 Script executed:
# Look for other tokenizer implementations for comparison
fd "tokenizer" -t f | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 101
🏁 Script executed:
# Check the factory function and how tokenizer_type is determined
rg "get_tokenizer|tokenizer_type" -A 5 -B 2 | head -50Repository: NVIDIA-NeMo/Skills
Length of output: 4251
🏁 Script executed:
# Look at the prepare scripts to see what model_path values are used
rg "model_path|tokenizer" nemo_skills/dataset/ruler2/prepare.py -A 2 -B 2 | head -100Repository: NVIDIA-NeMo/Skills
Length of output: 3675
🏁 Script executed:
# Check if there's any CLI argument or environment variable that controls trust_remote_code
rg "trust_remote_code|argparse|cli"Repository: NVIDIA-NeMo/Skills
Length of output: 50375
trust_remote_code=True creates an avoidable RCE risk; change the default to False.
HuggingFace's own security guidance recommends keeping trust_remote_code=False as the default and enabling it only for explicitly trusted model repositories. Enabling remote code execution allows arbitrary Python code from the model repository to run during tokenizer initialization—a known attack vector with multiple CVEs.
Since model_path is user-controlled (passed via CLI argument), users can inadvertently point to malicious or compromised model repositories without explicit intent to do so.
- self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+ self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=False)🤖 Prompt for AI Agents
In nemo_skills/dataset/ruler2/tokenizer.py around lines 78 to 93, the tokenizer
is initialized with trust_remote_code=True which introduces an unnecessary RCE
risk when model_path is user-controlled; change the default to
trust_remote_code=False and only enable True when an explicit, documented flag
or parameter (e.g., trust_remote_code or trust_remote_code_env) is passed by the
caller, update the __init__ signature to accept that flag (default False), and
ensure any code paths calling HFTokenizer are updated to pass True explicitly
when the model source is trusted.
| class GeminiTokenizer: | ||
| """ | ||
| Tokenizer from gemini | ||
| """ | ||
| def __init__(self, model_path="gemini-1.5-pro-latest") -> None: | ||
| import google.generativeai as genai | ||
| genai.configure(api_key=os.environ["GEMINI_API_KEY"]) | ||
| self.model = genai.GenerativeModel(model_path) | ||
|
|
||
| @retry(wait=wait_fixed(60) + wait_random(0, 10), stop=stop_after_attempt(3)) | ||
| def text_to_tokens(self, text: str) -> List[int]: | ||
| tokens = list(range(self.model.count_tokens(text).total_tokens)) | ||
| return tokens | ||
|
|
||
| def tokens_to_text(self, tokens: List[int]) -> str: | ||
| pass |
There was a problem hiding this comment.
Make GeminiTokenizer fail fast + avoid allocating huge lists; don’t leave pass.
- KeyError on missing
GEMINI_API_KEYis opaque. list(range(total_tokens))is a large allocation for long contexts;range(total_tokens)still supportslen().tokens_to_text()should raiseNotImplementedError.
- genai.configure(api_key=os.environ["GEMINI_API_KEY"])
+ api_key = os.environ.get("GEMINI_API_KEY")
+ if not api_key:
+ raise RuntimeError("GEMINI_API_KEY is not set")
+ genai.configure(api_key=api_key)
...
- tokens = list(range(self.model.count_tokens(text).total_tokens))
- return tokens
+ return range(self.model.count_tokens(text).total_tokens)
...
def tokens_to_text(self, tokens: List[int]) -> str:
- pass
+ raise NotImplementedError("GeminiTokenizer.tokens_to_text is not supported")🤖 Prompt for AI Agents
In nemo_skills/dataset/ruler2/tokenizer.py around lines 112-127, the initializer
and token methods need hardening and a real implementation for tokens_to_text:
check for GEMINI_API_KEY explicitly and raise a clear RuntimeError (or
ValueError) if missing instead of allowing a KeyError, avoid creating huge lists
by returning a range(total_tokens) (or another lazy/iterable) from
text_to_tokens rather than list(range(...)), and replace the tokens_to_text pass
with raising NotImplementedError to signal it's intentionally unimplemented.
Kipok
left a comment
There was a problem hiding this comment.
I haven't reviewed the underlying logic as I don't know the details of ruler2. @fayejf could you have a look? Or @hsiehjackson if you think it's ok to merge without Fei's review, we can do that as well, let me know.
Please add the benchmark to docs though and also please add an example command with some reference results for any model
|
|
||
|
|
||
| def select_tokenizer(tokenizer_type, tokenizer_path): | ||
| if tokenizer_type == 'nemo': |
There was a problem hiding this comment.
do we really need this? I think all recent nemotrons use normal hf tokenizers
Greptile SummaryThis PR adds the RULERv2 benchmark for evaluating long-context language models. It introduces a comprehensive framework with three dataset types (NIAH, MMLU, QA) across four difficulty levels each (basic, easy, medium, hard), totaling 12 benchmark tasks. The implementation includes:
Key Changes:
Issues Addressed in Previous Review Threads: Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant PrepareScript as prepare.py
participant Tokenizer as tokenizer.py
participant PrepareNIAH as prepare_niah.py
participant PrepareMMLU as prepare_mmlu.py
participant PrepareQA as prepare_qa.py
participant Evaluator as ruler.py
participant Scorer as ruler2_score.py
User->>PrepareScript: ns prepare_data ruler2 --setup --tokenizer_path --max_seq_length
PrepareScript->>Tokenizer: select_tokenizer(type, path)
Tokenizer-->>PrepareScript: tokenizer instance
par Parallel Dataset Generation
PrepareScript->>PrepareNIAH: subprocess call for mk_niah_basic/mv_niah_basic
PrepareNIAH->>Tokenizer: text_to_tokens(text)
Tokenizer-->>PrepareNIAH: token list
PrepareNIAH-->>PrepareScript: NIAH dataset generated
and
PrepareScript->>PrepareMMLU: subprocess call for mk_niah_easy/medium/hard
PrepareMMLU->>Tokenizer: text_to_tokens(text)
Tokenizer-->>PrepareMMLU: token list
PrepareMMLU-->>PrepareScript: MMLU dataset generated
and
PrepareScript->>PrepareQA: subprocess call for qa_basic/easy/medium/hard
PrepareQA->>Tokenizer: text_to_tokens(text)
Tokenizer-->>PrepareQA: token list
PrepareQA-->>PrepareScript: QA dataset generated
end
PrepareScript-->>User: Dataset preparation completed
User->>Evaluator: ns eval --benchmarks=ruler2.setup --model
loop For each sample
Evaluator->>Evaluator: default_parse(prediction)
Evaluator->>Evaluator: string_match_*/wer scoring
Evaluator->>Evaluator: compute is_correct
end
Evaluator-->>User: Evaluation results per task
User->>Scorer: compute_score(metrics)
Scorer->>Scorer: aggregate all 12 task scores
Scorer-->>User: Overall RULER2 score
|
Additional Comments (1)
|
| output_folder = Path(__file__).parent / setup | ||
|
|
||
| # 1. installing necessary packages | ||
| subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True) |
There was a problem hiding this comment.
the use of shell=True with subprocess.run is dangerous because it can lead to command injection vulnerabilities. The package names in the command are hardcoded here, but shell=True should generally be avoided. Use a list of arguments instead:
| subprocess.run(["pip install wonderwords html2text tenacity"], check=True, shell=True) | |
| subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True) |
| subprocess.run( | ||
| f"python -m nemo_skills.dataset.ruler2.prepare_niah " | ||
| f"--output_folder {output_folder} " | ||
| f"--tokenizer_type {tokenizer_type} " | ||
| f"--tokenizer_path {tokenizer_path} " | ||
| f"--max_seq_length {length} " | ||
| f"--num_samples {dataset_size} " | ||
| f"--random_seed 42 " | ||
| f"--num_needle_k 1 " | ||
| f"--num_needle_v 1 " | ||
| f"--num_needle_q 1 " | ||
| f"--type_haystack needle " | ||
| f"--type_needle_k words " | ||
| f"--type_needle_v numbers " | ||
| f"--num_digits_v 10", | ||
| shell=True, | ||
| check=True, | ||
| ) |
There was a problem hiding this comment.
command injection vulnerability: all the subprocess.run calls in this file use shell=True with f-strings that include user-provided arguments like output_folder, tokenizer_path, etc. If these contain shell metacharacters (e.g., semicolons, backticks), arbitrary commands could be executed.
The proper fix is to use shell=False and pass arguments as a list. For example:
| subprocess.run( | |
| f"python -m nemo_skills.dataset.ruler2.prepare_niah " | |
| f"--output_folder {output_folder} " | |
| f"--tokenizer_type {tokenizer_type} " | |
| f"--tokenizer_path {tokenizer_path} " | |
| f"--max_seq_length {length} " | |
| f"--num_samples {dataset_size} " | |
| f"--random_seed 42 " | |
| f"--num_needle_k 1 " | |
| f"--num_needle_v 1 " | |
| f"--num_needle_q 1 " | |
| f"--type_haystack needle " | |
| f"--type_needle_k words " | |
| f"--type_needle_v numbers " | |
| f"--num_digits_v 10", | |
| shell=True, | |
| check=True, | |
| ) | |
| def prepare_mk_niah_basic(output_folder, tokenizer_type, tokenizer_path, length, dataset_size): | |
| subprocess.run( | |
| [ | |
| "python", "-m", "nemo_skills.dataset.ruler2.prepare_niah", | |
| "--output_folder", output_folder, | |
| "--tokenizer_type", tokenizer_type, | |
| "--tokenizer_path", tokenizer_path, | |
| "--max_seq_length", str(length), | |
| "--num_samples", str(dataset_size), | |
| "--random_seed", "42", | |
| "--num_needle_k", "1", | |
| "--num_needle_v", "1", | |
| "--num_needle_q", "1", | |
| "--type_haystack", "needle", | |
| "--type_needle_k", "words", | |
| "--type_needle_v", "numbers", | |
| "--num_digits_v", "10", | |
| ], | |
| check=True, | |
| ) |
This pattern should be applied to ALL subprocess.run calls in this file (all 12 prepare functions).
| @@ -0,0 +1,86 @@ | |||
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
copyright year is 2024 but should be 2025 to match the other new files in this PR
| # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. |
| def tokens_to_text(self, tokens: List[int]) -> str: | ||
| pass |
There was a problem hiding this comment.
the tokens_to_text method in GeminiTokenizer is not implemented - it only contains pass, which means it returns None. If this method is called, it will cause errors downstream. Either implement this method properly or raise NotImplementedError:
| def tokens_to_text(self, tokens: List[int]) -> str: | |
| pass | |
| def tokens_to_text(self, tokens: List[int]) -> str: | |
| raise NotImplementedError("GeminiTokenizer does not support tokens_to_text conversion") |
| """ | ||
| def __init__(self, model_path) -> None: | ||
| from transformers import AutoTokenizer | ||
| self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
There was a problem hiding this comment.
using trust_remote_code=True allows arbitrary code execution from the model files, which is a security risk. Only use this if you fully trust the model source and have verified its contents. Consider adding a warning or making this configurable:
| self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) | |
| # WARNING: trust_remote_code=True allows arbitrary code execution from model files | |
| # Only use with trusted model sources | |
| self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
| )) | ||
| values.append(value) | ||
|
|
||
| random.Random(args.random_seed).shuffle(needles) |
There was a problem hiding this comment.
creating a new Random instance with a fixed seed (random.Random(args.random_seed)) on every call to generate_input_output will cause the needles to be shuffled in the exact same order for every sample. This defeats the purpose of randomization and makes all samples have the same needle ordering pattern.
The fix is to use the global random that was already seeded at the module level:
| random.Random(args.random_seed).shuffle(needles) | |
| random.shuffle(needles) |
This ensures each call gets a different shuffle while still maintaining reproducibility through the global seed.
| }) | ||
| questions.add(d['problem']) | ||
|
|
||
| from math_verify import parse |
There was a problem hiding this comment.
the import of math_verify.parse is inside a conditional block that only executes when args.dataset == "math500". This means:
- If the import fails, you won't know until runtime when processing math500 data
- The dependency on math_verify might not be obvious to users
Consider moving this import to the top of the file with a try/except to give a clear error message:
| from math_verify import parse | |
| try: | |
| from math_verify import parse | |
| except ImportError: | |
| parse = None # Will be checked when needed |
Then add a check where it's used:
if parse is None:
raise ImportError("math_verify is required for math500 dataset. Install with: pip install math-verify")Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| if words != 0: | ||
| wer = 1.0 * scores / words | ||
| else: | ||
| wer = float('inf') | ||
| return wer |
There was a problem hiding this comment.
when the reference text is empty (words == 0), the WER function returns float('inf'). This causes issues in the scoring functions (lines 125, 135, 146) where 1 - wer(...) would become -inf, and max(1.0, -inf) would always be 1.0, potentially giving a perfect score for empty references.
Consider handling this edge case explicitly:
| if words != 0: | |
| wer = 1.0 * scores / words | |
| else: | |
| wer = float('inf') | |
| return wer | |
| if words != 0: | |
| wer = 1.0 * scores / words | |
| else: | |
| # Empty reference - return maximum error rate instead of inf | |
| wer = 1.0 if scores > 0 else 0.0 | |
| return wer |
| """ | ||
| def __init__(self, model_path="gemini-1.5-pro-latest") -> None: | ||
| import google.generativeai as genai | ||
| genai.configure(api_key=os.environ["GEMINI_API_KEY"]) |
There was a problem hiding this comment.
accessing environment variables with bracket notation will raise a KeyError if the variable is not set, causing the program to crash with an unhelpful error message. Use .get() with proper error handling to provide a clear message to users about what's missing.
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
| ] | ||
| check=True, | ||
| ) |
There was a problem hiding this comment.
missing comma between the list closing bracket and check=True causes a SyntaxError. The closing bracket ] on line 49 should be followed by a comma before check=True on line 50.
| ] | |
| check=True, | |
| ) | |
| ] | |
| , check=True, | |
| ) |
This same syntax error appears in all 12 prepare functions (lines 50, 70, 90, 110, 131, 151, 171, 191, 211, 230, 250, 269). The code will fail to run until all instances are fixed.
| def text_to_tokens(self, text: str) -> List[str]: | ||
| tokens = self.tokenizer.tokenize(text) | ||
| return tokens | ||
|
|
||
| def tokens_to_text(self, tokens: List[int]) -> str: | ||
| text = self.tokenizer.convert_tokens_to_string(tokens) | ||
| return text |
There was a problem hiding this comment.
type annotation mismatch in HFTokenizer: text_to_tokens returns List[str] (line 45) but tokens_to_text expects List[int] as input (line 49). This is inconsistent with the OpenAITokenizer which correctly uses List[int] for both methods.
The HFTokenizer should be consistent - either both methods should use List[str] (if working with string tokens) or both should use List[int] (if working with token IDs). Based on the implementation using tokenizer.tokenize() which returns string tokens, the fix should be:
| def text_to_tokens(self, text: str) -> List[str]: | |
| tokens = self.tokenizer.tokenize(text) | |
| return tokens | |
| def tokens_to_text(self, tokens: List[int]) -> str: | |
| text = self.tokenizer.convert_tokens_to_string(tokens) | |
| return text | |
| def tokens_to_text(self, tokens: List[str]) -> str: | |
| text = self.tokenizer.convert_tokens_to_string(tokens) | |
| return text |
| def tokens_to_text(self, tokens: List[int]) -> str: | ||
| pass |
There was a problem hiding this comment.
the tokens_to_text method is not implemented and will return None. This will cause errors if this method is ever called. Since GeminiTokenizer's text_to_tokens returns fake sequential numbers rather than actual tokens, there's no way to reverse this operation. Either implement proper token decoding or raise NotImplementedError to make it explicit:
| def tokens_to_text(self, tokens: List[int]) -> str: | |
| pass | |
| def tokens_to_text(self, tokens: List[int]) -> str: | |
| raise NotImplementedError("GeminiTokenizer does not support converting tokens back to text") |
| def text_to_tokens(self, text: str) -> List[int]: | ||
| tokens = list(range(self.model.count_tokens(text).total_tokens)) | ||
| return tokens |
There was a problem hiding this comment.
GeminiTokenizer returns fake tokens (sequential numbers from 0 to token_count) instead of actual token IDs. The line list(range(self.model.count_tokens(text).total_tokens)) creates [0, 1, 2, ..., n-1] which are not real tokens. This breaks the tokenizer abstraction since these fake tokens cannot be decoded back to text and don't represent the actual tokenization.
If the Gemini API doesn't provide access to actual tokens, this should be documented clearly or the tokenizer should raise an error for operations that require real tokens.
| if words != 0: | ||
| wer = 1.0 * scores / words | ||
| else: | ||
| wer = float('inf') | ||
| return wer | ||
|
|
||
| def string_match_all_single(preds, refs): | ||
| """the metric function with input (predictions: [str], references: [[str]]) to compute score.""" | ||
| preds = post_process_preds(preds) | ||
| preds = [preds] | ||
| refs = [refs] | ||
| score = [ | ||
| sum([max(1.0 if r.lower() in pred.lower() else 0.0, 1 - wer([pred.lower()], [r.lower()])) for r in ref]) / len(ref) for pred, ref in zip(preds, refs) | ||
| ][0] | ||
| return score |
There was a problem hiding this comment.
when WER returns float('inf') (for empty references on line 116), the score calculation 1 - wer([pred.lower()], [r.lower()]) on line 125 becomes 1 - inf = -inf. The max(1.0 if ..., -inf) will always choose the first value (0.0 or 1.0), but this behavior may not be intended.
Consider handling the infinity case explicitly:
| if words != 0: | |
| wer = 1.0 * scores / words | |
| else: | |
| wer = float('inf') | |
| return wer | |
| def string_match_all_single(preds, refs): | |
| """the metric function with input (predictions: [str], references: [[str]]) to compute score.""" | |
| preds = post_process_preds(preds) | |
| preds = [preds] | |
| refs = [refs] | |
| score = [ | |
| sum([max(1.0 if r.lower() in pred.lower() else 0.0, 1 - wer([pred.lower()], [r.lower()])) for r in ref]) / len(ref) for pred, ref in zip(preds, refs) | |
| ][0] | |
| return score | |
| def string_match_all_single(preds, refs): | |
| """the metric function with input (predictions: [str], references: [[str]]) to compute score.""" | |
| preds = post_process_preds(preds) | |
| preds = [preds] | |
| refs = [refs] | |
| score = [ | |
| sum([max(1.0 if r.lower() in pred.lower() else 0.0, 1 - min(wer([pred.lower()], [r.lower()]), 1.0)) for r in ref]) / len(ref) for pred, ref in zip(preds, refs) | |
| ][0] | |
| return score |
This clamps WER to [0, 1] before computing 1 - wer, preventing negative infinity. The same issue exists in string_match_2steps_single (line 135) and string_match_part_single (line 146).
| "--max_seq_length", length, | ||
| "--num_samples", dataset_size, |
There was a problem hiding this comment.
the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime
| "--max_seq_length", length, | |
| "--num_samples", dataset_size, | |
| "--max_seq_length", str(length), | |
| "--num_samples", str(dataset_size), |
| "--max_seq_length", length, | ||
| "--num_samples", dataset_size, |
There was a problem hiding this comment.
the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime
| "--max_seq_length", length, | |
| "--num_samples", dataset_size, | |
| "--max_seq_length", str(length), | |
| "--num_samples", str(dataset_size), |
| "--max_seq_length", length, | ||
| "--num_samples", dataset_size, |
There was a problem hiding this comment.
the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime
| "--max_seq_length", length, | |
| "--num_samples", dataset_size, | |
| "--max_seq_length", str(length), | |
| "--num_samples", str(dataset_size), |
| "--max_seq_length", length, | ||
| "--num_samples", dataset_size, |
There was a problem hiding this comment.
the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime
| "--max_seq_length", length, | |
| "--num_samples", dataset_size, | |
| "--max_seq_length", str(length), | |
| "--num_samples", str(dataset_size), |
| "--max_seq_length", length, | ||
| "--num_samples", dataset_size, |
There was a problem hiding this comment.
the parameters length and dataset_size are integers but subprocess.run requires all command arguments to be strings - this will cause a TypeError at runtime
| "--max_seq_length", length, | |
| "--num_samples", dataset_size, | |
| "--max_seq_length", str(length), | |
| "--num_samples", str(dataset_size), |
| + except AssertionError: | ||
| + if used_qs > incremental: | ||
| + used_qs -= incremental | ||
| + else: | ||
| + raise |
There was a problem hiding this comment.
the exception handler only catches AssertionError but generate_input_output can raise other exceptions like ValueError (from line 262 when fewshot > curr_context) - these uncaught exceptions will crash the sample generation loop
|
|
||
|
|
||
| def main(): | ||
| output_file = Path(args.output_folder) / "test.jsonl" |
There was a problem hiding this comment.
the output directory is not created before writing - if args.output_folder doesn't exist, this will raise FileNotFoundError when opening the file for writing
| output_file = Path(args.output_folder) / "test.jsonl" | |
| output_file = Path(args.output_folder) / "test.jsonl" | |
| output_file.parent.mkdir(parents=True, exist_ok=True) |
|
|
||
|
|
||
| def main(): | ||
| output_file = Path(args.output_folder) / "test.jsonl" |
There was a problem hiding this comment.
the output directory is not created before writing - if args.output_folder doesn't exist, this will raise FileNotFoundError when opening the file for writing
| output_file = Path(args.output_folder) / "test.jsonl" | |
| output_file = Path(args.output_folder) / "test.jsonl" | |
| output_file.parent.mkdir(parents=True, exist_ok=True) |
|
|
||
|
|
||
| def main(): | ||
| output_file = Path(args.output_folder) / "test.jsonl" |
There was a problem hiding this comment.
the output directory is not created before writing - if args.output_folder doesn't exist, this will raise FileNotFoundError when opening the file for writing
| output_file = Path(args.output_folder) / "test.jsonl" | |
| output_file = Path(args.output_folder) / "test.jsonl" | |
| output_file.parent.mkdir(parents=True, exist_ok=True) |
| curr_needle["distractor"] = [{**c, "random_index": generate_random_number()} for c in curr_needle["distractor"]] | ||
|
|
||
| if args.fewshot > 0: | ||
| fewshot_examples = random.sample([i for i in range(len(needle)) if i != index], args.fewshot) |
There was a problem hiding this comment.
the random.sample call will raise ValueError if args.fewshot > len(needle) - 1 (since we exclude the current index) - this should be validated or handled
| "--max_seq_length", | ||
| length, |
There was a problem hiding this comment.
the integer arguments length and dataset_size are passed directly to subprocess.run, which requires all arguments to be strings. This will cause a TypeError at runtime
| "--max_seq_length", | |
| length, | |
| str(length), | |
| "--num_samples", | |
| str(dataset_size), |
| "--max_seq_length", | ||
| length, |
There was a problem hiding this comment.
the integer arguments length and dataset_size must be converted to strings for subprocess.run
| "--max_seq_length", | |
| length, | |
| str(length), | |
| "--num_samples", | |
| str(dataset_size), |
| tokenizer_path, | ||
| "--max_seq_length", |
There was a problem hiding this comment.
the integer arguments length and dataset_size must be converted to strings for subprocess.run
| tokenizer_path, | |
| "--max_seq_length", | |
| str(length), | |
| "--num_samples", | |
| str(dataset_size), |
| "--max_seq_length", | ||
| length, |
There was a problem hiding this comment.
the integer arguments length and dataset_size must be converted to strings for subprocess.run
| "--max_seq_length", | |
| length, | |
| str(length), | |
| "--num_samples", | |
| str(dataset_size), |
| "--max_seq_length", | ||
| length, |
There was a problem hiding this comment.
the integer arguments length and dataset_size must be converted to strings for subprocess.run
| "--max_seq_length", | |
| length, | |
| str(length), | |
| "--num_samples", | |
| str(dataset_size), |
| "--max_seq_length", | ||
| length, |
There was a problem hiding this comment.
the integer arguments length and dataset_size must be converted to strings for subprocess.run
| "--max_seq_length", | |
| length, | |
| str(length), | |
| "--num_samples", | |
| str(dataset_size), |
| "--max_seq_length", | ||
| length, |
There was a problem hiding this comment.
the integer arguments length and dataset_size must be converted to strings for subprocess.run
| "--max_seq_length", | |
| length, | |
| str(length), | |
| "--num_samples", | |
| str(dataset_size), |
| length, | ||
| "--num_samples", |
There was a problem hiding this comment.
the integer arguments length and dataset_size must be converted to strings for subprocess.run
| length, | |
| "--num_samples", | |
| str(length), | |
| "--num_samples", | |
| str(dataset_size), |
| "--max_seq_length", | ||
| length, |
There was a problem hiding this comment.
the integer arguments length and dataset_size must be converted to strings for subprocess.run
| "--max_seq_length", | |
| length, | |
| str(length), | |
| "--num_samples", | |
| str(dataset_size), |
| "--max_seq_length", | ||
| length, |
There was a problem hiding this comment.
the integer arguments length and dataset_size must be converted to strings for subprocess.run
| "--max_seq_length", | |
| length, | |
| str(length), | |
| "--num_samples", | |
| str(dataset_size), |
| ] | ||
| random.shuffle(sentences) | ||
|
|
||
| indexes = sorted(random.sample(range(num_haystack), len(needles)), reverse=True) |
There was a problem hiding this comment.
This line will raise a ValueError when len(needles) > num_haystack because random.sample() cannot sample more elements than the population size.
This can occur when args.num_needle_k * args.num_needle_v is large but num_haystack is small (e.g., during binary search or when max_seq_length is very restrictive).
| indexes = sorted(random.sample(range(num_haystack), len(needles)), reverse=True) | |
| indexes = sorted(random.sample(range(num_haystack), min(len(needles), num_haystack)), reverse=True) |
| setup = list(metrics.keys())[0].rsplit(".", 1)[0] | ||
| metrics[setup] = {} | ||
|
|
||
| for aggregation in metrics[f"{setup}.mk_niah_basic"]: |
There was a problem hiding this comment.
This code assumes metrics is non-empty and that specific keys exist without validation. If metrics is empty, line 33 raises IndexError. If expected tasks are missing, line 36 raises KeyError.
| setup = list(metrics.keys())[0].rsplit(".", 1)[0] | |
| metrics[setup] = {} | |
| for aggregation in metrics[f"{setup}.mk_niah_basic"]: | |
| if not metrics: | |
| raise ValueError("metrics dictionary cannot be empty") | |
| setup = list(metrics.keys())[0].rsplit(".", 1)[0] | |
| # Validate that mk_niah_basic exists | |
| if f"{setup}.mk_niah_basic" not in metrics: | |
| raise KeyError(f"Expected key '{setup}.mk_niah_basic' not found in metrics") | |
| metrics[setup] = {} |
| output_folder = Path(__file__).parent / setup | ||
|
|
||
| # 1. installing necessary packages | ||
| subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True) |
There was a problem hiding this comment.
The dependencies wonderwords, html2text, and tenacity are installed at runtime via pip, but they are not listed in requirements/main.txt. This creates an inconsistency where:
- The dependencies won't be installed during normal package installation
- Runtime pip installation may fail in restricted environments
- It makes dependency management unclear
These dependencies should be added to requirements/main.txt. Note that tenacity is used in tokenizer.py but not declared anywhere in requirements.
| q["random_index"] = random_numbers[i] | ||
|
|
||
| random.shuffle(curr_context) | ||
| examples = random.sample(curr_context, args.fewshot) |
There was a problem hiding this comment.
This line will raise a ValueError if args.fewshot > len(curr_context). While unlikely in typical usage, it could occur with specific parameter combinations or edge cases during binary search when num_qs is very small.
| examples = random.sample(curr_context, args.fewshot) | |
| examples = random.sample(curr_context, min(args.fewshot, len(curr_context))) |
| curr_needle["distractor"] = [{**c, "random_index": generate_random_number()} for c in curr_needle["distractor"]] | ||
|
|
||
| if args.fewshot > 0: | ||
| fewshot_examples = random.sample([i for i in range(len(needle)) if i != index], args.fewshot) |
There was a problem hiding this comment.
This line will raise a ValueError if args.fewshot >= len(needle). The code samples from indices excluding the current index, so if the dataset only has a few samples and fewshot is configured too high, this will fail.
| fewshot_examples = random.sample([i for i in range(len(needle)) if i != index], args.fewshot) | |
| available_samples = [i for i in range(len(needle)) if i != index] | |
| if args.fewshot > len(available_samples): | |
| raise ValueError(f"fewshot ({args.fewshot}) cannot exceed available samples ({len(available_samples)})") | |
| fewshot_examples = random.sample(available_samples, args.fewshot) |
| def __init__(self, model_path="gemini-1.5-pro-latest") -> None: | ||
| import google.generativeai as genai | ||
|
|
||
| genai.configure(api_key=os.environ["GEMINI_API_KEY"]) |
There was a problem hiding this comment.
Accessing os.environ["GEMINI_API_KEY"] will raise a KeyError if the environment variable is not set. This should be handled gracefully with a clear error message.
Consider checking if the environment variable exists first and raising a descriptive error if it's missing.
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License |
There was a problem hiding this comment.
The copyright notice is missing a period at the end. It should end with "limitations under the License."
| # limitations under the License | |
| # limitations under the License. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License |
There was a problem hiding this comment.
The copyright notice is missing a period at the end. It should end with "limitations under the License."
| # limitations under the License | |
| # limitations under the License. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License |
There was a problem hiding this comment.
The copyright notice is missing a period at the end. It should end with "limitations under the License."
| # limitations under the License | |
| # limitations under the License. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Kipok
left a comment
There was a problem hiding this comment.
Please add this to the documentation (ideally with an example command / expected results for any model).
If it takes a long time to prepare data (or data is very large) please also add this benchmark to https://github.com/NVIDIA-NeMo/Skills/blob/main/tests/gpu-tests/test_eval.py#L30, so that we don't automatically run it in ci
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
| output_folder = Path(__file__).parent / setup | ||
|
|
||
| # 1. installing necessary packages | ||
| # subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True) |
There was a problem hiding this comment.
logic: the dependencies wonderwords, html2text, and tenacity are commented out but actually required by the code (prepare_niah.py imports wonderwords, tokenizer.py imports tenacity). These should either be uncommented or added to requirements/main.txt
| # subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True) | |
| subprocess.run(["pip", "install", "wonderwords", "html2text", "tenacity"], check=True) |
| def tokens_to_text(self, tokens: List[int]) -> str: | ||
| pass |
There was a problem hiding this comment.
logic: tokens_to_text is not implemented and returns None. Since GeminiTokenizer's text_to_tokens returns fake sequential integers rather than actual tokens, decoding is impossible. If this functionality isn't needed for Gemini, raise NotImplementedError to make the limitation explicit
| def tokens_to_text(self, tokens: List[int]) -> str: | |
| pass | |
| def tokens_to_text(self, tokens: List[int]) -> str: | |
| raise NotImplementedError("GeminiTokenizer does not support decoding tokens back to text") |
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
There was a problem hiding this comment.
Additional Comments (1)
-
requirements/main.txt, line 54 (link)logic: missing required dependencies for RULER2:
wonderwords,html2text, andtenacityare imported inprepare_niah.pyandtokenizer.pybut not listed here
13 files reviewed, 1 comment
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com> Co-authored-by: Wasi Ahmad <wasiahmad@ucla.edu> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <3909933+activatedgeek@users.noreply.github.com> Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Co-authored-by: Jocelyn <jocelynh@nvidia.com> Co-authored-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com> Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Signed-off-by: Sanyam Kapoor <sanyamk@nvidia.com> Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com> Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com> Signed-off-by: Dan Lord <blahblahasdf@gmail.com> Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com> Co-authored-by: Wasi Ahmad <wasiahmad@ucla.edu> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <3909933+activatedgeek@users.noreply.github.com> Co-authored-by: Valentin Mendelev <vmendelev@nvidia.com> Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com> Co-authored-by: Jocelyn <jocelynh@nvidia.com> Co-authored-by: Dan Lord <blahblahasdf@gmail.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
New Features
✏️ Tip: You can customize this high-level summary in your review settings.