Skip to content

Add SPEED-bench#1276

Closed
talorabr wants to merge 10 commits intoNVIDIA-NeMo:mainfrom
talorabr:speed-bench
Closed

Add SPEED-bench#1276
talorabr wants to merge 10 commits intoNVIDIA-NeMo:mainfrom
talorabr:speed-bench

Conversation

@talorabr
Copy link
Collaborator

@talorabr talorabr commented Feb 25, 2026

PR to support speculative decoding (SD) evaluation, and specifically to support SPEED-Bench dataset (a new standard dataset for speculative decoding):

  • Add SD metrics.
  • Add SD evaluator and generation flows.
  • Add SPEED-Bench dataset for SD evaluation.

Summary by CodeRabbit

  • New Features

    • End-to-end Speculative Decoding (SD) evaluation: metric collection, delta computation, generation-time injection, and evaluator integration
    • SPEED-Bench data preparation CLI to generate per-config evaluation datasets
    • New SD reporting metrics: acceptance length, acceptance rate, and per-position statistics
  • Documentation

    • Added Speculative Decoding evaluation guide with examples and benchmark setup
  • Chores

    • Improved shell-argument quoting in server launch scripts

Signed-off-by: Talor Abramovich <talora@nvidia.com>
Signed-off-by: Talor Abramovich <talora@nvidia.com>
Signed-off-by: talora <talora@nvidia.com>
Signed-off-by: talora <talora@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f42d153 and 8b4a5d3.

📒 Files selected for processing (1)
  • nemo_skills/dataset/speed-bench/__init__.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • nemo_skills/dataset/speed-bench/init.py

📝 Walkthrough

Walkthrough

Adds end-to-end Speculative Decoding (SD) evaluation: docs and SPEED-Bench prep, new SpecdecGenerationTask to capture/compute pre/post spec-decoding metrics from vLLM or SGLang (Prometheus or per-request files), metrics collector and evaluator registration, and two minor server CLI quoting fixes.

Changes

Cohort / File(s) Summary
Documentation
docs/evaluation/index.md, docs/evaluation/speculative-decoding.md, mkdocs.yml
Add speculative decoding docs, index nav entry, SPEED-Bench usage and examples.
SPEED-Bench dataset prep
nemo_skills/dataset/speed-bench/__init__.py, nemo_skills/dataset/speed-bench/prepare.py
New dataset package constants and comprehensive CLI script to fetch external datasets, generate/pad prompts for many benchmarks, resolve turns to messages, and emit per-config JSONL outputs.
Generation / inference (specdec)
nemo_skills/inference/eval/specdec.py
Major feature: SpecdecGenerationTask, SpecDecodeMetrics model, vLLM and SGLang Prometheus and per-request parsers, delta computation, injection of specdec_stats into eval_config, Hydra entrypoint and helpers.
Evaluator & metrics integration
nemo_skills/evaluation/evaluator/specdec.py, nemo_skills/evaluation/evaluator/__init__.py, nemo_skills/evaluation/metrics/specdec_metrics.py, nemo_skills/evaluation/metrics/map_metrics.py
New specdec evaluator that stamps specdec_stats, SpecdecMetrics implementation, and registration in evaluator/metrics maps.
Server CLI quoting fixes
nemo_skills/inference/server/serve_sglang.py, nemo_skills/inference/server/serve_vllm.py
Use shlex.join for shell arg quoting when composing extra server arguments.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant GenTask as SpecdecGenerationTask
    participant Server as vLLM/SGLang
    participant Metrics as Prometheus/metrics file
    participant Evaluator

    User->>GenTask: start generation+evaluation(cfg)
    GenTask->>Metrics: fetch_metrics("before")
    Metrics-->>GenTask: before_snapshot
    GenTask->>Server: send_prompts / start_inference
    Server-->>GenTask: completions (+ optional per-request metrics files)
    GenTask->>Metrics: fetch_metrics("after") or read metrics files
    Metrics-->>GenTask: after_snapshot
    GenTask->>GenTask: compute_delta(before_snapshot, after_snapshot)
    GenTask->>GenTask: inject specdec_stats into eval_config
    GenTask->>Evaluator: run eval_specdec(config_with_stats)
    Evaluator-->>User: evaluation_results (data stamped with specdec metrics)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested labels

enhancement

Suggested reviewers

  • gwarmstrong
  • ekmb
  • Kipok
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add SPEED-bench' directly and specifically describes the main change—adding support for the SPEED-bench dataset and speculative decoding evaluation infrastructure throughout the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (3)
nemo_skills/dataset/speed-bench/prepare.py (1)

293-299: Redundant encoder.encode call.

Line 298 re-encodes prompt even though the result of the same call is already in tokens from line 293.

♻️ Proposed fix
     tokens = encoder.encode(prompt, disallowed_special=())
     current_num_tokens = len(tokens)
 
     if current_num_tokens > target_num_tokens:
         # Truncate if too long
-        tokens = encoder.encode(prompt, disallowed_special=())
         return encoder.decode(tokens[:target_num_tokens])
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 293 - 299, The code
redundantly calls encoder.encode(prompt, disallowed_special=()) twice; use the
already-computed tokens variable when truncating instead of re-encoding. In the
block that checks if current_num_tokens > target_num_tokens (around variables
tokens, current_num_tokens, target_num_tokens), remove the second encoder.encode
call and simply return encoder.decode(tokens[:target_num_tokens]) so the
existing tokens are reused.
nemo_skills/inference/eval/specdec.py (1)

440-448: Temp directory is never cleaned up.

_ensure_sglang_metrics_dir creates a temporary directory that persists after the process exits. Consider using atexit or a context manager to clean up, or at minimum document that cleanup is the caller's responsibility.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/specdec.py` around lines 440 - 448, The temp dir
created by _ensure_sglang_metrics_dir (uses cls._sglang_metrics_dir) is never
removed; register an atexit cleanup that removes the directory (use
shutil.rmtree) when the process exits, handle and log OSErrors, and ensure
registration only happens once (e.g., when you first create
cls._sglang_metrics_dir) so the temp dir is cleaned up automatically;
alternatively, add a documented method (e.g., cleanup_sglang_metrics_dir) to
explicitly delete cls._sglang_metrics_dir and call it from atexit if you prefer
manual control.
nemo_skills/evaluation/metrics/specdec_metrics.py (1)

87-98: no_answer branch is unreachable given compute_no_answer=False.

Since __init__ hardcodes compute_no_answer=False, the block at lines 96-97 can never execute. This is fine as defensive code for potential subclasses, but worth noting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/specdec_metrics.py` around lines 87 - 98, The
metrics_to_print method contains an unreachable branch because __init__
hardcodes self.compute_no_answer=False; either make compute_no_answer
configurable in __init__ (add a parameter and assign to self.compute_no_answer)
so the "no_answer" key (as_percentage) can be enabled, or remove the dead branch
entirely; update the constructor (the __init__ method) and references to
compute_no_answer or delete the conditional block that adds "no_answer" from
metrics_to_print (affecting metrics_to_print, compute_no_answer, __init__, and
the "no_answer"/as_percentage handling) to keep behavior consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 62-73: The multi-line shell invocation for ns eval in
docs/evaluation/speculative-decoding.md is broken because some lines (notably
the line containing the flag --server_type=sglang and the line before the
++inference.tokens_to_generate stanza) are missing trailing backslashes, causing
the following lines (e.g., ++inference.tokens_to_generate=1024) to be executed
as separate commands; fix by adding the missing trailing "\" to the end of the
broken lines so the entire ns eval invocation (including flags like
--server_type=sglang, --server_gpus=8, and the subsequent
++inference.tokens_to_generate entry) is one continuous multi-line command, and
apply the same fix to the other affected block noted in the comment (the block
covering the range that includes ++inference.tokens_to_generate).
- Line 7: Typo in the metric name: replace "accetpance" with "acceptance" in the
phrase "conditional acceptance rate (or per-position accetpance rate)" so the
text reads "conditional acceptance rate (or per-position acceptance rate)";
update the same phrase wherever it appears in
docs/evaluation/speculative-decoding.md (look for "conditional acceptance rate"
/ "per-position accetpance rate") to keep terminology consistent.

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Line 363: The LOG.warning call is passing character as a positional argument
but the message has no %s placeholder; update the LOG.warning invocation (the
LOG object usage in prepare.py) to include a format placeholder (e.g., add " %s"
in the message) so the character value is interpolated (i.e., change the call
that currently reads LOG.warning("no motivation provided for character",
character) to use a format string like "no motivation provided for character %s"
with character as the second argument).
- Around line 196-211: The parameter named answer is being shadowed by the loop
variables in the loops "for i, answer in enumerate(answers)" and "for i, answer
in enumerate(answers_to_add)"; rename the loop variables (for example to
candidate or opt_answer) and update all uses inside those loops
(encoder.encode(...) and prompt += f"A{i + 1}...") so they reference the new
loop name, preserving the existing logic that skips the correct_answer_i and
builds answers_to_add, to avoid clobbering the outer parameter value.
- Around line 414-420: The while loop that samples from hle_train_category can
fail or behave unexpectedly if hle_train_category is empty; before entering the
loop in prepare.py (where hle_train_category.sample(1, random_state=rng) is
used), add a guard to check hle_train_category.empty and handle it explicitly
(e.g., raise a clear ValueError or return a sensible fallback) so the function
(which later calls encoder.decode with prompt_tokens and example_tokens) fails
with an informative message rather than crashing inside sample() or producing
wrong output; reference hle_train_category, the while current_num_tokens <
num_tokens loop, and encoder.decode when implementing the fix.
- Around line 136-137: Remove the incorrect `@staticmethod` decorator lines from
all module-level functions (e.g., _generate_stackselect_prompt and the other
functions declared at the noted occurrences) so they are plain top-level
functions; locate each def decorated with `@staticmethod` (the ones reported at
lines 137, 217, 273, 290, 315, 321, 332, 382, 400, 424 in the review) and delete
only the decorator line, leaving the def signature and body unchanged.

In `@nemo_skills/evaluation/evaluator/specdec.py`:
- Around line 69-71: Check and validate eval_config.input_file before attempting
file I/O: ensure jsonl_file = eval_config.input_file is not None, is a non-empty
string (and optionally a valid path), and if not raise a clear ValueError (e.g.
"input_file must be a non-empty string") before the with open(...) block in
specdec.py so the open call no longer raises a generic TypeError; update any
callers or docstring if needed to reflect the validation on
eval_config.input_file.

In `@nemo_skills/evaluation/metrics/specdec_metrics.py`:
- Around line 67-85: In get_metrics(), guard the speculative-decoding division
by checking self.total before dividing (refer to method get_metrics, the loop
handling metric_key.startswith("spec_") and attribute self.total); if self.total
is zero, set the speculative metric value to 0 (or another sentinel like
None/float("nan") if preferred) instead of performing metric_value / self.total,
then continue to call update_common_metrics and _add_std_metrics as before.

In `@nemo_skills/inference/eval/specdec.py`:
- Around line 601-633: In process_single_datapoint, current_response sometimes
lacks the "response" key so later list comprehension accessing
response["response_id"] can KeyError; ensure every current_response includes a
response_id (e.g., after computing current_response check if "response" not in
current_response then set current_response["response_id"]=None) or change the
aggregation to use response.get("response_id") when building "response_ids";
update the logic around current_response, responses, and the final return to
consistently provide/expect "response_id" (refer to process_single_datapoint,
current_response, responses, and the final return's "response_ids"
construction).
- Around line 546-599: The code writes data points that lack metric fields and
then aggregates over all data_points causing KeyError; fix by only writing and
aggregating matched data points: in the loop that checks if all(response_id in
metrics for response_id in data_point["response_ids"]) (the block that updates
data_point with num_drafts, draft_tokens, etc.) add matched_data_points = [] (or
append to an existing list) and write/append only when the metrics check passes,
keep the LOG.warning for unmatched ones but do NOT write them to fout or include
them in the list used for the final return_value sums; finally, compute
return_value by summing over matched_data_points (not data_points) and handle
empty matched_data_points (return None or zeros) to avoid division by zero.
- Around line 720-734: specdec_stats may be None after calling
compute_sglang_spec_decode_delta (and similarly for the VLLM path), so guard the
logging that dereferences specdec_stats: after assigning specdec_stats =
compute_sglang_spec_decode_delta(...) (and for the VLLM equivalent
compute_vllm_spec_decode_delta/variable), check "if specdec_stats is None" and
in that branch log a clear fallback message (e.g., "no delta available, skipping
Prometheus delta logging") and skip the formatted access; otherwise log using
specdec_stats["acceptance_length"], ["acceptance_rate"], ["num_drafts"],
["draft_tokens"] as before. Ensure you apply the same None-guard to the VLLM
block to prevent TypeError.

---

Nitpick comments:
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 293-299: The code redundantly calls encoder.encode(prompt,
disallowed_special=()) twice; use the already-computed tokens variable when
truncating instead of re-encoding. In the block that checks if
current_num_tokens > target_num_tokens (around variables tokens,
current_num_tokens, target_num_tokens), remove the second encoder.encode call
and simply return encoder.decode(tokens[:target_num_tokens]) so the existing
tokens are reused.

In `@nemo_skills/evaluation/metrics/specdec_metrics.py`:
- Around line 87-98: The metrics_to_print method contains an unreachable branch
because __init__ hardcodes self.compute_no_answer=False; either make
compute_no_answer configurable in __init__ (add a parameter and assign to
self.compute_no_answer) so the "no_answer" key (as_percentage) can be enabled,
or remove the dead branch entirely; update the constructor (the __init__ method)
and references to compute_no_answer or delete the conditional block that adds
"no_answer" from metrics_to_print (affecting metrics_to_print,
compute_no_answer, __init__, and the "no_answer"/as_percentage handling) to keep
behavior consistent.

In `@nemo_skills/inference/eval/specdec.py`:
- Around line 440-448: The temp dir created by _ensure_sglang_metrics_dir (uses
cls._sglang_metrics_dir) is never removed; register an atexit cleanup that
removes the directory (use shutil.rmtree) when the process exits, handle and log
OSErrors, and ensure registration only happens once (e.g., when you first create
cls._sglang_metrics_dir) so the temp dir is cleaned up automatically;
alternatively, add a documented method (e.g., cleanup_sglang_metrics_dir) to
explicitly delete cls._sglang_metrics_dir and call it from atexit if you prefer
manual control.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a32b13 and 15df539.

📒 Files selected for processing (11)
  • docs/evaluation/index.md
  • docs/evaluation/speculative-decoding.md
  • nemo_skills/dataset/speed-bench/__init__.py
  • nemo_skills/dataset/speed-bench/prepare.py
  • nemo_skills/evaluation/evaluator/__init__.py
  • nemo_skills/evaluation/evaluator/specdec.py
  • nemo_skills/evaluation/metrics/map_metrics.py
  • nemo_skills/evaluation/metrics/specdec_metrics.py
  • nemo_skills/inference/eval/specdec.py
  • nemo_skills/inference/server/serve_sglang.py
  • nemo_skills/inference/server/serve_vllm.py

Comment on lines +196 to +211
for i, answer in enumerate(answers):
if i == correct_answer_i:
continue
answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
if all_tokens + answer_to_add_tokens > num_tokens:
break
answers_to_add_stop = i
answers_to_add = (
answers[: answers_to_add_stop + 1]
if answers_to_add_stop >= correct_answer_i
else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]
)
random.shuffle(answers_to_add)
for i, answer in enumerate(answers_to_add):
prompt += f"A{i + 1}:\n\n{answer}\n\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Variable answer is shadowed by loop variables.

The parameter answer (line 137) is shadowed by the loop variable on line 196 (for i, answer in enumerate(answers)) and again on line 210 (for i, answer in enumerate(answers_to_add)). This makes the code confusing and error-prone if the original answer parameter is needed after these loops.

🛡️ Proposed fix — rename loop variables
-    for i, answer in enumerate(answers):
+    for i, ans in enumerate(answers):
         if i == correct_answer_i:
             continue
-        answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
+        answer_to_add = f"A{i + 1}:\n\n{ans}\n\n"
-    for i, answer in enumerate(answers_to_add):
-        prompt += f"A{i + 1}:\n\n{answer}\n\n"
+    for i, ans in enumerate(answers_to_add):
+        prompt += f"A{i + 1}:\n\n{ans}\n\n"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i, answer in enumerate(answers):
if i == correct_answer_i:
continue
answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
if all_tokens + answer_to_add_tokens > num_tokens:
break
answers_to_add_stop = i
answers_to_add = (
answers[: answers_to_add_stop + 1]
if answers_to_add_stop >= correct_answer_i
else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]
)
random.shuffle(answers_to_add)
for i, answer in enumerate(answers_to_add):
prompt += f"A{i + 1}:\n\n{answer}\n\n"
for i, ans in enumerate(answers):
if i == correct_answer_i:
continue
answer_to_add = f"A{i + 1}:\n\n{ans}\n\n"
answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
if all_tokens + answer_to_add_tokens > num_tokens:
break
answers_to_add_stop = i
answers_to_add = (
answers[: answers_to_add_stop + 1]
if answers_to_add_stop >= correct_answer_i
else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]
)
random.shuffle(answers_to_add)
for i, ans in enumerate(answers_to_add):
prompt += f"A{i + 1}:\n\n{ans}\n\n"
🧰 Tools
🪛 Ruff (0.15.2)

[warning] 207-207: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 196 - 211, The
parameter named answer is being shadowed by the loop variables in the loops "for
i, answer in enumerate(answers)" and "for i, answer in
enumerate(answers_to_add)"; rename the loop variables (for example to candidate
or opt_answer) and update all uses inside those loops (encoder.encode(...) and
prompt += f"A{i + 1}...") so they reference the new loop name, preserving the
existing logic that skips the correct_answer_i and builds answers_to_add, to
avoid clobbering the outer parameter value.

Comment on lines +69 to +71
jsonl_file = eval_config.input_file
with open(jsonl_file, "rt", encoding="utf-8") as fin:
data = [json.loads(line) for line in fin]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add explicit input_file validation before file I/O.

If input_file is absent, this currently fails with a generic TypeError at open-time. A direct check yields clearer failure semantics.

Suggested fix
     jsonl_file = eval_config.input_file
+    if not jsonl_file:
+        raise ValueError("`input_file` must be provided for specdec evaluation")
     with open(jsonl_file, "rt", encoding="utf-8") as fin:
         data = [json.loads(line) for line in fin]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
jsonl_file = eval_config.input_file
with open(jsonl_file, "rt", encoding="utf-8") as fin:
data = [json.loads(line) for line in fin]
jsonl_file = eval_config.input_file
if not jsonl_file:
raise ValueError("`input_file` must be provided for specdec evaluation")
with open(jsonl_file, "rt", encoding="utf-8") as fin:
data = [json.loads(line) for line in fin]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/specdec.py` around lines 69 - 71, Check and
validate eval_config.input_file before attempting file I/O: ensure jsonl_file =
eval_config.input_file is not None, is a non-empty string (and optionally a
valid path), and if not raise a clear ValueError (e.g. "input_file must be a
non-empty string") before the with open(...) block in specdec.py so the open
call no longer raises a generic TypeError; update any callers or docstring if
needed to reflect the validation on eval_config.input_file.

Comment on lines +67 to +85
def get_metrics(self) -> dict:
"""Get all computed metrics including speculative decoding statistics.

Returns:
Nested dict of evaluation mode → metric name → value.
"""
metrics_dict = {}

for agg_mode, agg_metric_dict in self.eval_dict.items():
metrics_dict[agg_mode] = {}
self.update_common_metrics(metrics_dict[agg_mode])
for metric_key, metric_value in agg_metric_dict.items():
if metric_key.startswith("spec_"):
metrics_dict[agg_mode][metric_key] = metric_value / self.total
else:
metrics_dict[agg_mode][metric_key] = metric_value
self._add_std_metrics(metrics_dict)

return metrics_dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the structure of BaseMetrics and related classes
cd nemo_skills && find . -type f -name "*metrics*.py" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 1727


🏁 Script executed:

# Search for BaseMetrics class definition and self.total usage
rg -n "class BaseMetrics" --type=py -A 20

Repository: NVIDIA-NeMo/Skills

Length of output: 1978


🏁 Script executed:

# Check how self.total is initialized and used throughout the codebase
rg -n "self\.total\s*=" --type=py -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 372


🏁 Script executed:

# Look for get_metrics implementations in other metric classes
rg -n "def get_metrics" --type=py -B 5 -A 20

Repository: NVIDIA-NeMo/Skills

Length of output: 45882


Guard against self.total == 0 before dividing by it in speculative decoding metrics.

If get_metrics() is called before any predictions are ingested (or after an empty update), the metric_value / self.total on line 80 will raise a ZeroDivisionError.

Proposed fix
     def get_metrics(self) -> dict:
         metrics_dict = {}
+        if self.total == 0:
+            return metrics_dict
 
         for agg_mode, agg_metric_dict in self.eval_dict.items():
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/specdec_metrics.py` around lines 67 - 85, In
get_metrics(), guard the speculative-decoding division by checking self.total
before dividing (refer to method get_metrics, the loop handling
metric_key.startswith("spec_") and attribute self.total); if self.total is zero,
set the speculative metric value to 0 (or another sentinel like
None/float("nan") if preferred) instead of performing metric_value / self.total,
then continue to call update_common_metrics and _add_std_metrics as before.

Comment on lines +546 to +599
data_points = []
with open(self.cfg.output_file, "rt", encoding="utf-8") as fin:
for i, line in enumerate(fin):
try:
data_points.append(json.loads(line))
except json.JSONDecodeError:
LOG.warning(f"Failed to parse JSON line {i} for metrics injection, skipping")
continue

with open(self.cfg.output_file, "wt", encoding="utf-8") as fout:
for data_point in data_points:
if all(response_id in metrics for response_id in data_point["response_ids"]):
data_point.update(
{
"num_drafts": sum(
metrics[response_id]["num_drafts"] for response_id in data_point["response_ids"]
),
"draft_tokens": sum(
metrics[response_id]["draft_tokens"] for response_id in data_point["response_ids"]
),
"accepted_tokens": sum(
metrics[response_id]["accepted_tokens"] for response_id in data_point["response_ids"]
),
"acceptance_rate": sum(
metrics[response_id]["acceptance_rate"] for response_id in data_point["response_ids"]
)
/ len(data_point["response_ids"]),
"acceptance_length": sum(
metrics[response_id]["acceptance_length"] for response_id in data_point["response_ids"]
)
/ len(data_point["response_ids"]),
}
)
else:
LOG.warning(
"No metrics found for response_ids: %s, skipping data point", data_point["response_ids"]
)
fout.write(json.dumps(data_point) + "\n")

try:
return_value = {
"num_drafts": sum([data_point["num_drafts"] for data_point in data_points]),
"draft_tokens": sum([data_point["draft_tokens"] for data_point in data_points]),
"accepted_tokens": sum([data_point["accepted_tokens"] for data_point in data_points]),
"acceptance_rate": sum([data_point["acceptance_rate"] for data_point in data_points])
/ len(data_points),
"acceptance_length": sum([data_point["acceptance_length"] for data_point in data_points])
/ len(data_points),
"per_position_acceptance_rates": [],
}
except KeyError:
LOG.warning("Metrics injection failed for some data points, skipping")
return None
return return_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Data points missing metrics are written without spec fields, causing the final aggregation to KeyError.

When response_ids are not all found in metrics (line 557), the data point is written to the output file without the spec-decode keys (line 583 runs unconditionally). Then lines 587-591 iterate over all data_points including those missing the keys, which raises a KeyError caught on line 596, discarding the entire result.

Consider either skipping unmatched data points from the aggregation or not writing them to the output file.

🐛 Proposed fix — track matched data points for aggregation
+        matched_data_points = []
         with open(self.cfg.output_file, "wt", encoding="utf-8") as fout:
             for data_point in data_points:
                 if all(response_id in metrics for response_id in data_point["response_ids"]):
                     data_point.update(
                         {
                             ...
                         }
                     )
+                    matched_data_points.append(data_point)
                 else:
                     LOG.warning(
                         "No metrics found for response_ids: %s, skipping data point", data_point["response_ids"]
                     )
                 fout.write(json.dumps(data_point) + "\n")
 
+        if not matched_data_points:
+            LOG.warning("No data points matched metrics, skipping aggregation")
+            return None
+
         try:
             return_value = {
-                "num_drafts": sum([data_point["num_drafts"] for data_point in data_points]),
-                "draft_tokens": sum([data_point["draft_tokens"] for data_point in data_points]),
-                "accepted_tokens": sum([data_point["accepted_tokens"] for data_point in data_points]),
-                "acceptance_rate": sum([data_point["acceptance_rate"] for data_point in data_points])
-                / len(data_points),
-                "acceptance_length": sum([data_point["acceptance_length"] for data_point in data_points])
-                / len(data_points),
+                "num_drafts": sum(dp["num_drafts"] for dp in matched_data_points),
+                "draft_tokens": sum(dp["draft_tokens"] for dp in matched_data_points),
+                "accepted_tokens": sum(dp["accepted_tokens"] for dp in matched_data_points),
+                "acceptance_rate": sum(dp["acceptance_rate"] for dp in matched_data_points)
+                / len(matched_data_points),
+                "acceptance_length": sum(dp["acceptance_length"] for dp in matched_data_points)
+                / len(matched_data_points),
                 "per_position_acceptance_rates": [],
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/specdec.py` around lines 546 - 599, The code
writes data points that lack metric fields and then aggregates over all
data_points causing KeyError; fix by only writing and aggregating matched data
points: in the loop that checks if all(response_id in metrics for response_id in
data_point["response_ids"]) (the block that updates data_point with num_drafts,
draft_tokens, etc.) add matched_data_points = [] (or append to an existing list)
and write/append only when the metrics check passes, keep the LOG.warning for
unmatched ones but do NOT write them to fout or include them in the list used
for the final return_value sums; finally, compute return_value by summing over
matched_data_points (not data_points) and handle empty matched_data_points
(return None or zeros) to avoid division by zero.

Comment on lines +601 to +633
async def process_single_datapoint(self, data_point, all_data, prompt_format="openai"):
"""Handle single-turn and multi-turn generation.

For single-turn data points (or when ``multiturn`` is ``False``), this
falls through to the base-class implementation.

For multi-turn data points, it loops through every turn, building up
the conversation and accumulating token counts.

Also tracks request IDs for SGLang metrics matching.
"""

messages = []
responses = []

for message in data_point["messages"]:
messages.append(message)
new_data_point = {"messages": messages}
current_response = await super().process_single_datapoint(
new_data_point, all_data, prompt_format=prompt_format
)
if "response" in current_response:
raw_response = current_response.pop("response")
current_response["response_id"] = raw_response.id
responses.append(current_response)
messages.append({"role": "assistant", "content": current_response["generation"]})

# Return aggregated results
return {
"generation": [response["generation"] for response in responses],
"num_generated_tokens": sum([response["num_generated_tokens"] for response in responses]),
"response_ids": [response["response_id"] for response in responses],
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

response_id may be missing if the response object lacks a "response" key.

On line 622, if "response" is not in current_response, the response_id field is never set. However, line 632 unconditionally accesses response["response_id"] for every response, which would raise a KeyError.

🛡️ Proposed fix
-            if "response" in current_response:
-                raw_response = current_response.pop("response")
-                current_response["response_id"] = raw_response.id
+            raw_response = current_response.pop("response", None)
+            current_response["response_id"] = raw_response.id if raw_response is not None else None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/specdec.py` around lines 601 - 633, In
process_single_datapoint, current_response sometimes lacks the "response" key so
later list comprehension accessing response["response_id"] can KeyError; ensure
every current_response includes a response_id (e.g., after computing
current_response check if "response" not in current_response then set
current_response["response_id"]=None) or change the aggregation to use
response.get("response_id") when building "response_ids"; update the logic
around current_response, responses, and the final return to consistently
provide/expect "response_id" (refer to process_single_datapoint,
current_response, responses, and the final return's "response_ids"
construction).

Comment on lines +720 to +734
# ----- Strategy 2: Prometheus before/after fallback -----
if specdec_stats is None:
LOG.info("Falling back to SGLang Prometheus before/after delta…")
LOG.info("Fetching AFTER SGLang spec-decode metrics from %s/metrics", server_address)
after_sglang = fetch_sglang_spec_decode_metrics(server_address)

specdec_stats = compute_sglang_spec_decode_delta(self._before_metrics, after_sglang)
LOG.info(
"SGLang Prometheus delta: acceptance_length=%.4f, "
"acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d",
specdec_stats["acceptance_length"],
specdec_stats["acceptance_rate"],
specdec_stats["num_drafts"],
specdec_stats["draft_tokens"],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

specdec_stats can be None when logged on lines 730-734, causing TypeError.

compute_sglang_spec_decode_delta can return None (e.g. when delta_requests <= 0). The code on line 726 assigns the result to specdec_stats, but immediately accesses specdec_stats["acceptance_length"] on line 730 without a null check. The same pattern affects the VLLM path on lines 741-749.

🐛 Proposed fix — guard both logging blocks
             # ----- Strategy 2: Prometheus before/after fallback -----
             if specdec_stats is None:
                 LOG.info("Falling back to SGLang Prometheus before/after delta…")
                 LOG.info("Fetching AFTER SGLang spec-decode metrics from %s/metrics", server_address)
                 after_sglang = fetch_sglang_spec_decode_metrics(server_address)
 
                 specdec_stats = compute_sglang_spec_decode_delta(self._before_metrics, after_sglang)
-                LOG.info(
-                    "SGLang Prometheus delta: acceptance_length=%.4f, "
-                    "acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d",
-                    specdec_stats["acceptance_length"],
-                    specdec_stats["acceptance_rate"],
-                    specdec_stats["num_drafts"],
-                    specdec_stats["draft_tokens"],
-                )
+                if specdec_stats is not None:
+                    LOG.info(
+                        "SGLang Prometheus delta: acceptance_length=%.4f, "
+                        "acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d",
+                        specdec_stats["acceptance_length"],
+                        specdec_stats["acceptance_rate"],
+                        specdec_stats["num_drafts"],
+                        specdec_stats["draft_tokens"],
+                    )
+                else:
+                    LOG.warning("SGLang: failed to compute spec-decode delta.")
         else:
             # VLLM: Fetch "after" snapshot and compute delta
             LOG.info("Fetching AFTER spec-decode metrics from %s/metrics", server_address)
             after_metrics = fetch_vllm_spec_decode_metrics(server_address)
 
             specdec_stats = compute_vllm_spec_decode_delta(self._before_metrics, after_metrics)
-            LOG.info(
-                "Spec-decode delta: drafts=%d, draft_tokens=%d, accepted=%d, "
-                "acceptance_rate=%.2f%%, acceptance_length=%.4f",
-                specdec_stats["num_drafts"],
-                specdec_stats["draft_tokens"],
-                specdec_stats["accepted_tokens"],
-                specdec_stats["acceptance_rate"],
-                specdec_stats["acceptance_length"],
-            )
-            if specdec_stats["per_position_acceptance_rates"]:
+            if specdec_stats is not None:
                 LOG.info(
-                    "Per-position acceptance rates: %s",
-                    [f"{r:.4f}" for r in specdec_stats["per_position_acceptance_rates"]],
+                    "Spec-decode delta: drafts=%d, draft_tokens=%d, accepted=%d, "
+                    "acceptance_rate=%.2f%%, acceptance_length=%.4f",
+                    specdec_stats["num_drafts"],
+                    specdec_stats["draft_tokens"],
+                    specdec_stats["accepted_tokens"],
+                    specdec_stats["acceptance_rate"],
+                    specdec_stats["acceptance_length"],
                 )
+                if specdec_stats["per_position_acceptance_rates"]:
+                    LOG.info(
+                        "Per-position acceptance rates: %s",
+                        [f"{r:.4f}" for r in specdec_stats["per_position_acceptance_rates"]],
+                    )
+            else:
+                LOG.warning("VLLM: failed to compute spec-decode delta.")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/specdec.py` around lines 720 - 734, specdec_stats
may be None after calling compute_sglang_spec_decode_delta (and similarly for
the VLLM path), so guard the logging that dereferences specdec_stats: after
assigning specdec_stats = compute_sglang_spec_decode_delta(...) (and for the
VLLM equivalent compute_vllm_spec_decode_delta/variable), check "if
specdec_stats is None" and in that branch log a clear fallback message (e.g.,
"no delta available, skipping Prometheus delta logging") and skip the formatted
access; otherwise log using specdec_stats["acceptance_length"],
["acceptance_rate"], ["num_drafts"], ["draft_tokens"] as before. Ensure you
apply the same None-guard to the VLLM block to prevent TypeError.

talorabr and others added 2 commits February 25, 2026 16:36
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Talor Abramovich <talor19@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: Talor Abramovich <talor19@gmail.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
nemo_skills/dataset/speed-bench/prepare.py (2)

136-136: ⚠️ Potential issue | 🟠 Major

Remove module-level @staticmethod decorators.

These are top-level functions, not class members. Keeping @staticmethod here is misleading and can create avoidable descriptor/type-introspection issues.

#!/bin/bash
# Verify minimum Python requirement and all module-level `@staticmethod` occurrences.

fd pyproject.toml setup.cfg setup.py | xargs -r rg -n "requires-python|python_requires"
rg -n '^\s*@staticmethod\s*$' nemo_skills/dataset/speed-bench/prepare.py
Suggested fix (apply to all listed occurrences)
-@staticmethod
 def _generate_stackselect_prompt(question: str, answers: list[str], answer: str, num_tokens: int) -> str:
-@staticmethod
 def _generate_textsort_prompt(prompt: str) -> str:

Also applies to: 216-216, 272-272, 289-289, 314-314, 320-320, 331-331, 381-381, 399-399, 423-423

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` at line 136, Remove the
module-level `@staticmethod` decorators that were applied to top-level functions
in nemo_skills/dataset/speed-bench/prepare.py; find each bare "@staticmethod"
above top-level function definitions (the occurrences flagged in the review) and
delete the decorator lines so the functions are plain module-level functions,
then run lint/tests to ensure no descriptor/type-introspection code relies on
them and update any references that assumed a staticmethod descriptor.

412-416: ⚠️ Potential issue | 🟡 Minor

Guard empty HLE category before sampling.

If filtering leaves hle_train_category empty, sample(1) raises and the failure message is not actionable.

Suggested fix
     hle_train_category = hle_train[hle_train["category"] == example["category"]]
+    if hle_train_category.empty:
+        raise ValueError(
+            f"No HLE training examples found for category '{example['category']}' after filtering"
+        )
 
     while current_num_tokens < num_tokens:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 412 - 416, Before
calling hle_train_category.sample(1) guard against an empty filtered DataFrame:
check hle_train_category.empty after filtering by example["category"] and either
(a) fallback to a broader sample (e.g., sample from hle_train with same rng) or
(b) raise a clear, actionable error that includes the missing category and
context (example["category"], example id) so the caller can diagnose the data
issue; ensure the subsequent code that appends to prompt still expects a string
demonstration from hle_train_category_sample.iloc[0].
docs/evaluation/speculative-decoding.md (1)

71-72: ⚠️ Potential issue | 🟠 Major

Fix broken multiline eval commands (missing trailing \).

Without a trailing backslash on Line 71 and Line 95, ++inference.tokens_to_generate=1024 runs as a separate command.

Suggested fix
-    --server_type=sglang
+    --server_type=sglang \
     ++inference.tokens_to_generate=1024
-    --server_type=vllm
+    --server_type=vllm \
     ++inference.tokens_to_generate=1024

Also applies to: 95-96

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/speculative-decoding.md` around lines 71 - 72, The multiline
shell flag sequences are missing trailing backslashes so that the next line
(e.g., the changed token "++inference.tokens_to_generate=1024") is executed as a
separate command; update the broken multiline commands (the lines containing
"--server_type=sglang" and the later block around
"++inference.tokens_to_generate=1024") to end with a trailing backslash on the
previous lines to continue the command, making sure the backslash is the last
character (no trailing spaces) and that each flag line uses the shell
continuation syntax correctly.
🧹 Nitpick comments (2)
docs/evaluation/speculative-decoding.md (1)

77-81: Add a language identifier to the fenced metrics block.

This keeps markdown lint clean and improves rendering consistency.

Suggested fix
-```
+```text
 --------------------------------------------- speed-bench ----------------------------------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
 pass@1          | 880         | 464        | 139         | 2.78                   | 69.38
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/evaluation/speculative-decoding.md around lines 77 - 81, The fenced
metrics block for the "speed-bench" table is missing a language identifier;
update the triple-backtick fence that precedes the table (the block starting
with the "--------------------------------------------- speed-bench
----------------------------------------------" header) to include "text" (i.e.,
replace withtext) so the code fence becomes ```text and keep the closing

nemo_skills/dataset/speed-bench/prepare.py (1)

457-465: Avoid recomputing HLE train preprocessing per row in dataset.map.

The to_pandas() conversion plus demonstration/tokens construction is inside per-example flow, which is expensive and scales poorly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 457 - 465, The
current code calls _get_external_dataset(...), then converts to_pandas() and
builds "demonstration" and "tokens" per-example inside a mapping flow which is
expensive; move the HLE preprocessing out of the per-example map by operating on
the full dataset once: fetch hle_train via
_get_external_dataset(BenchmarkDataset.HLE.value, ...), convert it once to a
pandas DataFrame (hle_train.to_pandas()) or use dataset.map with batched=True,
then create the "demonstration" column (using the lambda that concatenates
question and rationale) and compute "tokens" with a single
tiktoken.get_encoding("o200k_base").encode call applied vectorized/batched;
ensure you replace the per-example creation of "demonstration" and "tokens" in
the map with the precomputed columns so the dataset.map no longer recomputes
to_pandas or encoding per row (reference symbols: _get_external_dataset,
BenchmarkDataset.HLE.value, hle_train, "demonstration", "tokens",
tiktoken.get_encoding).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 194-208: The loop that chooses which candidate answers to include
doesn't update the running token count and can incorrectly include an answer
that exceeds the budget; change answers_to_add_stop initialization from 0 to -1,
and inside the for-loop when an answer fits (i.e., all_tokens +
answer_to_add_tokens <= num_tokens) increment all_tokens by answer_to_add_tokens
and then set answers_to_add_stop = i; keep the break when it doesn't fit. Apply
the same token-counting and answers_to_add_stop initialization fix to the
analogous block referenced at lines 210-211 so cumulative budget is enforced.

---

Duplicate comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 71-72: The multiline shell flag sequences are missing trailing
backslashes so that the next line (e.g., the changed token
"++inference.tokens_to_generate=1024") is executed as a separate command; update
the broken multiline commands (the lines containing "--server_type=sglang" and
the later block around "++inference.tokens_to_generate=1024") to end with a
trailing backslash on the previous lines to continue the command, making sure
the backslash is the last character (no trailing spaces) and that each flag line
uses the shell continuation syntax correctly.

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Line 136: Remove the module-level `@staticmethod` decorators that were applied
to top-level functions in nemo_skills/dataset/speed-bench/prepare.py; find each
bare "@staticmethod" above top-level function definitions (the occurrences
flagged in the review) and delete the decorator lines so the functions are plain
module-level functions, then run lint/tests to ensure no
descriptor/type-introspection code relies on them and update any references that
assumed a staticmethod descriptor.
- Around line 412-416: Before calling hle_train_category.sample(1) guard against
an empty filtered DataFrame: check hle_train_category.empty after filtering by
example["category"] and either (a) fallback to a broader sample (e.g., sample
from hle_train with same rng) or (b) raise a clear, actionable error that
includes the missing category and context (example["category"], example id) so
the caller can diagnose the data issue; ensure the subsequent code that appends
to prompt still expects a string demonstration from
hle_train_category_sample.iloc[0].

---

Nitpick comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 77-81: The fenced metrics block for the "speed-bench" table is
missing a language identifier; update the triple-backtick fence that precedes
the table (the block starting with the
"--------------------------------------------- speed-bench
----------------------------------------------" header) to include "text" (i.e.,
replace ``` with ```text) so the code fence becomes ```text and keep the closing
``` unchanged to satisfy markdown lint and improve rendering.

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 457-465: The current code calls _get_external_dataset(...), then
converts to_pandas() and builds "demonstration" and "tokens" per-example inside
a mapping flow which is expensive; move the HLE preprocessing out of the
per-example map by operating on the full dataset once: fetch hle_train via
_get_external_dataset(BenchmarkDataset.HLE.value, ...), convert it once to a
pandas DataFrame (hle_train.to_pandas()) or use dataset.map with batched=True,
then create the "demonstration" column (using the lambda that concatenates
question and rationale) and compute "tokens" with a single
tiktoken.get_encoding("o200k_base").encode call applied vectorized/batched;
ensure you replace the per-example creation of "demonstration" and "tokens" in
the map with the precomputed columns so the dataset.map no longer recomputes
to_pandas or encoding per row (reference symbols: _get_external_dataset,
BenchmarkDataset.HLE.value, hle_train, "demonstration", "tokens",
tiktoken.get_encoding).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 15df539 and e326fde.

📒 Files selected for processing (2)
  • docs/evaluation/speculative-decoding.md
  • nemo_skills/dataset/speed-bench/prepare.py

Comment on lines +194 to +208
all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens
answers_to_add_stop = 0
for i, answer in enumerate(answers):
if i == correct_answer_i:
continue
answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
if all_tokens + answer_to_add_tokens > num_tokens:
break
answers_to_add_stop = i
answers_to_add = (
answers[: answers_to_add_stop + 1]
if answers_to_add_stop >= correct_answer_i
else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix token-budget accounting in stackselect prompt construction.

all_tokens is never incremented after adding a candidate answer, so selection does not enforce cumulative budget. Also, answers_to_add_stop logic can include an answer that already exceeded budget.

Suggested fix
-    all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens
-    answers_to_add_stop = 0
-    for i, answer in enumerate(answers):
+    all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens
+    answers_to_add = [answers[correct_answer_i]]
+    for i, candidate_answer in enumerate(answers):
         if i == correct_answer_i:
             continue
-        answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
+        answer_to_add = f"A{i + 1}:\n\n{candidate_answer}\n\n"
         answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
         if all_tokens + answer_to_add_tokens > num_tokens:
             break
-        answers_to_add_stop = i
-    answers_to_add = (
-        answers[: answers_to_add_stop + 1]
-        if answers_to_add_stop >= correct_answer_i
-        else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]
-    )
+        all_tokens += answer_to_add_tokens
+        answers_to_add.append(candidate_answer)
     random.shuffle(answers_to_add)
-    for i, answer in enumerate(answers_to_add):
-        prompt += f"A{i + 1}:\n\n{answer}\n\n"
+    for i, candidate_answer in enumerate(answers_to_add):
+        prompt += f"A{i + 1}:\n\n{candidate_answer}\n\n"

Also applies to: 210-211

🧰 Tools
🪛 Ruff (0.15.2)

[warning] 207-207: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 194 - 208, The loop
that chooses which candidate answers to include doesn't update the running token
count and can incorrectly include an answer that exceeds the budget; change
answers_to_add_stop initialization from 0 to -1, and inside the for-loop when an
answer fits (i.e., all_tokens + answer_to_add_tokens <= num_tokens) increment
all_tokens by answer_to_add_tokens and then set answers_to_add_stop = i; keep
the break when it doesn't fit. Apply the same token-counting and
answers_to_add_stop initialization fix to the analogous block referenced at
lines 210-211 so cumulative budget is enforced.

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talorabr, thanks for this contribution! Please re-create from a branch, so that tests can run

if key not in sample:
sample[key] = value
else:
LOG.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fail instead? If this is a spec decoding benchmark, I think it's better to just fail with that error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in the configuration to check if specdec_stats is a dictionary and raises an error if not

}
)
else:
LOG.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fail here instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily. There was a bug in SGLang where concurrent writes to the metrics file could corrupt it. This has been fixed in: sgl-project/sglang#19011 .
In addition, we have a fallback mechanism that retrieves metrics from the /metrics endpoint if reading from the file fails.
The benefit of using the metrics file is that it provides more detailed metrics compared to the /metrics endpoint.

Signed-off-by: talora <talora@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
nemo_skills/dataset/speed-bench/prepare.py (2)

193-207: ⚠️ Potential issue | 🟠 Major

Fix cumulative token-budget accounting in StackSelect prompt assembly.

all_tokens is never incremented after accepting candidates, so budget checks are non-cumulative and can overshoot num_tokens.

Proposed fix
-    all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens
-    answers_to_add_stop = 0
-    for i, answer in enumerate(answers):
+    all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens
+    answers_to_add = [answers[correct_answer_i]]
+    for i, candidate_answer in enumerate(answers):
         if i == correct_answer_i:
             continue
-        answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
+        answer_to_add = f"A{i + 1}:\n\n{candidate_answer}\n\n"
         answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
         if all_tokens + answer_to_add_tokens > num_tokens:
             break
-        answers_to_add_stop = i
-    answers_to_add = (
-        answers[: answers_to_add_stop + 1]
-        if answers_to_add_stop >= correct_answer_i
-        else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]
-    )
+        all_tokens += answer_to_add_tokens
+        answers_to_add.append(candidate_answer)
     random.shuffle(answers_to_add)
-    for i, answer in enumerate(answers_to_add):
-        prompt += f"A{i + 1}:\n\n{answer}\n\n"
+    for i, candidate_answer in enumerate(answers_to_add):
+        prompt += f"A{i + 1}:\n\n{candidate_answer}\n\n"

Also applies to: 209-210

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 193 - 207, The
token-budget check is non-cumulative because all_tokens is never updated after
accepting an answer candidate; modify the loop in prepare.py (the block that
iterates over answers and computes answer_to_add_tokens) to increment all_tokens
by answer_to_add_tokens whenever an answer is accepted (i.e., when you do not
break) so subsequent iterations account for previously added tokens, and apply
the same cumulative-update fix to the similar logic around lines 209-210 that
assembles additional answers (ensure any branch that appends tokens also
increases all_tokens before continuing).

403-407: ⚠️ Potential issue | 🟡 Minor

Add an explicit empty-category guard in HLE prompt generation.

If category filtering yields zero rows, .sample(1) fails with a low-context pandas error.

Proposed fix
     hle_train_category = hle_train[hle_train["category"] == example["category"]]
+    if hle_train_category.empty:
+        raise ValueError(f"No HLE training examples found for category '{example['category']}'")
 
     while current_num_tokens < num_tokens:
         hle_train_category_sample = hle_train_category.sample(1, random_state=rng)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 403 - 407, When
building the HLE prompt, guard against empty category results by checking
hle_train_category.empty before calling hle_train_category.sample(1); if empty,
fall back to sampling from the full hle_train (e.g., hle_train.sample(1,
random_state=rng)) or handle by breaking the loop/continuing to avoid calling
.sample on an empty frame — update the loop that uses hle_train_category,
example, rng, current_num_tokens, num_tokens, and prompt to perform this
empty-category check and use the fallback branch so .sample(1) cannot be invoked
on an empty DataFrame.
nemo_skills/evaluation/evaluator/specdec.py (1)

69-71: ⚠️ Potential issue | 🟡 Minor

Validate input_file before file I/O.

If input_file is unset, this fails with a generic open-time error instead of a clear config failure.

Proposed fix
     jsonl_file = eval_config.input_file
+    if not isinstance(jsonl_file, str) or not jsonl_file:
+        raise ValueError("`input_file` must be a non-empty string for specdec evaluation")
     with open(jsonl_file, "rt", encoding="utf-8") as fin:
         data = [json.loads(line) for line in fin]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/specdec.py` around lines 69 - 71, Verify
that eval_config.input_file (jsonl_file) is set and points to a non-empty string
before attempting to open it; if it's missing or falsy, raise a clear config
error (e.g., ValueError("input_file must be set in eval_config")) rather than
letting open() throw, then proceed to open and load lines into data as before
(refer to symbols eval_config.input_file, jsonl_file, data).
nemo_skills/evaluation/metrics/specdec_metrics.py (1)

79-80: ⚠️ Potential issue | 🟠 Major

Guard speculative metric averaging when no entries exist.

Division by self.total can crash when no predictions were ingested.

Proposed fix
             for metric_key, metric_value in agg_metric_dict.items():
                 if metric_key.startswith("spec_"):
-                    metrics_dict[agg_mode][metric_key] = metric_value / self.total
+                    metrics_dict[agg_mode][metric_key] = (metric_value / self.total) if self.total else 0.0
                 else:
                     metrics_dict[agg_mode][metric_key] = metric_value
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/specdec_metrics.py` around lines 79 - 80, The
speculative-metric averaging code uses metric_value / self.total and will raise
when no entries were ingested (self.total == 0); in the block where you handle
metric_key.startswith("spec_") (and write into
metrics_dict[agg_mode][metric_key]), guard the division by checking self.total >
0 and only perform metric_value / self.total when positive, otherwise assign a
safe default (e.g., 0 or None) or skip adding the metric; update the code around
metric_key.startswith("spec_") to use this guard so division-by-zero cannot
occur.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/speculative-decoding.md`:
- Line 9: Replace the vague "here" link text with a descriptive label (e.g.,
"Speculative decoding evaluation framework on GitHub") and update the metrics
output code fence to include a language specifier (use ```text) so the table is
rendered correctly; specifically, find the markdown link whose anchor text is
"here" and replace it with a descriptive label, and change the fenced block that
shows the "speed-bench" metrics to start with ```text (apply the same fixes to
the other occurrences mentioned in the comment).

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 620-622: The current direct call to dataset.to_json(output_path)
can truncate or corrupt an existing file if serialization fails; change this to
write to a temporary file in the same directory (e.g.,
output_path.with_suffix(".jsonl.tmp") or similar), perform the full
dataset.to_json to that temp path, flush/close it, then atomically move/replace
the temp file to output_path (use Path.replace or os.replace) and finally
LOG.info the final output_path; reference output_path, args.output_dir,
dataset.to_json and LOG.info when making the change.

---

Duplicate comments:
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 193-207: The token-budget check is non-cumulative because
all_tokens is never updated after accepting an answer candidate; modify the loop
in prepare.py (the block that iterates over answers and computes
answer_to_add_tokens) to increment all_tokens by answer_to_add_tokens whenever
an answer is accepted (i.e., when you do not break) so subsequent iterations
account for previously added tokens, and apply the same cumulative-update fix to
the similar logic around lines 209-210 that assembles additional answers (ensure
any branch that appends tokens also increases all_tokens before continuing).
- Around line 403-407: When building the HLE prompt, guard against empty
category results by checking hle_train_category.empty before calling
hle_train_category.sample(1); if empty, fall back to sampling from the full
hle_train (e.g., hle_train.sample(1, random_state=rng)) or handle by breaking
the loop/continuing to avoid calling .sample on an empty frame — update the loop
that uses hle_train_category, example, rng, current_num_tokens, num_tokens, and
prompt to perform this empty-category check and use the fallback branch so
.sample(1) cannot be invoked on an empty DataFrame.

In `@nemo_skills/evaluation/evaluator/specdec.py`:
- Around line 69-71: Verify that eval_config.input_file (jsonl_file) is set and
points to a non-empty string before attempting to open it; if it's missing or
falsy, raise a clear config error (e.g., ValueError("input_file must be set in
eval_config")) rather than letting open() throw, then proceed to open and load
lines into data as before (refer to symbols eval_config.input_file, jsonl_file,
data).

In `@nemo_skills/evaluation/metrics/specdec_metrics.py`:
- Around line 79-80: The speculative-metric averaging code uses metric_value /
self.total and will raise when no entries were ingested (self.total == 0); in
the block where you handle metric_key.startswith("spec_") (and write into
metrics_dict[agg_mode][metric_key]), guard the division by checking self.total >
0 and only perform metric_value / self.total when positive, otherwise assign a
safe default (e.g., 0 or None) or skip adding the metric; update the code around
metric_key.startswith("spec_") to use this guard so division-by-zero cannot
occur.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e326fde and f42d153.

📒 Files selected for processing (6)
  • docs/evaluation/speculative-decoding.md
  • mkdocs.yml
  • nemo_skills/dataset/speed-bench/prepare.py
  • nemo_skills/evaluation/evaluator/__init__.py
  • nemo_skills/evaluation/evaluator/specdec.py
  • nemo_skills/evaluation/metrics/specdec_metrics.py

In all SD benchmarks we want to measure two qualitative metrics for draft accuracy/quality: acceptance length (AL), acceptance rate (AR).
Other metric in this group is conditional acceptance rate (or per-position acceptance rate), which measures the acceptance rate in a given position conditioned that all previous tokens were accepted.

For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Tighten doc lint compliance (link text + code fence language).

Use descriptive link labels instead of “here”, and set a language for the metrics output block.

Proposed fix
-For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).
+For more advanced evaluation of SD, including throughput and per-category metrics, see the [Model Optimizer specdec_bench examples](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).

-- Original benchmark source is [here](https://huggingface.co/datasets/nvidia/SPEED-Bench).
+- Original benchmark source is the [SPEED-Bench dataset on Hugging Face](https://huggingface.co/datasets/nvidia/SPEED-Bench).

-```
+```text
 --------------------------------------------- speed-bench ----------------------------------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
 pass@1          | 880         | 464        | 139         | 2.78                   | 69.38
</details>


Also applies to: 32-32, 77-81

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 9-9: Link text should be descriptive

(MD059, descriptive-link-text)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/evaluation/speculative-decoding.md at line 9, Replace the vague "here"
link text with a descriptive label (e.g., "Speculative decoding evaluation
framework on GitHub") and update the metrics output code fence to include a
language specifier (use text) so the table is rendered correctly; specifically, find the markdown link whose anchor text is "here" and replace it with a descriptive label, and change the fenced block that shows the "speed-bench" metrics to start with text (apply the same fixes to the other
occurrences mentioned in the comment).


</details>

<!-- fingerprinting:phantom:poseidon:hawk -->

<!-- This is an auto-generated comment by CodeRabbit -->

Comment on lines +620 to +622
output_path = args.output_dir / f"{config}.jsonl"
dataset.to_json(output_path)
LOG.info(f" -> Saved to {output_path}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Write prepared benchmark files atomically to avoid partial-output corruption.

Direct writes to the final path risk truncating existing outputs if serialization fails mid-write.

Proposed fix
         output_path = args.output_dir / f"{config}.jsonl"
-        dataset.to_json(output_path)
+        args.output_dir.mkdir(parents=True, exist_ok=True)
+        tmp_output_path = output_path.with_suffix(output_path.suffix + ".tmp")
+        dataset.to_json(tmp_output_path)
+        tmp_output_path.replace(output_path)
         LOG.info(f"  -> Saved to {output_path}")

As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 620 - 622, The
current direct call to dataset.to_json(output_path) can truncate or corrupt an
existing file if serialization fails; change this to write to a temporary file
in the same directory (e.g., output_path.with_suffix(".jsonl.tmp") or similar),
perform the full dataset.to_json to that temp path, flush/close it, then
atomically move/replace the temp file to output_path (use Path.replace or
os.replace) and finally LOG.info the final output_path; reference output_path,
args.output_dir, dataset.to_json and LOG.info when making the change.

Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>
Signed-off-by: Talor Abramovich <talor19@gmail.com>
@gwarmstrong
Copy link
Collaborator

closing in favor of #1279

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants