Add SPEED-bench by talorabr · Pull Request #1276 · NVIDIA-NeMo/Skills

talorabr · 2026-02-25T14:23:34Z

PR to support speculative decoding (SD) evaluation, and specifically to support SPEED-Bench dataset (a new standard dataset for speculative decoding):

Add SD metrics.
Add SD evaluator and generation flows.
Add SPEED-Bench dataset for SD evaluation.

Summary by CodeRabbit

New Features
- End-to-end Speculative Decoding (SD) evaluation: metric collection, delta computation, generation-time injection, and evaluator integration
- SPEED-Bench data preparation CLI to generate per-config evaluation datasets
- New SD reporting metrics: acceptance length, acceptance rate, and per-position statistics
Documentation
- Added Speculative Decoding evaluation guide with examples and benchmark setup
Chores
- Improved shell-argument quoting in server launch scripts

Signed-off-by: Talor Abramovich <talora@nvidia.com>

Signed-off-by: talora <talora@nvidia.com>

coderabbitai · 2026-02-25T14:32:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f42d153 and 8b4a5d3.

📒 Files selected for processing (1)

nemo_skills/dataset/speed-bench/__init__.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/dataset/speed-bench/init.py

📝 Walkthrough

Walkthrough

Adds end-to-end Speculative Decoding (SD) evaluation: docs and SPEED-Bench prep, new SpecdecGenerationTask to capture/compute pre/post spec-decoding metrics from vLLM or SGLang (Prometheus or per-request files), metrics collector and evaluator registration, and two minor server CLI quoting fixes.

Changes

Cohort / File(s)	Summary
Documentation `docs/evaluation/index.md`, `docs/evaluation/speculative-decoding.md`, `mkdocs.yml`	Add speculative decoding docs, index nav entry, SPEED-Bench usage and examples.
SPEED-Bench dataset prep `nemo_skills/dataset/speed-bench/__init__.py`, `nemo_skills/dataset/speed-bench/prepare.py`	New dataset package constants and comprehensive CLI script to fetch external datasets, generate/pad prompts for many benchmarks, resolve turns to messages, and emit per-config JSONL outputs.
Generation / inference (specdec) `nemo_skills/inference/eval/specdec.py`	Major feature: SpecdecGenerationTask, SpecDecodeMetrics model, vLLM and SGLang Prometheus and per-request parsers, delta computation, injection of specdec_stats into eval_config, Hydra entrypoint and helpers.
Evaluator & metrics integration `nemo_skills/evaluation/evaluator/specdec.py`, `nemo_skills/evaluation/evaluator/__init__.py`, `nemo_skills/evaluation/metrics/specdec_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	New specdec evaluator that stamps specdec_stats, SpecdecMetrics implementation, and registration in evaluator/metrics maps.
Server CLI quoting fixes `nemo_skills/inference/server/serve_sglang.py`, `nemo_skills/inference/server/serve_vllm.py`	Use `shlex.join` for shell arg quoting when composing extra server arguments.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant GenTask as SpecdecGenerationTask
    participant Server as vLLM/SGLang
    participant Metrics as Prometheus/metrics file
    participant Evaluator

    User->>GenTask: start generation+evaluation(cfg)
    GenTask->>Metrics: fetch_metrics("before")
    Metrics-->>GenTask: before_snapshot
    GenTask->>Server: send_prompts / start_inference
    Server-->>GenTask: completions (+ optional per-request metrics files)
    GenTask->>Metrics: fetch_metrics("after") or read metrics files
    Metrics-->>GenTask: after_snapshot
    GenTask->>GenTask: compute_delta(before_snapshot, after_snapshot)
    GenTask->>GenTask: inject specdec_stats into eval_config
    GenTask->>Evaluator: run eval_specdec(config_with_stats)
    Evaluator-->>User: evaluation_results (data stamped with specdec metrics)

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested labels

enhancement

Suggested reviewers

gwarmstrong
ekmb
Kipok

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.89% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add SPEED-bench' directly and specifically describes the main change—adding support for the SPEED-bench dataset and speculative decoding evaluation infrastructure throughout the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (3)

nemo_skills/dataset/speed-bench/prepare.py (1)

293-299: Redundant encoder.encode call.

Line 298 re-encodes prompt even though the result of the same call is already in tokens from line 293.

♻️ Proposed fix

     tokens = encoder.encode(prompt, disallowed_special=())
     current_num_tokens = len(tokens)
 
     if current_num_tokens > target_num_tokens:
         # Truncate if too long
-        tokens = encoder.encode(prompt, disallowed_special=())
         return encoder.decode(tokens[:target_num_tokens])

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 293 - 299, The code
redundantly calls encoder.encode(prompt, disallowed_special=()) twice; use the
already-computed tokens variable when truncating instead of re-encoding. In the
block that checks if current_num_tokens > target_num_tokens (around variables
tokens, current_num_tokens, target_num_tokens), remove the second encoder.encode
call and simply return encoder.decode(tokens[:target_num_tokens]) so the
existing tokens are reused.

nemo_skills/inference/eval/specdec.py (1)

440-448: Temp directory is never cleaned up.

_ensure_sglang_metrics_dir creates a temporary directory that persists after the process exits. Consider using atexit or a context manager to clean up, or at minimum document that cleanup is the caller's responsibility.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/specdec.py` around lines 440 - 448, The temp dir
created by _ensure_sglang_metrics_dir (uses cls._sglang_metrics_dir) is never
removed; register an atexit cleanup that removes the directory (use
shutil.rmtree) when the process exits, handle and log OSErrors, and ensure
registration only happens once (e.g., when you first create
cls._sglang_metrics_dir) so the temp dir is cleaned up automatically;
alternatively, add a documented method (e.g., cleanup_sglang_metrics_dir) to
explicitly delete cls._sglang_metrics_dir and call it from atexit if you prefer
manual control.

nemo_skills/evaluation/metrics/specdec_metrics.py (1)

87-98: no_answer branch is unreachable given compute_no_answer=False.

Since __init__ hardcodes compute_no_answer=False, the block at lines 96-97 can never execute. This is fine as defensive code for potential subclasses, but worth noting.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/specdec_metrics.py` around lines 87 - 98, The
metrics_to_print method contains an unreachable branch because __init__
hardcodes self.compute_no_answer=False; either make compute_no_answer
configurable in __init__ (add a parameter and assign to self.compute_no_answer)
so the "no_answer" key (as_percentage) can be enabled, or remove the dead branch
entirely; update the constructor (the __init__ method) and references to
compute_no_answer or delete the conditional block that adds "no_answer" from
metrics_to_print (affecting metrics_to_print, compute_no_answer, __init__, and
the "no_answer"/as_percentage handling) to keep behavior consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 62-73: The multi-line shell invocation for ns eval in
docs/evaluation/speculative-decoding.md is broken because some lines (notably
the line containing the flag --server_type=sglang and the line before the
++inference.tokens_to_generate stanza) are missing trailing backslashes, causing
the following lines (e.g., ++inference.tokens_to_generate=1024) to be executed
as separate commands; fix by adding the missing trailing "\" to the end of the
broken lines so the entire ns eval invocation (including flags like
--server_type=sglang, --server_gpus=8, and the subsequent
++inference.tokens_to_generate entry) is one continuous multi-line command, and
apply the same fix to the other affected block noted in the comment (the block
covering the range that includes ++inference.tokens_to_generate).
- Line 7: Typo in the metric name: replace "accetpance" with "acceptance" in the
phrase "conditional acceptance rate (or per-position accetpance rate)" so the
text reads "conditional acceptance rate (or per-position acceptance rate)";
update the same phrase wherever it appears in
docs/evaluation/speculative-decoding.md (look for "conditional acceptance rate"
/ "per-position accetpance rate") to keep terminology consistent.

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Line 363: The LOG.warning call is passing character as a positional argument
but the message has no %s placeholder; update the LOG.warning invocation (the
LOG object usage in prepare.py) to include a format placeholder (e.g., add " %s"
in the message) so the character value is interpolated (i.e., change the call
that currently reads LOG.warning("no motivation provided for character",
character) to use a format string like "no motivation provided for character %s"
with character as the second argument).
- Around line 196-211: The parameter named answer is being shadowed by the loop
variables in the loops "for i, answer in enumerate(answers)" and "for i, answer
in enumerate(answers_to_add)"; rename the loop variables (for example to
candidate or opt_answer) and update all uses inside those loops
(encoder.encode(...) and prompt += f"A{i + 1}...") so they reference the new
loop name, preserving the existing logic that skips the correct_answer_i and
builds answers_to_add, to avoid clobbering the outer parameter value.
- Around line 414-420: The while loop that samples from hle_train_category can
fail or behave unexpectedly if hle_train_category is empty; before entering the
loop in prepare.py (where hle_train_category.sample(1, random_state=rng) is
used), add a guard to check hle_train_category.empty and handle it explicitly
(e.g., raise a clear ValueError or return a sensible fallback) so the function
(which later calls encoder.decode with prompt_tokens and example_tokens) fails
with an informative message rather than crashing inside sample() or producing
wrong output; reference hle_train_category, the while current_num_tokens <
num_tokens loop, and encoder.decode when implementing the fix.
- Around line 136-137: Remove the incorrect `@staticmethod` decorator lines from
all module-level functions (e.g., _generate_stackselect_prompt and the other
functions declared at the noted occurrences) so they are plain top-level
functions; locate each def decorated with `@staticmethod` (the ones reported at
lines 137, 217, 273, 290, 315, 321, 332, 382, 400, 424 in the review) and delete
only the decorator line, leaving the def signature and body unchanged.

In `@nemo_skills/evaluation/evaluator/specdec.py`:
- Around line 69-71: Check and validate eval_config.input_file before attempting
file I/O: ensure jsonl_file = eval_config.input_file is not None, is a non-empty
string (and optionally a valid path), and if not raise a clear ValueError (e.g.
"input_file must be a non-empty string") before the with open(...) block in
specdec.py so the open call no longer raises a generic TypeError; update any
callers or docstring if needed to reflect the validation on
eval_config.input_file.

In `@nemo_skills/evaluation/metrics/specdec_metrics.py`:
- Around line 67-85: In get_metrics(), guard the speculative-decoding division
by checking self.total before dividing (refer to method get_metrics, the loop
handling metric_key.startswith("spec_") and attribute self.total); if self.total
is zero, set the speculative metric value to 0 (or another sentinel like
None/float("nan") if preferred) instead of performing metric_value / self.total,
then continue to call update_common_metrics and _add_std_metrics as before.

In `@nemo_skills/inference/eval/specdec.py`:
- Around line 601-633: In process_single_datapoint, current_response sometimes
lacks the "response" key so later list comprehension accessing
response["response_id"] can KeyError; ensure every current_response includes a
response_id (e.g., after computing current_response check if "response" not in
current_response then set current_response["response_id"]=None) or change the
aggregation to use response.get("response_id") when building "response_ids";
update the logic around current_response, responses, and the final return to
consistently provide/expect "response_id" (refer to process_single_datapoint,
current_response, responses, and the final return's "response_ids"
construction).
- Around line 546-599: The code writes data points that lack metric fields and
then aggregates over all data_points causing KeyError; fix by only writing and
aggregating matched data points: in the loop that checks if all(response_id in
metrics for response_id in data_point["response_ids"]) (the block that updates
data_point with num_drafts, draft_tokens, etc.) add matched_data_points = [] (or
append to an existing list) and write/append only when the metrics check passes,
keep the LOG.warning for unmatched ones but do NOT write them to fout or include
them in the list used for the final return_value sums; finally, compute
return_value by summing over matched_data_points (not data_points) and handle
empty matched_data_points (return None or zeros) to avoid division by zero.
- Around line 720-734: specdec_stats may be None after calling
compute_sglang_spec_decode_delta (and similarly for the VLLM path), so guard the
logging that dereferences specdec_stats: after assigning specdec_stats =
compute_sglang_spec_decode_delta(...) (and for the VLLM equivalent
compute_vllm_spec_decode_delta/variable), check "if specdec_stats is None" and
in that branch log a clear fallback message (e.g., "no delta available, skipping
Prometheus delta logging") and skip the formatted access; otherwise log using
specdec_stats["acceptance_length"], ["acceptance_rate"], ["num_drafts"],
["draft_tokens"] as before. Ensure you apply the same None-guard to the VLLM
block to prevent TypeError.

---

Nitpick comments:
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 293-299: The code redundantly calls encoder.encode(prompt,
disallowed_special=()) twice; use the already-computed tokens variable when
truncating instead of re-encoding. In the block that checks if
current_num_tokens > target_num_tokens (around variables tokens,
current_num_tokens, target_num_tokens), remove the second encoder.encode call
and simply return encoder.decode(tokens[:target_num_tokens]) so the existing
tokens are reused.

In `@nemo_skills/evaluation/metrics/specdec_metrics.py`:
- Around line 87-98: The metrics_to_print method contains an unreachable branch
because __init__ hardcodes self.compute_no_answer=False; either make
compute_no_answer configurable in __init__ (add a parameter and assign to
self.compute_no_answer) so the "no_answer" key (as_percentage) can be enabled,
or remove the dead branch entirely; update the constructor (the __init__ method)
and references to compute_no_answer or delete the conditional block that adds
"no_answer" from metrics_to_print (affecting metrics_to_print,
compute_no_answer, __init__, and the "no_answer"/as_percentage handling) to keep
behavior consistent.

In `@nemo_skills/inference/eval/specdec.py`:
- Around line 440-448: The temp dir created by _ensure_sglang_metrics_dir (uses
cls._sglang_metrics_dir) is never removed; register an atexit cleanup that
removes the directory (use shutil.rmtree) when the process exits, handle and log
OSErrors, and ensure registration only happens once (e.g., when you first create
cls._sglang_metrics_dir) so the temp dir is cleaned up automatically;
alternatively, add a documented method (e.g., cleanup_sglang_metrics_dir) to
explicitly delete cls._sglang_metrics_dir and call it from atexit if you prefer
manual control.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a32b13 and 15df539.

📒 Files selected for processing (11)

docs/evaluation/index.md
docs/evaluation/speculative-decoding.md
nemo_skills/dataset/speed-bench/__init__.py
nemo_skills/dataset/speed-bench/prepare.py
nemo_skills/evaluation/evaluator/__init__.py
nemo_skills/evaluation/evaluator/specdec.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/specdec_metrics.py
nemo_skills/inference/eval/specdec.py
nemo_skills/inference/server/serve_sglang.py
nemo_skills/inference/server/serve_vllm.py

docs/evaluation/speculative-decoding.md

nemo_skills/dataset/speed-bench/prepare.py

coderabbitai · 2026-02-25T14:32:22Z

nemo_skills/dataset/speed-bench/prepare.py

+    for i, answer in enumerate(answers):
+        if i == correct_answer_i:
+            continue
+        answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
+        answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
+        if all_tokens + answer_to_add_tokens > num_tokens:
+            break
+        answers_to_add_stop = i
+    answers_to_add = (
+        answers[: answers_to_add_stop + 1]
+        if answers_to_add_stop >= correct_answer_i
+        else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]
+    )
+    random.shuffle(answers_to_add)
+    for i, answer in enumerate(answers_to_add):
+        prompt += f"A{i + 1}:\n\n{answer}\n\n"


⚠️ Potential issue | 🟡 Minor

Variable answer is shadowed by loop variables.

The parameter answer (line 137) is shadowed by the loop variable on line 196 (for i, answer in enumerate(answers)) and again on line 210 (for i, answer in enumerate(answers_to_add)). This makes the code confusing and error-prone if the original answer parameter is needed after these loops.

🛡️ Proposed fix — rename loop variables

- for i, answer in enumerate(answers): + for i, ans in enumerate(answers): if i == correct_answer_i: continue - answer_to_add = f"A{i + 1}:\n\n{answer}\n\n" + answer_to_add = f"A{i + 1}:\n\n{ans}\n\n"

- for i, answer in enumerate(answers_to_add): - prompt += f"A{i + 1}:\n\n{answer}\n\n" + for i, ans in enumerate(answers_to_add): + prompt += f"A{i + 1}:\n\n{ans}\n\n"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for i, answer in enumerate(answers):

if i == correct_answer_i:

continue

answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"

answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))

if all_tokens + answer_to_add_tokens > num_tokens:

break

answers_to_add_stop = i

answers_to_add = (

answers[: answers_to_add_stop + 1]

if answers_to_add_stop >= correct_answer_i

else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]

)

random.shuffle(answers_to_add)

for i, answer in enumerate(answers_to_add):

prompt += f"A{i + 1}:\n\n{answer}\n\n"

for i, ans in enumerate(answers):

if i == correct_answer_i:

continue

answer_to_add = f"A{i + 1}:\n\n{ans}\n\n"

answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))

if all_tokens + answer_to_add_tokens > num_tokens:

break

answers_to_add_stop = i

answers_to_add = (

answers[: answers_to_add_stop + 1]

if answers_to_add_stop >= correct_answer_i

else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]

)

random.shuffle(answers_to_add)

for i, ans in enumerate(answers_to_add):

prompt += f"A{i + 1}:\n\n{ans}\n\n"

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 207-207: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 196 - 211, The parameter named answer is being shadowed by the loop variables in the loops "for i, answer in enumerate(answers)" and "for i, answer in enumerate(answers_to_add)"; rename the loop variables (for example to candidate or opt_answer) and update all uses inside those loops (encoder.encode(...) and prompt += f"A{i + 1}...") so they reference the new loop name, preserving the existing logic that skips the correct_answer_i and builds answers_to_add, to avoid clobbering the outer parameter value.

nemo_skills/dataset/speed-bench/prepare.py

coderabbitai · 2026-02-25T14:32:23Z

nemo_skills/evaluation/evaluator/specdec.py

+    jsonl_file = eval_config.input_file
+    with open(jsonl_file, "rt", encoding="utf-8") as fin:
+        data = [json.loads(line) for line in fin]


⚠️ Potential issue | 🟡 Minor

Add explicit input_file validation before file I/O.

If input_file is absent, this currently fails with a generic TypeError at open-time. A direct check yields clearer failure semantics.

Suggested fix

jsonl_file = eval_config.input_file + if not jsonl_file: + raise ValueError("`input_file` must be provided for specdec evaluation") with open(jsonl_file, "rt", encoding="utf-8") as fin: data = [json.loads(line) for line in fin]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

jsonl_file = eval_config.input_file

with open(jsonl_file, "rt", encoding="utf-8") as fin:

data = [json.loads(line) for line in fin]

jsonl_file = eval_config.input_file

if not jsonl_file:

raise ValueError("`input_file` must be provided for specdec evaluation")

with open(jsonl_file, "rt", encoding="utf-8") as fin:

data = [json.loads(line) for line in fin]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/specdec.py` around lines 69 - 71, Check and validate eval_config.input_file before attempting file I/O: ensure jsonl_file = eval_config.input_file is not None, is a non-empty string (and optionally a valid path), and if not raise a clear ValueError (e.g. "input_file must be a non-empty string") before the with open(...) block in specdec.py so the open call no longer raises a generic TypeError; update any callers or docstring if needed to reflect the validation on eval_config.input_file.

coderabbitai · 2026-02-25T14:32:23Z

nemo_skills/evaluation/metrics/specdec_metrics.py

+    def get_metrics(self) -> dict:
+        """Get all computed metrics including speculative decoding statistics.
+
+        Returns:
+            Nested dict of evaluation mode → metric name → value.
+        """
+        metrics_dict = {}
+
+        for agg_mode, agg_metric_dict in self.eval_dict.items():
+            metrics_dict[agg_mode] = {}
+            self.update_common_metrics(metrics_dict[agg_mode])
+            for metric_key, metric_value in agg_metric_dict.items():
+                if metric_key.startswith("spec_"):
+                    metrics_dict[agg_mode][metric_key] = metric_value / self.total
+                else:
+                    metrics_dict[agg_mode][metric_key] = metric_value
+        self._add_std_metrics(metrics_dict)
+
+        return metrics_dict


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's explore the structure of BaseMetrics and related classes cd nemo_skills && find . -type f -name "*metrics*.py" | head -20

Repository: NVIDIA-NeMo/Skills

Length of output: 1727

🏁 Script executed:

# Search for BaseMetrics class definition and self.total usage rg -n "class BaseMetrics" --type=py -A 20

Repository: NVIDIA-NeMo/Skills

Length of output: 1978

🏁 Script executed:

# Check how self.total is initialized and used throughout the codebase rg -n "self\.total\s*=" --type=py -B 2 -A 2

Repository: NVIDIA-NeMo/Skills

Length of output: 372

🏁 Script executed:

# Look for get_metrics implementations in other metric classes rg -n "def get_metrics" --type=py -B 5 -A 20

Repository: NVIDIA-NeMo/Skills

Length of output: 45882

Guard against self.total == 0 before dividing by it in speculative decoding metrics.

If get_metrics() is called before any predictions are ingested (or after an empty update), the metric_value / self.total on line 80 will raise a ZeroDivisionError.

Proposed fix

def get_metrics(self) -> dict: metrics_dict = {} + if self.total == 0: + return metrics_dict for agg_mode, agg_metric_dict in self.eval_dict.items():

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/specdec_metrics.py` around lines 67 - 85, In get_metrics(), guard the speculative-decoding division by checking self.total before dividing (refer to method get_metrics, the loop handling metric_key.startswith("spec_") and attribute self.total); if self.total is zero, set the speculative metric value to 0 (or another sentinel like None/float("nan") if preferred) instead of performing metric_value / self.total, then continue to call update_common_metrics and _add_std_metrics as before.

coderabbitai · 2026-02-25T14:32:23Z

nemo_skills/inference/eval/specdec.py

+        data_points = []
+        with open(self.cfg.output_file, "rt", encoding="utf-8") as fin:
+            for i, line in enumerate(fin):
+                try:
+                    data_points.append(json.loads(line))
+                except json.JSONDecodeError:
+                    LOG.warning(f"Failed to parse JSON line {i} for metrics injection, skipping")
+                    continue
+
+        with open(self.cfg.output_file, "wt", encoding="utf-8") as fout:
+            for data_point in data_points:
+                if all(response_id in metrics for response_id in data_point["response_ids"]):
+                    data_point.update(
+                        {
+                            "num_drafts": sum(
+                                metrics[response_id]["num_drafts"] for response_id in data_point["response_ids"]
+                            ),
+                            "draft_tokens": sum(
+                                metrics[response_id]["draft_tokens"] for response_id in data_point["response_ids"]
+                            ),
+                            "accepted_tokens": sum(
+                                metrics[response_id]["accepted_tokens"] for response_id in data_point["response_ids"]
+                            ),
+                            "acceptance_rate": sum(
+                                metrics[response_id]["acceptance_rate"] for response_id in data_point["response_ids"]
+                            )
+                            / len(data_point["response_ids"]),
+                            "acceptance_length": sum(
+                                metrics[response_id]["acceptance_length"] for response_id in data_point["response_ids"]
+                            )
+                            / len(data_point["response_ids"]),
+                        }
+                    )
+                else:
+                    LOG.warning(
+                        "No metrics found for response_ids: %s, skipping data point", data_point["response_ids"]
+                    )
+                fout.write(json.dumps(data_point) + "\n")
+
+        try:
+            return_value = {
+                "num_drafts": sum([data_point["num_drafts"] for data_point in data_points]),
+                "draft_tokens": sum([data_point["draft_tokens"] for data_point in data_points]),
+                "accepted_tokens": sum([data_point["accepted_tokens"] for data_point in data_points]),
+                "acceptance_rate": sum([data_point["acceptance_rate"] for data_point in data_points])
+                / len(data_points),
+                "acceptance_length": sum([data_point["acceptance_length"] for data_point in data_points])
+                / len(data_points),
+                "per_position_acceptance_rates": [],
+            }
+        except KeyError:
+            LOG.warning("Metrics injection failed for some data points, skipping")
+            return None
+        return return_value


⚠️ Potential issue | 🟠 Major

Data points missing metrics are written without spec fields, causing the final aggregation to KeyError.

When response_ids are not all found in metrics (line 557), the data point is written to the output file without the spec-decode keys (line 583 runs unconditionally). Then lines 587-591 iterate over all data_points including those missing the keys, which raises a KeyError caught on line 596, discarding the entire result.

Consider either skipping unmatched data points from the aggregation or not writing them to the output file.

🐛 Proposed fix — track matched data points for aggregation

+ matched_data_points = [] with open(self.cfg.output_file, "wt", encoding="utf-8") as fout: for data_point in data_points: if all(response_id in metrics for response_id in data_point["response_ids"]): data_point.update( { ... } ) + matched_data_points.append(data_point) else: LOG.warning( "No metrics found for response_ids: %s, skipping data point", data_point["response_ids"] ) fout.write(json.dumps(data_point) + "\n") + if not matched_data_points: + LOG.warning("No data points matched metrics, skipping aggregation") + return None + try: return_value = { - "num_drafts": sum([data_point["num_drafts"] for data_point in data_points]), - "draft_tokens": sum([data_point["draft_tokens"] for data_point in data_points]), - "accepted_tokens": sum([data_point["accepted_tokens"] for data_point in data_points]), - "acceptance_rate": sum([data_point["acceptance_rate"] for data_point in data_points]) - / len(data_points), - "acceptance_length": sum([data_point["acceptance_length"] for data_point in data_points]) - / len(data_points), + "num_drafts": sum(dp["num_drafts"] for dp in matched_data_points), + "draft_tokens": sum(dp["draft_tokens"] for dp in matched_data_points), + "accepted_tokens": sum(dp["accepted_tokens"] for dp in matched_data_points), + "acceptance_rate": sum(dp["acceptance_rate"] for dp in matched_data_points) + / len(matched_data_points), + "acceptance_length": sum(dp["acceptance_length"] for dp in matched_data_points) + / len(matched_data_points), "per_position_acceptance_rates": [], }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/specdec.py` around lines 546 - 599, The code writes data points that lack metric fields and then aggregates over all data_points causing KeyError; fix by only writing and aggregating matched data points: in the loop that checks if all(response_id in metrics for response_id in data_point["response_ids"]) (the block that updates data_point with num_drafts, draft_tokens, etc.) add matched_data_points = [] (or append to an existing list) and write/append only when the metrics check passes, keep the LOG.warning for unmatched ones but do NOT write them to fout or include them in the list used for the final return_value sums; finally, compute return_value by summing over matched_data_points (not data_points) and handle empty matched_data_points (return None or zeros) to avoid division by zero.

coderabbitai · 2026-02-25T14:32:23Z

nemo_skills/inference/eval/specdec.py

+    async def process_single_datapoint(self, data_point, all_data, prompt_format="openai"):
+        """Handle single-turn and multi-turn generation.
+
+        For single-turn data points (or when ``multiturn`` is ``False``), this
+        falls through to the base-class implementation.
+
+        For multi-turn data points, it loops through every turn, building up
+        the conversation and accumulating token counts.
+
+        Also tracks request IDs for SGLang metrics matching.
+        """
+
+        messages = []
+        responses = []
+
+        for message in data_point["messages"]:
+            messages.append(message)
+            new_data_point = {"messages": messages}
+            current_response = await super().process_single_datapoint(
+                new_data_point, all_data, prompt_format=prompt_format
+            )
+            if "response" in current_response:
+                raw_response = current_response.pop("response")
+                current_response["response_id"] = raw_response.id
+            responses.append(current_response)
+            messages.append({"role": "assistant", "content": current_response["generation"]})
+
+        # Return aggregated results
+        return {
+            "generation": [response["generation"] for response in responses],
+            "num_generated_tokens": sum([response["num_generated_tokens"] for response in responses]),
+            "response_ids": [response["response_id"] for response in responses],
+        }


⚠️ Potential issue | 🟠 Major

response_id may be missing if the response object lacks a "response" key.

On line 622, if "response" is not in current_response, the response_id field is never set. However, line 632 unconditionally accesses response["response_id"] for every response, which would raise a KeyError.

🛡️ Proposed fix

- if "response" in current_response: - raw_response = current_response.pop("response") - current_response["response_id"] = raw_response.id + raw_response = current_response.pop("response", None) + current_response["response_id"] = raw_response.id if raw_response is not None else None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/specdec.py` around lines 601 - 633, In process_single_datapoint, current_response sometimes lacks the "response" key so later list comprehension accessing response["response_id"] can KeyError; ensure every current_response includes a response_id (e.g., after computing current_response check if "response" not in current_response then set current_response["response_id"]=None) or change the aggregation to use response.get("response_id") when building "response_ids"; update the logic around current_response, responses, and the final return to consistently provide/expect "response_id" (refer to process_single_datapoint, current_response, responses, and the final return's "response_ids" construction).

coderabbitai · 2026-02-25T14:32:23Z

nemo_skills/inference/eval/specdec.py

+            # ----- Strategy 2: Prometheus before/after fallback -----
+            if specdec_stats is None:
+                LOG.info("Falling back to SGLang Prometheus before/after delta…")
+                LOG.info("Fetching AFTER SGLang spec-decode metrics from %s/metrics", server_address)
+                after_sglang = fetch_sglang_spec_decode_metrics(server_address)
+
+                specdec_stats = compute_sglang_spec_decode_delta(self._before_metrics, after_sglang)
+                LOG.info(
+                    "SGLang Prometheus delta: acceptance_length=%.4f, "
+                    "acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d",
+                    specdec_stats["acceptance_length"],
+                    specdec_stats["acceptance_rate"],
+                    specdec_stats["num_drafts"],
+                    specdec_stats["draft_tokens"],
+                )


⚠️ Potential issue | 🔴 Critical

specdec_stats can be None when logged on lines 730-734, causing TypeError.

compute_sglang_spec_decode_delta can return None (e.g. when delta_requests <= 0). The code on line 726 assigns the result to specdec_stats, but immediately accesses specdec_stats["acceptance_length"] on line 730 without a null check. The same pattern affects the VLLM path on lines 741-749.

🐛 Proposed fix — guard both logging blocks

# ----- Strategy 2: Prometheus before/after fallback ----- if specdec_stats is None: LOG.info("Falling back to SGLang Prometheus before/after delta…") LOG.info("Fetching AFTER SGLang spec-decode metrics from %s/metrics", server_address) after_sglang = fetch_sglang_spec_decode_metrics(server_address) specdec_stats = compute_sglang_spec_decode_delta(self._before_metrics, after_sglang) - LOG.info( - "SGLang Prometheus delta: acceptance_length=%.4f, " - "acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d", - specdec_stats["acceptance_length"], - specdec_stats["acceptance_rate"], - specdec_stats["num_drafts"], - specdec_stats["draft_tokens"], - ) + if specdec_stats is not None: + LOG.info( + "SGLang Prometheus delta: acceptance_length=%.4f, " + "acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d", + specdec_stats["acceptance_length"], + specdec_stats["acceptance_rate"], + specdec_stats["num_drafts"], + specdec_stats["draft_tokens"], + ) + else: + LOG.warning("SGLang: failed to compute spec-decode delta.") else: # VLLM: Fetch "after" snapshot and compute delta LOG.info("Fetching AFTER spec-decode metrics from %s/metrics", server_address) after_metrics = fetch_vllm_spec_decode_metrics(server_address) specdec_stats = compute_vllm_spec_decode_delta(self._before_metrics, after_metrics) - LOG.info( - "Spec-decode delta: drafts=%d, draft_tokens=%d, accepted=%d, " - "acceptance_rate=%.2f%%, acceptance_length=%.4f", - specdec_stats["num_drafts"], - specdec_stats["draft_tokens"], - specdec_stats["accepted_tokens"], - specdec_stats["acceptance_rate"], - specdec_stats["acceptance_length"], - ) - if specdec_stats["per_position_acceptance_rates"]: + if specdec_stats is not None: LOG.info( - "Per-position acceptance rates: %s", - [f"{r:.4f}" for r in specdec_stats["per_position_acceptance_rates"]], + "Spec-decode delta: drafts=%d, draft_tokens=%d, accepted=%d, " + "acceptance_rate=%.2f%%, acceptance_length=%.4f", + specdec_stats["num_drafts"], + specdec_stats["draft_tokens"], + specdec_stats["accepted_tokens"], + specdec_stats["acceptance_rate"], + specdec_stats["acceptance_length"], ) + if specdec_stats["per_position_acceptance_rates"]: + LOG.info( + "Per-position acceptance rates: %s", + [f"{r:.4f}" for r in specdec_stats["per_position_acceptance_rates"]], + ) + else: + LOG.warning("VLLM: failed to compute spec-decode delta.")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/specdec.py` around lines 720 - 734, specdec_stats may be None after calling compute_sglang_spec_decode_delta (and similarly for the VLLM path), so guard the logging that dereferences specdec_stats: after assigning specdec_stats = compute_sglang_spec_decode_delta(...) (and for the VLLM equivalent compute_vllm_spec_decode_delta/variable), check "if specdec_stats is None" and in that branch log a clear fallback message (e.g., "no delta available, skipping Prometheus delta logging") and skip the formatted access; otherwise log using specdec_stats["acceptance_length"], ["acceptance_rate"], ["num_drafts"], ["draft_tokens"] as before. Ensure you apply the same None-guard to the VLLM block to prevent TypeError.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

nemo_skills/dataset/speed-bench/prepare.py (2)

136-136: ⚠️ Potential issue | 🟠 Major

Remove module-level @staticmethod decorators.

These are top-level functions, not class members. Keeping @staticmethod here is misleading and can create avoidable descriptor/type-introspection issues.

#!/bin/bash
# Verify minimum Python requirement and all module-level `@staticmethod` occurrences.

fd pyproject.toml setup.cfg setup.py | xargs -r rg -n "requires-python|python_requires"
rg -n '^\s*@staticmethod\s*$' nemo_skills/dataset/speed-bench/prepare.py

Suggested fix (apply to all listed occurrences)

-@staticmethod
 def _generate_stackselect_prompt(question: str, answers: list[str], answer: str, num_tokens: int) -> str:

-@staticmethod
 def _generate_textsort_prompt(prompt: str) -> str:

Also applies to: 216-216, 272-272, 289-289, 314-314, 320-320, 331-331, 381-381, 399-399, 423-423

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` at line 136, Remove the
module-level `@staticmethod` decorators that were applied to top-level functions
in nemo_skills/dataset/speed-bench/prepare.py; find each bare "@staticmethod"
above top-level function definitions (the occurrences flagged in the review) and
delete the decorator lines so the functions are plain module-level functions,
then run lint/tests to ensure no descriptor/type-introspection code relies on
them and update any references that assumed a staticmethod descriptor.

412-416: ⚠️ Potential issue | 🟡 Minor

Guard empty HLE category before sampling.

If filtering leaves hle_train_category empty, sample(1) raises and the failure message is not actionable.

Suggested fix

     hle_train_category = hle_train[hle_train["category"] == example["category"]]
+    if hle_train_category.empty:
+        raise ValueError(
+            f"No HLE training examples found for category '{example['category']}' after filtering"
+        )
 
     while current_num_tokens < num_tokens:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 412 - 416, Before
calling hle_train_category.sample(1) guard against an empty filtered DataFrame:
check hle_train_category.empty after filtering by example["category"] and either
(a) fallback to a broader sample (e.g., sample from hle_train with same rng) or
(b) raise a clear, actionable error that includes the missing category and
context (example["category"], example id) so the caller can diagnose the data
issue; ensure the subsequent code that appends to prompt still expects a string
demonstration from hle_train_category_sample.iloc[0].

docs/evaluation/speculative-decoding.md (1)

71-72: ⚠️ Potential issue | 🟠 Major

Fix broken multiline eval commands (missing trailing \).

Without a trailing backslash on Line 71 and Line 95, ++inference.tokens_to_generate=1024 runs as a separate command.

Suggested fix

-    --server_type=sglang
+    --server_type=sglang \
     ++inference.tokens_to_generate=1024

-    --server_type=vllm
+    --server_type=vllm \
     ++inference.tokens_to_generate=1024

Also applies to: 95-96

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/speculative-decoding.md` around lines 71 - 72, The multiline
shell flag sequences are missing trailing backslashes so that the next line
(e.g., the changed token "++inference.tokens_to_generate=1024") is executed as a
separate command; update the broken multiline commands (the lines containing
"--server_type=sglang" and the later block around
"++inference.tokens_to_generate=1024") to end with a trailing backslash on the
previous lines to continue the command, making sure the backslash is the last
character (no trailing spaces) and that each flag line uses the shell
continuation syntax correctly.

🧹 Nitpick comments (2)

docs/evaluation/speculative-decoding.md (1)
77-81: Add a language identifier to the fenced metrics block.

This keeps markdown lint clean and improves rendering consistency.
Suggested fix
-```
+```text
 --------------------------------------------- speed-bench ----------------------------------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
 pass@1          | 880         | 464        | 139         | 2.78                   | 69.38
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @docs/evaluation/speculative-decoding.md around lines 77 - 81, The fenced
metrics block for the "speed-bench" table is missing a language identifier;
update the triple-backtick fence that precedes the table (the block starting
with the "--------------------------------------------- speed-bench
----------------------------------------------" header) to include "text" (i.e.,
replace withtext) so the code fence becomes ```text and keep the closing
nemo_skills/dataset/speed-bench/prepare.py (1)
457-465: Avoid recomputing HLE train preprocessing per row in dataset.map.

The to_pandas() conversion plus demonstration/tokens construction is inside per-example flow, which is expensive and scales poorly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 457 - 465, The
current code calls _get_external_dataset(...), then converts to_pandas() and
builds "demonstration" and "tokens" per-example inside a mapping flow which is
expensive; move the HLE preprocessing out of the per-example map by operating on
the full dataset once: fetch hle_train via
_get_external_dataset(BenchmarkDataset.HLE.value, ...), convert it once to a
pandas DataFrame (hle_train.to_pandas()) or use dataset.map with batched=True,
then create the "demonstration" column (using the lambda that concatenates
question and rationale) and compute "tokens" with a single
tiktoken.get_encoding("o200k_base").encode call applied vectorized/batched;
ensure you replace the per-example creation of "demonstration" and "tokens" in
the map with the precomputed columns so the dataset.map no longer recomputes
to_pandas or encoding per row (reference symbols: _get_external_dataset,
BenchmarkDataset.HLE.value, hle_train, "demonstration", "tokens",
tiktoken.get_encoding).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 194-208: The loop that chooses which candidate answers to include
doesn't update the running token count and can incorrectly include an answer
that exceeds the budget; change answers_to_add_stop initialization from 0 to -1,
and inside the for-loop when an answer fits (i.e., all_tokens +
answer_to_add_tokens <= num_tokens) increment all_tokens by answer_to_add_tokens
and then set answers_to_add_stop = i; keep the break when it doesn't fit. Apply
the same token-counting and answers_to_add_stop initialization fix to the
analogous block referenced at lines 210-211 so cumulative budget is enforced.

---

Duplicate comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 71-72: The multiline shell flag sequences are missing trailing
backslashes so that the next line (e.g., the changed token
"++inference.tokens_to_generate=1024") is executed as a separate command; update
the broken multiline commands (the lines containing "--server_type=sglang" and
the later block around "++inference.tokens_to_generate=1024") to end with a
trailing backslash on the previous lines to continue the command, making sure
the backslash is the last character (no trailing spaces) and that each flag line
uses the shell continuation syntax correctly.

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Line 136: Remove the module-level `@staticmethod` decorators that were applied
to top-level functions in nemo_skills/dataset/speed-bench/prepare.py; find each
bare "@staticmethod" above top-level function definitions (the occurrences
flagged in the review) and delete the decorator lines so the functions are plain
module-level functions, then run lint/tests to ensure no
descriptor/type-introspection code relies on them and update any references that
assumed a staticmethod descriptor.
- Around line 412-416: Before calling hle_train_category.sample(1) guard against
an empty filtered DataFrame: check hle_train_category.empty after filtering by
example["category"] and either (a) fallback to a broader sample (e.g., sample
from hle_train with same rng) or (b) raise a clear, actionable error that
includes the missing category and context (example["category"], example id) so
the caller can diagnose the data issue; ensure the subsequent code that appends
to prompt still expects a string demonstration from
hle_train_category_sample.iloc[0].

---

Nitpick comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 77-81: The fenced metrics block for the "speed-bench" table is
missing a language identifier; update the triple-backtick fence that precedes
the table (the block starting with the
"--------------------------------------------- speed-bench
----------------------------------------------" header) to include "text" (i.e.,
replace ``` with ```text) so the code fence becomes ```text and keep the closing
``` unchanged to satisfy markdown lint and improve rendering.

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 457-465: The current code calls _get_external_dataset(...), then
converts to_pandas() and builds "demonstration" and "tokens" per-example inside
a mapping flow which is expensive; move the HLE preprocessing out of the
per-example map by operating on the full dataset once: fetch hle_train via
_get_external_dataset(BenchmarkDataset.HLE.value, ...), convert it once to a
pandas DataFrame (hle_train.to_pandas()) or use dataset.map with batched=True,
then create the "demonstration" column (using the lambda that concatenates
question and rationale) and compute "tokens" with a single
tiktoken.get_encoding("o200k_base").encode call applied vectorized/batched;
ensure you replace the per-example creation of "demonstration" and "tokens" in
the map with the precomputed columns so the dataset.map no longer recomputes
to_pandas or encoding per row (reference symbols: _get_external_dataset,
BenchmarkDataset.HLE.value, hle_train, "demonstration", "tokens",
tiktoken.get_encoding).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 15df539 and e326fde.

📒 Files selected for processing (2)

docs/evaluation/speculative-decoding.md
nemo_skills/dataset/speed-bench/prepare.py

coderabbitai · 2026-02-25T14:45:46Z

nemo_skills/dataset/speed-bench/prepare.py

+    all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens
+    answers_to_add_stop = 0
+    for i, answer in enumerate(answers):
+        if i == correct_answer_i:
+            continue
+        answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
+        answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
+        if all_tokens + answer_to_add_tokens > num_tokens:
+            break
+        answers_to_add_stop = i
+    answers_to_add = (
+        answers[: answers_to_add_stop + 1]
+        if answers_to_add_stop >= correct_answer_i
+        else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]
+    )


⚠️ Potential issue | 🟠 Major

Fix token-budget accounting in stackselect prompt construction.

all_tokens is never incremented after adding a candidate answer, so selection does not enforce cumulative budget. Also, answers_to_add_stop logic can include an answer that already exceeded budget.

Suggested fix

- all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens - answers_to_add_stop = 0 - for i, answer in enumerate(answers): + all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens + answers_to_add = [answers[correct_answer_i]] + for i, candidate_answer in enumerate(answers): if i == correct_answer_i: continue - answer_to_add = f"A{i + 1}:\n\n{answer}\n\n" + answer_to_add = f"A{i + 1}:\n\n{candidate_answer}\n\n" answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=())) if all_tokens + answer_to_add_tokens > num_tokens: break - answers_to_add_stop = i - answers_to_add = ( - answers[: answers_to_add_stop + 1] - if answers_to_add_stop >= correct_answer_i - else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1] - ) + all_tokens += answer_to_add_tokens + answers_to_add.append(candidate_answer) random.shuffle(answers_to_add) - for i, answer in enumerate(answers_to_add): - prompt += f"A{i + 1}:\n\n{answer}\n\n" + for i, candidate_answer in enumerate(answers_to_add): + prompt += f"A{i + 1}:\n\n{candidate_answer}\n\n"

Also applies to: 210-211

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 207-207: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 194 - 208, The loop that chooses which candidate answers to include doesn't update the running token count and can incorrectly include an answer that exceeds the budget; change answers_to_add_stop initialization from 0 to -1, and inside the for-loop when an answer fits (i.e., all_tokens + answer_to_add_tokens <= num_tokens) increment all_tokens by answer_to_add_tokens and then set answers_to_add_stop = i; keep the break when it doesn't fit. Apply the same token-counting and answers_to_add_stop initialization fix to the analogous block referenced at lines 210-211 so cumulative budget is enforced.

Kipok

@talorabr, thanks for this contribution! Please re-create from a branch, so that tests can run

docs/evaluation/speculative-decoding.md

nemo_skills/dataset/speed-bench/__init__.py

nemo_skills/dataset/speed-bench/prepare.py

Kipok · 2026-02-26T02:19:56Z

nemo_skills/evaluation/evaluator/specdec.py

+                if key not in sample:
+                    sample[key] = value
+    else:
+        LOG.warning(


should we fail instead? If this is a spec decoding benchmark, I think it's better to just fail with that error

Changed in the configuration to check if specdec_stats is a dictionary and raises an error if not

nemo_skills/evaluation/evaluator/specdec.py

nemo_skills/evaluation/metrics/specdec_metrics.py

Kipok · 2026-02-26T02:22:51Z

nemo_skills/inference/eval/specdec.py

+                        }
+                    )
+                else:
+                    LOG.warning(


should we fail here instead?

Not necessarily. There was a bug in SGLang where concurrent writes to the metrics file could corrupt it. This has been fixed in: sgl-project/sglang#19011 .
In addition, we have a fallback mechanism that retrieves metrics from the /metrics endpoint if reading from the file fails.
The benefit of using the metrics file is that it provides more detailed metrics compared to the /metrics endpoint.

Signed-off-by: talora <talora@nvidia.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

nemo_skills/dataset/speed-bench/prepare.py (2)

193-207: ⚠️ Potential issue | 🟠 Major

Fix cumulative token-budget accounting in StackSelect prompt assembly.

all_tokens is never incremented after accepting candidates, so budget checks are non-cumulative and can overshoot num_tokens.

Proposed fix

-    all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens
-    answers_to_add_stop = 0
-    for i, answer in enumerate(answers):
+    all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens
+    answers_to_add = [answers[correct_answer_i]]
+    for i, candidate_answer in enumerate(answers):
         if i == correct_answer_i:
             continue
-        answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
+        answer_to_add = f"A{i + 1}:\n\n{candidate_answer}\n\n"
         answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
         if all_tokens + answer_to_add_tokens > num_tokens:
             break
-        answers_to_add_stop = i
-    answers_to_add = (
-        answers[: answers_to_add_stop + 1]
-        if answers_to_add_stop >= correct_answer_i
-        else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]
-    )
+        all_tokens += answer_to_add_tokens
+        answers_to_add.append(candidate_answer)
     random.shuffle(answers_to_add)
-    for i, answer in enumerate(answers_to_add):
-        prompt += f"A{i + 1}:\n\n{answer}\n\n"
+    for i, candidate_answer in enumerate(answers_to_add):
+        prompt += f"A{i + 1}:\n\n{candidate_answer}\n\n"

Also applies to: 209-210

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 193 - 207, The
token-budget check is non-cumulative because all_tokens is never updated after
accepting an answer candidate; modify the loop in prepare.py (the block that
iterates over answers and computes answer_to_add_tokens) to increment all_tokens
by answer_to_add_tokens whenever an answer is accepted (i.e., when you do not
break) so subsequent iterations account for previously added tokens, and apply
the same cumulative-update fix to the similar logic around lines 209-210 that
assembles additional answers (ensure any branch that appends tokens also
increases all_tokens before continuing).

403-407: ⚠️ Potential issue | 🟡 Minor

Add an explicit empty-category guard in HLE prompt generation.

If category filtering yields zero rows, .sample(1) fails with a low-context pandas error.

Proposed fix

     hle_train_category = hle_train[hle_train["category"] == example["category"]]
+    if hle_train_category.empty:
+        raise ValueError(f"No HLE training examples found for category '{example['category']}'")
 
     while current_num_tokens < num_tokens:
         hle_train_category_sample = hle_train_category.sample(1, random_state=rng)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 403 - 407, When
building the HLE prompt, guard against empty category results by checking
hle_train_category.empty before calling hle_train_category.sample(1); if empty,
fall back to sampling from the full hle_train (e.g., hle_train.sample(1,
random_state=rng)) or handle by breaking the loop/continuing to avoid calling
.sample on an empty frame — update the loop that uses hle_train_category,
example, rng, current_num_tokens, num_tokens, and prompt to perform this
empty-category check and use the fallback branch so .sample(1) cannot be invoked
on an empty DataFrame.

nemo_skills/evaluation/evaluator/specdec.py (1)

69-71: ⚠️ Potential issue | 🟡 Minor

Validate input_file before file I/O.

If input_file is unset, this fails with a generic open-time error instead of a clear config failure.

Proposed fix

     jsonl_file = eval_config.input_file
+    if not isinstance(jsonl_file, str) or not jsonl_file:
+        raise ValueError("`input_file` must be a non-empty string for specdec evaluation")
     with open(jsonl_file, "rt", encoding="utf-8") as fin:
         data = [json.loads(line) for line in fin]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/specdec.py` around lines 69 - 71, Verify
that eval_config.input_file (jsonl_file) is set and points to a non-empty string
before attempting to open it; if it's missing or falsy, raise a clear config
error (e.g., ValueError("input_file must be set in eval_config")) rather than
letting open() throw, then proceed to open and load lines into data as before
(refer to symbols eval_config.input_file, jsonl_file, data).

nemo_skills/evaluation/metrics/specdec_metrics.py (1)

79-80: ⚠️ Potential issue | 🟠 Major

Guard speculative metric averaging when no entries exist.

Division by self.total can crash when no predictions were ingested.

Proposed fix

             for metric_key, metric_value in agg_metric_dict.items():
                 if metric_key.startswith("spec_"):
-                    metrics_dict[agg_mode][metric_key] = metric_value / self.total
+                    metrics_dict[agg_mode][metric_key] = (metric_value / self.total) if self.total else 0.0
                 else:
                     metrics_dict[agg_mode][metric_key] = metric_value

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/specdec_metrics.py` around lines 79 - 80, The
speculative-metric averaging code uses metric_value / self.total and will raise
when no entries were ingested (self.total == 0); in the block where you handle
metric_key.startswith("spec_") (and write into
metrics_dict[agg_mode][metric_key]), guard the division by checking self.total >
0 and only perform metric_value / self.total when positive, otherwise assign a
safe default (e.g., 0 or None) or skip adding the metric; update the code around
metric_key.startswith("spec_") to use this guard so division-by-zero cannot
occur.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/speculative-decoding.md`:
- Line 9: Replace the vague "here" link text with a descriptive label (e.g.,
"Speculative decoding evaluation framework on GitHub") and update the metrics
output code fence to include a language specifier (use ```text) so the table is
rendered correctly; specifically, find the markdown link whose anchor text is
"here" and replace it with a descriptive label, and change the fenced block that
shows the "speed-bench" metrics to start with ```text (apply the same fixes to
the other occurrences mentioned in the comment).

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 620-622: The current direct call to dataset.to_json(output_path)
can truncate or corrupt an existing file if serialization fails; change this to
write to a temporary file in the same directory (e.g.,
output_path.with_suffix(".jsonl.tmp") or similar), perform the full
dataset.to_json to that temp path, flush/close it, then atomically move/replace
the temp file to output_path (use Path.replace or os.replace) and finally
LOG.info the final output_path; reference output_path, args.output_dir,
dataset.to_json and LOG.info when making the change.

---

Duplicate comments:
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 193-207: The token-budget check is non-cumulative because
all_tokens is never updated after accepting an answer candidate; modify the loop
in prepare.py (the block that iterates over answers and computes
answer_to_add_tokens) to increment all_tokens by answer_to_add_tokens whenever
an answer is accepted (i.e., when you do not break) so subsequent iterations
account for previously added tokens, and apply the same cumulative-update fix to
the similar logic around lines 209-210 that assembles additional answers (ensure
any branch that appends tokens also increases all_tokens before continuing).
- Around line 403-407: When building the HLE prompt, guard against empty
category results by checking hle_train_category.empty before calling
hle_train_category.sample(1); if empty, fall back to sampling from the full
hle_train (e.g., hle_train.sample(1, random_state=rng)) or handle by breaking
the loop/continuing to avoid calling .sample on an empty frame — update the loop
that uses hle_train_category, example, rng, current_num_tokens, num_tokens, and
prompt to perform this empty-category check and use the fallback branch so
.sample(1) cannot be invoked on an empty DataFrame.

In `@nemo_skills/evaluation/evaluator/specdec.py`:
- Around line 69-71: Verify that eval_config.input_file (jsonl_file) is set and
points to a non-empty string before attempting to open it; if it's missing or
falsy, raise a clear config error (e.g., ValueError("input_file must be set in
eval_config")) rather than letting open() throw, then proceed to open and load
lines into data as before (refer to symbols eval_config.input_file, jsonl_file,
data).

In `@nemo_skills/evaluation/metrics/specdec_metrics.py`:
- Around line 79-80: The speculative-metric averaging code uses metric_value /
self.total and will raise when no entries were ingested (self.total == 0); in
the block where you handle metric_key.startswith("spec_") (and write into
metrics_dict[agg_mode][metric_key]), guard the division by checking self.total >
0 and only perform metric_value / self.total when positive, otherwise assign a
safe default (e.g., 0 or None) or skip adding the metric; update the code around
metric_key.startswith("spec_") to use this guard so division-by-zero cannot
occur.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e326fde and f42d153.

📒 Files selected for processing (6)

docs/evaluation/speculative-decoding.md
mkdocs.yml
nemo_skills/dataset/speed-bench/prepare.py
nemo_skills/evaluation/evaluator/__init__.py
nemo_skills/evaluation/evaluator/specdec.py
nemo_skills/evaluation/metrics/specdec_metrics.py

coderabbitai · 2026-02-26T12:13:32Z

docs/evaluation/speculative-decoding.md

+In all SD benchmarks we want to measure two qualitative metrics for draft accuracy/quality: acceptance length (AL), acceptance rate (AR).
+Other metric in this group is conditional acceptance rate (or per-position acceptance rate), which measures the acceptance rate in a given position conditioned that all previous tokens were accepted.
+
+For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).


⚠️ Potential issue | 🟡 Minor

Tighten doc lint compliance (link text + code fence language).

Use descriptive link labels instead of “here”, and set a language for the metrics output block.

Proposed fix

-For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench). +For more advanced evaluation of SD, including throughput and per-category metrics, see the [Model Optimizer specdec_bench examples](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench). -- Original benchmark source is [here](https://huggingface.co/datasets/nvidia/SPEED-Bench). +- Original benchmark source is the [SPEED-Bench dataset on Hugging Face](https://huggingface.co/datasets/nvidia/SPEED-Bench). -``` +```text --------------------------------------------- speed-bench ---------------------------------------------- evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate pass@1 | 880 | 464 | 139 | 2.78 | 69.38

</details> Also applies to: 32-32, 77-81 <details> <summary>🧰 Tools</summary> <details> <summary>🪛 markdownlint-cli2 (0.21.0)</summary> [warning] 9-9: Link text should be descriptive (MD059, descriptive-link-text) </details> </details> <details> <summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/evaluation/speculative-decoding.md at line 9, Replace the vague "here"
link text with a descriptive label (e.g., "Speculative decoding evaluation
framework on GitHub") and update the metrics output code fence to include a
language specifier (use text) so the table is rendered correctly; specifically, find the markdown link whose anchor text is "here" and replace it with a descriptive label, and change the fenced block that shows the "speed-bench" metrics to start with text (apply the same fixes to the other
occurrences mentioned in the comment).

</details>  

coderabbitai · 2026-02-26T12:13:32Z

nemo_skills/dataset/speed-bench/prepare.py

+        output_path = args.output_dir / f"{config}.jsonl"
+        dataset.to_json(output_path)
+        LOG.info(f"  -> Saved to {output_path}")


⚠️ Potential issue | 🟠 Major

Write prepared benchmark files atomically to avoid partial-output corruption.

Direct writes to the final path risk truncating existing outputs if serialization fails mid-write.

Proposed fix

output_path = args.output_dir / f"{config}.jsonl" - dataset.to_json(output_path) + args.output_dir.mkdir(parents=True, exist_ok=True) + tmp_output_path = output_path.with_suffix(output_path.suffix + ".tmp") + dataset.to_json(tmp_output_path) + tmp_output_path.replace(output_path) LOG.info(f" -> Saved to {output_path}")

As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 620 - 622, The current direct call to dataset.to_json(output_path) can truncate or corrupt an existing file if serialization fails; change this to write to a temporary file in the same directory (e.g., output_path.with_suffix(".jsonl.tmp") or similar), perform the full dataset.to_json to that temp path, flush/close it, then atomically move/replace the temp file to output_path (use Path.replace or os.replace) and finally LOG.info the final output_path; reference output_path, args.output_dir, dataset.to_json and LOG.info when making the change.

Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>

gwarmstrong · 2026-02-27T00:21:02Z

closing in favor of #1279

talorabr added 4 commits February 25, 2026 12:48

Add SPEED-Bench

d031708

Signed-off-by: Talor Abramovich <talora@nvidia.com>

Fix metric and prepare data script for speed-bench

b237015

Signed-off-by: Talor Abramovich <talora@nvidia.com>

Fix data dir argument

0ac8cb4

Signed-off-by: talora <talora@nvidia.com>

Add specdec generation module

15df539

Signed-off-by: talora <talora@nvidia.com>

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

talorabr and others added 2 commits February 25, 2026 16:36

Update docs/evaluation/speculative-decoding.md

445213a

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>

Update nemo_skills/dataset/speed-bench/prepare.py

e326fde

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

Merge branch 'main' into speed-bench

9c73e12

Kipok reviewed Feb 26, 2026

View reviewed changes

talorabr added 2 commits February 26, 2026 01:44

Merge branch 'main' of github.com:NVIDIA-NeMo/Skills into speed-bench

3143529

Signed-off-by: talora <talora@nvidia.com>

CR fix

f42d153

Signed-off-by: talora <talora@nvidia.com>

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

Remove dataset group

8b4a5d3

Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>

gwarmstrong closed this Feb 27, 2026

Conversation

talorabr commented Feb 25, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Kipok left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Kipok Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

talorabr Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Kipok Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

talorabr Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong commented Feb 27, 2026

Uh oh!

Reviewers

talorabr commented Feb 25, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2026 •

edited

Loading