Conversation
Signed-off-by: Talor Abramovich <talora@nvidia.com>
Signed-off-by: Talor Abramovich <talora@nvidia.com>
Signed-off-by: talora <talora@nvidia.com>
Signed-off-by: talora <talora@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds end-to-end Speculative Decoding (SD) evaluation: docs and SPEED-Bench prep, new SpecdecGenerationTask to capture/compute pre/post spec-decoding metrics from vLLM or SGLang (Prometheus or per-request files), metrics collector and evaluator registration, and two minor server CLI quoting fixes. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant GenTask as SpecdecGenerationTask
participant Server as vLLM/SGLang
participant Metrics as Prometheus/metrics file
participant Evaluator
User->>GenTask: start generation+evaluation(cfg)
GenTask->>Metrics: fetch_metrics("before")
Metrics-->>GenTask: before_snapshot
GenTask->>Server: send_prompts / start_inference
Server-->>GenTask: completions (+ optional per-request metrics files)
GenTask->>Metrics: fetch_metrics("after") or read metrics files
Metrics-->>GenTask: after_snapshot
GenTask->>GenTask: compute_delta(before_snapshot, after_snapshot)
GenTask->>GenTask: inject specdec_stats into eval_config
GenTask->>Evaluator: run eval_specdec(config_with_stats)
Evaluator-->>User: evaluation_results (data stamped with specdec metrics)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (3)
nemo_skills/dataset/speed-bench/prepare.py (1)
293-299: Redundantencoder.encodecall.Line 298 re-encodes
prompteven though the result of the same call is already intokensfrom line 293.♻️ Proposed fix
tokens = encoder.encode(prompt, disallowed_special=()) current_num_tokens = len(tokens) if current_num_tokens > target_num_tokens: # Truncate if too long - tokens = encoder.encode(prompt, disallowed_special=()) return encoder.decode(tokens[:target_num_tokens])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 293 - 299, The code redundantly calls encoder.encode(prompt, disallowed_special=()) twice; use the already-computed tokens variable when truncating instead of re-encoding. In the block that checks if current_num_tokens > target_num_tokens (around variables tokens, current_num_tokens, target_num_tokens), remove the second encoder.encode call and simply return encoder.decode(tokens[:target_num_tokens]) so the existing tokens are reused.nemo_skills/inference/eval/specdec.py (1)
440-448: Temp directory is never cleaned up.
_ensure_sglang_metrics_dircreates a temporary directory that persists after the process exits. Consider usingatexitor a context manager to clean up, or at minimum document that cleanup is the caller's responsibility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/specdec.py` around lines 440 - 448, The temp dir created by _ensure_sglang_metrics_dir (uses cls._sglang_metrics_dir) is never removed; register an atexit cleanup that removes the directory (use shutil.rmtree) when the process exits, handle and log OSErrors, and ensure registration only happens once (e.g., when you first create cls._sglang_metrics_dir) so the temp dir is cleaned up automatically; alternatively, add a documented method (e.g., cleanup_sglang_metrics_dir) to explicitly delete cls._sglang_metrics_dir and call it from atexit if you prefer manual control.nemo_skills/evaluation/metrics/specdec_metrics.py (1)
87-98:no_answerbranch is unreachable givencompute_no_answer=False.Since
__init__hardcodescompute_no_answer=False, the block at lines 96-97 can never execute. This is fine as defensive code for potential subclasses, but worth noting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/specdec_metrics.py` around lines 87 - 98, The metrics_to_print method contains an unreachable branch because __init__ hardcodes self.compute_no_answer=False; either make compute_no_answer configurable in __init__ (add a parameter and assign to self.compute_no_answer) so the "no_answer" key (as_percentage) can be enabled, or remove the dead branch entirely; update the constructor (the __init__ method) and references to compute_no_answer or delete the conditional block that adds "no_answer" from metrics_to_print (affecting metrics_to_print, compute_no_answer, __init__, and the "no_answer"/as_percentage handling) to keep behavior consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 62-73: The multi-line shell invocation for ns eval in
docs/evaluation/speculative-decoding.md is broken because some lines (notably
the line containing the flag --server_type=sglang and the line before the
++inference.tokens_to_generate stanza) are missing trailing backslashes, causing
the following lines (e.g., ++inference.tokens_to_generate=1024) to be executed
as separate commands; fix by adding the missing trailing "\" to the end of the
broken lines so the entire ns eval invocation (including flags like
--server_type=sglang, --server_gpus=8, and the subsequent
++inference.tokens_to_generate entry) is one continuous multi-line command, and
apply the same fix to the other affected block noted in the comment (the block
covering the range that includes ++inference.tokens_to_generate).
- Line 7: Typo in the metric name: replace "accetpance" with "acceptance" in the
phrase "conditional acceptance rate (or per-position accetpance rate)" so the
text reads "conditional acceptance rate (or per-position acceptance rate)";
update the same phrase wherever it appears in
docs/evaluation/speculative-decoding.md (look for "conditional acceptance rate"
/ "per-position accetpance rate") to keep terminology consistent.
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Line 363: The LOG.warning call is passing character as a positional argument
but the message has no %s placeholder; update the LOG.warning invocation (the
LOG object usage in prepare.py) to include a format placeholder (e.g., add " %s"
in the message) so the character value is interpolated (i.e., change the call
that currently reads LOG.warning("no motivation provided for character",
character) to use a format string like "no motivation provided for character %s"
with character as the second argument).
- Around line 196-211: The parameter named answer is being shadowed by the loop
variables in the loops "for i, answer in enumerate(answers)" and "for i, answer
in enumerate(answers_to_add)"; rename the loop variables (for example to
candidate or opt_answer) and update all uses inside those loops
(encoder.encode(...) and prompt += f"A{i + 1}...") so they reference the new
loop name, preserving the existing logic that skips the correct_answer_i and
builds answers_to_add, to avoid clobbering the outer parameter value.
- Around line 414-420: The while loop that samples from hle_train_category can
fail or behave unexpectedly if hle_train_category is empty; before entering the
loop in prepare.py (where hle_train_category.sample(1, random_state=rng) is
used), add a guard to check hle_train_category.empty and handle it explicitly
(e.g., raise a clear ValueError or return a sensible fallback) so the function
(which later calls encoder.decode with prompt_tokens and example_tokens) fails
with an informative message rather than crashing inside sample() or producing
wrong output; reference hle_train_category, the while current_num_tokens <
num_tokens loop, and encoder.decode when implementing the fix.
- Around line 136-137: Remove the incorrect `@staticmethod` decorator lines from
all module-level functions (e.g., _generate_stackselect_prompt and the other
functions declared at the noted occurrences) so they are plain top-level
functions; locate each def decorated with `@staticmethod` (the ones reported at
lines 137, 217, 273, 290, 315, 321, 332, 382, 400, 424 in the review) and delete
only the decorator line, leaving the def signature and body unchanged.
In `@nemo_skills/evaluation/evaluator/specdec.py`:
- Around line 69-71: Check and validate eval_config.input_file before attempting
file I/O: ensure jsonl_file = eval_config.input_file is not None, is a non-empty
string (and optionally a valid path), and if not raise a clear ValueError (e.g.
"input_file must be a non-empty string") before the with open(...) block in
specdec.py so the open call no longer raises a generic TypeError; update any
callers or docstring if needed to reflect the validation on
eval_config.input_file.
In `@nemo_skills/evaluation/metrics/specdec_metrics.py`:
- Around line 67-85: In get_metrics(), guard the speculative-decoding division
by checking self.total before dividing (refer to method get_metrics, the loop
handling metric_key.startswith("spec_") and attribute self.total); if self.total
is zero, set the speculative metric value to 0 (or another sentinel like
None/float("nan") if preferred) instead of performing metric_value / self.total,
then continue to call update_common_metrics and _add_std_metrics as before.
In `@nemo_skills/inference/eval/specdec.py`:
- Around line 601-633: In process_single_datapoint, current_response sometimes
lacks the "response" key so later list comprehension accessing
response["response_id"] can KeyError; ensure every current_response includes a
response_id (e.g., after computing current_response check if "response" not in
current_response then set current_response["response_id"]=None) or change the
aggregation to use response.get("response_id") when building "response_ids";
update the logic around current_response, responses, and the final return to
consistently provide/expect "response_id" (refer to process_single_datapoint,
current_response, responses, and the final return's "response_ids"
construction).
- Around line 546-599: The code writes data points that lack metric fields and
then aggregates over all data_points causing KeyError; fix by only writing and
aggregating matched data points: in the loop that checks if all(response_id in
metrics for response_id in data_point["response_ids"]) (the block that updates
data_point with num_drafts, draft_tokens, etc.) add matched_data_points = [] (or
append to an existing list) and write/append only when the metrics check passes,
keep the LOG.warning for unmatched ones but do NOT write them to fout or include
them in the list used for the final return_value sums; finally, compute
return_value by summing over matched_data_points (not data_points) and handle
empty matched_data_points (return None or zeros) to avoid division by zero.
- Around line 720-734: specdec_stats may be None after calling
compute_sglang_spec_decode_delta (and similarly for the VLLM path), so guard the
logging that dereferences specdec_stats: after assigning specdec_stats =
compute_sglang_spec_decode_delta(...) (and for the VLLM equivalent
compute_vllm_spec_decode_delta/variable), check "if specdec_stats is None" and
in that branch log a clear fallback message (e.g., "no delta available, skipping
Prometheus delta logging") and skip the formatted access; otherwise log using
specdec_stats["acceptance_length"], ["acceptance_rate"], ["num_drafts"],
["draft_tokens"] as before. Ensure you apply the same None-guard to the VLLM
block to prevent TypeError.
---
Nitpick comments:
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 293-299: The code redundantly calls encoder.encode(prompt,
disallowed_special=()) twice; use the already-computed tokens variable when
truncating instead of re-encoding. In the block that checks if
current_num_tokens > target_num_tokens (around variables tokens,
current_num_tokens, target_num_tokens), remove the second encoder.encode call
and simply return encoder.decode(tokens[:target_num_tokens]) so the existing
tokens are reused.
In `@nemo_skills/evaluation/metrics/specdec_metrics.py`:
- Around line 87-98: The metrics_to_print method contains an unreachable branch
because __init__ hardcodes self.compute_no_answer=False; either make
compute_no_answer configurable in __init__ (add a parameter and assign to
self.compute_no_answer) so the "no_answer" key (as_percentage) can be enabled,
or remove the dead branch entirely; update the constructor (the __init__ method)
and references to compute_no_answer or delete the conditional block that adds
"no_answer" from metrics_to_print (affecting metrics_to_print,
compute_no_answer, __init__, and the "no_answer"/as_percentage handling) to keep
behavior consistent.
In `@nemo_skills/inference/eval/specdec.py`:
- Around line 440-448: The temp dir created by _ensure_sglang_metrics_dir (uses
cls._sglang_metrics_dir) is never removed; register an atexit cleanup that
removes the directory (use shutil.rmtree) when the process exits, handle and log
OSErrors, and ensure registration only happens once (e.g., when you first create
cls._sglang_metrics_dir) so the temp dir is cleaned up automatically;
alternatively, add a documented method (e.g., cleanup_sglang_metrics_dir) to
explicitly delete cls._sglang_metrics_dir and call it from atexit if you prefer
manual control.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
docs/evaluation/index.mddocs/evaluation/speculative-decoding.mdnemo_skills/dataset/speed-bench/__init__.pynemo_skills/dataset/speed-bench/prepare.pynemo_skills/evaluation/evaluator/__init__.pynemo_skills/evaluation/evaluator/specdec.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/specdec_metrics.pynemo_skills/inference/eval/specdec.pynemo_skills/inference/server/serve_sglang.pynemo_skills/inference/server/serve_vllm.py
| for i, answer in enumerate(answers): | ||
| if i == correct_answer_i: | ||
| continue | ||
| answer_to_add = f"A{i + 1}:\n\n{answer}\n\n" | ||
| answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=())) | ||
| if all_tokens + answer_to_add_tokens > num_tokens: | ||
| break | ||
| answers_to_add_stop = i | ||
| answers_to_add = ( | ||
| answers[: answers_to_add_stop + 1] | ||
| if answers_to_add_stop >= correct_answer_i | ||
| else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1] | ||
| ) | ||
| random.shuffle(answers_to_add) | ||
| for i, answer in enumerate(answers_to_add): | ||
| prompt += f"A{i + 1}:\n\n{answer}\n\n" |
There was a problem hiding this comment.
Variable answer is shadowed by loop variables.
The parameter answer (line 137) is shadowed by the loop variable on line 196 (for i, answer in enumerate(answers)) and again on line 210 (for i, answer in enumerate(answers_to_add)). This makes the code confusing and error-prone if the original answer parameter is needed after these loops.
🛡️ Proposed fix — rename loop variables
- for i, answer in enumerate(answers):
+ for i, ans in enumerate(answers):
if i == correct_answer_i:
continue
- answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
+ answer_to_add = f"A{i + 1}:\n\n{ans}\n\n"- for i, answer in enumerate(answers_to_add):
- prompt += f"A{i + 1}:\n\n{answer}\n\n"
+ for i, ans in enumerate(answers_to_add):
+ prompt += f"A{i + 1}:\n\n{ans}\n\n"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for i, answer in enumerate(answers): | |
| if i == correct_answer_i: | |
| continue | |
| answer_to_add = f"A{i + 1}:\n\n{answer}\n\n" | |
| answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=())) | |
| if all_tokens + answer_to_add_tokens > num_tokens: | |
| break | |
| answers_to_add_stop = i | |
| answers_to_add = ( | |
| answers[: answers_to_add_stop + 1] | |
| if answers_to_add_stop >= correct_answer_i | |
| else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1] | |
| ) | |
| random.shuffle(answers_to_add) | |
| for i, answer in enumerate(answers_to_add): | |
| prompt += f"A{i + 1}:\n\n{answer}\n\n" | |
| for i, ans in enumerate(answers): | |
| if i == correct_answer_i: | |
| continue | |
| answer_to_add = f"A{i + 1}:\n\n{ans}\n\n" | |
| answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=())) | |
| if all_tokens + answer_to_add_tokens > num_tokens: | |
| break | |
| answers_to_add_stop = i | |
| answers_to_add = ( | |
| answers[: answers_to_add_stop + 1] | |
| if answers_to_add_stop >= correct_answer_i | |
| else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1] | |
| ) | |
| random.shuffle(answers_to_add) | |
| for i, ans in enumerate(answers_to_add): | |
| prompt += f"A{i + 1}:\n\n{ans}\n\n" |
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 207-207: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 196 - 211, The
parameter named answer is being shadowed by the loop variables in the loops "for
i, answer in enumerate(answers)" and "for i, answer in
enumerate(answers_to_add)"; rename the loop variables (for example to candidate
or opt_answer) and update all uses inside those loops (encoder.encode(...) and
prompt += f"A{i + 1}...") so they reference the new loop name, preserving the
existing logic that skips the correct_answer_i and builds answers_to_add, to
avoid clobbering the outer parameter value.
| jsonl_file = eval_config.input_file | ||
| with open(jsonl_file, "rt", encoding="utf-8") as fin: | ||
| data = [json.loads(line) for line in fin] |
There was a problem hiding this comment.
Add explicit input_file validation before file I/O.
If input_file is absent, this currently fails with a generic TypeError at open-time. A direct check yields clearer failure semantics.
Suggested fix
jsonl_file = eval_config.input_file
+ if not jsonl_file:
+ raise ValueError("`input_file` must be provided for specdec evaluation")
with open(jsonl_file, "rt", encoding="utf-8") as fin:
data = [json.loads(line) for line in fin]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| jsonl_file = eval_config.input_file | |
| with open(jsonl_file, "rt", encoding="utf-8") as fin: | |
| data = [json.loads(line) for line in fin] | |
| jsonl_file = eval_config.input_file | |
| if not jsonl_file: | |
| raise ValueError("`input_file` must be provided for specdec evaluation") | |
| with open(jsonl_file, "rt", encoding="utf-8") as fin: | |
| data = [json.loads(line) for line in fin] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/evaluation/evaluator/specdec.py` around lines 69 - 71, Check and
validate eval_config.input_file before attempting file I/O: ensure jsonl_file =
eval_config.input_file is not None, is a non-empty string (and optionally a
valid path), and if not raise a clear ValueError (e.g. "input_file must be a
non-empty string") before the with open(...) block in specdec.py so the open
call no longer raises a generic TypeError; update any callers or docstring if
needed to reflect the validation on eval_config.input_file.
| def get_metrics(self) -> dict: | ||
| """Get all computed metrics including speculative decoding statistics. | ||
|
|
||
| Returns: | ||
| Nested dict of evaluation mode → metric name → value. | ||
| """ | ||
| metrics_dict = {} | ||
|
|
||
| for agg_mode, agg_metric_dict in self.eval_dict.items(): | ||
| metrics_dict[agg_mode] = {} | ||
| self.update_common_metrics(metrics_dict[agg_mode]) | ||
| for metric_key, metric_value in agg_metric_dict.items(): | ||
| if metric_key.startswith("spec_"): | ||
| metrics_dict[agg_mode][metric_key] = metric_value / self.total | ||
| else: | ||
| metrics_dict[agg_mode][metric_key] = metric_value | ||
| self._add_std_metrics(metrics_dict) | ||
|
|
||
| return metrics_dict |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's explore the structure of BaseMetrics and related classes
cd nemo_skills && find . -type f -name "*metrics*.py" | head -20Repository: NVIDIA-NeMo/Skills
Length of output: 1727
🏁 Script executed:
# Search for BaseMetrics class definition and self.total usage
rg -n "class BaseMetrics" --type=py -A 20Repository: NVIDIA-NeMo/Skills
Length of output: 1978
🏁 Script executed:
# Check how self.total is initialized and used throughout the codebase
rg -n "self\.total\s*=" --type=py -B 2 -A 2Repository: NVIDIA-NeMo/Skills
Length of output: 372
🏁 Script executed:
# Look for get_metrics implementations in other metric classes
rg -n "def get_metrics" --type=py -B 5 -A 20Repository: NVIDIA-NeMo/Skills
Length of output: 45882
Guard against self.total == 0 before dividing by it in speculative decoding metrics.
If get_metrics() is called before any predictions are ingested (or after an empty update), the metric_value / self.total on line 80 will raise a ZeroDivisionError.
Proposed fix
def get_metrics(self) -> dict:
metrics_dict = {}
+ if self.total == 0:
+ return metrics_dict
for agg_mode, agg_metric_dict in self.eval_dict.items():🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/evaluation/metrics/specdec_metrics.py` around lines 67 - 85, In
get_metrics(), guard the speculative-decoding division by checking self.total
before dividing (refer to method get_metrics, the loop handling
metric_key.startswith("spec_") and attribute self.total); if self.total is zero,
set the speculative metric value to 0 (or another sentinel like
None/float("nan") if preferred) instead of performing metric_value / self.total,
then continue to call update_common_metrics and _add_std_metrics as before.
| data_points = [] | ||
| with open(self.cfg.output_file, "rt", encoding="utf-8") as fin: | ||
| for i, line in enumerate(fin): | ||
| try: | ||
| data_points.append(json.loads(line)) | ||
| except json.JSONDecodeError: | ||
| LOG.warning(f"Failed to parse JSON line {i} for metrics injection, skipping") | ||
| continue | ||
|
|
||
| with open(self.cfg.output_file, "wt", encoding="utf-8") as fout: | ||
| for data_point in data_points: | ||
| if all(response_id in metrics for response_id in data_point["response_ids"]): | ||
| data_point.update( | ||
| { | ||
| "num_drafts": sum( | ||
| metrics[response_id]["num_drafts"] for response_id in data_point["response_ids"] | ||
| ), | ||
| "draft_tokens": sum( | ||
| metrics[response_id]["draft_tokens"] for response_id in data_point["response_ids"] | ||
| ), | ||
| "accepted_tokens": sum( | ||
| metrics[response_id]["accepted_tokens"] for response_id in data_point["response_ids"] | ||
| ), | ||
| "acceptance_rate": sum( | ||
| metrics[response_id]["acceptance_rate"] for response_id in data_point["response_ids"] | ||
| ) | ||
| / len(data_point["response_ids"]), | ||
| "acceptance_length": sum( | ||
| metrics[response_id]["acceptance_length"] for response_id in data_point["response_ids"] | ||
| ) | ||
| / len(data_point["response_ids"]), | ||
| } | ||
| ) | ||
| else: | ||
| LOG.warning( | ||
| "No metrics found for response_ids: %s, skipping data point", data_point["response_ids"] | ||
| ) | ||
| fout.write(json.dumps(data_point) + "\n") | ||
|
|
||
| try: | ||
| return_value = { | ||
| "num_drafts": sum([data_point["num_drafts"] for data_point in data_points]), | ||
| "draft_tokens": sum([data_point["draft_tokens"] for data_point in data_points]), | ||
| "accepted_tokens": sum([data_point["accepted_tokens"] for data_point in data_points]), | ||
| "acceptance_rate": sum([data_point["acceptance_rate"] for data_point in data_points]) | ||
| / len(data_points), | ||
| "acceptance_length": sum([data_point["acceptance_length"] for data_point in data_points]) | ||
| / len(data_points), | ||
| "per_position_acceptance_rates": [], | ||
| } | ||
| except KeyError: | ||
| LOG.warning("Metrics injection failed for some data points, skipping") | ||
| return None | ||
| return return_value |
There was a problem hiding this comment.
Data points missing metrics are written without spec fields, causing the final aggregation to KeyError.
When response_ids are not all found in metrics (line 557), the data point is written to the output file without the spec-decode keys (line 583 runs unconditionally). Then lines 587-591 iterate over all data_points including those missing the keys, which raises a KeyError caught on line 596, discarding the entire result.
Consider either skipping unmatched data points from the aggregation or not writing them to the output file.
🐛 Proposed fix — track matched data points for aggregation
+ matched_data_points = []
with open(self.cfg.output_file, "wt", encoding="utf-8") as fout:
for data_point in data_points:
if all(response_id in metrics for response_id in data_point["response_ids"]):
data_point.update(
{
...
}
)
+ matched_data_points.append(data_point)
else:
LOG.warning(
"No metrics found for response_ids: %s, skipping data point", data_point["response_ids"]
)
fout.write(json.dumps(data_point) + "\n")
+ if not matched_data_points:
+ LOG.warning("No data points matched metrics, skipping aggregation")
+ return None
+
try:
return_value = {
- "num_drafts": sum([data_point["num_drafts"] for data_point in data_points]),
- "draft_tokens": sum([data_point["draft_tokens"] for data_point in data_points]),
- "accepted_tokens": sum([data_point["accepted_tokens"] for data_point in data_points]),
- "acceptance_rate": sum([data_point["acceptance_rate"] for data_point in data_points])
- / len(data_points),
- "acceptance_length": sum([data_point["acceptance_length"] for data_point in data_points])
- / len(data_points),
+ "num_drafts": sum(dp["num_drafts"] for dp in matched_data_points),
+ "draft_tokens": sum(dp["draft_tokens"] for dp in matched_data_points),
+ "accepted_tokens": sum(dp["accepted_tokens"] for dp in matched_data_points),
+ "acceptance_rate": sum(dp["acceptance_rate"] for dp in matched_data_points)
+ / len(matched_data_points),
+ "acceptance_length": sum(dp["acceptance_length"] for dp in matched_data_points)
+ / len(matched_data_points),
"per_position_acceptance_rates": [],
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/inference/eval/specdec.py` around lines 546 - 599, The code
writes data points that lack metric fields and then aggregates over all
data_points causing KeyError; fix by only writing and aggregating matched data
points: in the loop that checks if all(response_id in metrics for response_id in
data_point["response_ids"]) (the block that updates data_point with num_drafts,
draft_tokens, etc.) add matched_data_points = [] (or append to an existing list)
and write/append only when the metrics check passes, keep the LOG.warning for
unmatched ones but do NOT write them to fout or include them in the list used
for the final return_value sums; finally, compute return_value by summing over
matched_data_points (not data_points) and handle empty matched_data_points
(return None or zeros) to avoid division by zero.
| async def process_single_datapoint(self, data_point, all_data, prompt_format="openai"): | ||
| """Handle single-turn and multi-turn generation. | ||
|
|
||
| For single-turn data points (or when ``multiturn`` is ``False``), this | ||
| falls through to the base-class implementation. | ||
|
|
||
| For multi-turn data points, it loops through every turn, building up | ||
| the conversation and accumulating token counts. | ||
|
|
||
| Also tracks request IDs for SGLang metrics matching. | ||
| """ | ||
|
|
||
| messages = [] | ||
| responses = [] | ||
|
|
||
| for message in data_point["messages"]: | ||
| messages.append(message) | ||
| new_data_point = {"messages": messages} | ||
| current_response = await super().process_single_datapoint( | ||
| new_data_point, all_data, prompt_format=prompt_format | ||
| ) | ||
| if "response" in current_response: | ||
| raw_response = current_response.pop("response") | ||
| current_response["response_id"] = raw_response.id | ||
| responses.append(current_response) | ||
| messages.append({"role": "assistant", "content": current_response["generation"]}) | ||
|
|
||
| # Return aggregated results | ||
| return { | ||
| "generation": [response["generation"] for response in responses], | ||
| "num_generated_tokens": sum([response["num_generated_tokens"] for response in responses]), | ||
| "response_ids": [response["response_id"] for response in responses], | ||
| } |
There was a problem hiding this comment.
response_id may be missing if the response object lacks a "response" key.
On line 622, if "response" is not in current_response, the response_id field is never set. However, line 632 unconditionally accesses response["response_id"] for every response, which would raise a KeyError.
🛡️ Proposed fix
- if "response" in current_response:
- raw_response = current_response.pop("response")
- current_response["response_id"] = raw_response.id
+ raw_response = current_response.pop("response", None)
+ current_response["response_id"] = raw_response.id if raw_response is not None else None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/inference/eval/specdec.py` around lines 601 - 633, In
process_single_datapoint, current_response sometimes lacks the "response" key so
later list comprehension accessing response["response_id"] can KeyError; ensure
every current_response includes a response_id (e.g., after computing
current_response check if "response" not in current_response then set
current_response["response_id"]=None) or change the aggregation to use
response.get("response_id") when building "response_ids"; update the logic
around current_response, responses, and the final return to consistently
provide/expect "response_id" (refer to process_single_datapoint,
current_response, responses, and the final return's "response_ids"
construction).
| # ----- Strategy 2: Prometheus before/after fallback ----- | ||
| if specdec_stats is None: | ||
| LOG.info("Falling back to SGLang Prometheus before/after delta…") | ||
| LOG.info("Fetching AFTER SGLang spec-decode metrics from %s/metrics", server_address) | ||
| after_sglang = fetch_sglang_spec_decode_metrics(server_address) | ||
|
|
||
| specdec_stats = compute_sglang_spec_decode_delta(self._before_metrics, after_sglang) | ||
| LOG.info( | ||
| "SGLang Prometheus delta: acceptance_length=%.4f, " | ||
| "acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d", | ||
| specdec_stats["acceptance_length"], | ||
| specdec_stats["acceptance_rate"], | ||
| specdec_stats["num_drafts"], | ||
| specdec_stats["draft_tokens"], | ||
| ) |
There was a problem hiding this comment.
specdec_stats can be None when logged on lines 730-734, causing TypeError.
compute_sglang_spec_decode_delta can return None (e.g. when delta_requests <= 0). The code on line 726 assigns the result to specdec_stats, but immediately accesses specdec_stats["acceptance_length"] on line 730 without a null check. The same pattern affects the VLLM path on lines 741-749.
🐛 Proposed fix — guard both logging blocks
# ----- Strategy 2: Prometheus before/after fallback -----
if specdec_stats is None:
LOG.info("Falling back to SGLang Prometheus before/after delta…")
LOG.info("Fetching AFTER SGLang spec-decode metrics from %s/metrics", server_address)
after_sglang = fetch_sglang_spec_decode_metrics(server_address)
specdec_stats = compute_sglang_spec_decode_delta(self._before_metrics, after_sglang)
- LOG.info(
- "SGLang Prometheus delta: acceptance_length=%.4f, "
- "acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d",
- specdec_stats["acceptance_length"],
- specdec_stats["acceptance_rate"],
- specdec_stats["num_drafts"],
- specdec_stats["draft_tokens"],
- )
+ if specdec_stats is not None:
+ LOG.info(
+ "SGLang Prometheus delta: acceptance_length=%.4f, "
+ "acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d",
+ specdec_stats["acceptance_length"],
+ specdec_stats["acceptance_rate"],
+ specdec_stats["num_drafts"],
+ specdec_stats["draft_tokens"],
+ )
+ else:
+ LOG.warning("SGLang: failed to compute spec-decode delta.")
else:
# VLLM: Fetch "after" snapshot and compute delta
LOG.info("Fetching AFTER spec-decode metrics from %s/metrics", server_address)
after_metrics = fetch_vllm_spec_decode_metrics(server_address)
specdec_stats = compute_vllm_spec_decode_delta(self._before_metrics, after_metrics)
- LOG.info(
- "Spec-decode delta: drafts=%d, draft_tokens=%d, accepted=%d, "
- "acceptance_rate=%.2f%%, acceptance_length=%.4f",
- specdec_stats["num_drafts"],
- specdec_stats["draft_tokens"],
- specdec_stats["accepted_tokens"],
- specdec_stats["acceptance_rate"],
- specdec_stats["acceptance_length"],
- )
- if specdec_stats["per_position_acceptance_rates"]:
+ if specdec_stats is not None:
LOG.info(
- "Per-position acceptance rates: %s",
- [f"{r:.4f}" for r in specdec_stats["per_position_acceptance_rates"]],
+ "Spec-decode delta: drafts=%d, draft_tokens=%d, accepted=%d, "
+ "acceptance_rate=%.2f%%, acceptance_length=%.4f",
+ specdec_stats["num_drafts"],
+ specdec_stats["draft_tokens"],
+ specdec_stats["accepted_tokens"],
+ specdec_stats["acceptance_rate"],
+ specdec_stats["acceptance_length"],
)
+ if specdec_stats["per_position_acceptance_rates"]:
+ LOG.info(
+ "Per-position acceptance rates: %s",
+ [f"{r:.4f}" for r in specdec_stats["per_position_acceptance_rates"]],
+ )
+ else:
+ LOG.warning("VLLM: failed to compute spec-decode delta.")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/inference/eval/specdec.py` around lines 720 - 734, specdec_stats
may be None after calling compute_sglang_spec_decode_delta (and similarly for
the VLLM path), so guard the logging that dereferences specdec_stats: after
assigning specdec_stats = compute_sglang_spec_decode_delta(...) (and for the
VLLM equivalent compute_vllm_spec_decode_delta/variable), check "if
specdec_stats is None" and in that branch log a clear fallback message (e.g.,
"no delta available, skipping Prometheus delta logging") and skip the formatted
access; otherwise log using specdec_stats["acceptance_length"],
["acceptance_rate"], ["num_drafts"], ["draft_tokens"] as before. Ensure you
apply the same None-guard to the VLLM block to prevent TypeError.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
nemo_skills/dataset/speed-bench/prepare.py (2)
136-136:⚠️ Potential issue | 🟠 MajorRemove module-level
@staticmethoddecorators.These are top-level functions, not class members. Keeping
@staticmethodhere is misleading and can create avoidable descriptor/type-introspection issues.#!/bin/bash # Verify minimum Python requirement and all module-level `@staticmethod` occurrences. fd pyproject.toml setup.cfg setup.py | xargs -r rg -n "requires-python|python_requires" rg -n '^\s*@staticmethod\s*$' nemo_skills/dataset/speed-bench/prepare.pySuggested fix (apply to all listed occurrences)
-@staticmethod def _generate_stackselect_prompt(question: str, answers: list[str], answer: str, num_tokens: int) -> str:-@staticmethod def _generate_textsort_prompt(prompt: str) -> str:Also applies to: 216-216, 272-272, 289-289, 314-314, 320-320, 331-331, 381-381, 399-399, 423-423
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` at line 136, Remove the module-level `@staticmethod` decorators that were applied to top-level functions in nemo_skills/dataset/speed-bench/prepare.py; find each bare "@staticmethod" above top-level function definitions (the occurrences flagged in the review) and delete the decorator lines so the functions are plain module-level functions, then run lint/tests to ensure no descriptor/type-introspection code relies on them and update any references that assumed a staticmethod descriptor.
412-416:⚠️ Potential issue | 🟡 MinorGuard empty HLE category before sampling.
If filtering leaves
hle_train_categoryempty,sample(1)raises and the failure message is not actionable.Suggested fix
hle_train_category = hle_train[hle_train["category"] == example["category"]] + if hle_train_category.empty: + raise ValueError( + f"No HLE training examples found for category '{example['category']}' after filtering" + ) while current_num_tokens < num_tokens:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 412 - 416, Before calling hle_train_category.sample(1) guard against an empty filtered DataFrame: check hle_train_category.empty after filtering by example["category"] and either (a) fallback to a broader sample (e.g., sample from hle_train with same rng) or (b) raise a clear, actionable error that includes the missing category and context (example["category"], example id) so the caller can diagnose the data issue; ensure the subsequent code that appends to prompt still expects a string demonstration from hle_train_category_sample.iloc[0].docs/evaluation/speculative-decoding.md (1)
71-72:⚠️ Potential issue | 🟠 MajorFix broken multiline eval commands (missing trailing
\).Without a trailing backslash on Line 71 and Line 95,
++inference.tokens_to_generate=1024runs as a separate command.Suggested fix
- --server_type=sglang + --server_type=sglang \ ++inference.tokens_to_generate=1024- --server_type=vllm + --server_type=vllm \ ++inference.tokens_to_generate=1024Also applies to: 95-96
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/evaluation/speculative-decoding.md` around lines 71 - 72, The multiline shell flag sequences are missing trailing backslashes so that the next line (e.g., the changed token "++inference.tokens_to_generate=1024") is executed as a separate command; update the broken multiline commands (the lines containing "--server_type=sglang" and the later block around "++inference.tokens_to_generate=1024") to end with a trailing backslash on the previous lines to continue the command, making sure the backslash is the last character (no trailing spaces) and that each flag line uses the shell continuation syntax correctly.
🧹 Nitpick comments (2)
docs/evaluation/speculative-decoding.md (1)
77-81: Add a language identifier to the fenced metrics block.This keeps markdown lint clean and improves rendering consistency.
Suggested fix
-``` +```text --------------------------------------------- speed-bench ---------------------------------------------- evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate pass@1 | 880 | 464 | 139 | 2.78 | 69.38</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/evaluation/speculative-decoding.mdaround lines 77 - 81, The fenced
metrics block for the "speed-bench" table is missing a language identifier;
update the triple-backtick fence that precedes the table (the block starting
with the "--------------------------------------------- speed-bench
----------------------------------------------" header) to include "text" (i.e.,
replacewithtext) so the code fence becomes ```text and keep the closingnemo_skills/dataset/speed-bench/prepare.py (1)
457-465: Avoid recomputing HLE train preprocessing per row indataset.map.The
to_pandas()conversion plusdemonstration/tokensconstruction is inside per-example flow, which is expensive and scales poorly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 457 - 465, The current code calls _get_external_dataset(...), then converts to_pandas() and builds "demonstration" and "tokens" per-example inside a mapping flow which is expensive; move the HLE preprocessing out of the per-example map by operating on the full dataset once: fetch hle_train via _get_external_dataset(BenchmarkDataset.HLE.value, ...), convert it once to a pandas DataFrame (hle_train.to_pandas()) or use dataset.map with batched=True, then create the "demonstration" column (using the lambda that concatenates question and rationale) and compute "tokens" with a single tiktoken.get_encoding("o200k_base").encode call applied vectorized/batched; ensure you replace the per-example creation of "demonstration" and "tokens" in the map with the precomputed columns so the dataset.map no longer recomputes to_pandas or encoding per row (reference symbols: _get_external_dataset, BenchmarkDataset.HLE.value, hle_train, "demonstration", "tokens", tiktoken.get_encoding).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 194-208: The loop that chooses which candidate answers to include
doesn't update the running token count and can incorrectly include an answer
that exceeds the budget; change answers_to_add_stop initialization from 0 to -1,
and inside the for-loop when an answer fits (i.e., all_tokens +
answer_to_add_tokens <= num_tokens) increment all_tokens by answer_to_add_tokens
and then set answers_to_add_stop = i; keep the break when it doesn't fit. Apply
the same token-counting and answers_to_add_stop initialization fix to the
analogous block referenced at lines 210-211 so cumulative budget is enforced.
---
Duplicate comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 71-72: The multiline shell flag sequences are missing trailing
backslashes so that the next line (e.g., the changed token
"++inference.tokens_to_generate=1024") is executed as a separate command; update
the broken multiline commands (the lines containing "--server_type=sglang" and
the later block around "++inference.tokens_to_generate=1024") to end with a
trailing backslash on the previous lines to continue the command, making sure
the backslash is the last character (no trailing spaces) and that each flag line
uses the shell continuation syntax correctly.
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Line 136: Remove the module-level `@staticmethod` decorators that were applied
to top-level functions in nemo_skills/dataset/speed-bench/prepare.py; find each
bare "@staticmethod" above top-level function definitions (the occurrences
flagged in the review) and delete the decorator lines so the functions are plain
module-level functions, then run lint/tests to ensure no
descriptor/type-introspection code relies on them and update any references that
assumed a staticmethod descriptor.
- Around line 412-416: Before calling hle_train_category.sample(1) guard against
an empty filtered DataFrame: check hle_train_category.empty after filtering by
example["category"] and either (a) fallback to a broader sample (e.g., sample
from hle_train with same rng) or (b) raise a clear, actionable error that
includes the missing category and context (example["category"], example id) so
the caller can diagnose the data issue; ensure the subsequent code that appends
to prompt still expects a string demonstration from
hle_train_category_sample.iloc[0].
---
Nitpick comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 77-81: The fenced metrics block for the "speed-bench" table is
missing a language identifier; update the triple-backtick fence that precedes
the table (the block starting with the
"--------------------------------------------- speed-bench
----------------------------------------------" header) to include "text" (i.e.,
replace ``` with ```text) so the code fence becomes ```text and keep the closing
``` unchanged to satisfy markdown lint and improve rendering.
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 457-465: The current code calls _get_external_dataset(...), then
converts to_pandas() and builds "demonstration" and "tokens" per-example inside
a mapping flow which is expensive; move the HLE preprocessing out of the
per-example map by operating on the full dataset once: fetch hle_train via
_get_external_dataset(BenchmarkDataset.HLE.value, ...), convert it once to a
pandas DataFrame (hle_train.to_pandas()) or use dataset.map with batched=True,
then create the "demonstration" column (using the lambda that concatenates
question and rationale) and compute "tokens" with a single
tiktoken.get_encoding("o200k_base").encode call applied vectorized/batched;
ensure you replace the per-example creation of "demonstration" and "tokens" in
the map with the precomputed columns so the dataset.map no longer recomputes
to_pandas or encoding per row (reference symbols: _get_external_dataset,
BenchmarkDataset.HLE.value, hle_train, "demonstration", "tokens",
tiktoken.get_encoding).
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
docs/evaluation/speculative-decoding.mdnemo_skills/dataset/speed-bench/prepare.py
| all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens | ||
| answers_to_add_stop = 0 | ||
| for i, answer in enumerate(answers): | ||
| if i == correct_answer_i: | ||
| continue | ||
| answer_to_add = f"A{i + 1}:\n\n{answer}\n\n" | ||
| answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=())) | ||
| if all_tokens + answer_to_add_tokens > num_tokens: | ||
| break | ||
| answers_to_add_stop = i | ||
| answers_to_add = ( | ||
| answers[: answers_to_add_stop + 1] | ||
| if answers_to_add_stop >= correct_answer_i | ||
| else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1] | ||
| ) |
There was a problem hiding this comment.
Fix token-budget accounting in stackselect prompt construction.
all_tokens is never incremented after adding a candidate answer, so selection does not enforce cumulative budget. Also, answers_to_add_stop logic can include an answer that already exceeded budget.
Suggested fix
- all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens
- answers_to_add_stop = 0
- for i, answer in enumerate(answers):
+ all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens
+ answers_to_add = [answers[correct_answer_i]]
+ for i, candidate_answer in enumerate(answers):
if i == correct_answer_i:
continue
- answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
+ answer_to_add = f"A{i + 1}:\n\n{candidate_answer}\n\n"
answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
if all_tokens + answer_to_add_tokens > num_tokens:
break
- answers_to_add_stop = i
- answers_to_add = (
- answers[: answers_to_add_stop + 1]
- if answers_to_add_stop >= correct_answer_i
- else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]
- )
+ all_tokens += answer_to_add_tokens
+ answers_to_add.append(candidate_answer)
random.shuffle(answers_to_add)
- for i, answer in enumerate(answers_to_add):
- prompt += f"A{i + 1}:\n\n{answer}\n\n"
+ for i, candidate_answer in enumerate(answers_to_add):
+ prompt += f"A{i + 1}:\n\n{candidate_answer}\n\n"Also applies to: 210-211
🧰 Tools
🪛 Ruff (0.15.2)
[warning] 207-207: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 194 - 208, The loop
that chooses which candidate answers to include doesn't update the running token
count and can incorrectly include an answer that exceeds the budget; change
answers_to_add_stop initialization from 0 to -1, and inside the for-loop when an
answer fits (i.e., all_tokens + answer_to_add_tokens <= num_tokens) increment
all_tokens by answer_to_add_tokens and then set answers_to_add_stop = i; keep
the break when it doesn't fit. Apply the same token-counting and
answers_to_add_stop initialization fix to the analogous block referenced at
lines 210-211 so cumulative budget is enforced.
| if key not in sample: | ||
| sample[key] = value | ||
| else: | ||
| LOG.warning( |
There was a problem hiding this comment.
should we fail instead? If this is a spec decoding benchmark, I think it's better to just fail with that error
There was a problem hiding this comment.
Changed in the configuration to check if specdec_stats is a dictionary and raises an error if not
| } | ||
| ) | ||
| else: | ||
| LOG.warning( |
There was a problem hiding this comment.
should we fail here instead?
There was a problem hiding this comment.
Not necessarily. There was a bug in SGLang where concurrent writes to the metrics file could corrupt it. This has been fixed in: sgl-project/sglang#19011 .
In addition, we have a fallback mechanism that retrieves metrics from the /metrics endpoint if reading from the file fails.
The benefit of using the metrics file is that it provides more detailed metrics compared to the /metrics endpoint.
Signed-off-by: talora <talora@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (4)
nemo_skills/dataset/speed-bench/prepare.py (2)
193-207:⚠️ Potential issue | 🟠 MajorFix cumulative token-budget accounting in StackSelect prompt assembly.
all_tokensis never incremented after accepting candidates, so budget checks are non-cumulative and can overshootnum_tokens.Proposed fix
- all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens - answers_to_add_stop = 0 - for i, answer in enumerate(answers): + all_tokens = tokens_prompt + end_prompt_tokens + correct_answer_tokens + answers_to_add = [answers[correct_answer_i]] + for i, candidate_answer in enumerate(answers): if i == correct_answer_i: continue - answer_to_add = f"A{i + 1}:\n\n{answer}\n\n" + answer_to_add = f"A{i + 1}:\n\n{candidate_answer}\n\n" answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=())) if all_tokens + answer_to_add_tokens > num_tokens: break - answers_to_add_stop = i - answers_to_add = ( - answers[: answers_to_add_stop + 1] - if answers_to_add_stop >= correct_answer_i - else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1] - ) + all_tokens += answer_to_add_tokens + answers_to_add.append(candidate_answer) random.shuffle(answers_to_add) - for i, answer in enumerate(answers_to_add): - prompt += f"A{i + 1}:\n\n{answer}\n\n" + for i, candidate_answer in enumerate(answers_to_add): + prompt += f"A{i + 1}:\n\n{candidate_answer}\n\n"Also applies to: 209-210
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 193 - 207, The token-budget check is non-cumulative because all_tokens is never updated after accepting an answer candidate; modify the loop in prepare.py (the block that iterates over answers and computes answer_to_add_tokens) to increment all_tokens by answer_to_add_tokens whenever an answer is accepted (i.e., when you do not break) so subsequent iterations account for previously added tokens, and apply the same cumulative-update fix to the similar logic around lines 209-210 that assembles additional answers (ensure any branch that appends tokens also increases all_tokens before continuing).
403-407:⚠️ Potential issue | 🟡 MinorAdd an explicit empty-category guard in HLE prompt generation.
If category filtering yields zero rows,
.sample(1)fails with a low-context pandas error.Proposed fix
hle_train_category = hle_train[hle_train["category"] == example["category"]] + if hle_train_category.empty: + raise ValueError(f"No HLE training examples found for category '{example['category']}'") while current_num_tokens < num_tokens: hle_train_category_sample = hle_train_category.sample(1, random_state=rng)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 403 - 407, When building the HLE prompt, guard against empty category results by checking hle_train_category.empty before calling hle_train_category.sample(1); if empty, fall back to sampling from the full hle_train (e.g., hle_train.sample(1, random_state=rng)) or handle by breaking the loop/continuing to avoid calling .sample on an empty frame — update the loop that uses hle_train_category, example, rng, current_num_tokens, num_tokens, and prompt to perform this empty-category check and use the fallback branch so .sample(1) cannot be invoked on an empty DataFrame.nemo_skills/evaluation/evaluator/specdec.py (1)
69-71:⚠️ Potential issue | 🟡 MinorValidate
input_filebefore file I/O.If
input_fileis unset, this fails with a generic open-time error instead of a clear config failure.Proposed fix
jsonl_file = eval_config.input_file + if not isinstance(jsonl_file, str) or not jsonl_file: + raise ValueError("`input_file` must be a non-empty string for specdec evaluation") with open(jsonl_file, "rt", encoding="utf-8") as fin: data = [json.loads(line) for line in fin]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/specdec.py` around lines 69 - 71, Verify that eval_config.input_file (jsonl_file) is set and points to a non-empty string before attempting to open it; if it's missing or falsy, raise a clear config error (e.g., ValueError("input_file must be set in eval_config")) rather than letting open() throw, then proceed to open and load lines into data as before (refer to symbols eval_config.input_file, jsonl_file, data).nemo_skills/evaluation/metrics/specdec_metrics.py (1)
79-80:⚠️ Potential issue | 🟠 MajorGuard speculative metric averaging when no entries exist.
Division by
self.totalcan crash when no predictions were ingested.Proposed fix
for metric_key, metric_value in agg_metric_dict.items(): if metric_key.startswith("spec_"): - metrics_dict[agg_mode][metric_key] = metric_value / self.total + metrics_dict[agg_mode][metric_key] = (metric_value / self.total) if self.total else 0.0 else: metrics_dict[agg_mode][metric_key] = metric_value🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/specdec_metrics.py` around lines 79 - 80, The speculative-metric averaging code uses metric_value / self.total and will raise when no entries were ingested (self.total == 0); in the block where you handle metric_key.startswith("spec_") (and write into metrics_dict[agg_mode][metric_key]), guard the division by checking self.total > 0 and only perform metric_value / self.total when positive, otherwise assign a safe default (e.g., 0 or None) or skip adding the metric; update the code around metric_key.startswith("spec_") to use this guard so division-by-zero cannot occur.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/evaluation/speculative-decoding.md`:
- Line 9: Replace the vague "here" link text with a descriptive label (e.g.,
"Speculative decoding evaluation framework on GitHub") and update the metrics
output code fence to include a language specifier (use ```text) so the table is
rendered correctly; specifically, find the markdown link whose anchor text is
"here" and replace it with a descriptive label, and change the fenced block that
shows the "speed-bench" metrics to start with ```text (apply the same fixes to
the other occurrences mentioned in the comment).
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 620-622: The current direct call to dataset.to_json(output_path)
can truncate or corrupt an existing file if serialization fails; change this to
write to a temporary file in the same directory (e.g.,
output_path.with_suffix(".jsonl.tmp") or similar), perform the full
dataset.to_json to that temp path, flush/close it, then atomically move/replace
the temp file to output_path (use Path.replace or os.replace) and finally
LOG.info the final output_path; reference output_path, args.output_dir,
dataset.to_json and LOG.info when making the change.
---
Duplicate comments:
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 193-207: The token-budget check is non-cumulative because
all_tokens is never updated after accepting an answer candidate; modify the loop
in prepare.py (the block that iterates over answers and computes
answer_to_add_tokens) to increment all_tokens by answer_to_add_tokens whenever
an answer is accepted (i.e., when you do not break) so subsequent iterations
account for previously added tokens, and apply the same cumulative-update fix to
the similar logic around lines 209-210 that assembles additional answers (ensure
any branch that appends tokens also increases all_tokens before continuing).
- Around line 403-407: When building the HLE prompt, guard against empty
category results by checking hle_train_category.empty before calling
hle_train_category.sample(1); if empty, fall back to sampling from the full
hle_train (e.g., hle_train.sample(1, random_state=rng)) or handle by breaking
the loop/continuing to avoid calling .sample on an empty frame — update the loop
that uses hle_train_category, example, rng, current_num_tokens, num_tokens, and
prompt to perform this empty-category check and use the fallback branch so
.sample(1) cannot be invoked on an empty DataFrame.
In `@nemo_skills/evaluation/evaluator/specdec.py`:
- Around line 69-71: Verify that eval_config.input_file (jsonl_file) is set and
points to a non-empty string before attempting to open it; if it's missing or
falsy, raise a clear config error (e.g., ValueError("input_file must be set in
eval_config")) rather than letting open() throw, then proceed to open and load
lines into data as before (refer to symbols eval_config.input_file, jsonl_file,
data).
In `@nemo_skills/evaluation/metrics/specdec_metrics.py`:
- Around line 79-80: The speculative-metric averaging code uses metric_value /
self.total and will raise when no entries were ingested (self.total == 0); in
the block where you handle metric_key.startswith("spec_") (and write into
metrics_dict[agg_mode][metric_key]), guard the division by checking self.total >
0 and only perform metric_value / self.total when positive, otherwise assign a
safe default (e.g., 0 or None) or skip adding the metric; update the code around
metric_key.startswith("spec_") to use this guard so division-by-zero cannot
occur.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
docs/evaluation/speculative-decoding.mdmkdocs.ymlnemo_skills/dataset/speed-bench/prepare.pynemo_skills/evaluation/evaluator/__init__.pynemo_skills/evaluation/evaluator/specdec.pynemo_skills/evaluation/metrics/specdec_metrics.py
| In all SD benchmarks we want to measure two qualitative metrics for draft accuracy/quality: acceptance length (AL), acceptance rate (AR). | ||
| Other metric in this group is conditional acceptance rate (or per-position acceptance rate), which measures the acceptance rate in a given position conditioned that all previous tokens were accepted. | ||
|
|
||
| For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench). |
There was a problem hiding this comment.
Tighten doc lint compliance (link text + code fence language).
Use descriptive link labels instead of “here”, and set a language for the metrics output block.
Proposed fix
-For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).
+For more advanced evaluation of SD, including throughput and per-category metrics, see the [Model Optimizer specdec_bench examples](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).
-- Original benchmark source is [here](https://huggingface.co/datasets/nvidia/SPEED-Bench).
+- Original benchmark source is the [SPEED-Bench dataset on Hugging Face](https://huggingface.co/datasets/nvidia/SPEED-Bench).
-```
+```text
--------------------------------------------- speed-bench ----------------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
pass@1 | 880 | 464 | 139 | 2.78 | 69.38</details>
Also applies to: 32-32, 77-81
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>
[warning] 9-9: Link text should be descriptive
(MD059, descriptive-link-text)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @docs/evaluation/speculative-decoding.md at line 9, Replace the vague "here"
link text with a descriptive label (e.g., "Speculative decoding evaluation
framework on GitHub") and update the metrics output code fence to include a
language specifier (use text) so the table is rendered correctly; specifically, find the markdown link whose anchor text is "here" and replace it with a descriptive label, and change the fenced block that shows the "speed-bench" metrics to start with text (apply the same fixes to the other
occurrences mentioned in the comment).
</details>
<!-- fingerprinting:phantom:poseidon:hawk -->
<!-- This is an auto-generated comment by CodeRabbit -->
| output_path = args.output_dir / f"{config}.jsonl" | ||
| dataset.to_json(output_path) | ||
| LOG.info(f" -> Saved to {output_path}") |
There was a problem hiding this comment.
Write prepared benchmark files atomically to avoid partial-output corruption.
Direct writes to the final path risk truncating existing outputs if serialization fails mid-write.
Proposed fix
output_path = args.output_dir / f"{config}.jsonl"
- dataset.to_json(output_path)
+ args.output_dir.mkdir(parents=True, exist_ok=True)
+ tmp_output_path = output_path.with_suffix(output_path.suffix + ".tmp")
+ dataset.to_json(tmp_output_path)
+ tmp_output_path.replace(output_path)
LOG.info(f" -> Saved to {output_path}")As per coding guidelines: “When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 620 - 622, The
current direct call to dataset.to_json(output_path) can truncate or corrupt an
existing file if serialization fails; change this to write to a temporary file
in the same directory (e.g., output_path.with_suffix(".jsonl.tmp") or similar),
perform the full dataset.to_json to that temp path, flush/close it, then
atomically move/replace the temp file to output_path (use Path.replace or
os.replace) and finally LOG.info the final output_path; reference output_path,
args.output_dir, dataset.to_json and LOG.info when making the change.
Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>
|
closing in favor of #1279 |
PR to support speculative decoding (SD) evaluation, and specifically to support SPEED-Bench dataset (a new standard dataset for speculative decoding):
Summary by CodeRabbit
New Features
Documentation
Chores