Add SPEED-Bench (within repo) by talorabr · Pull Request #1279 · NVIDIA-NeMo/Skills

talorabr · 2026-02-26T08:25:05Z

PR to support speculative decoding (SD) evaluation, and specifically to support SPEED-Bench dataset (a new standard dataset for speculative decoding):

Add SD metrics.
Add SD evaluator and generation flows.
Add SPEED-Bench dataset for SD evaluation.

Summary by CodeRabbit

New Features
- Speculative decoding evaluation: acceptance-rate, acceptance-length, per-position metrics, and end-to-end generation/evaluation workflow with automatic metric capture and injection.
New Tools
- SPEED‑Bench integration with a CLI to prepare evaluation-ready test sets across multiple benchmarks and configurations.
Documentation
- Detailed speculative decoding guide with example commands, server setups, data prep, and sample outputs.
Improvements
- Safer shell argument handling when launching servers.

Signed-off-by: Talor Abramovich <talora@nvidia.com>

Signed-off-by: talora <talora@nvidia.com>

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>

coderabbitai · 2026-02-26T08:29:22Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds speculative-decoding evaluation: docs, SPEED‑Bench data preparation, new specdec metrics and evaluator, inference tooling to capture/compute Prometheus deltas for SGLang and vLLM and inject per-request metrics, plus minor server CLI escaping fixes.

Changes

Cohort / File(s)	Summary
Documentation `docs/evaluation/index.md`, `docs/evaluation/speculative-decoding.md`	Adds speculative decoding guide entry and a new detailed evaluation doc describing metrics, SPEED‑Bench workflow, and server-specific notes.
Dataset: SPEED‑Bench `nemo_skills/dataset/speed-bench/__init__.py`, `nemo_skills/dataset/speed-bench/prepare.py`	New dataset package and CLI: dataset enums/loaders, many dataset adapters, token-aware prompt generators, external-data resolution, and JSONL output preparation for SPEED‑Bench configs.
Evaluation Metrics `nemo_skills/evaluation/metrics/specdec_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Adds `SpecdecMetrics`, registers `"specdec"` in `METRICS_MAP`, and implements aggregation/printing for spec-decoding metrics (acceptance length/rate, std dev).
Evaluator `nemo_skills/evaluation/evaluator/specdec.py`, `nemo_skills/evaluation/evaluator/__init__.py`	Adds `SpecdecEvaluatorConfig` and `eval_specdec` function; registers `"specdec"` in `EVALUATOR_MAP`; evaluator augments JSONL entries with `specdec_stats` atomically.
Inference evaluation & generation task `nemo_skills/inference/eval/specdec.py`	Large new module: Prometheus text fetch/parsers for vLLM and SGLang, `SpecDecodeMetrics` dataclass, delta computations, SGLang metrics-file handling, `_build_specdec_stats`, `SpecdecGenerationConfig`/`SpecdecGenerationTask` with server wait/run hooks, per-request metrics injection, and Hydra config registration.
Server CLI escaping `nemo_skills/inference/server/serve_sglang.py`, `nemo_skills/inference/server/serve_vllm.py`	Use `shlex.join` to shell-quote extra CLI args when building server commands.
Site config `mkdocs.yml`	Adds `evaluation/speculative-decoding.md` to the docs navigation.

Sequence Diagram

sequenceDiagram
    participant Runner as Runner (Client)
    participant Server as Model Server (SGLang / vLLM)
    participant Metrics as Metrics Endpoint / Files
    participant Evaluator as Evaluator

    Runner->>Server: wait_for_server() — capture "before" metrics
    Server->>Metrics: expose metrics endpoint / write metrics file
    Metrics-->>Runner: return initial counters/gauges or snapshot

    Runner->>Server: run_batch_evaluation() — send inference requests
    Server->>Server: perform speculative decoding (drafts, accepts)
    Server->>Metrics: update counters/gauges or append per-request metrics

    Runner->>Metrics: fetch "after" metrics or read metrics files
    Metrics-->>Runner: return updated counters/gauges or per-request data

    Runner->>Runner: compute_*_spec_decode_delta(before, after) / aggregate files
    Runner->>Evaluator: attach specdec_stats to eval config
    Evaluator->>Evaluator: eval_specdec() — stamp data points with stats
    Evaluator-->>Runner: write updated evaluation outputs

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested labels

enhancement

Suggested reviewers

gwarmstrong
ekmb
Kipok

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 47.83% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add SPEED-Bench (within repo)' clearly and concisely summarizes the main change: introducing SPEED-Bench dataset integration for speculative decoding evaluation, which is the primary focus of this comprehensive PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch talora/speed-bench

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

🧹 Nitpick comments (4)

docs/evaluation/speculative-decoding.md (1)

9-9: Use descriptive link text instead of “here”.

The current link labels are vague and trigger markdownlint MD059.

💡 Proposed doc wording

-For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).
+For more advanced evaluation of SD, including throughput and per-category metrics, see the [Model Optimizer speculative decoding benchmark examples](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).

-- Original benchmark source is [here](https://huggingface.co/datasets/nvidia/SPEED-Bench).
+- Original benchmark source is the [SPEED-Bench dataset on Hugging Face](https://huggingface.co/datasets/nvidia/SPEED-Bench).

Also applies to: 32-32

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/speculative-decoding.md` at line 9, Replace the vague "here"
link text with descriptive link text (e.g., "speculative decoding evaluation
framework" or "SpecDec benchmark repository") in the sentence referencing the
GitHub repo URL so it satisfies markdownlint MD059; update the link label in the
line containing the GitHub URL in docs/evaluation/speculative-decoding.md and
make the same replacement for the other occurrence noted (also at the second
instance around line 32) to ensure both links are descriptive.

nemo_skills/dataset/speed-bench/__init__.py (1)

16-21: Please include SPEED-Bench in Slurm/CI benchmark coverage.

Given this introduces a new evaluation modality (specdec) plus dataset defaults, it would be good to add at least one SPEED-Bench path to default Slurm test coverage to catch integration regressions early.

Based on learnings, "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/__init__.py` around lines 16 - 21, This
change adds a new dataset/modality (REQUIRES_DATA_DIR,
DATASET_GROUP="speculative-decoding", METRICS_TYPE="specdec", EVAL_SPLIT,
GENERATION_ARGS, GENERATION_MODULE) but no CI/Slurm test covers it; add a
SPEED-Bench entry to the default Slurm/CI benchmark matrix so at least one path
runs during tests. Concretely: update the Slurm/CI benchmark list to include a
SPEED-Bench job that references the new dataset (dataset name or DATASET_GROUP
"speculative-decoding"), ensure the job uses the correct generation flags
(include GENERATION_ARGS with ++eval_type=specdec and
++inference.include_response=true) and imports the GENERATION_MODULE
"nemo_skills.inference.eval.specdec", and add this job to the default test suite
so integration/regression failures for the new modality are caught.

nemo_skills/dataset/speed-bench/prepare.py (2)

393-396: Add strict=True to zip() to catch length mismatches.

If question and options columns have different lengths, zip() silently truncates to the shorter one. Per coding guidelines, code should fail clearly rather than silently corrupting data.

♻️ Proposed fix

     return [first_question] + [
         get_question_and_options(question, options)
-        for question, options in zip(external_dataset["question"][1:], external_dataset["options"][1:])
+        for question, options in zip(external_dataset["question"][1:], external_dataset["options"][1:], strict=True)
     ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 393 - 396, The list
comprehension using zip(external_dataset["question"][1:],
external_dataset["options"][1:]) can silently truncate if the two columns differ
in length; update that call to zip(..., strict=True) to force an immediate error
on length mismatch so data corruption is not silent—specifically modify the
expression inside the return that builds the list after first_question (the zip
used with get_question_and_options and external_dataset["question"/"options"])
to pass strict=True.

196-211: Variable shadowing: answer loop variable shadows the function parameter.

The loop variable answer shadows the function parameter of the same name. While the current logic happens to work, this is confusing and error-prone.

♻️ Proposed fix

-    for i, answer in enumerate(answers):
+    for i, ans in enumerate(answers):
         if i == correct_answer_i:
             continue
-        answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
+        answer_to_add = f"A{i + 1}:\n\n{ans}\n\n"
         answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
         if all_tokens + answer_to_add_tokens > num_tokens:
             break
         answers_to_add_stop = i
     answers_to_add = (
         answers[: answers_to_add_stop + 1]
         if answers_to_add_stop >= correct_answer_i
         else [answers[correct_answer_i]] + answers[: answers_to_add_stop + 1]
     )
     random.shuffle(answers_to_add)
-    for i, answer in enumerate(answers_to_add):
-        prompt += f"A{i + 1}:\n\n{answer}\n\n"
+    for i, ans in enumerate(answers_to_add):
+        prompt += f"A{i + 1}:\n\n{ans}\n\n"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 196 - 211, The loops
in prepare.py use the loop variable name "answer", which shadows the function
parameter named "answer"; rename the loop variables in both occurrences (the
first "for i, answer in enumerate(answers):" and the second "for i, answer in
enumerate(answers_to_add):") to a non-conflicting name like "candidate" or "ans"
and update all uses inside those loops (e.g., replace "answer" -> "candidate"
and "answer_to_add" construction, "answer_to_add_tokens", and prompt appends) so
the function parameter "answer" is not shadowed and behavior is preserved.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 77-81: The markdown code fence in
docs/evaluation/speculative-decoding.md containing the speed-bench metrics is
missing a language identifier which triggers MD040; update the opening
triple-backtick for that block to include a language tag (e.g., use ```text) so
the fenced block reads as a text code block and satisfies the linter, ensuring
the metrics table remains unchanged.
- Around line 71-72: The multiline shell examples break because the line with
the flag inference.tokens_to_generate=1024 is not joined to the previous line;
update both occurrences where
--server_type=sglang/++inference.tokens_to_generate=1024 appear so the command
continuation is preserved by adding a trailing backslash on the previous line
(or otherwise joining the lines) before the inference.tokens_to_generate=1024
token so copy-paste execution works.

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 136-137: Remove the invalid module-level `@staticmethod` decorators:
delete the decorator lines above all listed functions so they are plain
functions (specifically _generate_stackselect_prompt, _generate_textsort_prompt,
_generate_writing_prompt, _pad_or_truncate_prompt, _generate_bamboo_prompt,
_generate_chatrag_bench_prompt, _generate_coser_prompt,
_generate_mmlu_pro_prompt, _generate_hle_prompt, and
_get_num_tokens_from_config); ensure any call sites that previously expected a
function (e.g., calls to _generate_stackselect_prompt) now invoke the plain
function and no longer rely on staticmethod objects, and run tests/lint to
confirm no remaining NameError or staticmethod misuse.
- Around line 432-436: The function _fetch_all_turns_data currently accesses
turns[0] without ensuring turns is non-empty, which can raise IndexError; update
_fetch_all_turns_data to first check that example["turns"] (the turns variable)
is a non-empty list (e.g., if not turns: return example or handle appropriately)
before evaluating turns[0]. Keep the existing TURNS_PLACEHOLDER check but
perform it only after verifying turns has elements, and ensure any downstream
logic that assumes at least one turn is similarly guarded.

In `@nemo_skills/evaluation/evaluator/specdec.py`:
- Line 64: After constructing eval_config via SpecdecEvaluatorConfig(**cfg), add
explicit fail-fast validation that cfg contains the required 'input_file' key
and that the parsed stats object includes all required specdec metric keys (do
not rely on dict.get); if any are missing raise a clear ValueError mentioning
the missing key(s). Locate usages around SpecdecEvaluatorConfig and the
stats-parsing/metrics extraction logic (previously using .get at the referenced
areas) and replace .get() calls with direct key access or an explicit
presence-check + informative exception so malformed or missing stats cannot
silently degrade results.

In `@nemo_skills/inference/eval/specdec.py`:
- Around line 726-734: compute_sglang_spec_decode_delta can return None, so
before using specdec_stats in the LOG.info call you must null-check it; locate
the call that assigns specdec_stats from compute_sglang_spec_decode_delta (uses
self._before_metrics and after_sglang) and if specdec_stats is None, skip or log
a safe message (e.g., "no delta to report") instead of dereferencing keys like
"acceptance_length", "acceptance_rate", "num_drafts", "draft_tokens"; ensure the
LOG.info only runs when specdec_stats is a dict with those keys to avoid
TypeError.
- Around line 740-754: compute_vllm_spec_decode_delta can return None (e.g.,
when delta_draft_tokens <= 0), but the code immediately dereferences
specdec_stats for logging; update the VLLM path to guard against None by
checking if specdec_stats is truthy before logging (or early-return/skip logging
when None). Locate the call site around specdec_stats =
compute_vllm_spec_decode_delta(...) and wrap the subsequent LOG.info(...) blocks
in a conditional like if specdec_stats: so you only access specdec_stats["..."]
when compute_vllm_spec_decode_delta returned a valid dict.
- Around line 585-599: The aggregation can KeyError when some entries in
data_points lack metric keys; update the logic that builds return_value to first
filter data_points into a metric_points list containing only items with the
expected keys (e.g., "num_drafts", "draft_tokens", "accepted_tokens",
"acceptance_rate", "acceptance_length"), or alternatively use
data_point.get(key, 0) and compute averages against the count of metric_points
(not original data_points); if metric_points is empty return None (keeping the
LOG.warning path) and then compute the sums/averages from metric_points when
constructing return_value so the aggregation never raises KeyError.

---

Nitpick comments:
In `@docs/evaluation/speculative-decoding.md`:
- Line 9: Replace the vague "here" link text with descriptive link text (e.g.,
"speculative decoding evaluation framework" or "SpecDec benchmark repository")
in the sentence referencing the GitHub repo URL so it satisfies markdownlint
MD059; update the link label in the line containing the GitHub URL in
docs/evaluation/speculative-decoding.md and make the same replacement for the
other occurrence noted (also at the second instance around line 32) to ensure
both links are descriptive.

In `@nemo_skills/dataset/speed-bench/__init__.py`:
- Around line 16-21: This change adds a new dataset/modality (REQUIRES_DATA_DIR,
DATASET_GROUP="speculative-decoding", METRICS_TYPE="specdec", EVAL_SPLIT,
GENERATION_ARGS, GENERATION_MODULE) but no CI/Slurm test covers it; add a
SPEED-Bench entry to the default Slurm/CI benchmark matrix so at least one path
runs during tests. Concretely: update the Slurm/CI benchmark list to include a
SPEED-Bench job that references the new dataset (dataset name or DATASET_GROUP
"speculative-decoding"), ensure the job uses the correct generation flags
(include GENERATION_ARGS with ++eval_type=specdec and
++inference.include_response=true) and imports the GENERATION_MODULE
"nemo_skills.inference.eval.specdec", and add this job to the default test suite
so integration/regression failures for the new modality are caught.

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 393-396: The list comprehension using
zip(external_dataset["question"][1:], external_dataset["options"][1:]) can
silently truncate if the two columns differ in length; update that call to
zip(..., strict=True) to force an immediate error on length mismatch so data
corruption is not silent—specifically modify the expression inside the return
that builds the list after first_question (the zip used with
get_question_and_options and external_dataset["question"/"options"]) to pass
strict=True.
- Around line 196-211: The loops in prepare.py use the loop variable name
"answer", which shadows the function parameter named "answer"; rename the loop
variables in both occurrences (the first "for i, answer in enumerate(answers):"
and the second "for i, answer in enumerate(answers_to_add):") to a
non-conflicting name like "candidate" or "ans" and update all uses inside those
loops (e.g., replace "answer" -> "candidate" and "answer_to_add" construction,
"answer_to_add_tokens", and prompt appends) so the function parameter "answer"
is not shadowed and behavior is preserved.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b38cce and e326fde.

📒 Files selected for processing (11)

docs/evaluation/index.md
docs/evaluation/speculative-decoding.md
nemo_skills/dataset/speed-bench/__init__.py
nemo_skills/dataset/speed-bench/prepare.py
nemo_skills/evaluation/evaluator/__init__.py
nemo_skills/evaluation/evaluator/specdec.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/specdec_metrics.py
nemo_skills/inference/eval/specdec.py
nemo_skills/inference/server/serve_sglang.py
nemo_skills/inference/server/serve_vllm.py

docs/evaluation/speculative-decoding.md

coderabbitai · 2026-02-26T08:29:25Z

docs/evaluation/speculative-decoding.md

+```
+--------------------------------------------- speed-bench ----------------------------------------------
+evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
+pass@1          | 880         | 464        | 139         | 2.78                   | 69.38
+```


⚠️ Potential issue | 🟡 Minor

Add a language identifier to the metrics output code fence.

This block should specify a language (e.g., text) to satisfy MD040.

💡 Proposed markdownlint fix

-``` +```text --------------------------------------------- speed-bench ---------------------------------------------- evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate pass@1 | 880 | 464 | 139 | 2.78 | 69.38

</details>  <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 77-77: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/evaluation/speculative-decoding.md` around lines 77 - 81, The markdown code fence in docs/evaluation/speculative-decoding.md containing the speed-bench metrics is missing a language identifier which triggers MD040; update the opening triple-backtick for that block to include a language tag (e.g., use ```text) so the fenced block reads as a text code block and satisfies the linter, ensuring the metrics table remains unchanged.

nemo_skills/dataset/speed-bench/prepare.py

coderabbitai · 2026-02-26T08:29:26Z

nemo_skills/dataset/speed-bench/prepare.py

+def _fetch_all_turns_data(example: dict[str, Any], speed_config: DATASET_CONFIG | str) -> dict[str, Any]:
+    turns = example["turns"]
+    if not turns[0].startswith(TURNS_PLACEHOLDER):
+        return example
+


⚠️ Potential issue | 🟡 Minor

Potential IndexError if turns list is empty.

Accessing turns[0] without checking if the list is empty will raise an IndexError. Consider adding a guard.

🛡️ Proposed fix

def _fetch_all_turns_data(example: dict[str, Any], speed_config: DATASET_CONFIG | str) -> dict[str, Any]: turns = example["turns"] - if not turns[0].startswith(TURNS_PLACEHOLDER): + if not turns or not turns[0].startswith(TURNS_PLACEHOLDER): return example

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 432 - 436, The function _fetch_all_turns_data currently accesses turns[0] without ensuring turns is non-empty, which can raise IndexError; update _fetch_all_turns_data to first check that example["turns"] (the turns variable) is a non-empty list (e.g., if not turns: return example or handle appropriately) before evaluating turns[0]. Keep the existing TURNS_PLACEHOLDER check but perform it only after verifying turns has elements, and ensure any downstream logic that assumes at least one turn is similarly guarded.

coderabbitai · 2026-02-26T08:29:26Z

nemo_skills/evaluation/evaluator/specdec.py

+        cfg: Evaluator configuration dict.  Must contain ``input_file`` and
+            ``specdec_stats`` keys.
+    """
+    eval_config = SpecdecEvaluatorConfig(**cfg)


⚠️ Potential issue | 🟠 Major

Fail fast on required evaluator inputs and stats schema.

input_file and required specdec metrics should be validated explicitly. Current fallback defaults at Line 82–84 can hide malformed stats and silently degrade results.

💡 Proposed fail-fast validation

def eval_specdec(cfg: dict[str, Any]) -> None: @@ eval_config = SpecdecEvaluatorConfig(**cfg) + if not eval_config.input_file: + raise ValueError("`input_file` is required for specdec evaluation.") @@ stats = eval_config.specdec_stats if stats is not None: + required_keys = {"acceptance_length", "acceptance_rate", "num_drafts", "draft_tokens", "accepted_tokens"} + missing = required_keys - set(stats.keys()) + if missing: + raise ValueError(f"Missing required `specdec_stats` keys: {sorted(missing)}") LOG.info( "Stamping spec-decode stats onto %d data points: " "acceptance_length=%.4f, acceptance_rate=%.2f%%, num_drafts=%d", len(data), - stats.get("acceptance_length", 0.0), - stats.get("acceptance_rate", 0.0), - stats.get("num_drafts", 0), + stats["acceptance_length"], + stats["acceptance_rate"], + stats["num_drafts"], )

As per coding guidelines, "Don't use `.get()` for accessing dictionary keys if the code expects them to be present; use direct access `data[key_name]` to fail with a clear error instead of silently corrupting data."
Also applies to: 69-70, 76-85

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/specdec.py` at line 64, After constructing eval_config via SpecdecEvaluatorConfig(**cfg), add explicit fail-fast validation that cfg contains the required 'input_file' key and that the parsed stats object includes all required specdec metric keys (do not rely on dict.get); if any are missing raise a clear ValueError mentioning the missing key(s). Locate usages around SpecdecEvaluatorConfig and the stats-parsing/metrics extraction logic (previously using .get at the referenced areas) and replace .get() calls with direct key access or an explicit presence-check + informative exception so malformed or missing stats cannot silently degrade results.

coderabbitai · 2026-02-26T08:29:26Z

nemo_skills/inference/eval/specdec.py

+        try:
+            return_value = {
+                "num_drafts": sum([data_point["num_drafts"] for data_point in data_points]),
+                "draft_tokens": sum([data_point["draft_tokens"] for data_point in data_points]),
+                "accepted_tokens": sum([data_point["accepted_tokens"] for data_point in data_points]),
+                "acceptance_rate": sum([data_point["acceptance_rate"] for data_point in data_points])
+                / len(data_points),
+                "acceptance_length": sum([data_point["acceptance_length"] for data_point in data_points])
+                / len(data_points),
+                "per_position_acceptance_rates": [],
+            }
+        except KeyError:
+            LOG.warning("Metrics injection failed for some data points, skipping")
+            return None
+        return return_value


⚠️ Potential issue | 🟠 Major

Aggregation may fail if any data point lacks metrics keys.

The aggregation at lines 587-594 accesses data_point["num_drafts"] etc. directly, but line 579-582 only logs a warning when metrics aren't found—the data point is still written without these keys. This will cause a KeyError in the final aggregation.

🐛 Proposed fix — filter data points or use .get() with fallback

try: - return_value = { - "num_drafts": sum([data_point["num_drafts"] for data_point in data_points]), - "draft_tokens": sum([data_point["draft_tokens"] for data_point in data_points]), - "accepted_tokens": sum([data_point["accepted_tokens"] for data_point in data_points]), - "acceptance_rate": sum([data_point["acceptance_rate"] for data_point in data_points]) - / len(data_points), - "acceptance_length": sum([data_point["acceptance_length"] for data_point in data_points]) - / len(data_points), - "per_position_acceptance_rates": [], - } + valid_data_points = [dp for dp in data_points if "num_drafts" in dp] + if not valid_data_points: + LOG.warning("No data points with metrics found") + return None + return_value = { + "num_drafts": sum(dp["num_drafts"] for dp in valid_data_points), + "draft_tokens": sum(dp["draft_tokens"] for dp in valid_data_points), + "accepted_tokens": sum(dp["accepted_tokens"] for dp in valid_data_points), + "acceptance_rate": sum(dp["acceptance_rate"] for dp in valid_data_points) / len(valid_data_points), + "acceptance_length": sum(dp["acceptance_length"] for dp in valid_data_points) / len(valid_data_points), + "per_position_acceptance_rates": [], + } except KeyError: LOG.warning("Metrics injection failed for some data points, skipping") return None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/specdec.py` around lines 585 - 599, The aggregation can KeyError when some entries in data_points lack metric keys; update the logic that builds return_value to first filter data_points into a metric_points list containing only items with the expected keys (e.g., "num_drafts", "draft_tokens", "accepted_tokens", "acceptance_rate", "acceptance_length"), or alternatively use data_point.get(key, 0) and compute averages against the count of metric_points (not original data_points); if metric_points is empty return None (keeping the LOG.warning path) and then compute the sums/averages from metric_points when constructing return_value so the aggregation never raises KeyError.

coderabbitai · 2026-02-26T08:29:26Z

nemo_skills/inference/eval/specdec.py

+                specdec_stats = compute_sglang_spec_decode_delta(self._before_metrics, after_sglang)
+                LOG.info(
+                    "SGLang Prometheus delta: acceptance_length=%.4f, "
+                    "acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d",
+                    specdec_stats["acceptance_length"],
+                    specdec_stats["acceptance_rate"],
+                    specdec_stats["num_drafts"],
+                    specdec_stats["draft_tokens"],
+                )


⚠️ Potential issue | 🟠 Major

Missing null check — specdec_stats can be None from compute_sglang_spec_decode_delta.

compute_sglang_spec_decode_delta returns None when there are no new requests (line 295) or when weighted delta computation fails (line 311). Accessing specdec_stats["acceptance_length"] on line 730 will raise a TypeError in these cases.

🐛 Proposed fix

specdec_stats = compute_sglang_spec_decode_delta(self._before_metrics, after_sglang) + if specdec_stats is None: + LOG.warning("Failed to compute SGLang spec-decode delta from Prometheus metrics") + else: - LOG.info( - "SGLang Prometheus delta: acceptance_length=%.4f, " - "acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d", - specdec_stats["acceptance_length"], - specdec_stats["acceptance_rate"], - specdec_stats["num_drafts"], - specdec_stats["draft_tokens"], - ) + LOG.info( + "SGLang Prometheus delta: acceptance_length=%.4f, " + "acceptance_rate=%.2f%%, requests=%d, gen_tokens=%d", + specdec_stats["acceptance_length"], + specdec_stats["acceptance_rate"], + specdec_stats["num_drafts"], + specdec_stats["draft_tokens"], + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/specdec.py` around lines 726 - 734, compute_sglang_spec_decode_delta can return None, so before using specdec_stats in the LOG.info call you must null-check it; locate the call that assigns specdec_stats from compute_sglang_spec_decode_delta (uses self._before_metrics and after_sglang) and if specdec_stats is None, skip or log a safe message (e.g., "no delta to report") instead of dereferencing keys like "acceptance_length", "acceptance_rate", "num_drafts", "draft_tokens"; ensure the LOG.info only runs when specdec_stats is a dict with those keys to avoid TypeError.

coderabbitai · 2026-02-26T08:29:26Z

nemo_skills/inference/eval/specdec.py

+            specdec_stats = compute_vllm_spec_decode_delta(self._before_metrics, after_metrics)
+            LOG.info(
+                "Spec-decode delta: drafts=%d, draft_tokens=%d, accepted=%d, "
+                "acceptance_rate=%.2f%%, acceptance_length=%.4f",
+                specdec_stats["num_drafts"],
+                specdec_stats["draft_tokens"],
+                specdec_stats["accepted_tokens"],
+                specdec_stats["acceptance_rate"],
+                specdec_stats["acceptance_length"],
+            )
+            if specdec_stats["per_position_acceptance_rates"]:
+                LOG.info(
+                    "Per-position acceptance rates: %s",
+                    [f"{r:.4f}" for r in specdec_stats["per_position_acceptance_rates"]],
+                )


⚠️ Potential issue | 🟠 Major

Same null-check issue for VLLM path.

compute_vllm_spec_decode_delta returns None when delta_draft_tokens <= 0 (line 375). The subsequent logging on lines 741-754 will crash.

🐛 Proposed fix

specdec_stats = compute_vllm_spec_decode_delta(self._before_metrics, after_metrics) + if specdec_stats is None: + LOG.warning("No speculative decoding activity detected during generation") + else: - LOG.info( - "Spec-decode delta: drafts=%d, draft_tokens=%d, accepted=%d, " - "acceptance_rate=%.2f%%, acceptance_length=%.4f", - specdec_stats["num_drafts"], - specdec_stats["draft_tokens"], - specdec_stats["accepted_tokens"], - specdec_stats["acceptance_rate"], - specdec_stats["acceptance_length"], - ) - if specdec_stats["per_position_acceptance_rates"]: LOG.info( - "Per-position acceptance rates: %s", - [f"{r:.4f}" for r in specdec_stats["per_position_acceptance_rates"]], + "Spec-decode delta: drafts=%d, draft_tokens=%d, accepted=%d, " + "acceptance_rate=%.2f%%, acceptance_length=%.4f", + specdec_stats["num_drafts"], + specdec_stats["draft_tokens"], + specdec_stats["accepted_tokens"], + specdec_stats["acceptance_rate"], + specdec_stats["acceptance_length"], ) + if specdec_stats["per_position_acceptance_rates"]: + LOG.info( + "Per-position acceptance rates: %s", + [f"{r:.4f}" for r in specdec_stats["per_position_acceptance_rates"]], + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

specdec_stats = compute_vllm_spec_decode_delta(self._before_metrics, after_metrics)

LOG.info(

"Spec-decode delta: drafts=%d, draft_tokens=%d, accepted=%d, "

"acceptance_rate=%.2f%%, acceptance_length=%.4f",

specdec_stats["num_drafts"],

specdec_stats["draft_tokens"],

specdec_stats["accepted_tokens"],

specdec_stats["acceptance_rate"],

specdec_stats["acceptance_length"],

)

if specdec_stats["per_position_acceptance_rates"]:

LOG.info(

"Per-position acceptance rates: %s",

[f"{r:.4f}" for r in specdec_stats["per_position_acceptance_rates"]],

)

specdec_stats = compute_vllm_spec_decode_delta(self._before_metrics, after_metrics)

if specdec_stats is None:

LOG.warning("No speculative decoding activity detected during generation")

else:

LOG.info(

"Spec-decode delta: drafts=%d, draft_tokens=%d, accepted=%d, "

"acceptance_rate=%.2f%%, acceptance_length=%.4f",

specdec_stats["num_drafts"],

specdec_stats["draft_tokens"],

specdec_stats["accepted_tokens"],

specdec_stats["acceptance_rate"],

specdec_stats["acceptance_length"],

)

if specdec_stats["per_position_acceptance_rates"]:

LOG.info(

"Per-position acceptance rates: %s",

[f"{r:.4f}" for r in specdec_stats["per_position_acceptance_rates"]],

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/specdec.py` around lines 740 - 754, compute_vllm_spec_decode_delta can return None (e.g., when delta_draft_tokens <= 0), but the code immediately dereferences specdec_stats for logging; update the VLLM path to guard against None by checking if specdec_stats is truthy before logging (or early-return/skip logging when None). Locate the call site around specdec_stats = compute_vllm_spec_decode_delta(...) and wrap the subsequent LOG.info(...) blocks in a conditional like if specdec_stats: so you only access specdec_stats["..."] when compute_vllm_spec_decode_delta returned a valid dict.

Signed-off-by: talora <talora@nvidia.com>

Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

docs/evaluation/speculative-decoding.md (3)

77-81: ⚠️ Potential issue | 🟡 Minor

Add language identifier to code fence.

The fenced code block showing metrics output should specify a language (e.g., text) to satisfy MD040.

💡 Proposed fix

-```
+```text
 --------------------------------------------- speed-bench ----------------------------------------------

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/speculative-decoding.md` around lines 77 - 81, The fenced
code block that shows the metrics output is missing a language specifier which
triggers MD040; update the opening fence for that block to include a language
identifier (e.g., use "text") so the block starts with ```text instead of ``` —
locate the metrics block showing "speed-bench" and change only the fence marker
to include the language identifier.

85-97: ⚠️ Potential issue | 🟡 Minor

Missing line continuation backslash in vLLM example.

Same issue as the SGLang example: line 95 is missing the trailing backslash before line 96.

💡 Proposed fix

-    --server_type=vllm
+    --server_type=vllm \
     ++inference.tokens_to_generate=1024

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/speculative-decoding.md` around lines 85 - 97, The bash
example is missing a line-continuation backslash before the final argument;
update the multiline command so the line containing --server_type=vllm ends with
a trailing backslash to join the next line that sets
++inference.tokens_to_generate=1024, ensuring the --server_type and
++inference.tokens_to_generate lines are part of the same continued command.

61-73: ⚠️ Potential issue | 🟡 Minor

Missing line continuation backslash breaks shell command.

Line 71 ends with --server_type=sglang but is missing the trailing backslash \ needed to continue to line 72's ++inference.tokens_to_generate=1024. Copy-paste execution will fail.

💡 Proposed fix

-    --server_type=sglang
+    --server_type=sglang \
     ++inference.tokens_to_generate=1024

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/speculative-decoding.md` around lines 61 - 73, The shell
example for the ns eval command is missing a line-continuation backslash after
the --server_type=sglang token, causing the ++inference.tokens_to_generate=1024
flag to be treated as a new shell command; update the example invocation of ns
eval so the --server_type=sglang line ends with a trailing backslash (or place
++inference.tokens_to_generate=1024 on the previous line) so the entire command
including ++inference.tokens_to_generate=1024 is passed to ns eval.

nemo_skills/dataset/speed-bench/prepare.py (1)

422-426: ⚠️ Potential issue | 🟡 Minor

Potential IndexError if turns list is empty.

Accessing turns[0] without checking if the list is empty will raise an IndexError.

🛡️ Proposed fix

 def _fetch_all_turns_data(example: dict[str, Any], speed_config: DATASET_CONFIG | str) -> dict[str, Any]:
     turns = example["turns"]
-    if not turns[0].startswith(TURNS_PLACEHOLDER):
+    if not turns or not turns[0].startswith(TURNS_PLACEHOLDER):
         return example

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 422 - 426, In
_fetch_all_turns_data, accessing turns[0] can raise IndexError when turns is
empty; add a guard to check that turns is non-empty (e.g., if not turns or not
turns[0].startswith(TURNS_PLACEHOLDER): return example) so the function returns
early for empty lists and avoids indexing into turns[0]; update any related
checks in the same function to use the safe non-empty check rather than assuming
turns has at least one element.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 385-388: The list construction silently truncates if
external_dataset["question"] and external_dataset["options"] have different
lengths; update the comprehension that builds the return value (the expression
using zip(external_dataset["question"][1:], external_dataset["options"][1:])) to
call zip with strict=True so a ValueError is raised on length mismatch, keeping
first_question and the helper get_question_and_options unchanged.
- Around line 195-202: The for-loop in prepare.py shadows the function parameter
named answer; rename the loop variable (e.g., to candidate_answer or
other_answer) in the loop "for i, answer in enumerate(answers)" and update all
uses inside that loop (including building answer_to_add, computing
answer_to_add_tokens with encoder.encode, checking correct_answer_i and updating
answers_to_add_stop) so the function parameter answer remains unshadowed and
clear.

---

Duplicate comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 77-81: The fenced code block that shows the metrics output is
missing a language specifier which triggers MD040; update the opening fence for
that block to include a language identifier (e.g., use "text") so the block
starts with ```text instead of ``` — locate the metrics block showing
"speed-bench" and change only the fence marker to include the language
identifier.
- Around line 85-97: The bash example is missing a line-continuation backslash
before the final argument; update the multiline command so the line containing
--server_type=vllm ends with a trailing backslash to join the next line that
sets ++inference.tokens_to_generate=1024, ensuring the --server_type and
++inference.tokens_to_generate lines are part of the same continued command.
- Around line 61-73: The shell example for the ns eval command is missing a
line-continuation backslash after the --server_type=sglang token, causing the
++inference.tokens_to_generate=1024 flag to be treated as a new shell command;
update the example invocation of ns eval so the --server_type=sglang line ends
with a trailing backslash (or place ++inference.tokens_to_generate=1024 on the
previous line) so the entire command including
++inference.tokens_to_generate=1024 is passed to ns eval.

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 422-426: In _fetch_all_turns_data, accessing turns[0] can raise
IndexError when turns is empty; add a guard to check that turns is non-empty
(e.g., if not turns or not turns[0].startswith(TURNS_PLACEHOLDER): return
example) so the function returns early for empty lists and avoids indexing into
turns[0]; update any related checks in the same function to use the safe
non-empty check rather than assuming turns has at least one element.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3143529 and f42d153.

📒 Files selected for processing (5)

docs/evaluation/speculative-decoding.md
mkdocs.yml
nemo_skills/dataset/speed-bench/prepare.py
nemo_skills/evaluation/evaluator/specdec.py
nemo_skills/evaluation/metrics/specdec_metrics.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/evaluation/evaluator/specdec.py

coderabbitai · 2026-02-26T12:21:08Z

nemo_skills/dataset/speed-bench/prepare.py

+    for i, answer in enumerate(answers):
+        if i == correct_answer_i:
+            continue
+        answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"
+        answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))
+        if all_tokens + answer_to_add_tokens > num_tokens:
+            break
+        answers_to_add_stop = i


⚠️ Potential issue | 🟡 Minor

Variable shadowing: answer loop variable shadows function parameter.

The loop variable answer on line 195 shadows the answer parameter from line 136. While the code currently works because answer is used at line 189 before the loop, this shadowing is confusing and error-prone for future maintenance.

🛠️ Proposed fix

- for i, answer in enumerate(answers): + for i, ans in enumerate(answers): if i == correct_answer_i: continue - answer_to_add = f"A{i + 1}:\n\n{answer}\n\n" + answer_to_add = f"A{i + 1}:\n\n{ans}\n\n"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for i, answer in enumerate(answers):

if i == correct_answer_i:

continue

answer_to_add = f"A{i + 1}:\n\n{answer}\n\n"

answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))

if all_tokens + answer_to_add_tokens > num_tokens:

break

answers_to_add_stop = i

for i, ans in enumerate(answers):

if i == correct_answer_i:

continue

answer_to_add = f"A{i + 1}:\n\n{ans}\n\n"

answer_to_add_tokens = len(encoder.encode(answer_to_add, disallowed_special=()))

if all_tokens + answer_to_add_tokens > num_tokens:

break

answers_to_add_stop = i

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 195 - 202, The for-loop in prepare.py shadows the function parameter named answer; rename the loop variable (e.g., to candidate_answer or other_answer) in the loop "for i, answer in enumerate(answers)" and update all uses inside that loop (including building answer_to_add, computing answer_to_add_tokens with encoder.encode, checking correct_answer_i and updating answers_to_add_stop) so the function parameter answer remains unshadowed and clear.

coderabbitai · 2026-02-26T12:21:08Z

nemo_skills/dataset/speed-bench/prepare.py

+    return [first_question] + [
+        get_question_and_options(question, options)
+        for question, options in zip(external_dataset["question"][1:], external_dataset["options"][1:])
+    ]


⚠️ Potential issue | 🟡 Minor

Add strict=True to zip() to catch length mismatches.

Per coding guidelines, code should fail when something unexpected happens. If question and options have different lengths, silent truncation would corrupt data.

🛠️ Proposed fix

return [first_question] + [ get_question_and_options(question, options) - for question, options in zip(external_dataset["question"][1:], external_dataset["options"][1:]) + for question, options in zip(external_dataset["question"][1:], external_dataset["options"][1:], strict=True) ]

🧰 Tools

🪛 Ruff (0.15.2)

[warning] 387-387: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 385 - 388, The list construction silently truncates if external_dataset["question"] and external_dataset["options"] have different lengths; update the comprehension that builds the return value (the expression using zip(external_dataset["question"][1:], external_dataset["options"][1:])) to call zip with strict=True so a ValueError is raised on length mismatch, keeping first_question and the helper get_question_and_options unchanged.

Signed-off-by: talora <talora@nvidia.com>

coderabbitai

♻️ Duplicate comments (1)

nemo_skills/evaluation/evaluator/specdec.py (1)

70-77: ⚠️ Potential issue | 🟡 Minor

Add explicit validation that input_file is not None.

BaseEvaluatorConfig.input_file defaults to None. If not provided, open(jsonl_file, ...) at Line 76 will raise an unclear TypeError. A fail-fast check with a descriptive message would improve debuggability.

💡 Proposed fix

     eval_config = SpecdecEvaluatorConfig(**cfg)

+    if not eval_config.input_file:
+        raise ValueError("`input_file` is required for specdec evaluation.")
+
     # ------------------------------------------------------------------
     # 1. Read output file
     # ------------------------------------------------------------------
     jsonl_file = eval_config.input_file

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/specdec.py` around lines 70 - 77, The code
currently uses eval_config = SpecdecEvaluatorConfig(**cfg) then opens jsonl_file
= eval_config.input_file without checking for None; add an explicit fail-fast
check after constructing SpecdecEvaluatorConfig (check eval_config.input_file is
not None and non-empty) and if it is None raise a ValueError with a clear
message like "input_file must be provided for SpecdecEvaluatorConfig"; reference
the symbols SpecdecEvaluatorConfig, eval_config.input_file, jsonl_file and the
subsequent open(...) call so the check sits immediately before the with
open(jsonl_file, ...) block.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/evaluation/evaluator/specdec.py`:
- Around line 70-77: The code currently uses eval_config =
SpecdecEvaluatorConfig(**cfg) then opens jsonl_file = eval_config.input_file
without checking for None; add an explicit fail-fast check after constructing
SpecdecEvaluatorConfig (check eval_config.input_file is not None and non-empty)
and if it is None raise a ValueError with a clear message like "input_file must
be provided for SpecdecEvaluatorConfig"; reference the symbols
SpecdecEvaluatorConfig, eval_config.input_file, jsonl_file and the subsequent
open(...) call so the check sits immediately before the with open(jsonl_file,
...) block.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f42d153 and 858317a.

📒 Files selected for processing (2)

nemo_skills/dataset/speed-bench/__init__.py
nemo_skills/evaluation/evaluator/specdec.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/dataset/speed-bench/init.py

gwarmstrong

largely looks good, just a few coments/questions

docs/evaluation/speculative-decoding.md

gwarmstrong · 2026-02-27T00:38:58Z

docs/evaluation/speculative-decoding.md

+#### License
+
+GOVERNING TERMS: This dataset is governed by the NVIDIA Evaluation Dataset License Agreement.
+
+ADDITIONAL INFORMATION: MIT for bigcode/humanevalpack, RUCAIBox/MMATH, RUCAIBox/BAMBOO and EQ-Bench. Apache 2.0 for Writing Bench and Spec-Bench. CC BY 4.0 for FBK-MT/MCIF. MIT and Apache 2.0 for tianyang/repobench_python_v1.1, JetBrains-Research/lca-project-level-code-completion and tianyang/repobench_java_v1.1.
+
+NOTICE: For each dataset a user elects to use, the user is responsible for checking if the dataset license is fit for the intended purpose. The `prepare_data.py` script automatically fetches data from all the source datasets.


we don't typically include this information in the docs, maybe just the "additional details" comment below will suffice?

nemo_skills/dataset/speed-bench/prepare.py

Signed-off-by: talora <talora@nvidia.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (4)

nemo_skills/evaluation/evaluator/specdec.py (1)

68-68: ⚠️ Potential issue | 🟠 Major

Fail fast on required input_file and full specdec_stats schema.

Missing input_file currently fails later with a less-clear error, and partial specdec_stats can silently stamp incomplete metrics.

🛠️ Proposed fix

 def eval_specdec(cfg: dict[str, Any]) -> None:
@@
     eval_config = SpecdecEvaluatorConfig(**cfg)
+    if not eval_config.input_file:
+        raise ValueError("`input_file` is required for specdec evaluation.")

@@
     stats = eval_config.specdec_stats
+    required_stats_keys = {
+        "acceptance_length",
+        "acceptance_rate",
+        "num_drafts",
+        "draft_tokens",
+        "accepted_tokens",
+        "per_position_acceptance_rates",
+    }
+    missing = required_stats_keys - set(stats)
+    if missing:
+        raise ValueError(f"Missing required `specdec_stats` keys: {sorted(missing)}")
+
     LOG.info(
         "Stamping spec-decode stats onto %d data points: "
         "acceptance_length=%.4f, acceptance_rate=%.2f%%, num_drafts=%d",

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

Also applies to: 73-75, 80-93

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/specdec.py` at line 68, The code currently
constructs eval_config = SpecdecEvaluatorConfig(**cfg) but does not validate
that required fields (notably input_file) are present nor that specdec_stats
conforms to the full expected schema; add explicit validation immediately after
construction to fail fast: check eval_config.input_file is non-empty and raise a
clear error if missing, and validate eval_config.specdec_stats contains all
required keys/fields (or run the config's schema validator) and raise if any
required metric is absent or has wrong type; update the logic used around
SpecdecEvaluatorConfig (and the related blocks referenced near lines 73-75 and
80-93) to enforce these checks so user-supplied unsupported/partial parameters
are rejected early with informative errors.

nemo_skills/dataset/speed-bench/prepare.py (2)

415-417: ⚠️ Potential issue | 🟡 Minor

Guard against empty turns before indexing turns[0].

Line 416 can raise IndexError when turns is empty.

🛠️ Proposed fix

 def _fetch_all_turns_data(example: dict[str, Any], speed_config: DATASET_CONFIG | str) -> dict[str, Any]:
     turns = example["turns"]
-    if not turns[0].startswith(TURNS_PLACEHOLDER):
+    if not turns or not turns[0].startswith(TURNS_PLACEHOLDER):
         return example

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 415 - 417, Guard
against empty turns before indexing: ensure you check that the list turns is
non-empty before calling turns[0].startswith(TURNS_PLACEHOLDER). Update the
conditional around turns/TURNS_PLACEHOLDER in prepare.py so it first returns
example if turns is empty (e.g., check not turns or len(turns) == 0), and only
then evaluate startswith on turns[0]; keep the existing behavior of returning
example when the placeholder check fails.

377-380: ⚠️ Potential issue | 🟠 Major

Use zip(..., strict=True) to prevent silent truncation of question/option pairs.

If lengths diverge, current code truncates silently and can corrupt prompt construction.

🛠️ Proposed fix

-        for question, options in zip(external_dataset["question"][1:], external_dataset["options"][1:])
+        for question, options in zip(
+            external_dataset["question"][1:],
+            external_dataset["options"][1:],
+            strict=True,
+        )

#!/bin/bash
# Verify all zip() call sites in this file and confirm strict=True is present where paired sequences must align.
rg -n -C2 'zip\(' nemo_skills/dataset/speed-bench/prepare.py

As per coding guidelines, "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 377 - 380, The list
construction at the end of the function that returns [first_question] + [...]
uses zip(external_dataset["question"][1:], external_dataset["options"][1:])
which can silently truncate mismatched sequences; update that call to
zip(external_dataset["question"][1:], external_dataset["options"][1:],
strict=True) so mismatched lengths raise immediately, and similarly audit other
zip(...) usages in this module (e.g., any other calls near
get_question_and_options) and add strict=True where paired sequences must align.

docs/evaluation/speculative-decoding.md (1)

68-72: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to both metrics code fences.

Both fenced blocks are missing a language tag (MD040). This was already flagged earlier and is still present.
💡 Proposed markdownlint fix
-```
+```text
 --------------------------------------------- speed-bench ----------------------------------------------
 evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
 pass@1          | 880         | 464        | 139         | 2.78                   | 69.38
- +text
--------------------------------------------- speed-bench ----------------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | spec_acceptance_length | spec_acceptance_rate
pass@1 | 880 | 463 | 104 | 2.37 | 45.52
Also applies to: 92-96
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/speculative-decoding.md` around lines 68 - 72, The two fenced
blocks that render the "---------------------------------------------
speed-bench ----------------------------------------------" table are missing
language identifiers; update each opening fence (the triple backticks before the
table rows) to include a language tag such as "text" or "ansi" so markdownlint
MD040 is satisfied. Locate the two code fences that begin with the speed-bench
header (the block showing evaluation_mode | num_entries | avg_tokens ...) and
change ``` to ```text for both occurrences.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/evaluation/speculative-decoding.md`:
- Line 9: Replace the non-descriptive link text "here" with meaningful,
accessible link text that describes the target (e.g., "Model-Optimizer
speculative decoding benchmark" or "SpecDec benchmark repository") for both
occurrences of the GitHub URL (the evaluation framework link to
NVIDIA/Model-Optimizer/examples/specdec_bench) so the link text is descriptive
and resolves the markdownlint MD059 warning.

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 187-201: The loop assembling stackselect answers fails to update
the running token count and initializes answers_to_add_stop to 0 causing an
off-by-one inclusion; fix by initializing answers_to_add_stop to -1, and inside
the for loop (in the block iterating answers with index i and variable
answer_to_add_tokens) when the new answer fits (all_tokens +
answer_to_add_tokens <= num_tokens) set answers_to_add_stop = i and increment
all_tokens by answer_to_add_tokens, otherwise break; retain the existing skip of
the correct answer (i == correct_answer_i) so the final answers_to_add selection
logic works correctly.
- Around line 612-613: The code constructs output_path from args.output_dir and
calls dataset.to_json(output_path) without ensuring the directory exists; before
creating output_path, ensure args.output_dir (or output_path.parent) is created
(e.g., call pathlib.Path(...).mkdir(parents=True, exist_ok=True)) so writing via
dataset.to_json(output_path) cannot fail due to a missing directory. Reference:
output_path, args.output_dir, dataset.to_json.

---

Duplicate comments:
In `@docs/evaluation/speculative-decoding.md`:
- Around line 68-72: The two fenced blocks that render the
"--------------------------------------------- speed-bench
----------------------------------------------" table are missing language
identifiers; update each opening fence (the triple backticks before the table
rows) to include a language tag such as "text" or "ansi" so markdownlint MD040
is satisfied. Locate the two code fences that begin with the speed-bench header
(the block showing evaluation_mode | num_entries | avg_tokens ...) and change
``` to ```text for both occurrences.

In `@nemo_skills/dataset/speed-bench/prepare.py`:
- Around line 415-417: Guard against empty turns before indexing: ensure you
check that the list turns is non-empty before calling
turns[0].startswith(TURNS_PLACEHOLDER). Update the conditional around
turns/TURNS_PLACEHOLDER in prepare.py so it first returns example if turns is
empty (e.g., check not turns or len(turns) == 0), and only then evaluate
startswith on turns[0]; keep the existing behavior of returning example when the
placeholder check fails.
- Around line 377-380: The list construction at the end of the function that
returns [first_question] + [...] uses zip(external_dataset["question"][1:],
external_dataset["options"][1:]) which can silently truncate mismatched
sequences; update that call to zip(external_dataset["question"][1:],
external_dataset["options"][1:], strict=True) so mismatched lengths raise
immediately, and similarly audit other zip(...) usages in this module (e.g., any
other calls near get_question_and_options) and add strict=True where paired
sequences must align.

In `@nemo_skills/evaluation/evaluator/specdec.py`:
- Line 68: The code currently constructs eval_config =
SpecdecEvaluatorConfig(**cfg) but does not validate that required fields
(notably input_file) are present nor that specdec_stats conforms to the full
expected schema; add explicit validation immediately after construction to fail
fast: check eval_config.input_file is non-empty and raise a clear error if
missing, and validate eval_config.specdec_stats contains all required
keys/fields (or run the config's schema validator) and raise if any required
metric is absent or has wrong type; update the logic used around
SpecdecEvaluatorConfig (and the related blocks referenced near lines 73-75 and
80-93) to enforce these checks so user-supplied unsupported/partial parameters
are rejected early with informative errors.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 858317a and 7715161.

📒 Files selected for processing (3)

docs/evaluation/speculative-decoding.md
nemo_skills/dataset/speed-bench/prepare.py
nemo_skills/evaluation/evaluator/specdec.py

coderabbitai · 2026-03-01T16:06:10Z

docs/evaluation/speculative-decoding.md

+In all SD benchmarks we want to measure two qualitative metrics for draft accuracy/quality: acceptance length (AL), acceptance rate (AR).
+Other metric in this group is conditional acceptance rate (or per-position acceptance rate), which measures the acceptance rate in a given position conditioned that all previous tokens were accepted.
+
+For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench).


⚠️ Potential issue | 🟡 Minor

Use descriptive link text instead of “here”.

The links on Line 9 and Line 32 are non-descriptive and trigger markdownlint MD059.

💡 Proposed doc fix

-For more advanced evaluation of SD, including throughput and per-category metrics, please use the evaluation framework [here](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench). +For more advanced evaluation of SD, including throughput and per-category metrics, please use the [Model Optimizer speculative-decoding benchmark framework](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/specdec_bench). -- Original benchmark source, is [here](https://huggingface.co/datasets/nvidia/SPEED-Bench). +- Original benchmark source is the [SPEED-Bench dataset on Hugging Face](https://huggingface.co/datasets/nvidia/SPEED-Bench).

Also applies to: 32-32

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 9-9: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/evaluation/speculative-decoding.md` at line 9, Replace the non-descriptive link text "here" with meaningful, accessible link text that describes the target (e.g., "Model-Optimizer speculative decoding benchmark" or "SpecDec benchmark repository") for both occurrences of the GitHub URL (the evaluation framework link to NVIDIA/Model-Optimizer/examples/specdec_bench) so the link text is descriptive and resolves the markdownlint MD059 warning.

nemo_skills/dataset/speed-bench/prepare.py

coderabbitai · 2026-03-01T16:06:10Z

nemo_skills/dataset/speed-bench/prepare.py

+        output_path = args.output_dir / f"{config}.jsonl"
+        dataset.to_json(output_path)


⚠️ Potential issue | 🟡 Minor

Ensure output_dir exists before writing JSONL files.

Writing to a non-existent user-provided directory will fail at runtime.

🛠️ Proposed fix

def prepare_data(args: argparse.Namespace) -> None: @@ configs = get_args(DATASET_CONFIG) if args.config == "all" else [args.config] + args.output_dir.mkdir(parents=True, exist_ok=True) for config in configs:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/speed-bench/prepare.py` around lines 612 - 613, The code constructs output_path from args.output_dir and calls dataset.to_json(output_path) without ensuring the directory exists; before creating output_path, ensure args.output_dir (or output_path.parent) is created (e.g., call pathlib.Path(...).mkdir(parents=True, exist_ok=True)) so writing via dataset.to_json(output_path) cannot fail due to a missing directory. Reference: output_path, args.output_dir, dataset.to_json.

Signed-off-by: talora <talora@nvidia.com>

…to speed-bench

gwarmstrong · 2026-03-03T17:29:21Z

LGTM, let's just run GPU tests before merge

SPEED-Bench downloads dozens of large external HuggingFace datasets during preparation, which exhausts the CI runner's disk space. Signed-off-by: George Armstrong <georgea@nvidia.com>

commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> Signed-off-by: dgitman <dgitman@nvidia.com>

talorabr and others added 7 commits February 25, 2026 12:48

Add SPEED-Bench

d031708

Signed-off-by: Talor Abramovich <talora@nvidia.com>

Fix metric and prepare data script for speed-bench

b237015

Signed-off-by: Talor Abramovich <talora@nvidia.com>

Fix data dir argument

0ac8cb4

Signed-off-by: talora <talora@nvidia.com>

Add specdec generation module

15df539

Signed-off-by: talora <talora@nvidia.com>

Update docs/evaluation/speculative-decoding.md

445213a

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>

Update nemo_skills/dataset/speed-bench/prepare.py

e326fde

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>

Merge branch 'main' into speed-bench

9c73e12

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

talorabr and others added 3 commits February 26, 2026 01:44

Merge branch 'main' of github.com:NVIDIA-NeMo/Skills into speed-bench

3143529

Signed-off-by: talora <talora@nvidia.com>

CR fix

f42d153

Signed-off-by: talora <talora@nvidia.com>

Remove dataset group

8b4a5d3

Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> Signed-off-by: Talor Abramovich <talor19@gmail.com>

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

Raise exception when no specdec_stats

858317a

Signed-off-by: talora <talora@nvidia.com>

coderabbitai bot reviewed Feb 26, 2026

View reviewed changes

gwarmstrong mentioned this pull request Feb 27, 2026

Add SPEED-bench #1276

Closed

gwarmstrong reviewed Feb 27, 2026

View reviewed changes

gwarmstrong added reviewed needs changes labels Feb 27, 2026

talorabr and others added 4 commits March 1, 2026 06:01

Remove nemo skill dep in prepare data of speed-bench

aeda1da

Signed-off-by: talora <talora@nvidia.com>

Fix documentation

dce2ed7

Signed-off-by: talora <talora@nvidia.com>

Added metrics to documentation

a4185aa

Signed-off-by: talora <talora@nvidia.com>

Merge branch 'main' into talora/speed-bench

7715161

coderabbitai bot reviewed Mar 1, 2026

View reviewed changes

gwarmstrong and others added 3 commits March 2, 2026 16:31

Merge branch 'main' into talora/speed-bench

b2aea0e

stackselect prompt token count fix

d47dae5

Signed-off-by: talora <talora@nvidia.com>

Merge branch 'talora/speed-bench' of github.com:NVIDIA-NeMo/Skills in…

78645cc

…to speed-bench

gwarmstrong added run GPU tests and removed needs changes labels Mar 3, 2026

gwarmstrong approved these changes Mar 3, 2026

View reviewed changes

gwarmstrong added run GPU tests and removed run GPU tests labels Mar 3, 2026

Exclude speed-bench from GPU CI tests (exhausts disk space)

55be3c1

SPEED-Bench downloads dozens of large external HuggingFace datasets during preparation, which exhausts the CI runner's disk space. Signed-off-by: George Armstrong <georgea@nvidia.com>

gwarmstrong force-pushed the talora/speed-bench branch from 4212cc4 to 55be3c1 Compare March 3, 2026 20:22

gwarmstrong added run GPU tests and removed run GPU tests labels Mar 3, 2026

Merge branch 'main' into talora/speed-bench

c8c2e69

gwarmstrong enabled auto-merge (squash) March 4, 2026 00:13

gwarmstrong merged commit 5ac8609 into main Mar 4, 2026
5 checks passed

gwarmstrong deleted the talora/speed-bench branch March 4, 2026 00:30

coderabbitai bot mentioned this pull request Mar 4, 2026

Add HotpotQA multi-hop QA benchmark #1292

Merged

		output_path = args.output_dir / f"{config}.jsonl"
		dataset.to_json(output_path)

Conversation

talorabr commented Feb 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gwarmstrong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gwarmstrong Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

talorabr commented Feb 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 26, 2026 •

edited

Loading