Eval kit support by Jorjeous · Pull Request #1239 · NVIDIA-NeMo/Skills

Jorjeous · 2026-02-12T17:59:36Z

VLMEvalKit's MultiModalMCore inference engine:

eval_kit: uses VLMEvalKit's native dataset/evaluation pipeline with NeMo Skills output formatting
mcore_skills: reads NeMo Skills JSONL data, runs inference through MultiModalMCore in-process with data parallelism, computes metrics using VLMEvalKit functions (e.g. asr_wer), and writes results in NeMo Skills format

Key changes:

New inference modules: eval_kit.py, mcore_skills.py
New dataset module: eval_kit/ for VLMEvalKit benchmark resolution
New metrics class: EvalKitMetrics for pre-computed VLMEvalKit metrics
Pipeline extensions: self-contained task support, METRICS_TYPE_OVERRIDE, eval_kit container/mounts handling
Lazy imports for sacrebleu to avoid missing-dep crashes

Summary by CodeRabbit

New Features
- VLMEvalKit integration for end-to-end multimodal generation, incremental JSONL outputs, and a self-contained in-process generation mode.
- Self-contained task support with per-task GPU allocation and serverless execution options.
Improvements
- Pre-computed metrics loading for faster aggregation and reporting.
- Expanded and normalized audio/translation evaluation handling.
- Smarter pipeline packaging: env-prefix support, torchrun orchestration, container selection, extra package dir handling, and optional input-file workflows.
Documentation
- Comprehensive docs added for VLMEvalKit usage, modes, and troubleshooting.

greptile-apps

_{17 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/mcore_skills.py

nemo_skills/inference/eval/eval_kit.py

nemo_skills/inference/mcore_skills.py

coderabbitai · 2026-02-12T18:07:02Z

📝 Walkthrough

Walkthrough

Adds VLMEvalKit integration: dataset-level generation config, two generation backends (vLLM client and in-process Megatron/MCore), async JSONL writer, EvalKit metrics reader, pipeline wiring for self-contained generation tasks, and expanded ASR/translation evaluation handling.

Changes

Cohort / File(s)	Summary
EvalKit dataset integration `nemo_skills/dataset/eval_kit/__init__.py`, `nemo_skills/dataset/utils.py`	Add dataset-level constants (`GENERATION_MODULE`, `METRICS_TYPE`, `GENERATION_ARGS`, `NUM_SAMPLES`, `SKIP_INPUT_FILE`) and `get_extra_generation_args(benchmark)`; make `get_default_dataset_module` handle `eval_kit.*` dotted names via early return.
VLMEvalKit generation task `nemo_skills/inference/eval/eval_kit.py`	New `EvalKitConfig` and `EvalKitGenerationTask` implementing dataset build, model init (mcore or vllm), rank-aware dataset prep, async JSONL writer, inference dispatch, evaluation, conversion to NeMo Skills output, and Hydra `main`.
In-process Megatron/MCore backend `nemo_skills/inference/mcore_skills.py`	New `MegatronMCoreConfig` and `MegatronMCoreGenerationTask` providing in-process MultiModalMCore generation, prompt/tokenizer handling, per-rank outputs and merging, optional in-process evaluation, and Hydra entry.
Generation framework extensions `nemo_skills/inference/generate.py`, `nemo_skills/inference/factory.py`	Add `GenerationTask` declarative attributes `CONTAINER_KEY`, `USE_TORCHRUN`, and classmethods `is_self_contained`, `get_env_prefix`, `get_extra_package_dirs`; add `GenerationType.mcore_skills` mapping.
Pipeline orchestration and batching `nemo_skills/pipeline/eval.py`, `nemo_skills/pipeline/utils/eval.py`, `nemo_skills/pipeline/utils/generation.py`	Add `_apply_task_overrides` and `_resolve_generation_task_class`; extend `BenchmarkArgs` (optional `input_file`, `self_contained_task`, `num_gpus`, `generation_task_class`); support per-benchmark self-contained tasks, per-benchmark GPU allocation, extra package dirs, and optional omission of `input_file` when building commands.
Metrics: EvalKit reader & registration `nemo_skills/evaluation/metrics/eval_kit_metrics.py`, `nemo_skills/evaluation/metrics/map_metrics.py`	Add `EvalKitMetrics` that reads pre-computed `eval_kit_metrics.json` (instance and shared-file support) and register it in `METRICS_MAP` under `"eval_kit"`.
Evaluation/evaluator tweaks `nemo_skills/evaluation/evaluator/audio.py`, `nemo_skills/evaluation/metrics/translation_metrics.py`	Normalize ASR/translation task-type handling, add ASR_LEADERBOARD and unified translation types, handle missing generations, and move `corpus_bleu` import into `get_metrics`.
Docs & requirements placeholder `docs/evaluation/eval-kit.md`, `docs/evaluation/index.md`, `requirements/eval-kit.txt`	Add comprehensive VLMEvalKit documentation and explanatory lines in requirements placeholder describing installation at job start.

Sequence Diagram(s)

sequenceDiagram
    participant Pipeline as Eval Pipeline
    participant Resolver as Task Resolver
    participant Dataset as VLMEvalKit Dataset
    participant Model as Model Init (mcore/vLLM)
    participant Generator as Generation Task
    participant Writer as Async JSONL Writer
    participant Evaluator as VLMEvalKit Eval
    participant Converter as Result Converter

    Pipeline->>Resolver: resolve generation_task_class, flags, extra_args
    Resolver-->>Pipeline: class, self_contained, num_gpus, extra_args
    Pipeline->>Dataset: request/build dataset (rank-aware)
    Dataset-->>Pipeline: dataset ready / metadata
    Pipeline->>Model: initialize model interface (mcore or vLLM)
    Model-->>Generator: model client/interface
    Generator->>Writer: start async writer thread
    loop per-sample
        Generator->>Generator: run inference -> prediction
        Generator->>Writer: enqueue prediction
    end
    Generator->>Writer: stop & flush
    Writer-->>Generator: final JSONL
    Generator->>Evaluator: run eval_kit evaluation
    Evaluator-->>Converter: metrics + ordered results
    Converter-->>Pipeline: NeMo Skills JSONL + eval_kit_metrics.json

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Asynchronous eval in Generation Loop #825 — overlapping changes to inference/generate.py and pipeline utils for generation-task attributes and command composition.
New slurm customization parameters (account, containers) #1209 — related pipeline container/env/torchrun override logic and task-level overrides.
Moving evaluation inside generation class and enforcing empty generations when remove_thinking=True #958 — related benchmark/task-class resolution and self-contained task handling in pipeline utils.

Suggested labels

enhancement, run GPU tests

Suggested reviewers

melllinia
gwarmstrong
Kipok

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.06% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The PR title 'Eval kit support' is vague and does not clearly describe the main changes. While it mentions the eval_kit feature, it lacks specificity about the scope and nature of the integration.	Consider a more descriptive title such as 'Add VLMEvalKit integration with MultiModalMCore in-process and vLLM inference modes' or 'Integrate VLMEvalKit inference pipeline with NeMo Skills' to better convey the substantial changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch eval-kit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/evaluation/evaluator/__init__.py (1)
131-131: ⚠️ Potential issue | 🟡 Minor

Remove debug print statement.

This looks like a leftover debug artifact. Either remove it or replace with proper logging.
-        print(f"evaluator: {evaluator}")

🤖 Fix all issues with AI agents

In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Around line 30-33: The current top-level try/except hides import errors for
ComputeEvalEvaluator; instead capture the ImportError into a module-level
variable (e.g. _compute_eval_import_error) and set ComputeEvalEvaluator = None,
then in get_evaluator_class (or the evaluator registration lookup) check if
eval_type == "compute-eval" and if ComputeEvalEvaluator is None raise a clear
ImportError that includes _compute_eval_import_error; this defers the failure to
the point of use and gives an actionable message when someone requests the
"compute-eval" evaluator.

In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 264-275: In _flush_pkl_to_jsonl, replace the broad "except
Exception" around pickle.load with a narrower handler for the transient errors
that indicate a mid-write pickle (e.g., (EOFError, pickle.UnpicklingError,
BlockingIOError)) so those are safely skipped; for other unexpected errors
(e.g., PermissionError, MemoryError) let them propagate or log them explicitly
before re-raising so they aren't silently swallowed—update the exception clause
in _flush_pkl_to_jsonl accordingly and include a clear processLogger.error(...)
call when re-raising non-transient exceptions.

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 467-522: _evaluate_results currently always imports
vlmeval.dataset.avlm.utils.asr_wer and writes eval_kit_metrics.json with only
WER which is wrong for non-ASR datasets; update the method to consult a config
flag (e.g., self.cfg.metrics_type or self.cfg.dataset_type) or a new
self.cfg.eval_function setting before importing/using asr_wer so you only
compute WER when the dataset is ASR-type, otherwise skip creating
eval_kit_metrics.json or call a configurable evaluator; reference the
METRICS_TYPE_OVERRIDE constant and the EvalKitMetrics consumer to ensure the
written file matches the selected evaluator, and add a clear guard in
_evaluate_results (or load evaluator via a registry/lookup) to avoid the broad
Exception path producing meaningless metrics.

In `@nemo_skills/pipeline/eval.py`:
- Around line 40-67: The code mixes direct access and .get() for
cluster_config["containers"] in _apply_task_overrides which is inconsistent;
update the membership check to use direct access so failures are explicit.
Replace cluster_config.get("containers", {}) with cluster_config["containers"]
in the for-loop that selects container (i.e., change if key and key in
cluster_config.get("containers", {}): to if key and key in
cluster_config["containers"]:) while keeping the initial container =
cluster_config["containers"]["nemo-skills"] and the references to CONTAINER_KEY
and task_classes unchanged.

In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 32-42: The _resolve_generation_task_class function currently
swallows all exceptions; change its error handling to only catch ImportError and
ModuleNotFoundError (so syntax/runtime errors in the module propagate) and, when
catching these import-related errors, log a warning including the exception
details and the module_name; keep the rest of the logic (import_from_path vs
importlib.import_module and returning getattr(..., "GENERATION_TASK_CLASS",
None)) unchanged and use the module logger (e.g., logging.getLogger(__name__))
or an existing logger in the file for the warning message.

🧹 Nitpick comments (13)

nemo_skills/evaluation/evaluator/audio.py (2)
499-516: Mutable _TRANSLATION_TYPES set is fragile if ever hoisted to module scope.

_TRANSLATION_TYPES is mutated on line 504 (_TRANSLATION_TYPES.add(task_type)). This works correctly today because it's re-created on every call, but the _ALL_CAPS naming convention strongly suggests a module-level constant. If a future refactor moves it to module scope (like _FAILURE_RESPONSES or VALID_NORMALIZATION_MODES above), the set would accumulate task types across calls — a subtle, hard-to-detect bug.

Consider making the intent clearer by either:

Keeping the sets local but using lowercase naming (asr_types, translation_types), or

Making the module-level sets truly immutable (frozensets) and building the union inline.
Option 2: immutable module-level sets + inline union

Define at module level:
_ASR_TYPES = frozenset({"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"})
_TRANSLATION_TYPES = frozenset({"AST", "Translation"})
Then inside evaluate_sample:
-    _ASR_TYPES = {"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"}
-    _TRANSLATION_TYPES = {"AST", "Translation"}
-    # AudioBench speech translation types: ST-{src}-{tgt}
-    if task_type.startswith("ST-"):
-        _TRANSLATION_TYPES.add(task_type)
-
-    if task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation:
+    is_translation = task_type in _TRANSLATION_TYPES or task_type.startswith("ST-")
+    is_asr = task_type in _ASR_TYPES
+
+    if (is_asr or is_translation or task_type == "CER") and not generation:
And similarly replace task_type in _TRANSLATION_TYPES checks with is_translation.
557-562: MathQA exact-match may be too strict for numerical answers.

Pure string equality after lowercasing won't match equivalent representations like "3.0" vs "3", "1/2" vs "0.5", or whitespace/formatting differences in expressions. If the AudioBench MathQA dataset guarantees canonical answer forms, this is fine — but worth a brief comment in the code to document that assumption.
nemo_skills/inference/generate.py (1)

296-306: Consider extracting the shared get_env_prefix / get_extra_package_dirs logic into a mixin or the base class.

Both mcore_skills.py (lines 143–164) and eval_kit.py (lines 128–149) have identical implementations of get_env_prefix() and get_extra_package_dirs(). Since the base class already defines these hooks, the shared VLMEvalKit environment setup could live in a common mixin (e.g., VLMEvalKitMixin) or as a utility that both subclasses delegate to, avoiding the copy-paste.
nemo_skills/dataset/eval_kit/__init__.py (1)
33-42: Add type hints to the function signature.

As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".
Suggested fix
-def get_extra_generation_args(benchmark):
+def get_extra_generation_args(benchmark: str) -> str:
nemo_skills/evaluation/metrics/eval_kit_metrics.py (3)
42-44: **kwargs is silently swallowed — unsupported arguments won't raise errors.

get_metrics() in map_metrics.py forwards **kwargs from the user's metrics_kwargs. Silently discarding them here means typos or invalid metric options will be ignored rather than failing. Consider either validating that no unexpected kwargs are passed or removing **kwargs entirely if no extra arguments are needed. As per coding guidelines, "Avoid silently ignoring unused user-passed parameters".
Suggested fix
-    def __init__(self, **kwargs):
+    def __init__(self, **kwargs):
+        if kwargs:
+            LOG.warning("EvalKitMetrics ignores extra kwargs: %s", list(kwargs.keys()))
         super().__init__(compute_no_answer=False)
Or, if no extra kwargs should ever be passed:
-    def __init__(self, **kwargs):
+    def __init__(self):
         super().__init__(compute_no_answer=False)
56-58: update() skips super().update(predictions) — intentional but worth documenting inline.

BaseMetrics.update() handles token counting, timing stats, and max_k tracking. Skipping it means those metrics will be absent. This is correct for pre-computed VLMEvalKit aggregates, but a brief inline comment would help future maintainers understand the deliberate deviation.

38-40: Class-level mutable state _shared_metrics_file persists across test runs.

_shared_metrics_file is a class variable that survives across multiple instantiations and test cases. If tests create EvalKitMetrics instances for different benchmarks, a stale path from a prior test could leak. Consider resetting it in __init__ or setup() when an instance-level path is found, or documenting the expected lifecycle.
nemo_skills/pipeline/utils/eval.py (2)

365-379: Duplicate module import logic — _resolve_generation_task_class vs lines 461-474.

_resolve_generation_task_class (lines 32-42) performs the same import-and-get-GENERATION_TASK_CLASS as lines 461-474 later in prepare_eval_commands. The inline version raises on missing GENERATION_TASK_CLASS, while the helper silently returns None. Consider consolidating into one function with an option to raise or return None.

413-417: Forcing num_jobs = total_evals when self-contained tasks are present may over-parallelize mixed workloads.

If the benchmark list contains both self-contained and server-based benchmarks, this forces every benchmark into its own job. The comment at line 413-414 explains the rationale for self-contained tasks, but it may be an unexpected side effect for the non-self-contained benchmarks in the same eval run.

Consider documenting this behavior in the --num_jobs help text or logging which benchmarks are affected.
nemo_skills/inference/eval/eval_kit.py (2)
162-227: Dataset is built twice on rank 0 when world_size > 1.

Lines 218 and 221 both call build_dataset(cfg.vlm_dataset, **dataset_kwargs). The first is rank-0 only (for download), the second is on all ranks. Rank 0 redundantly builds the dataset a second time. If dataset construction is expensive (beyond just downloading), consider caching the result from the first call.
Suggested optimization
         if world_size > 1:
             import torch.distributed as dist

             if rank == 0:
-                build_dataset(cfg.vlm_dataset, **dataset_kwargs)
+                self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)
             dist.barrier()
+            if rank != 0:
+                self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)
+        else:
+            self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)

-        self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)
337-361: Hardcoded dataset name lists will silently become stale as VLMEvalKit evolves.

These lists mirror VLMEvalKit/run.py but must be manually kept in sync. If VLMEvalKit adds a new dataset that needs special kwargs (e.g., nframe), it won't get them here. Consider importing or referencing these lists from VLMEvalKit if possible, or adding a comment with the VLMEvalKit version/commit these were copied from for traceability.
nemo_skills/inference/mcore_skills.py (2)

126-165: get_env_prefix() and get_extra_package_dirs() are duplicated verbatim from eval_kit.py.

Both EvalKitGenerationTask (in eval_kit.py, lines 128-150) and MegatronMCoreGenerationTask share identical implementations for get_env_prefix() and get_extra_package_dirs(). Extract these into a shared mixin or utility to avoid drift.

406-428: Redirecting sys.stdout to /dev/null is fragile — prefer suppressing at the logging/tqdm level.

Replacing sys.stdout globally on non-primary DP ranks suppresses all stdout, including potential error messages or debug info that isn't from VLMEvalKit. This pattern can also confuse debuggers and profilers. Consider using contextlib.redirect_stdout for a cleaner scope, or configuring VLMEvalKit's verbosity directly if possible.

That said, the finally cleanup at lines 425-428 is correct and ensures sys.stdout is restored even on exceptions.

nemo_skills/evaluation/evaluator/__init__.py

nemo_skills/inference/eval/eval_kit.py

nemo_skills/inference/mcore_skills.py

nemo_skills/pipeline/eval.py

nemo_skills/pipeline/utils/eval.py

VLMEvalKit's MultiModalMCore inference engine: - eval_kit: uses VLMEvalKit's native dataset/evaluation pipeline with NeMo Skills output formatting - mcore_skills: reads NeMo Skills JSONL data, runs inference through MultiModalMCore in-process with data parallelism, computes metrics using VLMEvalKit functions (e.g. asr_wer), and writes results in NeMo Skills format Key changes: - New inference modules: eval_kit.py, mcore_skills.py - New dataset module: eval_kit/ for VLMEvalKit benchmark resolution - New metrics class: EvalKitMetrics for pre-computed VLMEvalKit metrics - Pipeline extensions: self-contained task support, METRICS_TYPE_OVERRIDE, eval_kit container/mounts handling - Lazy imports for sacrebleu to avoid missing-dep crashes Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>

- Remove redundant metric_type field from BenchmarkArgs; consolidate into the single metrics_type field that the summarize step reads. METRICS_TYPE_OVERRIDE from task classes now writes to metrics_type so the override actually takes effect for non-eval_kit benchmarks. - Sort job_benchmarks before passing to job_batches for deterministic task names across runs. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

EvalKitConfig uses flat fields (server_url, model_name) while the pipeline's configure_client() injects nested ++server.* Hydra overrides. This mismatch crashes Hydra at startup for vllm mode. Fix by making eval_kit always self-contained. For vllm mode the user passes ++server_url and ++model_name directly as extra args. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (3)

nemo_skills/inference/mcore_skills.py (1)

480-523: ⚠️ Potential issue | 🟠 Major

_evaluate_results() is ASR-specific and masks failures.

This path always computes WER via asr_wer and then swallows all other runtime errors. That can silently generate incorrect/missing eval_kit_metrics.json for non-ASR use cases.

#!/bin/bash
# Verify whether mcore_skills is used outside ASR contexts and confirm hardcoded WER path.

set -euo pipefail

echo "== mcore_skills wiring =="
rg -n "mcore_skills|METRICS_TYPE_OVERRIDE" nemo_skills -g '*.py'

echo
echo "== hardcoded evaluator in mcore_skills =="
rg -n "asr_wer|_evaluate_results|except Exception" nemo_skills/inference/mcore_skills.py -C 3

echo
echo "== dataset modules pointing to mcore/eval-kit generation =="
rg -n "GENERATION_MODULE|mcore_skills|eval_kit" nemo_skills/dataset -g '*.py'

Based on learnings: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/mcore_skills.py` around lines 480 - 523, The
_evaluate_results() block currently always calls asr_wer and swallows all
runtime errors via a broad except Exception, which hides failures and misapplies
WER for non‑ASR tasks; change it to only run the asr_wer path when the
job/dataset/metrics type explicitly indicates ASR (check METRICS_TYPE_OVERRIDE
or task/dataset metadata before calling asr_wer), remove or narrow the broad
except Exception (either let unexpected exceptions propagate or re-raise after
logging), and ensure failures produce no silent missing eval_kit_metrics.json
(i.e., only write eval_kit_metrics.json when asr_wer succeeded); update symbols:
_evaluate_results, asr_wer, eval_kit_metrics.json, METRICS_TYPE_OVERRIDE
accordingly.

nemo_skills/inference/eval/eval_kit.py (1)

277-282: ⚠️ Potential issue | 🟠 Major

Narrow pickle-read exception handling to transient cases only.

Catching Exception here can hide non-transient failures and silently mask broken runs.

Proposed fix

-        except Exception:
-            # pkl may be mid-write; skip this cycle
-            return
+        except (EOFError, pickle.UnpicklingError, BlockingIOError):
+            # pkl may be mid-write; skip this cycle
+            return

Based on learnings: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/eval_kit.py` around lines 277 - 282, The current
blanket except in the pickle load block silently swallows all errors; narrow it
to transient/expected read failures only by catching specific exceptions (e.g.,
EOFError, pickle.UnpicklingError, and OSError) around the with open(pkl_path,
"rb") / pickle.load(f) call and keep the early return for those cases, but allow
any other unexpected exceptions to propagate (i.e., remove the generic except
Exception and only return on the specific transient exceptions).

nemo_skills/pipeline/utils/eval.py (1)

33-43: ⚠️ Potential issue | 🟠 Major

Don’t swallow non-import failures when resolving generation task classes.

except Exception masks real module bugs (syntax/runtime errors), causing silent fallback to None and downstream misconfiguration.

Proposed fix

 def _resolve_generation_task_class(module_name: str):
@@
-    except Exception:
+    except (ImportError, ModuleNotFoundError) as e:
+        LOG.warning("Could not import generation module '%s': %s", module_name, e)
         return None

#!/bin/bash
set -euo pipefail
# Verify broad exception handling in generation task resolution.
rg -n -C3 --type=py 'def _resolve_generation_task_class|except Exception' nemo_skills/pipeline/utils/eval.py

Based on learnings: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 33 - 43, The
_resolve_generation_task_class function currently swallows all exceptions;
change the broad "except Exception" to catch only import-related failures (e.g.,
ImportError, ModuleNotFoundError, FileNotFoundError) so real module errors
(syntax/runtime) propagate; keep the existing return None behavior inside that
narrow except block and let any other exceptions raised by import_from_path or
importlib.import_module bubble up. Reference: _resolve_generation_task_class,
import_from_path, importlib.import_module, and GENERATION_TASK_CLASS.

🧹 Nitpick comments (2)

nemo_skills/evaluation/evaluator/audio.py (1)
515-532: Consider adding wer_c and wer_pc defaults for ASR-PC missing generation.

When task_type is "ASR-PC" and generation is missing, this returns only wer: 1.0. However, the normal ASR-PC path (line 537-543) returns additional metrics: wer_c, wer_pc, and per.

Looking at the metrics aggregation in audio_metrics.py, it filters with if "wer_c" in pred and pred["wer_c"] is not None, meaning missing-generation samples won't contribute to wer_c/wer_pc averages. This could cause those metrics to appear artificially better than wer since they exclude the worst samples.

If parity is desired, you could return default worst-case values for all metrics:
if task_type == "ASR-PC":
    return {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/audio.py` around lines 515 - 532, When
handling missing generation in the branch that checks task_type in (_ASR_TYPES |
_TRANSLATION_TYPES | {"CER"}) and not generation, add worst-case defaults for
ASR-PC so it returns all ASR-PC metrics instead of only "wer": specifically
detect if task_type == "ASR-PC" and return {**base, "wer": 1.0, "wer_c": 1.0,
"wer_pc": 1.0, "per": 1.0}; keep the existing CER and translation branches
unchanged and ensure you reference the existing
_ASR_TYPES/_TRANSLATION_TYPES/task_type variables and the base dict when
constructing the return value.
nemo_skills/pipeline/utils/eval.py (1)
397-403: Reuse the already-resolved generation_task_class to reduce duplication.

You store ba.generation_task_class here, but the loop later re-imports modules again per eval item. Reusing the cached class would simplify flow and avoid redundant imports.

As per coding guidelines: "Keep code simple and elegant; reuse/extend existing functionality when possible, minimize conditional checks..."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 397 - 403, The loop that
sets ba.generation_task_class using _resolve_generation_task_class already
resolves and caches the task class; update the later per-eval-item logic to
reuse ba.generation_task_class instead of re-importing or calling
_resolve_generation_task_class again. Concretely, in the code that iterates eval
items (where it currently recomputes generation task classes from
generation_module or ba.generation_module), first check ba.generation_task_class
and use it when present, falling back to _resolve_generation_task_class only if
the cached attribute is None; this removes duplicate imports and extra
conditional checks while preserving existing behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/dataset/eval_kit/__init__.py`:
- Around line 39-42: The code currently returns an empty string when the
incoming benchmark string lacks a dot, silently dropping the required
vlm_dataset; update the branch that handles the missing suffix (the snippet that
checks "if '.' in benchmark" and returns "" otherwise) to instead raise a clear
exception (e.g., ValueError) indicating the expected format
"eval_kit.<dataset_name>" and include the actual benchmark value in the message
so callers fail fast; keep the existing behavior of extracting sub =
benchmark.split(".", 1)[1] and returning f" ++vlm_dataset={sub} " when a dot is
present.

In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py`:
- Around line 46-55: In EvalKitMetrics.setup, if the candidate metrics file is
not found, clear any previous references by setting both
self.eval_kit_metrics_file and EvalKitMetrics._shared_metrics_file to None (or
appropriate empty value) so a stale path isn't reused; locate the setup method
in the EvalKitMetrics class and ensure the branch where candidate.exists() is
false explicitly resets these attributes (and still sets them when candidate
exists).

In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 124-127: is_self_contained currently only checks extra_arguments
for "++model_type=mcore" and returns False when extra_arguments is empty,
misclassifying defaults; update is_self_contained(cls, extra_arguments: str =
"") to first look for a "++model_type=" token in extra_arguments and, if found,
return True when its value is "mcore", otherwise if no token is present consult
EvalKitConfig.model_type (the config default) and return True when that default
equals "mcore"; reference the is_self_contained method and
EvalKitConfig.model_type when making this change.

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 404-405: The per-rank output filename using
output_rank{dp_rank}.jsonl can collide across concurrent runs; update the code
that constructs rank_file (and the similar constructions used at the other
occurrences where output_rank{dp_rank}.jsonl is created) to include a run-unique
suffix (e.g., uuid, process id, or timestamp) or use a safe temporary-file API
so each run gets a unique per-rank path; specifically change the places that set
rank_file (and the two other spots noted) to append a unique_run_id (or obtain a
tempfile from tempfile.NamedTemporaryFile/tmpdir) so concurrent seeds/chunks
cannot overwrite each other.

In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 526-527: Replace the silent .get() calls that convert None to
empty strings with direct key access so missing keys fail loudly: change model =
server_parameters.get("model", "") and server_type =
server_parameters.get("server_type", "") to model = server_parameters["model"]
and server_type = server_parameters["server_type"]; this aligns with the rest of
the file’s direct bracket access to server_parameters and prevents producing
invalid client overrides like "++model_name=".

---

Duplicate comments:
In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 277-282: The current blanket except in the pickle load block
silently swallows all errors; narrow it to transient/expected read failures only
by catching specific exceptions (e.g., EOFError, pickle.UnpicklingError, and
OSError) around the with open(pkl_path, "rb") / pickle.load(f) call and keep the
early return for those cases, but allow any other unexpected exceptions to
propagate (i.e., remove the generic except Exception and only return on the
specific transient exceptions).

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 480-523: The _evaluate_results() block currently always calls
asr_wer and swallows all runtime errors via a broad except Exception, which
hides failures and misapplies WER for non‑ASR tasks; change it to only run the
asr_wer path when the job/dataset/metrics type explicitly indicates ASR (check
METRICS_TYPE_OVERRIDE or task/dataset metadata before calling asr_wer), remove
or narrow the broad except Exception (either let unexpected exceptions propagate
or re-raise after logging), and ensure failures produce no silent missing
eval_kit_metrics.json (i.e., only write eval_kit_metrics.json when asr_wer
succeeded); update symbols: _evaluate_results, asr_wer, eval_kit_metrics.json,
METRICS_TYPE_OVERRIDE accordingly.

In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 33-43: The _resolve_generation_task_class function currently
swallows all exceptions; change the broad "except Exception" to catch only
import-related failures (e.g., ImportError, ModuleNotFoundError,
FileNotFoundError) so real module errors (syntax/runtime) propagate; keep the
existing return None behavior inside that narrow except block and let any other
exceptions raised by import_from_path or importlib.import_module bubble up.
Reference: _resolve_generation_task_class, import_from_path,
importlib.import_module, and GENERATION_TASK_CLASS.

---

Nitpick comments:
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 515-532: When handling missing generation in the branch that
checks task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not
generation, add worst-case defaults for ASR-PC so it returns all ASR-PC metrics
instead of only "wer": specifically detect if task_type == "ASR-PC" and return
{**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0}; keep the existing
CER and translation branches unchanged and ensure you reference the existing
_ASR_TYPES/_TRANSLATION_TYPES/task_type variables and the base dict when
constructing the return value.

In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 397-403: The loop that sets ba.generation_task_class using
_resolve_generation_task_class already resolves and caches the task class;
update the later per-eval-item logic to reuse ba.generation_task_class instead
of re-importing or calling _resolve_generation_task_class again. Concretely, in
the code that iterates eval items (where it currently recomputes generation task
classes from generation_module or ba.generation_module), first check
ba.generation_task_class and use it when present, falling back to
_resolve_generation_task_class only if the cached attribute is None; this
removes duplicate imports and extra conditional checks while preserving existing
behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 434bc35b-a3d5-475b-b898-e9b449c88afe

📥 Commits

Reviewing files that changed from the base of the PR and between b4629e6 and 86b40cb.

📒 Files selected for processing (13)

nemo_skills/dataset/eval_kit/__init__.py
nemo_skills/dataset/utils.py
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/evaluation/metrics/eval_kit_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/translation_metrics.py
nemo_skills/inference/eval/eval_kit.py
nemo_skills/inference/factory.py
nemo_skills/inference/generate.py
nemo_skills/inference/mcore_skills.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/eval.py
nemo_skills/pipeline/utils/generation.py

🚧 Files skipped from review as they are similar to previous changes (1)

nemo_skills/pipeline/eval.py

nemo_skills/dataset/eval_kit/__init__.py

nemo_skills/evaluation/metrics/eval_kit_metrics.py

nemo_skills/inference/eval/eval_kit.py

nemo_skills/inference/mcore_skills.py

nemo_skills/pipeline/utils/eval.py

Only catch ImportError/ModuleNotFoundError instead of bare Exception. Syntax errors, missing dependencies, and other real bugs in generation modules now propagate instead of being silently swallowed. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous · 2026-03-05T15:32:37Z

@coderabbitai review

coderabbitai · 2026-03-05T15:32:45Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 6

♻️ Duplicate comments (4)

nemo_skills/dataset/eval_kit/__init__.py (1)

39-45: ⚠️ Potential issue | 🟠 Major

Also fail when the dataset suffix is empty (eval_kit.).

Line 44 can produce ++vlm_dataset= with an empty value. Please validate that the suffix exists.

Proposed fix

 def get_extra_generation_args(benchmark):
@@
-    if "." not in benchmark:
+    if not benchmark.startswith("eval_kit.") or "." not in benchmark:
         raise ValueError(
             f"eval_kit benchmark must be in 'eval_kit.<dataset_name>' format, got '{benchmark}'. "
             f"Example: eval_kit.MMBench_DEV_EN, eval_kit.LibriSpeech_test_clean"
         )
     sub = benchmark.split(".", 1)[1]
+    if not sub:
+        raise ValueError(
+            f"eval_kit benchmark must include a dataset name after 'eval_kit.', got '{benchmark}'."
+        )
     return f" ++vlm_dataset={sub} "

As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/eval_kit/__init__.py` around lines 39 - 45, The function
currently accepts a benchmark string and splits on the first dot to produce sub
and return " ++vlm_dataset={sub} ", but it does not validate that the suffix
exists (so "eval_kit." yields an empty value); update the validation for the
input variable benchmark to ensure there is a non-empty suffix after the dot
(i.e., after benchmark.split(".", 1)[1]) and raise a ValueError with a clear
message if the suffix is empty; modify the logic around the existing benchmark
check and the variable sub to perform this empty-string check before returning
the formatted " ++vlm_dataset={sub} " value.

nemo_skills/evaluation/metrics/eval_kit_metrics.py (1)

45-56: ⚠️ Potential issue | 🟠 Major

Clear instance-level metrics path when setup does not find eval_kit_metrics.json.

Line 55 resets only EvalKitMetrics._shared_metrics_file. self.eval_kit_metrics_file can still point to a stale file and override the reset in get_metrics().

Proposed fix

 def setup(self, input_files):
     """Find the eval_kit_metrics.json in the same directory as the input files."""
+    self.eval_kit_metrics_file = None
+    EvalKitMetrics._shared_metrics_file = None
     if input_files:
         # input_files are like ['/path/to/eval-results/eval_kit.MMBench_DEV_EN/output.jsonl']
         metrics_dir = Path(input_files[0]).parent
         candidate = metrics_dir / "eval_kit_metrics.json"
         if candidate.exists():
             self.eval_kit_metrics_file = candidate
             EvalKitMetrics._shared_metrics_file = candidate
-        else:
-            # Reset stale shared path so a previous run's file isn't reused.
-            EvalKitMetrics._shared_metrics_file = None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py` around lines 45 - 56, The
setup method currently clears only EvalKitMetrics._shared_metrics_file when
eval_kit_metrics.json isn't found, leaving self.eval_kit_metrics_file pointing
to a stale path; update the setup() branch where candidate.exists() is false to
also set self.eval_kit_metrics_file = None so instance-level state is cleared
and get_metrics() won't reuse a stale file (refer to setup,
self.eval_kit_metrics_file, EvalKitMetrics._shared_metrics_file, and
get_metrics()).

nemo_skills/inference/eval/eval_kit.py (2)

124-132: ⚠️ Potential issue | 🟠 Major

is_self_contained() misclassifies default config runs.

Line 131 returns False unless ++model_type=mcore is explicitly passed, even though EvalKitConfig.model_type defaults to "mcore".

Proposed fix

 `@classmethod`
 def is_self_contained(cls, extra_arguments: str = "") -> bool:
-    """Self-contained only when user explicitly requests mcore mode.
-
-    Note: EvalKitConfig.model_type defaults to "mcore" at runtime, but
-    at submission time we check explicit user intent.  Without the flag
-    the pipeline assumes vllm (server-based) mode.
-    """
-    return "++model_type=mcore" in extra_arguments
+    """Self-contained in mcore mode."""
+    for token in extra_arguments.split():
+        if token.startswith("++model_type="):
+            return token.split("=", 1)[1] == "mcore"
+    return EvalKitConfig.model_type == "mcore"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/eval_kit.py` around lines 124 - 132, The
is_self_contained method currently only looks for the explicit
"++model_type=mcore" token in extra_arguments and thus misclassifies runs where
EvalKitConfig.model_type defaults to "mcore"; update is_self_contained to return
True if either the explicit flag is present in extra_arguments OR
EvalKitConfig.model_type == "mcore" (safely handling cases where the config may
be None or unset). Locate the is_self_contained(cls, extra_arguments: str = "")
definition and add a secondary check against EvalKitConfig.model_type (or the
appropriate config accessor) so both explicit user intent and the default config
are honored.

282-287: ⚠️ Potential issue | 🟠 Major

Narrow exception handling when reading pickle snapshots.

Line 285 catches all exceptions and silently skips, which can hide non-transient errors.

Proposed fix

-        except Exception:
+        except (EOFError, pickle.UnpicklingError, BlockingIOError, OSError):
             # pkl may be mid-write; skip this cycle
             return

As per coding guidelines: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/eval_kit.py` around lines 282 - 287, The
try/except around pickle.load(pkl_path) is too broad; narrow it to only handle
transient, expected errors (e.g., EOFError, pickle.UnpicklingError, and OSError)
so real bugs surface. Replace "except Exception:" with "except (EOFError,
pickle.UnpicklingError, OSError) as e:" and keep the sleep/skip/return behavior
(and optionally a debug log using pkl_path and e); let any other exceptions
propagate. Ensure pickle.UnpicklingError is referenced/imported and keep the
variable names pkl_path and data unchanged.

🧹 Nitpick comments (2)

nemo_skills/evaluation/evaluator/audio.py (1)
515-589: Add end-to-end coverage for new task routing and fallback paths

Please add/extend SLURM or integration benchmark tests for ST-* translation routing, MathQA, and missing-generation behavior to prevent silent metric regressions.

Based on learnings: "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/audio.py` around lines 515 - 589, The PR
lacks integration/SLURM test coverage for new routing and fallback logic (ST-*
translation routing, MathQA handling, and missing-generation behavior); add
end-to-end tests that feed sample records exercising task_type values starting
with "ST-", "MathQA", and cases where generation is empty/None to assert correct
metric outputs (e.g., that ST-* is treated as translation via
_TRANSLATION_TYPES, MathQA sets "is_correct" and "predicted_answer", and missing
generation returns the "missing_generation" base with BLEU/WER/CER as
appropriate). Implement tests that call the evaluator path hitting the routing
logic (the block using task_type checks and functions evaluate_translation,
evaluate_asr/evaluate_asr_pc, evaluate_cer, evaluate_hallucination,
evaluate_pc_rate) and validate returned metrics and fallback fields, including
the ASR_LEADERBOARD reference_fields branch to ensure extra wer_* and
is_correct_* metrics are produced when reference_fields are present.
nemo_skills/inference/mcore_skills.py (1)
468-526: Add a slurm e2e test for this new inline-eval path.

This file now owns non-trivial generation + metric persistence behavior; a benchmark-level slurm test would catch regressions in _evaluate_results and eval_kit_metrics.json production.

Based on learnings: "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/mcore_skills.py` around lines 468 - 526, Add a
slurm-level end-to-end test that exercises the new inline evaluation path in
_evaluate_results: create a small benchmark job that writes an output_file with
entries (including some with <think> tags), runs the skill so _evaluate_results
executes (triggering import of asr_wer), and assert that the cleaned output_file
is rewritten, that eval_kit_metrics.json is created next to the output file with
a "wer" key, and that LOG.info for "ASR WER" is emitted; use the same helper(s)
used by other slurm tests to schedule a job, point the job at a fixture dataset
and config that triggers _strip_thinking_tags and metric computation, and fail
the test if eval_kit_metrics.json is missing or malformed or generations remain
uncleaned.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 515-533: The code calls generation.strip() before the
missing-generation guard, which raises when generation is None; update the usage
of generation in the function (the place that currently does generation.strip())
to defensively handle None by using (generation or "").strip() or by explicitly
checking generation is not None before calling .strip(), and ensure this check
runs before the missing-generation handling block that uses task_type and
generation so missing-generation returns (is_correct False / error
"missing_generation") work as intended for None/empty generation values.
- Around line 522-533: The missing-generation branch for ASR-related tasks
returns only "wer" for ASR-PC variants, which omits metrics required for ASR-PC
aggregation; update the conditional that checks task_type in (_ASR_TYPES |
_TRANSLATION_TYPES | {"CER"}) and not generation to include the additional
ASR-PC default fields when task_type corresponds to ASR-PC (e.g., add "wer_c",
"wer_pc", and "per" alongside "wer"); use the existing base dict ("is_correct":
False, "error": "missing_generation") and return {**base, "wer": 1.0, "wer_c":
1.0, "wer_pc": 1.0, "per": 1.0} for the ASR-PC case while leaving the existing
branches for _TRANSLATION_TYPES and "CER" unchanged.

In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py`:
- Around line 41-43: The constructor currently takes **kwargs and drops them;
update the __init__ method (the constructor that calls
super().__init__(compute_no_answer=False) and sets self.eval_kit_metrics_file)
to fail fast: if kwargs is not empty, raise a TypeError listing the unexpected
kwarg names (e.g., raise TypeError(f"Unexpected constructor arguments: {',
'.join(kwargs.keys())}")), otherwise proceed to call super and set
self.eval_kit_metrics_file as before.

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 463-467: The .done file is created before running inline metrics
which are then caught and swallowed, allowing failed evaluations to be marked
complete; modify the flow so Path(f"{self.cfg.output_file}.done").touch() is
executed only after self._evaluate_results() completes successfully, and remove
or rework the broad try/except that silences errors around _evaluate_results
(and the similar block handling inline metrics referenced near the 519-526 area)
so exceptions propagate instead of being swallowed; ensure any
metric-evaluation-specific exceptions are either handled explicitly with proper
logging and re-raise, or not caught at all, so failed runs are not marked done
and can be retried.
- Around line 503-507: The current code reopens output_file and truncates
output.jsonl before writing cleaned entries, risking data loss if writing fails;
modify the cleanup/write to first write all JSONL lines to a temporary file
(e.g., using tempfile.NamedTemporaryFile(delete=False) or creating a tmp path
like f"{output_file}.tmp"), flush and close it, then atomically replace the
original by calling os.replace(tmp_path, output_file); reference the existing
variables output_file and entries and ensure proper encoding ("utf-8") and error
handling around the replace so the original file remains intact on failures.

In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 528-533: The current split(":") on job_server_address is brittle;
update the parsing before host, port assignment to robustly handle URLs and
IPv6: if job_server_address starts with a scheme (e.g., "http://" or "https://")
use URL parsing (e.g., urlparse) to extract hostname and port; otherwise handle
host:port and IPv6 literal forms by splitting on the last colon (rsplit(":", 1))
and stripping surrounding brackets from IPv6 hosts; fall back to defaults
("localhost", "5000") when parsing fails, then pass host and int(port) into
generation_task.configure_client_overrides to replace the fragile host/port =
(job_server_address or "localhost:5000").split(":") logic.

---

Duplicate comments:
In `@nemo_skills/dataset/eval_kit/__init__.py`:
- Around line 39-45: The function currently accepts a benchmark string and
splits on the first dot to produce sub and return " ++vlm_dataset={sub} ", but
it does not validate that the suffix exists (so "eval_kit." yields an empty
value); update the validation for the input variable benchmark to ensure there
is a non-empty suffix after the dot (i.e., after benchmark.split(".", 1)[1]) and
raise a ValueError with a clear message if the suffix is empty; modify the logic
around the existing benchmark check and the variable sub to perform this
empty-string check before returning the formatted " ++vlm_dataset={sub} " value.

In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py`:
- Around line 45-56: The setup method currently clears only
EvalKitMetrics._shared_metrics_file when eval_kit_metrics.json isn't found,
leaving self.eval_kit_metrics_file pointing to a stale path; update the setup()
branch where candidate.exists() is false to also set self.eval_kit_metrics_file
= None so instance-level state is cleared and get_metrics() won't reuse a stale
file (refer to setup, self.eval_kit_metrics_file,
EvalKitMetrics._shared_metrics_file, and get_metrics()).

In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 124-132: The is_self_contained method currently only looks for the
explicit "++model_type=mcore" token in extra_arguments and thus misclassifies
runs where EvalKitConfig.model_type defaults to "mcore"; update
is_self_contained to return True if either the explicit flag is present in
extra_arguments OR EvalKitConfig.model_type == "mcore" (safely handling cases
where the config may be None or unset). Locate the is_self_contained(cls,
extra_arguments: str = "") definition and add a secondary check against
EvalKitConfig.model_type (or the appropriate config accessor) so both explicit
user intent and the default config are honored.
- Around line 282-287: The try/except around pickle.load(pkl_path) is too broad;
narrow it to only handle transient, expected errors (e.g., EOFError,
pickle.UnpicklingError, and OSError) so real bugs surface. Replace "except
Exception:" with "except (EOFError, pickle.UnpicklingError, OSError) as e:" and
keep the sleep/skip/return behavior (and optionally a debug log using pkl_path
and e); let any other exceptions propagate. Ensure pickle.UnpicklingError is
referenced/imported and keep the variable names pkl_path and data unchanged.

---

Nitpick comments:
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 515-589: The PR lacks integration/SLURM test coverage for new
routing and fallback logic (ST-* translation routing, MathQA handling, and
missing-generation behavior); add end-to-end tests that feed sample records
exercising task_type values starting with "ST-", "MathQA", and cases where
generation is empty/None to assert correct metric outputs (e.g., that ST-* is
treated as translation via _TRANSLATION_TYPES, MathQA sets "is_correct" and
"predicted_answer", and missing generation returns the "missing_generation" base
with BLEU/WER/CER as appropriate). Implement tests that call the evaluator path
hitting the routing logic (the block using task_type checks and functions
evaluate_translation, evaluate_asr/evaluate_asr_pc, evaluate_cer,
evaluate_hallucination, evaluate_pc_rate) and validate returned metrics and
fallback fields, including the ASR_LEADERBOARD reference_fields branch to ensure
extra wer_* and is_correct_* metrics are produced when reference_fields are
present.

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 468-526: Add a slurm-level end-to-end test that exercises the new
inline evaluation path in _evaluate_results: create a small benchmark job that
writes an output_file with entries (including some with <think> tags), runs the
skill so _evaluate_results executes (triggering import of asr_wer), and assert
that the cleaned output_file is rewritten, that eval_kit_metrics.json is created
next to the output file with a "wer" key, and that LOG.info for "ASR WER" is
emitted; use the same helper(s) used by other slurm tests to schedule a job,
point the job at a fixture dataset and config that triggers _strip_thinking_tags
and metric computation, and fail the test if eval_kit_metrics.json is missing or
malformed or generations remain uncleaned.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8bb5eeba-7f94-4068-b6aa-dccf9bc3b02a

📥 Commits

Reviewing files that changed from the base of the PR and between 86b40cb and a0f5f54.

📒 Files selected for processing (14)

nemo_skills/dataset/eval_kit/__init__.py
nemo_skills/dataset/utils.py
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/evaluation/metrics/eval_kit_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/translation_metrics.py
nemo_skills/inference/eval/eval_kit.py
nemo_skills/inference/factory.py
nemo_skills/inference/generate.py
nemo_skills/inference/mcore_skills.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/eval.py
nemo_skills/pipeline/utils/generation.py
requirements/eval-kit.txt

🚧 Files skipped from review as they are similar to previous changes (5)

requirements/eval-kit.txt
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/translation_metrics.py
nemo_skills/inference/factory.py
nemo_skills/dataset/utils.py

nemo_skills/evaluation/evaluator/audio.py

nemo_skills/evaluation/metrics/eval_kit_metrics.py

nemo_skills/inference/mcore_skills.py

nemo_skills/pipeline/utils/eval.py

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (11)

nemo_skills/dataset/eval_kit/__init__.py (1)

39-45: ⚠️ Potential issue | 🟠 Major

Validate non-empty eval_kit dataset suffix.

Line 44 accepts eval_kit. and returns ++vlm_dataset=; this should fail fast because dataset name is required.

Proposed fix

-    if "." not in benchmark:
+    if not benchmark.startswith("eval_kit.") or "." not in benchmark:
         raise ValueError(
             f"eval_kit benchmark must be in 'eval_kit.<dataset_name>' format, got '{benchmark}'. "
             f"Example: eval_kit.MMBench_DEV_EN, eval_kit.LibriSpeech_test_clean"
         )
     sub = benchmark.split(".", 1)[1]
+    if not sub:
+        raise ValueError(
+            f"eval_kit benchmark must include a dataset name after 'eval_kit.', got '{benchmark}'."
+        )
     return f" ++vlm_dataset={sub} "

As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/dataset/eval_kit/__init__.py` around lines 39 - 45, The code
currently allows "eval_kit." and returns an empty vlm_dataset; update the
validation in the function handling the benchmark string so that after splitting
(where sub = benchmark.split(".", 1)[1]) you check that sub is non-empty and
raise a ValueError with a clear message if it is empty; ensure the error
mentions the required 'eval_kit.<dataset_name>' format and that the function
(where benchmark and sub are used and that returns f" ++vlm_dataset={sub} ")
fails fast when no dataset suffix is provided.

nemo_skills/evaluation/metrics/eval_kit_metrics.py (2)

45-57: ⚠️ Potential issue | 🟠 Major

Reset instance file path in setup() to avoid stale metrics reuse.

If setup() runs after a prior successful run, self.eval_kit_metrics_file can remain stale and still win at Line 70 even when the new candidate is missing.

Proposed fix

 def setup(self, input_files):
     """Find the eval_kit_metrics.json in the same directory as the input files."""
+    self.eval_kit_metrics_file = None
+    EvalKitMetrics._shared_metrics_file = None
     if input_files:
         metrics_dir = Path(input_files[0]).parent
         candidate = metrics_dir / "eval_kit_metrics.json"
         if candidate.exists():
             self.eval_kit_metrics_file = candidate
             EvalKitMetrics._shared_metrics_file = candidate
-        else:
-            # Reset stale shared path so a previous run's file isn't reused.
-            EvalKitMetrics._shared_metrics_file = None

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py` around lines 45 - 57, The
setup() method can leave self.eval_kit_metrics_file pointing at a previous run
when the new candidate doesn't exist; update EvalKitMetrics.setup so that when
the candidate file is missing you explicitly clear the instance path (set
self.eval_kit_metrics_file = None) in addition to resetting
EvalKitMetrics._shared_metrics_file, checking the candidate (metrics_dir /
"eval_kit_metrics.json") and only assigning both when it exists.

41-43: ⚠️ Potential issue | 🟠 Major

Fail fast on unsupported constructor kwargs.

Line 41 accepts **kwargs but silently discards them, which can hide invalid metrics_kwargs usage.

Proposed fix

 def __init__(self, **kwargs):
+    if kwargs:
+        unsupported = ", ".join(sorted(kwargs))
+        raise TypeError(f"Unsupported EvalKitMetrics kwargs: {unsupported}")
     super().__init__(compute_no_answer=False)
     self.eval_kit_metrics_file = None

Based on learnings: "Applies to **/*.py : Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py` around lines 41 - 43, The
constructor currently swallows **kwargs silently; update the __init__ in the
class that defines __init__(self, **kwargs) so unexpected user arguments fail
fast: either (a) replace **kwargs with explicit parameters (e.g.,
eval_kit_metrics_file=None) and pass known values to
super().__init__(compute_no_answer=False), or (b) validate kwargs at the start
of __init__ by extracting any supported keys (e.g., "eval_kit_metrics_file") and
if any keys remain raise TypeError("Unexpected keyword arguments: ..."); ensure
you still call super().__init__(compute_no_answer=False) and set
self.eval_kit_metrics_file from the validated argument.

nemo_skills/pipeline/eval.py (1)

57-61: ⚠️ Potential issue | 🟡 Minor

Use direct cluster_config["containers"] access consistently.

Line 60 still uses .get() even though Line 57 already assumes cluster_config["containers"] must exist.
Proposed fix
-        if key and key in cluster_config.get("containers", {}):
+        if key and key in cluster_config["containers"]:
             container = cluster_config["containers"][key]
As per coding guidelines: "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 57 - 61, The loop inconsistently
uses cluster_config.get("containers", {}) while earlier accessing
cluster_config["containers"] directly; change the lookup inside the for-loop to
use direct access (cluster_config["containers"]) when checking membership of key
so that missing container data fails loudly; update the condition that currently
uses cluster_config.get("containers", {}) to reference
cluster_config["containers"] when evaluating key in the containers mapping
(relating to variables/container assignment, task_classes loop, tc and its
CONTAINER_KEY).

nemo_skills/evaluation/evaluator/audio.py (2)

522-533: ⚠️ Potential issue | 🟠 Major

ASR-PC missing-generation defaults are still incomplete.

Line 532 currently returns only wer for ASR-PC, but ASR-PC outputs should include wer, wer_c, wer_pc, and per for consistent aggregation.

Proposed fix

         if task_type in _TRANSLATION_TYPES:
             return {**base, "bleu": 0.0}
         if task_type == "CER":
             return {**base, "cer": 1.0}
+        if task_type == "ASR-PC":
+            return {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0}
         # ASR / ASR-PC / ASR-ZH
         return {**base, "wer": 1.0}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/audio.py` around lines 522 - 533, The
missing-generation branch handling when task_type is in (_ASR_TYPES |
_TRANSLATION_TYPES | {"CER"}) currently returns only "wer" for ASR-PC; update
the branch so that when task_type corresponds to ASR-PC (identify via the value
used for ASR-PC in _ASR_TYPES or by name if present) you return the full set of
default metrics {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0}
instead of just "wer" so aggregation sees consistent keys; modify the final ASR
/ ASR-PC / ASR-ZH return logic to branch on ASR-PC and include these additional
fields while keeping existing behavior for other ASR variants.

522-533: ⚠️ Potential issue | 🔴 Critical

missing_generation handling is still bypassed for None generations.

Line 508 calls .strip() unconditionally, so None raises before Lines 522-533 can return fallback metrics.

Proposed fix

-    generation = sample["generation"].strip()
+    generation_raw = sample["generation"]
+    generation = generation_raw.strip() if isinstance(generation_raw, str) else ""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/evaluation/evaluator/audio.py` around lines 522 - 533, The code
currently calls generation.strip() earlier which raises when generation is None
and prevents the fallback in the block checking task_type in (_ASR_TYPES |
_TRANSLATION_TYPES | {"CER"}) and not generation from returning the intended
"missing_generation" metrics; fix by adding a None check before any .strip() use
or by moving the missing-generation branch earlier: if generation is None (or
falsy) and task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}), return the
base missing_generation dict and then add the task-specific metric keys (bleu
for _TRANSLATION_TYPES, cer for "CER", wer for ASR variants) just like the
existing returns so .strip() is never invoked on None.

nemo_skills/pipeline/utils/eval.py (1)

528-533: ⚠️ Potential issue | 🟠 Major

Parse job_server_address robustly (URL/IPv6-safe).

Line 528 uses .split(":"), which breaks valid inputs like http://host:8000 and IPv6 literals.

Proposed fix

+                    from urllib.parse import urlsplit
-                    host, port = (job_server_address or "localhost:5000").split(":")
+                    raw_address = job_server_address or "localhost:5000"
+                    parsed = urlsplit(raw_address if "://" in raw_address else f"http://{raw_address}")
+                    if parsed.hostname is None or parsed.port is None:
+                        raise ValueError(f"Invalid server address: {raw_address}")
+                    host, port = parsed.hostname, parsed.port
                     model = server_parameters["model"]
                     server_type = server_parameters["server_type"]
                     task_overrides = generation_task.configure_client_overrides(
                         host=host,
-                        port=int(port),
+                        port=port,
                         model=model,
                         server_type=server_type,
                     )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 528 - 533, The code that
extracts host and port from job_server_address using .split(":") is brittle for
URLs and IPv6; update the parsing in the block that assigns host, port and calls
generation_task.configure_client_overrides to robustly parse job_server_address
using urllib.parse.urlparse (or prepend '//' when no scheme) and then use
parsed.netloc (handling IPv6 brackets) or fallback to rsplit(":", 1) to separate
host and port, defaulting port to 5000 and casting port to int before passing to
generation_task.configure_client_overrides.

nemo_skills/inference/mcore_skills.py (2)

503-507: ⚠️ Potential issue | 🟠 Major

Avoid truncating output.jsonl before replacement; write atomically.

If writing fails mid-way, the current in-place rewrite loses the only output file.

Proposed fix

-            with open(output_file, "w", encoding="utf-8") as fout:
-                for entry in entries:
-                    fout.write(json.dumps(entry) + "\n")
+            tmp_output = output_path.with_suffix(output_path.suffix + ".tmp")
+            with open(tmp_output, "w", encoding="utf-8") as fout:
+                for entry in entries:
+                    fout.write(json.dumps(entry) + "\n")
+            os.replace(tmp_output, output_file)

Based on learnings: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/mcore_skills.py` around lines 503 - 507, The current
logic rewrites output_file in-place and can lose data if the write fails;
instead, write the JSONL content for entries to a temporary file (e.g.,
output_file + ".tmp") and only once the write completes successfully atomically
replace the original using os.replace (and optionally fsync/flush the temp file
before replace). Locate the block that opens output_file for writing (the loop
writing entries via fout.write(json.dumps(entry) + "\n")) and modify it to write
to a temp path, ensure the write completes and file is closed, then call
os.replace(temp_path, output_file) to atomically swap in the new output.

463-466: ⚠️ Potential issue | 🟠 Major

Create .done only after successful evaluation, and don’t swallow unexpected metric failures.

Line 463 marks completion before Line 466 evaluation, while Lines 524-525 suppress failures. This can mark failed runs as complete and skip reruns.

Proposed fix

-            Path(f"{self.cfg.output_file}.done").touch()
-
-            # Evaluate using VLMEvalKit (same as eval_kit.py does).
-            self._evaluate_results()
+            # Evaluate using VLMEvalKit (same as eval_kit.py does).
+            self._evaluate_results()
+            Path(f"{self.cfg.output_file}.done").touch()
@@
-        except Exception:
-            LOG.exception("Inline metrics computation failed")
+        except Exception:
+            LOG.exception("Inline metrics computation failed")
+            raise

As per coding guidelines, "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."

Also applies to: 519-526

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/mcore_skills.py` around lines 463 - 466, The code
currently creates the completion marker
Path(f"{self.cfg.output_file}.done").touch() before calling
self._evaluate_results() and also suppresses unexpected failures around
evaluation (see the try/except block near where metrics are handled), which can
mark failed runs as complete; move the touch call so the .done file is created
only after self._evaluate_results() returns successfully, and remove or narrow
the broad exception swallowing (remove bare except/except Exception that merely
passes) in the evaluation/metric handling block (the try/except around
self._evaluate_results() / metric processing) so unexpected exceptions propagate
(or re-raise them) instead of being ignored. Ensure any deliberate, expected
metric errors are handled explicitly with targeted exception types and clear
logging while still preventing creation of the .done file on failure.

nemo_skills/inference/eval/eval_kit.py (2)

282-287: ⚠️ Potential issue | 🟠 Major

Narrow transient pickle-read errors; let unexpected errors surface.

Catching Exception here suppresses non-transient failures and can silently stall async output.

Proposed fix

-        except Exception:
-            # pkl may be mid-write; skip this cycle
-            return
+        except (EOFError, pickle.UnpicklingError, BlockingIOError):
+            # pkl may be mid-write; skip this cycle
+            return

As per coding guidelines, "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/eval_kit.py` around lines 282 - 287, The broad
except around the pickle.load of pkl_path hides unexpected errors; change the
handler to only catch transient/read-related exceptions (e.g., EOFError,
pickle.UnpicklingError, OSError) when opening/reading pkl_path and return in
those cases, while allowing all other exceptions to propagate (i.e., re-raise)
so non-transient failures surface; keep the try around the with open(...) /
pickle.load(...) block and reference pkl_path and pickle.load when implementing
the narrower except.

124-132: ⚠️ Potential issue | 🟠 Major

is_self_contained() misclassifies default mcore runs.

With default model_type="mcore", empty extra_arguments returns False and can incorrectly trigger server-based flow.

Proposed fix

-        return "++model_type=mcore" in extra_arguments
+        # Default is mcore unless explicitly overridden.
+        return "++model_type=vllm" not in extra_arguments

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/eval/eval_kit.py` around lines 124 - 132, The
is_self_contained(cls, extra_arguments: str = "") currently returns True only
when the explicit flag "++model_type=mcore" is present, which misclassifies runs
where the runtime default EvalKitConfig.model_type is "mcore" and
extra_arguments is empty; update is_self_contained to also return True when
extra_arguments is empty AND the configured/default model type equals "mcore"
(e.g., check EvalKitConfig.model_type or a class-level default like
cls.default_model_type), preserving the original explicit-flag check so either
the explicit "++model_type=mcore" in extra_arguments or the runtime/default
model_type == "mcore" yields True.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 97-100: The dataclass currently declares skip_filled and
eval_config as accepted but unused, which hides unsupported user args; either
remove these fields so Hydra/arg parsing fails on unknown parameters, or add
explicit validation in the class's initializer (e.g., __post_init__ of the class
that defines skip_filled/eval_config or VLMEvalKit) that raises a clear error if
skip_filled or eval_config are provided with non-default values; reference the
skip_filled and eval_config symbols and the class (the dataclass that contains
them / VLMEvalKit) when implementing the change so callers cannot silently pass
unsupported arguments.
- Around line 528-536: The current loop writes directly to self.cfg.output_file
with "w" which can leave a partial file on failure; instead, serialize all rows
into the target JSONL content first and write atomically by writing to a
temporary file in the same directory (e.g., using tempfile.NamedTemporaryFile or
creating a .tmp path), close it, then os.replace(temp_path,
self.cfg.output_file) to atomically move it into place; update the block that
iterates df (reference: df, self.cfg.output_file, LOG) to build or stream into
the temp file and call os.replace so the final JSONL is either complete or
untouched, and keep the LOG.info after the atomic replace.

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 106-112: These fields silently accept user overrides; instead add
explicit validation (e.g., in the class's __post_init__ or initializer) that
raises an error if any of eval_config is non-empty, or eval_type or
prompt_format is not None, or enable_audio is True, so user-specified
unsupported args fail fast. Locate the dataclass or class that declares
eval_config, eval_type, prompt_format, enable_audio in mcore_skills.py and
implement checks that raise a clear ValueError mentioning the offending symbol
(eval_config/eval_type/prompt_format/enable_audio) when they are set, preventing
silent acceptance of unsupported pipeline overrides.
- Around line 491-500: The evaluation path is hardcoding the "generation" key
while generate() uses self.cfg.generation_key, causing mismatches; update the
block in question to use self.cfg.generation_key everywhere (when reading,
stripping via _strip_thinking_tags, assigning back to entry, and when building
the results dict) and replace .get(...) usages with direct indexing
(entry[self.cfg.generation_key] and entry["expected_answer"] as appropriate) so
missing keys fail loudly and evaluation uses the configured generation field
consistently.

---

Duplicate comments:
In `@nemo_skills/dataset/eval_kit/__init__.py`:
- Around line 39-45: The code currently allows "eval_kit." and returns an empty
vlm_dataset; update the validation in the function handling the benchmark string
so that after splitting (where sub = benchmark.split(".", 1)[1]) you check that
sub is non-empty and raise a ValueError with a clear message if it is empty;
ensure the error mentions the required 'eval_kit.<dataset_name>' format and that
the function (where benchmark and sub are used and that returns f"
++vlm_dataset={sub} ") fails fast when no dataset suffix is provided.

In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 522-533: The missing-generation branch handling when task_type is
in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) currently returns only "wer" for
ASR-PC; update the branch so that when task_type corresponds to ASR-PC (identify
via the value used for ASR-PC in _ASR_TYPES or by name if present) you return
the full set of default metrics {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc":
1.0, "per": 1.0} instead of just "wer" so aggregation sees consistent keys;
modify the final ASR / ASR-PC / ASR-ZH return logic to branch on ASR-PC and
include these additional fields while keeping existing behavior for other ASR
variants.
- Around line 522-533: The code currently calls generation.strip() earlier which
raises when generation is None and prevents the fallback in the block checking
task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation from
returning the intended "missing_generation" metrics; fix by adding a None check
before any .strip() use or by moving the missing-generation branch earlier: if
generation is None (or falsy) and task_type in (_ASR_TYPES | _TRANSLATION_TYPES
| {"CER"}), return the base missing_generation dict and then add the
task-specific metric keys (bleu for _TRANSLATION_TYPES, cer for "CER", wer for
ASR variants) just like the existing returns so .strip() is never invoked on
None.

In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py`:
- Around line 45-57: The setup() method can leave self.eval_kit_metrics_file
pointing at a previous run when the new candidate doesn't exist; update
EvalKitMetrics.setup so that when the candidate file is missing you explicitly
clear the instance path (set self.eval_kit_metrics_file = None) in addition to
resetting EvalKitMetrics._shared_metrics_file, checking the candidate
(metrics_dir / "eval_kit_metrics.json") and only assigning both when it exists.
- Around line 41-43: The constructor currently swallows **kwargs silently;
update the __init__ in the class that defines __init__(self, **kwargs) so
unexpected user arguments fail fast: either (a) replace **kwargs with explicit
parameters (e.g., eval_kit_metrics_file=None) and pass known values to
super().__init__(compute_no_answer=False), or (b) validate kwargs at the start
of __init__ by extracting any supported keys (e.g., "eval_kit_metrics_file") and
if any keys remain raise TypeError("Unexpected keyword arguments: ..."); ensure
you still call super().__init__(compute_no_answer=False) and set
self.eval_kit_metrics_file from the validated argument.

In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 282-287: The broad except around the pickle.load of pkl_path hides
unexpected errors; change the handler to only catch transient/read-related
exceptions (e.g., EOFError, pickle.UnpicklingError, OSError) when
opening/reading pkl_path and return in those cases, while allowing all other
exceptions to propagate (i.e., re-raise) so non-transient failures surface; keep
the try around the with open(...) / pickle.load(...) block and reference
pkl_path and pickle.load when implementing the narrower except.
- Around line 124-132: The is_self_contained(cls, extra_arguments: str = "")
currently returns True only when the explicit flag "++model_type=mcore" is
present, which misclassifies runs where the runtime default
EvalKitConfig.model_type is "mcore" and extra_arguments is empty; update
is_self_contained to also return True when extra_arguments is empty AND the
configured/default model type equals "mcore" (e.g., check
EvalKitConfig.model_type or a class-level default like cls.default_model_type),
preserving the original explicit-flag check so either the explicit
"++model_type=mcore" in extra_arguments or the runtime/default model_type ==
"mcore" yields True.

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 503-507: The current logic rewrites output_file in-place and can
lose data if the write fails; instead, write the JSONL content for entries to a
temporary file (e.g., output_file + ".tmp") and only once the write completes
successfully atomically replace the original using os.replace (and optionally
fsync/flush the temp file before replace). Locate the block that opens
output_file for writing (the loop writing entries via
fout.write(json.dumps(entry) + "\n")) and modify it to write to a temp path,
ensure the write completes and file is closed, then call os.replace(temp_path,
output_file) to atomically swap in the new output.
- Around line 463-466: The code currently creates the completion marker
Path(f"{self.cfg.output_file}.done").touch() before calling
self._evaluate_results() and also suppresses unexpected failures around
evaluation (see the try/except block near where metrics are handled), which can
mark failed runs as complete; move the touch call so the .done file is created
only after self._evaluate_results() returns successfully, and remove or narrow
the broad exception swallowing (remove bare except/except Exception that merely
passes) in the evaluation/metric handling block (the try/except around
self._evaluate_results() / metric processing) so unexpected exceptions propagate
(or re-raise them) instead of being ignored. Ensure any deliberate, expected
metric errors are handled explicitly with targeted exception types and clear
logging while still preventing creation of the .done file on failure.

In `@nemo_skills/pipeline/eval.py`:
- Around line 57-61: The loop inconsistently uses
cluster_config.get("containers", {}) while earlier accessing
cluster_config["containers"] directly; change the lookup inside the for-loop to
use direct access (cluster_config["containers"]) when checking membership of key
so that missing container data fails loudly; update the condition that currently
uses cluster_config.get("containers", {}) to reference
cluster_config["containers"] when evaluating key in the containers mapping
(relating to variables/container assignment, task_classes loop, tc and its
CONTAINER_KEY).

In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 528-533: The code that extracts host and port from
job_server_address using .split(":") is brittle for URLs and IPv6; update the
parsing in the block that assigns host, port and calls
generation_task.configure_client_overrides to robustly parse job_server_address
using urllib.parse.urlparse (or prepend '//' when no scheme) and then use
parsed.netloc (handling IPv6 brackets) or fallback to rsplit(":", 1) to separate
host and port, defaulting port to 5000 and casting port to int before passing to
generation_task.configure_client_overrides.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 68b4b1ff-f123-4a46-9a68-756b464a14ad

📥 Commits

Reviewing files that changed from the base of the PR and between 86b40cb and a0f5f54.

📒 Files selected for processing (14)

nemo_skills/dataset/eval_kit/__init__.py
nemo_skills/dataset/utils.py
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/evaluation/metrics/eval_kit_metrics.py
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/evaluation/metrics/translation_metrics.py
nemo_skills/inference/eval/eval_kit.py
nemo_skills/inference/factory.py
nemo_skills/inference/generate.py
nemo_skills/inference/mcore_skills.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/eval.py
nemo_skills/pipeline/utils/generation.py
requirements/eval-kit.txt

🚧 Files skipped from review as they are similar to previous changes (4)

requirements/eval-kit.txt
nemo_skills/evaluation/metrics/map_metrics.py
nemo_skills/dataset/utils.py
nemo_skills/evaluation/metrics/translation_metrics.py

nemo_skills/inference/eval/eval_kit.py

nemo_skills/inference/mcore_skills.py

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous · 2026-03-05T17:07:04Z

@melllinia
lets test this. i'll share Qucik start

melllinia · 2026-03-06T14:00:12Z

@Jorjeous are we considering adding the container to the nemo-skills containers?

melllinia

Looks good, can you please add some simple instruction about how to run it?

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

coderabbitai

♻️ Duplicate comments (1)

nemo_skills/inference/mcore_skills.py (1)
521-527: ⚠️ Potential issue | 🟠 Major

Bare Exception catch swallows failures silently.

The ImportError catch at line 521 is appropriate, but the bare except Exception at line 526 logs the failure but allows the run to complete and be marked .done. This can produce incomplete/invalid eval_kit_metrics.json without users noticing.

Per coding guidelines: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens."
🔧 Proposed fix: Re-raise after logging
         except ImportError:
             LOG.warning(
                 "VLMEvalKit asr_wer not available — skipping eval-kit-style metrics. "
                 "The summarize_results job will compute metrics separately."
             )
         except Exception:
             LOG.exception("Inline metrics computation failed")
+            raise
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/mcore_skills.py` around lines 521 - 527, In the
exception handling block for inline metrics computation, the bare `except
Exception` clause at line 526 logs the failure but allows execution to continue
silently. After the LOG.exception call in this block, add a re-raise statement
to propagate the exception up the call stack. This ensures that unexpected
failures in inline metrics computation will cause the run to fail rather than
completing with incomplete or invalid eval_kit_metrics.json, while still
allowing the ImportError catch to handle the expected case of VLMEvalKit not
being available.

🧹 Nitpick comments (4)

nemo_skills/inference/mcore_skills.py (2)

42-55: Fallback pattern for missing GenerationTask works but has unused cls parameter.

The fallback _get_server_command_fn is decorated with @classmethod but defines a standalone function. The cls parameter is unused because the decorator is applied incorrectly for this context.
💡 Suggested fix
 if GenerationTask is not None:
     _get_server_command_fn = GenerationTask.get_server_command_fn
 else:
-    `@classmethod`
-    def _get_server_command_fn(cls):
+    def _get_server_command_fn():
         from nemo_skills.pipeline.utils import get_server_command

         return get_server_command
Note: This may require adjusting how it's assigned to the class attribute at line 128.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/mcore_skills.py` around lines 42 - 55, The fallback
_get_server_command_fn is declared with `@classmethod` but the function signature
doesn't use cls; fix by either removing the `@classmethod` decorator and defining
a plain function def _get_server_command_fn(): ... that returns
get_server_command, or keep it as a proper classmethod def
_get_server_command_fn(cls): ... and reference cls if needed; then ensure the
attribute assignment/override for GenerationTask.get_server_command_fn (or the
class that expects this method) uses the correctly-typed callable so the
fallback is invoked without an unused cls parameter.
106-111: Silently accepting unused pipeline args may hide misconfiguration.

These fields accept user-passed overrides that are documented as unused. Per coding guidelines, code should fail if user specifies an unsupported argument.

Consider adding validation in __init__ or __post_init__ to warn or fail if these are set to non-default values:
💡 Suggested validation
def __post_init__(self):
    if self.eval_config:
        LOG.warning("eval_config is ignored by mcore_skills generation")
    if self.eval_type is not None:
        LOG.warning("eval_type is ignored by mcore_skills generation")
    # etc.
As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/inference/mcore_skills.py` around lines 106 - 111, Add a
__post_init__ to the mcore_skills dataclass that validates the pipeline override
fields (eval_config, eval_type, prompt_format, enable_audio): if any of these
are set to non-default values (eval_config non-empty, eval_type or prompt_format
not None, enable_audio True) raise a ValueError listing the offending field
names (or alternatively LOG.warning then raise) so user-supplied unsupported
arguments fail fast; implement this check inside __post_init__ and reference the
exact field names (eval_config, eval_type, prompt_format, enable_audio) in the
error message.

nemo_skills/pipeline/utils/eval.py (1)

528-531: Improved URL parsing, but still fragile for edge cases.

Using rsplit(":", 1) is better than split(":") for URLs like http://host:8000, but it can still fail for:

IPv6 addresses: [::1]:8000 → would split incorrectly
URLs with scheme: http://host:8000 → host="http://host", port="8000"

If these edge cases are expected usage, consider using urllib.parse:

💡 More robust URL parsing

+                    from urllib.parse import urlsplit
                     # rsplit to handle URLs like http://host:port (takes last colon)
-                    host, port = (job_server_address or "localhost:5000").rsplit(":", 1)
+                    raw_address = job_server_address or "localhost:5000"
+                    if "://" in raw_address:
+                        parsed = urlsplit(raw_address)
+                        host, port = parsed.hostname, str(parsed.port)
+                    else:
+                        host, port = raw_address.rsplit(":", 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/eval.py` around lines 528 - 531, The current
rsplit-based parsing for job_server_address (which sets host, port used later
with server_parameters["model"] and server_parameters["server_type"]) is fragile
for URLs with schemes and IPv6; replace that rsplit logic with
urllib.parse.urlparse: if job_server_address lacks a scheme, prepend "tcp://" or
"http://" to ensure urlparse recognizes netloc, then extract parsed.hostname and
parsed.port (which correctly handles IPv6 brackets and strips schemes); if
parsed.port is None, default to 5000 and if parsed.hostname is None default to
"localhost"; finally assign host and port from these parsed values before using
them.

docs/evaluation/eval-kit.md (1)

1-282: Documentation looks comprehensive, but consider adding expected results.

The documentation provides clear instructions and example commands for running eval_kit benchmarks. However, as per coding guidelines, when adding new benchmarks, documentation should include "expected results for tested models."

Consider adding a section with baseline metrics (e.g., expected WER for LibriSpeech with a tested model) so users can validate their setup is working correctly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/evaluation/eval-kit.md` around lines 1 - 282, Add a new "Expected
Results / Baselines" subsection to the eval_kit docs that lists baseline metrics
for representative benchmarks (e.g., eval_kit.LibriSpeech_test_clean) and
example output files (eval_kit_metrics.json, metrics.json, output.jsonl) so
users can validate runs; include specific baseline numbers (e.g., WER for the
tested model) and the metric format/schema to compare against, and place this
under "Understanding Results" near the existing output directory example so it's
discoverable when users open eval_kit.LibriSpeech_test_clean results.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/inference/mcore_skills.py`:
- Around line 521-527: In the exception handling block for inline metrics
computation, the bare `except Exception` clause at line 526 logs the failure but
allows execution to continue silently. After the LOG.exception call in this
block, add a re-raise statement to propagate the exception up the call stack.
This ensures that unexpected failures in inline metrics computation will cause
the run to fail rather than completing with incomplete or invalid
eval_kit_metrics.json, while still allowing the ImportError catch to handle the
expected case of VLMEvalKit not being available.

---

Nitpick comments:
In `@docs/evaluation/eval-kit.md`:
- Around line 1-282: Add a new "Expected Results / Baselines" subsection to the
eval_kit docs that lists baseline metrics for representative benchmarks (e.g.,
eval_kit.LibriSpeech_test_clean) and example output files
(eval_kit_metrics.json, metrics.json, output.jsonl) so users can validate runs;
include specific baseline numbers (e.g., WER for the tested model) and the
metric format/schema to compare against, and place this under "Understanding
Results" near the existing output directory example so it's discoverable when
users open eval_kit.LibriSpeech_test_clean results.

In `@nemo_skills/inference/mcore_skills.py`:
- Around line 42-55: The fallback _get_server_command_fn is declared with
`@classmethod` but the function signature doesn't use cls; fix by either removing
the `@classmethod` decorator and defining a plain function def
_get_server_command_fn(): ... that returns get_server_command, or keep it as a
proper classmethod def _get_server_command_fn(cls): ... and reference cls if
needed; then ensure the attribute assignment/override for
GenerationTask.get_server_command_fn (or the class that expects this method)
uses the correctly-typed callable so the fallback is invoked without an unused
cls parameter.
- Around line 106-111: Add a __post_init__ to the mcore_skills dataclass that
validates the pipeline override fields (eval_config, eval_type, prompt_format,
enable_audio): if any of these are set to non-default values (eval_config
non-empty, eval_type or prompt_format not None, enable_audio True) raise a
ValueError listing the offending field names (or alternatively LOG.warning then
raise) so user-supplied unsupported arguments fail fast; implement this check
inside __post_init__ and reference the exact field names (eval_config,
eval_type, prompt_format, enable_audio) in the error message.

In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 528-531: The current rsplit-based parsing for job_server_address
(which sets host, port used later with server_parameters["model"] and
server_parameters["server_type"]) is fragile for URLs with schemes and IPv6;
replace that rsplit logic with urllib.parse.urlparse: if job_server_address
lacks a scheme, prepend "tcp://" or "http://" to ensure urlparse recognizes
netloc, then extract parsed.hostname and parsed.port (which correctly handles
IPv6 brackets and strips schemes); if parsed.port is None, default to 5000 and
if parsed.hostname is None default to "localhost"; finally assign host and port
from these parsed values before using them.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f0287a47-0f73-49e5-b5d5-160e3b694409

📥 Commits

Reviewing files that changed from the base of the PR and between a0f5f54 and c4f4b72.

📒 Files selected for processing (5)

docs/evaluation/eval-kit.md
docs/evaluation/index.md
nemo_skills/evaluation/evaluator/audio.py
nemo_skills/inference/mcore_skills.py
nemo_skills/pipeline/utils/eval.py

✅ Files skipped from review due to trivial changes (1)

docs/evaluation/index.md

This reverts commit b237e33.

This reverts commit b237e33. Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Jorjeous changed the title ~~Adds two new generation modules for running NeMo Skills benchmarks via~~ Eval kit support Feb 12, 2026

Jorjeous marked this pull request as draft February 12, 2026 18:01

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

nemo_skills/inference/mcore_skills.py Show resolved Hide resolved

nemo_skills/inference/eval/eval_kit.py Show resolved Hide resolved

nemo_skills/inference/mcore_skills.py Outdated Show resolved Hide resolved

nemo_skills/inference/mcore_skills.py Show resolved Hide resolved

coderabbitai bot reviewed Feb 12, 2026

View reviewed changes

gwarmstrong added the not ready label Feb 27, 2026

Jorjeous force-pushed the eval-kit branch from 818c734 to 0486de7 Compare March 4, 2026 15:28

Jorjeous marked this pull request as ready for review March 5, 2026 15:15

Jorjeous requested a review from melllinia March 5, 2026 15:23

Jorjeous and others added 5 commits March 5, 2026 07:23

Update nemo_skills/inference/mcore_skills.py

7c75e63

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>

fix ruff lint and formatting

37476ac

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous force-pushed the eval-kit branch from 86b40cb to 37476ac Compare March 5, 2026 15:24

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

Jorjeous added 2 commits March 5, 2026 07:26

adressed coderabiit comments

a0f5f54

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

coderabbitai bot reviewed Mar 5, 2026

View reviewed changes

nemo_skills/inference/eval/eval_kit.py Show resolved Hide resolved

nemo_skills/inference/eval/eval_kit.py Show resolved Hide resolved

nemo_skills/inference/mcore_skills.py Show resolved Hide resolved

nemo_skills/inference/mcore_skills.py Outdated Show resolved Hide resolved

adressed coderabbit comments round 2

ed1c5a5

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous removed the not ready label Mar 5, 2026

Merge branch 'main' into eval-kit

e247ef3

melllinia approved these changes Mar 6, 2026

View reviewed changes

Added quick run instructions

c4f4b72

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

Jorjeous enabled auto-merge (squash) March 6, 2026 15:58

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

Jorjeous merged commit b237e33 into main Mar 6, 2026
5 checks passed

Jorjeous deleted the eval-kit branch March 6, 2026 16:25

Kipok added a commit that referenced this pull request Mar 6, 2026

Revert "Eval kit support (#1239)"

269fd2c

This reverts commit b237e33.

Kipok added a commit that referenced this pull request Mar 6, 2026

Revert "Eval kit support (#1239)"

99f2c08

This reverts commit b237e33. Signed-off-by: Igor Gitman <igitman@nvidia.com>

coderabbitai bot mentioned this pull request Mar 6, 2026

Revert "Eval kit support (#1239)" #1294

Merged

Kipok added a commit that referenced this pull request Mar 6, 2026

Revert "Eval kit support (#1239)" (#1294)

a5da597

Signed-off-by: Igor Gitman <igitman@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Revert "Eval kit support (#1239)" (#1294)

314de95

Signed-off-by: Igor Gitman <igitman@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Revert "Eval kit support (#1239)" (#1294)

de63186

Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Conversation

Jorjeous commented Feb 12, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jorjeous commented Mar 5, 2026

Uh oh!

coderabbitai bot commented Mar 5, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jorjeous commented Mar 5, 2026

Uh oh!

melllinia commented Mar 6, 2026

Uh oh!

melllinia left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jorjeous commented Feb 12, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 12, 2026 •

edited

Loading