Conversation
📝 WalkthroughWalkthroughAdds VLMEvalKit integration: dataset-level generation config, two generation backends (vLLM client and in-process Megatron/MCore), async JSONL writer, EvalKit metrics reader, pipeline wiring for self-contained generation tasks, and expanded ASR/translation evaluation handling. Changes
Sequence Diagram(s)sequenceDiagram
participant Pipeline as Eval Pipeline
participant Resolver as Task Resolver
participant Dataset as VLMEvalKit Dataset
participant Model as Model Init (mcore/vLLM)
participant Generator as Generation Task
participant Writer as Async JSONL Writer
participant Evaluator as VLMEvalKit Eval
participant Converter as Result Converter
Pipeline->>Resolver: resolve generation_task_class, flags, extra_args
Resolver-->>Pipeline: class, self_contained, num_gpus, extra_args
Pipeline->>Dataset: request/build dataset (rank-aware)
Dataset-->>Pipeline: dataset ready / metadata
Pipeline->>Model: initialize model interface (mcore or vLLM)
Model-->>Generator: model client/interface
Generator->>Writer: start async writer thread
loop per-sample
Generator->>Generator: run inference -> prediction
Generator->>Writer: enqueue prediction
end
Generator->>Writer: stop & flush
Writer-->>Generator: final JSONL
Generator->>Evaluator: run eval_kit evaluation
Evaluator-->>Converter: metrics + ordered results
Converter-->>Pipeline: NeMo Skills JSONL + eval_kit_metrics.json
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/evaluation/evaluator/__init__.py (1)
131-131:⚠️ Potential issue | 🟡 MinorRemove debug
This looks like a leftover debug artifact. Either remove it or replace with proper logging.
- print(f"evaluator: {evaluator}")
🤖 Fix all issues with AI agents
In `@nemo_skills/evaluation/evaluator/__init__.py`:
- Around line 30-33: The current top-level try/except hides import errors for
ComputeEvalEvaluator; instead capture the ImportError into a module-level
variable (e.g. _compute_eval_import_error) and set ComputeEvalEvaluator = None,
then in get_evaluator_class (or the evaluator registration lookup) check if
eval_type == "compute-eval" and if ComputeEvalEvaluator is None raise a clear
ImportError that includes _compute_eval_import_error; this defers the failure to
the point of use and gives an actionable message when someone requests the
"compute-eval" evaluator.
In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 264-275: In _flush_pkl_to_jsonl, replace the broad "except
Exception" around pickle.load with a narrower handler for the transient errors
that indicate a mid-write pickle (e.g., (EOFError, pickle.UnpicklingError,
BlockingIOError)) so those are safely skipped; for other unexpected errors
(e.g., PermissionError, MemoryError) let them propagate or log them explicitly
before re-raising so they aren't silently swallowed—update the exception clause
in _flush_pkl_to_jsonl accordingly and include a clear processLogger.error(...)
call when re-raising non-transient exceptions.
In `@nemo_skills/inference/mcore_skills.py`:
- Around line 467-522: _evaluate_results currently always imports
vlmeval.dataset.avlm.utils.asr_wer and writes eval_kit_metrics.json with only
WER which is wrong for non-ASR datasets; update the method to consult a config
flag (e.g., self.cfg.metrics_type or self.cfg.dataset_type) or a new
self.cfg.eval_function setting before importing/using asr_wer so you only
compute WER when the dataset is ASR-type, otherwise skip creating
eval_kit_metrics.json or call a configurable evaluator; reference the
METRICS_TYPE_OVERRIDE constant and the EvalKitMetrics consumer to ensure the
written file matches the selected evaluator, and add a clear guard in
_evaluate_results (or load evaluator via a registry/lookup) to avoid the broad
Exception path producing meaningless metrics.
In `@nemo_skills/pipeline/eval.py`:
- Around line 40-67: The code mixes direct access and .get() for
cluster_config["containers"] in _apply_task_overrides which is inconsistent;
update the membership check to use direct access so failures are explicit.
Replace cluster_config.get("containers", {}) with cluster_config["containers"]
in the for-loop that selects container (i.e., change if key and key in
cluster_config.get("containers", {}): to if key and key in
cluster_config["containers"]:) while keeping the initial container =
cluster_config["containers"]["nemo-skills"] and the references to CONTAINER_KEY
and task_classes unchanged.
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 32-42: The _resolve_generation_task_class function currently
swallows all exceptions; change its error handling to only catch ImportError and
ModuleNotFoundError (so syntax/runtime errors in the module propagate) and, when
catching these import-related errors, log a warning including the exception
details and the module_name; keep the rest of the logic (import_from_path vs
importlib.import_module and returning getattr(..., "GENERATION_TASK_CLASS",
None)) unchanged and use the module logger (e.g., logging.getLogger(__name__))
or an existing logger in the file for the warning message.
🧹 Nitpick comments (13)
nemo_skills/evaluation/evaluator/audio.py (2)
499-516: Mutable_TRANSLATION_TYPESset is fragile if ever hoisted to module scope.
_TRANSLATION_TYPESis mutated on line 504 (_TRANSLATION_TYPES.add(task_type)). This works correctly today because it's re-created on every call, but the_ALL_CAPSnaming convention strongly suggests a module-level constant. If a future refactor moves it to module scope (like_FAILURE_RESPONSESorVALID_NORMALIZATION_MODESabove), the set would accumulate task types across calls — a subtle, hard-to-detect bug.Consider making the intent clearer by either:
- Keeping the sets local but using lowercase naming (
asr_types,translation_types), or- Making the module-level sets truly immutable (frozensets) and building the union inline.
Option 2: immutable module-level sets + inline union
Define at module level:
_ASR_TYPES = frozenset({"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"}) _TRANSLATION_TYPES = frozenset({"AST", "Translation"})Then inside
evaluate_sample:- _ASR_TYPES = {"ASR", "ASR-ZH", "ASR-PC", "ASR_LEADERBOARD"} - _TRANSLATION_TYPES = {"AST", "Translation"} - # AudioBench speech translation types: ST-{src}-{tgt} - if task_type.startswith("ST-"): - _TRANSLATION_TYPES.add(task_type) - - if task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation: + is_translation = task_type in _TRANSLATION_TYPES or task_type.startswith("ST-") + is_asr = task_type in _ASR_TYPES + + if (is_asr or is_translation or task_type == "CER") and not generation:And similarly replace
task_type in _TRANSLATION_TYPESchecks withis_translation.
557-562: MathQA exact-match may be too strict for numerical answers.Pure string equality after lowercasing won't match equivalent representations like
"3.0"vs"3","1/2"vs"0.5", or whitespace/formatting differences in expressions. If the AudioBench MathQA dataset guarantees canonical answer forms, this is fine — but worth a brief comment in the code to document that assumption.nemo_skills/inference/generate.py (1)
296-306: Consider extracting the sharedget_env_prefix/get_extra_package_dirslogic into a mixin or the base class.Both
mcore_skills.py(lines 143–164) andeval_kit.py(lines 128–149) have identical implementations ofget_env_prefix()andget_extra_package_dirs(). Since the base class already defines these hooks, the shared VLMEvalKit environment setup could live in a common mixin (e.g.,VLMEvalKitMixin) or as a utility that both subclasses delegate to, avoiding the copy-paste.nemo_skills/dataset/eval_kit/__init__.py (1)
33-42: Add type hints to the function signature.As per coding guidelines, "Use type hints for simple types (dict, list, int, float, existing classes) in Python code".
Suggested fix
-def get_extra_generation_args(benchmark): +def get_extra_generation_args(benchmark: str) -> str:nemo_skills/evaluation/metrics/eval_kit_metrics.py (3)
42-44:**kwargsis silently swallowed — unsupported arguments won't raise errors.
get_metrics()inmap_metrics.pyforwards**kwargsfrom the user'smetrics_kwargs. Silently discarding them here means typos or invalid metric options will be ignored rather than failing. Consider either validating that no unexpected kwargs are passed or removing**kwargsentirely if no extra arguments are needed. As per coding guidelines, "Avoid silently ignoring unused user-passed parameters".Suggested fix
- def __init__(self, **kwargs): + def __init__(self, **kwargs): + if kwargs: + LOG.warning("EvalKitMetrics ignores extra kwargs: %s", list(kwargs.keys())) super().__init__(compute_no_answer=False)Or, if no extra kwargs should ever be passed:
- def __init__(self, **kwargs): + def __init__(self): super().__init__(compute_no_answer=False)
56-58:update()skipssuper().update(predictions)— intentional but worth documenting inline.
BaseMetrics.update()handles token counting, timing stats, andmax_ktracking. Skipping it means those metrics will be absent. This is correct for pre-computed VLMEvalKit aggregates, but a brief inline comment would help future maintainers understand the deliberate deviation.
38-40: Class-level mutable state_shared_metrics_filepersists across test runs.
_shared_metrics_fileis a class variable that survives across multiple instantiations and test cases. If tests createEvalKitMetricsinstances for different benchmarks, a stale path from a prior test could leak. Consider resetting it in__init__orsetup()when an instance-level path is found, or documenting the expected lifecycle.nemo_skills/pipeline/utils/eval.py (2)
365-379: Duplicate module import logic —_resolve_generation_task_classvs lines 461-474.
_resolve_generation_task_class(lines 32-42) performs the same import-and-get-GENERATION_TASK_CLASS as lines 461-474 later inprepare_eval_commands. The inline version raises on missingGENERATION_TASK_CLASS, while the helper silently returnsNone. Consider consolidating into one function with an option to raise or returnNone.
413-417: Forcingnum_jobs = total_evalswhen self-contained tasks are present may over-parallelize mixed workloads.If the benchmark list contains both self-contained and server-based benchmarks, this forces every benchmark into its own job. The comment at line 413-414 explains the rationale for self-contained tasks, but it may be an unexpected side effect for the non-self-contained benchmarks in the same eval run.
Consider documenting this behavior in the
--num_jobshelp text or logging which benchmarks are affected.nemo_skills/inference/eval/eval_kit.py (2)
162-227: Dataset is built twice on rank 0 whenworld_size > 1.Lines 218 and 221 both call
build_dataset(cfg.vlm_dataset, **dataset_kwargs). The first is rank-0 only (for download), the second is on all ranks. Rank 0 redundantly builds the dataset a second time. If dataset construction is expensive (beyond just downloading), consider caching the result from the first call.Suggested optimization
if world_size > 1: import torch.distributed as dist if rank == 0: - build_dataset(cfg.vlm_dataset, **dataset_kwargs) + self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs) dist.barrier() + if rank != 0: + self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs) + else: + self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs) - self.dataset = build_dataset(cfg.vlm_dataset, **dataset_kwargs)
337-361: Hardcoded dataset name lists will silently become stale as VLMEvalKit evolves.These lists mirror
VLMEvalKit/run.pybut must be manually kept in sync. If VLMEvalKit adds a new dataset that needs special kwargs (e.g.,nframe), it won't get them here. Consider importing or referencing these lists from VLMEvalKit if possible, or adding a comment with the VLMEvalKit version/commit these were copied from for traceability.nemo_skills/inference/mcore_skills.py (2)
126-165:get_env_prefix()andget_extra_package_dirs()are duplicated verbatim fromeval_kit.py.Both
EvalKitGenerationTask(ineval_kit.py, lines 128-150) andMegatronMCoreGenerationTaskshare identical implementations forget_env_prefix()andget_extra_package_dirs(). Extract these into a shared mixin or utility to avoid drift.
406-428: Redirectingsys.stdoutto/dev/nullis fragile — prefer suppressing at the logging/tqdm level.Replacing
sys.stdoutglobally on non-primary DP ranks suppresses all stdout, including potential error messages or debug info that isn't from VLMEvalKit. This pattern can also confuse debuggers and profilers. Consider usingcontextlib.redirect_stdoutfor a cleaner scope, or configuring VLMEvalKit's verbosity directly if possible.That said, the
finallycleanup at lines 425-428 is correct and ensuressys.stdoutis restored even on exceptions.
VLMEvalKit's MultiModalMCore inference engine: - eval_kit: uses VLMEvalKit's native dataset/evaluation pipeline with NeMo Skills output formatting - mcore_skills: reads NeMo Skills JSONL data, runs inference through MultiModalMCore in-process with data parallelism, computes metrics using VLMEvalKit functions (e.g. asr_wer), and writes results in NeMo Skills format Key changes: - New inference modules: eval_kit.py, mcore_skills.py - New dataset module: eval_kit/ for VLMEvalKit benchmark resolution - New metrics class: EvalKitMetrics for pre-computed VLMEvalKit metrics - Pipeline extensions: self-contained task support, METRICS_TYPE_OVERRIDE, eval_kit container/mounts handling - Lazy imports for sacrebleu to avoid missing-dep crashes Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
- Remove redundant metric_type field from BenchmarkArgs; consolidate into the single metrics_type field that the summarize step reads. METRICS_TYPE_OVERRIDE from task classes now writes to metrics_type so the override actually takes effect for non-eval_kit benchmarks. - Sort job_benchmarks before passing to job_batches for deterministic task names across runs. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
EvalKitConfig uses flat fields (server_url, model_name) while the pipeline's configure_client() injects nested ++server.* Hydra overrides. This mismatch crashes Hydra at startup for vllm mode. Fix by making eval_kit always self-contained. For vllm mode the user passes ++server_url and ++model_name directly as extra args. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (3)
nemo_skills/inference/mcore_skills.py (1)
480-523:⚠️ Potential issue | 🟠 Major
_evaluate_results()is ASR-specific and masks failures.This path always computes WER via
asr_werand then swallows all other runtime errors. That can silently generate incorrect/missingeval_kit_metrics.jsonfor non-ASR use cases.#!/bin/bash # Verify whether mcore_skills is used outside ASR contexts and confirm hardcoded WER path. set -euo pipefail echo "== mcore_skills wiring ==" rg -n "mcore_skills|METRICS_TYPE_OVERRIDE" nemo_skills -g '*.py' echo echo "== hardcoded evaluator in mcore_skills ==" rg -n "asr_wer|_evaluate_results|except Exception" nemo_skills/inference/mcore_skills.py -C 3 echo echo "== dataset modules pointing to mcore/eval-kit generation ==" rg -n "GENERATION_MODULE|mcore_skills|eval_kit" nemo_skills/dataset -g '*.py'Based on learnings: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/mcore_skills.py` around lines 480 - 523, The _evaluate_results() block currently always calls asr_wer and swallows all runtime errors via a broad except Exception, which hides failures and misapplies WER for non‑ASR tasks; change it to only run the asr_wer path when the job/dataset/metrics type explicitly indicates ASR (check METRICS_TYPE_OVERRIDE or task/dataset metadata before calling asr_wer), remove or narrow the broad except Exception (either let unexpected exceptions propagate or re-raise after logging), and ensure failures produce no silent missing eval_kit_metrics.json (i.e., only write eval_kit_metrics.json when asr_wer succeeded); update symbols: _evaluate_results, asr_wer, eval_kit_metrics.json, METRICS_TYPE_OVERRIDE accordingly.nemo_skills/inference/eval/eval_kit.py (1)
277-282:⚠️ Potential issue | 🟠 MajorNarrow pickle-read exception handling to transient cases only.
Catching
Exceptionhere can hide non-transient failures and silently mask broken runs.Based on learnings: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving".Proposed fix
- except Exception: - # pkl may be mid-write; skip this cycle - return + except (EOFError, pickle.UnpicklingError, BlockingIOError): + # pkl may be mid-write; skip this cycle + return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/eval_kit.py` around lines 277 - 282, The current blanket except in the pickle load block silently swallows all errors; narrow it to transient/expected read failures only by catching specific exceptions (e.g., EOFError, pickle.UnpicklingError, and OSError) around the with open(pkl_path, "rb") / pickle.load(f) call and keep the early return for those cases, but allow any other unexpected exceptions to propagate (i.e., remove the generic except Exception and only return on the specific transient exceptions).nemo_skills/pipeline/utils/eval.py (1)
33-43:⚠️ Potential issue | 🟠 MajorDon’t swallow non-import failures when resolving generation task classes.
except Exceptionmasks real module bugs (syntax/runtime errors), causing silent fallback toNoneand downstream misconfiguration.Proposed fix
def _resolve_generation_task_class(module_name: str): @@ - except Exception: + except (ImportError, ModuleNotFoundError) as e: + LOG.warning("Could not import generation module '%s': %s", module_name, e) return None#!/bin/bash set -euo pipefail # Verify broad exception handling in generation task resolution. rg -n -C3 --type=py 'def _resolve_generation_task_class|except Exception' nemo_skills/pipeline/utils/eval.pyBased on learnings: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/eval.py` around lines 33 - 43, The _resolve_generation_task_class function currently swallows all exceptions; change the broad "except Exception" to catch only import-related failures (e.g., ImportError, ModuleNotFoundError, FileNotFoundError) so real module errors (syntax/runtime) propagate; keep the existing return None behavior inside that narrow except block and let any other exceptions raised by import_from_path or importlib.import_module bubble up. Reference: _resolve_generation_task_class, import_from_path, importlib.import_module, and GENERATION_TASK_CLASS.
🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/audio.py (1)
515-532: Consider addingwer_candwer_pcdefaults for ASR-PC missing generation.When
task_typeis"ASR-PC"and generation is missing, this returns onlywer: 1.0. However, the normal ASR-PC path (line 537-543) returns additional metrics:wer_c,wer_pc, andper.Looking at the metrics aggregation in
audio_metrics.py, it filters withif "wer_c" in pred and pred["wer_c"] is not None, meaning missing-generation samples won't contribute tower_c/wer_pcaverages. This could cause those metrics to appear artificially better thanwersince they exclude the worst samples.If parity is desired, you could return default worst-case values for all metrics:
if task_type == "ASR-PC": return {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/audio.py` around lines 515 - 532, When handling missing generation in the branch that checks task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation, add worst-case defaults for ASR-PC so it returns all ASR-PC metrics instead of only "wer": specifically detect if task_type == "ASR-PC" and return {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0}; keep the existing CER and translation branches unchanged and ensure you reference the existing _ASR_TYPES/_TRANSLATION_TYPES/task_type variables and the base dict when constructing the return value.nemo_skills/pipeline/utils/eval.py (1)
397-403: Reuse the already-resolvedgeneration_task_classto reduce duplication.You store
ba.generation_task_classhere, but the loop later re-imports modules again per eval item. Reusing the cached class would simplify flow and avoid redundant imports.As per coding guidelines: "Keep code simple and elegant; reuse/extend existing functionality when possible, minimize conditional checks..."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/eval.py` around lines 397 - 403, The loop that sets ba.generation_task_class using _resolve_generation_task_class already resolves and caches the task class; update the later per-eval-item logic to reuse ba.generation_task_class instead of re-importing or calling _resolve_generation_task_class again. Concretely, in the code that iterates eval items (where it currently recomputes generation task classes from generation_module or ba.generation_module), first check ba.generation_task_class and use it when present, falling back to _resolve_generation_task_class only if the cached attribute is None; this removes duplicate imports and extra conditional checks while preserving existing behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/dataset/eval_kit/__init__.py`:
- Around line 39-42: The code currently returns an empty string when the
incoming benchmark string lacks a dot, silently dropping the required
vlm_dataset; update the branch that handles the missing suffix (the snippet that
checks "if '.' in benchmark" and returns "" otherwise) to instead raise a clear
exception (e.g., ValueError) indicating the expected format
"eval_kit.<dataset_name>" and include the actual benchmark value in the message
so callers fail fast; keep the existing behavior of extracting sub =
benchmark.split(".", 1)[1] and returning f" ++vlm_dataset={sub} " when a dot is
present.
In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py`:
- Around line 46-55: In EvalKitMetrics.setup, if the candidate metrics file is
not found, clear any previous references by setting both
self.eval_kit_metrics_file and EvalKitMetrics._shared_metrics_file to None (or
appropriate empty value) so a stale path isn't reused; locate the setup method
in the EvalKitMetrics class and ensure the branch where candidate.exists() is
false explicitly resets these attributes (and still sets them when candidate
exists).
In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 124-127: is_self_contained currently only checks extra_arguments
for "++model_type=mcore" and returns False when extra_arguments is empty,
misclassifying defaults; update is_self_contained(cls, extra_arguments: str =
"") to first look for a "++model_type=" token in extra_arguments and, if found,
return True when its value is "mcore", otherwise if no token is present consult
EvalKitConfig.model_type (the config default) and return True when that default
equals "mcore"; reference the is_self_contained method and
EvalKitConfig.model_type when making this change.
In `@nemo_skills/inference/mcore_skills.py`:
- Around line 404-405: The per-rank output filename using
output_rank{dp_rank}.jsonl can collide across concurrent runs; update the code
that constructs rank_file (and the similar constructions used at the other
occurrences where output_rank{dp_rank}.jsonl is created) to include a run-unique
suffix (e.g., uuid, process id, or timestamp) or use a safe temporary-file API
so each run gets a unique per-rank path; specifically change the places that set
rank_file (and the two other spots noted) to append a unique_run_id (or obtain a
tempfile from tempfile.NamedTemporaryFile/tmpdir) so concurrent seeds/chunks
cannot overwrite each other.
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 526-527: Replace the silent .get() calls that convert None to
empty strings with direct key access so missing keys fail loudly: change model =
server_parameters.get("model", "") and server_type =
server_parameters.get("server_type", "") to model = server_parameters["model"]
and server_type = server_parameters["server_type"]; this aligns with the rest of
the file’s direct bracket access to server_parameters and prevents producing
invalid client overrides like "++model_name=".
---
Duplicate comments:
In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 277-282: The current blanket except in the pickle load block
silently swallows all errors; narrow it to transient/expected read failures only
by catching specific exceptions (e.g., EOFError, pickle.UnpicklingError, and
OSError) around the with open(pkl_path, "rb") / pickle.load(f) call and keep the
early return for those cases, but allow any other unexpected exceptions to
propagate (i.e., remove the generic except Exception and only return on the
specific transient exceptions).
In `@nemo_skills/inference/mcore_skills.py`:
- Around line 480-523: The _evaluate_results() block currently always calls
asr_wer and swallows all runtime errors via a broad except Exception, which
hides failures and misapplies WER for non‑ASR tasks; change it to only run the
asr_wer path when the job/dataset/metrics type explicitly indicates ASR (check
METRICS_TYPE_OVERRIDE or task/dataset metadata before calling asr_wer), remove
or narrow the broad except Exception (either let unexpected exceptions propagate
or re-raise after logging), and ensure failures produce no silent missing
eval_kit_metrics.json (i.e., only write eval_kit_metrics.json when asr_wer
succeeded); update symbols: _evaluate_results, asr_wer, eval_kit_metrics.json,
METRICS_TYPE_OVERRIDE accordingly.
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 33-43: The _resolve_generation_task_class function currently
swallows all exceptions; change the broad "except Exception" to catch only
import-related failures (e.g., ImportError, ModuleNotFoundError,
FileNotFoundError) so real module errors (syntax/runtime) propagate; keep the
existing return None behavior inside that narrow except block and let any other
exceptions raised by import_from_path or importlib.import_module bubble up.
Reference: _resolve_generation_task_class, import_from_path,
importlib.import_module, and GENERATION_TASK_CLASS.
---
Nitpick comments:
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 515-532: When handling missing generation in the branch that
checks task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not
generation, add worst-case defaults for ASR-PC so it returns all ASR-PC metrics
instead of only "wer": specifically detect if task_type == "ASR-PC" and return
{**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0}; keep the existing
CER and translation branches unchanged and ensure you reference the existing
_ASR_TYPES/_TRANSLATION_TYPES/task_type variables and the base dict when
constructing the return value.
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 397-403: The loop that sets ba.generation_task_class using
_resolve_generation_task_class already resolves and caches the task class;
update the later per-eval-item logic to reuse ba.generation_task_class instead
of re-importing or calling _resolve_generation_task_class again. Concretely, in
the code that iterates eval items (where it currently recomputes generation task
classes from generation_module or ba.generation_module), first check
ba.generation_task_class and use it when present, falling back to
_resolve_generation_task_class only if the cached attribute is None; this
removes duplicate imports and extra conditional checks while preserving existing
behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 434bc35b-a3d5-475b-b898-e9b449c88afe
📒 Files selected for processing (13)
nemo_skills/dataset/eval_kit/__init__.pynemo_skills/dataset/utils.pynemo_skills/evaluation/evaluator/audio.pynemo_skills/evaluation/metrics/eval_kit_metrics.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/translation_metrics.pynemo_skills/inference/eval/eval_kit.pynemo_skills/inference/factory.pynemo_skills/inference/generate.pynemo_skills/inference/mcore_skills.pynemo_skills/pipeline/eval.pynemo_skills/pipeline/utils/eval.pynemo_skills/pipeline/utils/generation.py
🚧 Files skipped from review as they are similar to previous changes (1)
- nemo_skills/pipeline/eval.py
Only catch ImportError/ModuleNotFoundError instead of bare Exception. Syntax errors, missing dependencies, and other real bugs in generation modules now propagate instead of being silently swallowed. Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (4)
nemo_skills/dataset/eval_kit/__init__.py (1)
39-45:⚠️ Potential issue | 🟠 MajorAlso fail when the dataset suffix is empty (
eval_kit.).Line 44 can produce
++vlm_dataset=with an empty value. Please validate that the suffix exists.Proposed fix
def get_extra_generation_args(benchmark): @@ - if "." not in benchmark: + if not benchmark.startswith("eval_kit.") or "." not in benchmark: raise ValueError( f"eval_kit benchmark must be in 'eval_kit.<dataset_name>' format, got '{benchmark}'. " f"Example: eval_kit.MMBench_DEV_EN, eval_kit.LibriSpeech_test_clean" ) sub = benchmark.split(".", 1)[1] + if not sub: + raise ValueError( + f"eval_kit benchmark must include a dataset name after 'eval_kit.', got '{benchmark}'." + ) return f" ++vlm_dataset={sub} "As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/eval_kit/__init__.py` around lines 39 - 45, The function currently accepts a benchmark string and splits on the first dot to produce sub and return " ++vlm_dataset={sub} ", but it does not validate that the suffix exists (so "eval_kit." yields an empty value); update the validation for the input variable benchmark to ensure there is a non-empty suffix after the dot (i.e., after benchmark.split(".", 1)[1]) and raise a ValueError with a clear message if the suffix is empty; modify the logic around the existing benchmark check and the variable sub to perform this empty-string check before returning the formatted " ++vlm_dataset={sub} " value.nemo_skills/evaluation/metrics/eval_kit_metrics.py (1)
45-56:⚠️ Potential issue | 🟠 MajorClear instance-level metrics path when setup does not find
eval_kit_metrics.json.Line 55 resets only
EvalKitMetrics._shared_metrics_file.self.eval_kit_metrics_filecan still point to a stale file and override the reset inget_metrics().Proposed fix
def setup(self, input_files): """Find the eval_kit_metrics.json in the same directory as the input files.""" + self.eval_kit_metrics_file = None + EvalKitMetrics._shared_metrics_file = None if input_files: # input_files are like ['/path/to/eval-results/eval_kit.MMBench_DEV_EN/output.jsonl'] metrics_dir = Path(input_files[0]).parent candidate = metrics_dir / "eval_kit_metrics.json" if candidate.exists(): self.eval_kit_metrics_file = candidate EvalKitMetrics._shared_metrics_file = candidate - else: - # Reset stale shared path so a previous run's file isn't reused. - EvalKitMetrics._shared_metrics_file = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py` around lines 45 - 56, The setup method currently clears only EvalKitMetrics._shared_metrics_file when eval_kit_metrics.json isn't found, leaving self.eval_kit_metrics_file pointing to a stale path; update the setup() branch where candidate.exists() is false to also set self.eval_kit_metrics_file = None so instance-level state is cleared and get_metrics() won't reuse a stale file (refer to setup, self.eval_kit_metrics_file, EvalKitMetrics._shared_metrics_file, and get_metrics()).nemo_skills/inference/eval/eval_kit.py (2)
124-132:⚠️ Potential issue | 🟠 Major
is_self_contained()misclassifies default config runs.Line 131 returns
Falseunless++model_type=mcoreis explicitly passed, even thoughEvalKitConfig.model_typedefaults to"mcore".Proposed fix
`@classmethod` def is_self_contained(cls, extra_arguments: str = "") -> bool: - """Self-contained only when user explicitly requests mcore mode. - - Note: EvalKitConfig.model_type defaults to "mcore" at runtime, but - at submission time we check explicit user intent. Without the flag - the pipeline assumes vllm (server-based) mode. - """ - return "++model_type=mcore" in extra_arguments + """Self-contained in mcore mode.""" + for token in extra_arguments.split(): + if token.startswith("++model_type="): + return token.split("=", 1)[1] == "mcore" + return EvalKitConfig.model_type == "mcore"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/eval_kit.py` around lines 124 - 132, The is_self_contained method currently only looks for the explicit "++model_type=mcore" token in extra_arguments and thus misclassifies runs where EvalKitConfig.model_type defaults to "mcore"; update is_self_contained to return True if either the explicit flag is present in extra_arguments OR EvalKitConfig.model_type == "mcore" (safely handling cases where the config may be None or unset). Locate the is_self_contained(cls, extra_arguments: str = "") definition and add a secondary check against EvalKitConfig.model_type (or the appropriate config accessor) so both explicit user intent and the default config are honored.
282-287:⚠️ Potential issue | 🟠 MajorNarrow exception handling when reading pickle snapshots.
Line 285 catches all exceptions and silently skips, which can hide non-transient errors.
Proposed fix
- except Exception: + except (EOFError, pickle.UnpicklingError, BlockingIOError, OSError): # pkl may be mid-write; skip this cycle returnAs per coding guidelines: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/eval_kit.py` around lines 282 - 287, The try/except around pickle.load(pkl_path) is too broad; narrow it to only handle transient, expected errors (e.g., EOFError, pickle.UnpicklingError, and OSError) so real bugs surface. Replace "except Exception:" with "except (EOFError, pickle.UnpicklingError, OSError) as e:" and keep the sleep/skip/return behavior (and optionally a debug log using pkl_path and e); let any other exceptions propagate. Ensure pickle.UnpicklingError is referenced/imported and keep the variable names pkl_path and data unchanged.
🧹 Nitpick comments (2)
nemo_skills/evaluation/evaluator/audio.py (1)
515-589: Add end-to-end coverage for new task routing and fallback pathsPlease add/extend SLURM or integration benchmark tests for
ST-*translation routing,MathQA, and missing-generation behavior to prevent silent metric regressions.Based on learnings: "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/audio.py` around lines 515 - 589, The PR lacks integration/SLURM test coverage for new routing and fallback logic (ST-* translation routing, MathQA handling, and missing-generation behavior); add end-to-end tests that feed sample records exercising task_type values starting with "ST-", "MathQA", and cases where generation is empty/None to assert correct metric outputs (e.g., that ST-* is treated as translation via _TRANSLATION_TYPES, MathQA sets "is_correct" and "predicted_answer", and missing generation returns the "missing_generation" base with BLEU/WER/CER as appropriate). Implement tests that call the evaluator path hitting the routing logic (the block using task_type checks and functions evaluate_translation, evaluate_asr/evaluate_asr_pc, evaluate_cer, evaluate_hallucination, evaluate_pc_rate) and validate returned metrics and fallback fields, including the ASR_LEADERBOARD reference_fields branch to ensure extra wer_* and is_correct_* metrics are produced when reference_fields are present.nemo_skills/inference/mcore_skills.py (1)
468-526: Add a slurm e2e test for this new inline-eval path.This file now owns non-trivial generation + metric persistence behavior; a benchmark-level slurm test would catch regressions in
_evaluate_resultsandeval_kit_metrics.jsonproduction.Based on learnings: "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/mcore_skills.py` around lines 468 - 526, Add a slurm-level end-to-end test that exercises the new inline evaluation path in _evaluate_results: create a small benchmark job that writes an output_file with entries (including some with <think> tags), runs the skill so _evaluate_results executes (triggering import of asr_wer), and assert that the cleaned output_file is rewritten, that eval_kit_metrics.json is created next to the output file with a "wer" key, and that LOG.info for "ASR WER" is emitted; use the same helper(s) used by other slurm tests to schedule a job, point the job at a fixture dataset and config that triggers _strip_thinking_tags and metric computation, and fail the test if eval_kit_metrics.json is missing or malformed or generations remain uncleaned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 515-533: The code calls generation.strip() before the
missing-generation guard, which raises when generation is None; update the usage
of generation in the function (the place that currently does generation.strip())
to defensively handle None by using (generation or "").strip() or by explicitly
checking generation is not None before calling .strip(), and ensure this check
runs before the missing-generation handling block that uses task_type and
generation so missing-generation returns (is_correct False / error
"missing_generation") work as intended for None/empty generation values.
- Around line 522-533: The missing-generation branch for ASR-related tasks
returns only "wer" for ASR-PC variants, which omits metrics required for ASR-PC
aggregation; update the conditional that checks task_type in (_ASR_TYPES |
_TRANSLATION_TYPES | {"CER"}) and not generation to include the additional
ASR-PC default fields when task_type corresponds to ASR-PC (e.g., add "wer_c",
"wer_pc", and "per" alongside "wer"); use the existing base dict ("is_correct":
False, "error": "missing_generation") and return {**base, "wer": 1.0, "wer_c":
1.0, "wer_pc": 1.0, "per": 1.0} for the ASR-PC case while leaving the existing
branches for _TRANSLATION_TYPES and "CER" unchanged.
In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py`:
- Around line 41-43: The constructor currently takes **kwargs and drops them;
update the __init__ method (the constructor that calls
super().__init__(compute_no_answer=False) and sets self.eval_kit_metrics_file)
to fail fast: if kwargs is not empty, raise a TypeError listing the unexpected
kwarg names (e.g., raise TypeError(f"Unexpected constructor arguments: {',
'.join(kwargs.keys())}")), otherwise proceed to call super and set
self.eval_kit_metrics_file as before.
In `@nemo_skills/inference/mcore_skills.py`:
- Around line 463-467: The .done file is created before running inline metrics
which are then caught and swallowed, allowing failed evaluations to be marked
complete; modify the flow so Path(f"{self.cfg.output_file}.done").touch() is
executed only after self._evaluate_results() completes successfully, and remove
or rework the broad try/except that silences errors around _evaluate_results
(and the similar block handling inline metrics referenced near the 519-526 area)
so exceptions propagate instead of being swallowed; ensure any
metric-evaluation-specific exceptions are either handled explicitly with proper
logging and re-raise, or not caught at all, so failed runs are not marked done
and can be retried.
- Around line 503-507: The current code reopens output_file and truncates
output.jsonl before writing cleaned entries, risking data loss if writing fails;
modify the cleanup/write to first write all JSONL lines to a temporary file
(e.g., using tempfile.NamedTemporaryFile(delete=False) or creating a tmp path
like f"{output_file}.tmp"), flush and close it, then atomically replace the
original by calling os.replace(tmp_path, output_file); reference the existing
variables output_file and entries and ensure proper encoding ("utf-8") and error
handling around the replace so the original file remains intact on failures.
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 528-533: The current split(":") on job_server_address is brittle;
update the parsing before host, port assignment to robustly handle URLs and
IPv6: if job_server_address starts with a scheme (e.g., "http://" or "https://")
use URL parsing (e.g., urlparse) to extract hostname and port; otherwise handle
host:port and IPv6 literal forms by splitting on the last colon (rsplit(":", 1))
and stripping surrounding brackets from IPv6 hosts; fall back to defaults
("localhost", "5000") when parsing fails, then pass host and int(port) into
generation_task.configure_client_overrides to replace the fragile host/port =
(job_server_address or "localhost:5000").split(":") logic.
---
Duplicate comments:
In `@nemo_skills/dataset/eval_kit/__init__.py`:
- Around line 39-45: The function currently accepts a benchmark string and
splits on the first dot to produce sub and return " ++vlm_dataset={sub} ", but
it does not validate that the suffix exists (so "eval_kit." yields an empty
value); update the validation for the input variable benchmark to ensure there
is a non-empty suffix after the dot (i.e., after benchmark.split(".", 1)[1]) and
raise a ValueError with a clear message if the suffix is empty; modify the logic
around the existing benchmark check and the variable sub to perform this
empty-string check before returning the formatted " ++vlm_dataset={sub} " value.
In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py`:
- Around line 45-56: The setup method currently clears only
EvalKitMetrics._shared_metrics_file when eval_kit_metrics.json isn't found,
leaving self.eval_kit_metrics_file pointing to a stale path; update the setup()
branch where candidate.exists() is false to also set self.eval_kit_metrics_file
= None so instance-level state is cleared and get_metrics() won't reuse a stale
file (refer to setup, self.eval_kit_metrics_file,
EvalKitMetrics._shared_metrics_file, and get_metrics()).
In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 124-132: The is_self_contained method currently only looks for the
explicit "++model_type=mcore" token in extra_arguments and thus misclassifies
runs where EvalKitConfig.model_type defaults to "mcore"; update
is_self_contained to return True if either the explicit flag is present in
extra_arguments OR EvalKitConfig.model_type == "mcore" (safely handling cases
where the config may be None or unset). Locate the is_self_contained(cls,
extra_arguments: str = "") definition and add a secondary check against
EvalKitConfig.model_type (or the appropriate config accessor) so both explicit
user intent and the default config are honored.
- Around line 282-287: The try/except around pickle.load(pkl_path) is too broad;
narrow it to only handle transient, expected errors (e.g., EOFError,
pickle.UnpicklingError, and OSError) so real bugs surface. Replace "except
Exception:" with "except (EOFError, pickle.UnpicklingError, OSError) as e:" and
keep the sleep/skip/return behavior (and optionally a debug log using pkl_path
and e); let any other exceptions propagate. Ensure pickle.UnpicklingError is
referenced/imported and keep the variable names pkl_path and data unchanged.
---
Nitpick comments:
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 515-589: The PR lacks integration/SLURM test coverage for new
routing and fallback logic (ST-* translation routing, MathQA handling, and
missing-generation behavior); add end-to-end tests that feed sample records
exercising task_type values starting with "ST-", "MathQA", and cases where
generation is empty/None to assert correct metric outputs (e.g., that ST-* is
treated as translation via _TRANSLATION_TYPES, MathQA sets "is_correct" and
"predicted_answer", and missing generation returns the "missing_generation" base
with BLEU/WER/CER as appropriate). Implement tests that call the evaluator path
hitting the routing logic (the block using task_type checks and functions
evaluate_translation, evaluate_asr/evaluate_asr_pc, evaluate_cer,
evaluate_hallucination, evaluate_pc_rate) and validate returned metrics and
fallback fields, including the ASR_LEADERBOARD reference_fields branch to ensure
extra wer_* and is_correct_* metrics are produced when reference_fields are
present.
In `@nemo_skills/inference/mcore_skills.py`:
- Around line 468-526: Add a slurm-level end-to-end test that exercises the new
inline evaluation path in _evaluate_results: create a small benchmark job that
writes an output_file with entries (including some with <think> tags), runs the
skill so _evaluate_results executes (triggering import of asr_wer), and assert
that the cleaned output_file is rewritten, that eval_kit_metrics.json is created
next to the output file with a "wer" key, and that LOG.info for "ASR WER" is
emitted; use the same helper(s) used by other slurm tests to schedule a job,
point the job at a fixture dataset and config that triggers _strip_thinking_tags
and metric computation, and fail the test if eval_kit_metrics.json is missing or
malformed or generations remain uncleaned.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8bb5eeba-7f94-4068-b6aa-dccf9bc3b02a
📒 Files selected for processing (14)
nemo_skills/dataset/eval_kit/__init__.pynemo_skills/dataset/utils.pynemo_skills/evaluation/evaluator/audio.pynemo_skills/evaluation/metrics/eval_kit_metrics.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/translation_metrics.pynemo_skills/inference/eval/eval_kit.pynemo_skills/inference/factory.pynemo_skills/inference/generate.pynemo_skills/inference/mcore_skills.pynemo_skills/pipeline/eval.pynemo_skills/pipeline/utils/eval.pynemo_skills/pipeline/utils/generation.pyrequirements/eval-kit.txt
🚧 Files skipped from review as they are similar to previous changes (5)
- requirements/eval-kit.txt
- nemo_skills/evaluation/metrics/map_metrics.py
- nemo_skills/evaluation/metrics/translation_metrics.py
- nemo_skills/inference/factory.py
- nemo_skills/dataset/utils.py
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (11)
nemo_skills/dataset/eval_kit/__init__.py (1)
39-45:⚠️ Potential issue | 🟠 MajorValidate non-empty
eval_kitdataset suffix.Line 44 accepts
eval_kit.and returns++vlm_dataset=; this should fail fast because dataset name is required.Proposed fix
- if "." not in benchmark: + if not benchmark.startswith("eval_kit.") or "." not in benchmark: raise ValueError( f"eval_kit benchmark must be in 'eval_kit.<dataset_name>' format, got '{benchmark}'. " f"Example: eval_kit.MMBench_DEV_EN, eval_kit.LibriSpeech_test_clean" ) sub = benchmark.split(".", 1)[1] + if not sub: + raise ValueError( + f"eval_kit benchmark must include a dataset name after 'eval_kit.', got '{benchmark}'." + ) return f" ++vlm_dataset={sub} "As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/dataset/eval_kit/__init__.py` around lines 39 - 45, The code currently allows "eval_kit." and returns an empty vlm_dataset; update the validation in the function handling the benchmark string so that after splitting (where sub = benchmark.split(".", 1)[1]) you check that sub is non-empty and raise a ValueError with a clear message if it is empty; ensure the error mentions the required 'eval_kit.<dataset_name>' format and that the function (where benchmark and sub are used and that returns f" ++vlm_dataset={sub} ") fails fast when no dataset suffix is provided.nemo_skills/evaluation/metrics/eval_kit_metrics.py (2)
45-57:⚠️ Potential issue | 🟠 MajorReset instance file path in
setup()to avoid stale metrics reuse.If
setup()runs after a prior successful run,self.eval_kit_metrics_filecan remain stale and still win at Line 70 even when the new candidate is missing.Proposed fix
def setup(self, input_files): """Find the eval_kit_metrics.json in the same directory as the input files.""" + self.eval_kit_metrics_file = None + EvalKitMetrics._shared_metrics_file = None if input_files: metrics_dir = Path(input_files[0]).parent candidate = metrics_dir / "eval_kit_metrics.json" if candidate.exists(): self.eval_kit_metrics_file = candidate EvalKitMetrics._shared_metrics_file = candidate - else: - # Reset stale shared path so a previous run's file isn't reused. - EvalKitMetrics._shared_metrics_file = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py` around lines 45 - 57, The setup() method can leave self.eval_kit_metrics_file pointing at a previous run when the new candidate doesn't exist; update EvalKitMetrics.setup so that when the candidate file is missing you explicitly clear the instance path (set self.eval_kit_metrics_file = None) in addition to resetting EvalKitMetrics._shared_metrics_file, checking the candidate (metrics_dir / "eval_kit_metrics.json") and only assigning both when it exists.
41-43:⚠️ Potential issue | 🟠 MajorFail fast on unsupported constructor kwargs.
Line 41 accepts
**kwargsbut silently discards them, which can hide invalidmetrics_kwargsusage.Proposed fix
def __init__(self, **kwargs): + if kwargs: + unsupported = ", ".join(sorted(kwargs)) + raise TypeError(f"Unsupported EvalKitMetrics kwargs: {unsupported}") super().__init__(compute_no_answer=False) self.eval_kit_metrics_file = NoneBased on learnings: "Applies to **/*.py : Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py` around lines 41 - 43, The constructor currently swallows **kwargs silently; update the __init__ in the class that defines __init__(self, **kwargs) so unexpected user arguments fail fast: either (a) replace **kwargs with explicit parameters (e.g., eval_kit_metrics_file=None) and pass known values to super().__init__(compute_no_answer=False), or (b) validate kwargs at the start of __init__ by extracting any supported keys (e.g., "eval_kit_metrics_file") and if any keys remain raise TypeError("Unexpected keyword arguments: ..."); ensure you still call super().__init__(compute_no_answer=False) and set self.eval_kit_metrics_file from the validated argument.nemo_skills/pipeline/eval.py (1)
57-61:⚠️ Potential issue | 🟡 MinorUse direct
cluster_config["containers"]access consistently.Line 60 still uses
.get()even though Line 57 already assumescluster_config["containers"]must exist.Proposed fix
- if key and key in cluster_config.get("containers", {}): + if key and key in cluster_config["containers"]: container = cluster_config["containers"][key]As per coding guidelines: "Don't use
.get()for accessing dictionary keys if the code expects them to be present; use direct accessdata[key_name]to fail with a clear error instead of silently corrupting data."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` around lines 57 - 61, The loop inconsistently uses cluster_config.get("containers", {}) while earlier accessing cluster_config["containers"] directly; change the lookup inside the for-loop to use direct access (cluster_config["containers"]) when checking membership of key so that missing container data fails loudly; update the condition that currently uses cluster_config.get("containers", {}) to reference cluster_config["containers"] when evaluating key in the containers mapping (relating to variables/container assignment, task_classes loop, tc and its CONTAINER_KEY).nemo_skills/evaluation/evaluator/audio.py (2)
522-533:⚠️ Potential issue | 🟠 MajorASR-PC missing-generation defaults are still incomplete.
Line 532 currently returns only
werfor ASR-PC, but ASR-PC outputs should includewer,wer_c,wer_pc, andperfor consistent aggregation.Proposed fix
if task_type in _TRANSLATION_TYPES: return {**base, "bleu": 0.0} if task_type == "CER": return {**base, "cer": 1.0} + if task_type == "ASR-PC": + return {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0} # ASR / ASR-PC / ASR-ZH return {**base, "wer": 1.0}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/audio.py` around lines 522 - 533, The missing-generation branch handling when task_type is in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) currently returns only "wer" for ASR-PC; update the branch so that when task_type corresponds to ASR-PC (identify via the value used for ASR-PC in _ASR_TYPES or by name if present) you return the full set of default metrics {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc": 1.0, "per": 1.0} instead of just "wer" so aggregation sees consistent keys; modify the final ASR / ASR-PC / ASR-ZH return logic to branch on ASR-PC and include these additional fields while keeping existing behavior for other ASR variants.
522-533:⚠️ Potential issue | 🔴 Critical
missing_generationhandling is still bypassed forNonegenerations.Line 508 calls
.strip()unconditionally, soNoneraises before Lines 522-533 can return fallback metrics.Proposed fix
- generation = sample["generation"].strip() + generation_raw = sample["generation"] + generation = generation_raw.strip() if isinstance(generation_raw, str) else ""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/evaluation/evaluator/audio.py` around lines 522 - 533, The code currently calls generation.strip() earlier which raises when generation is None and prevents the fallback in the block checking task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation from returning the intended "missing_generation" metrics; fix by adding a None check before any .strip() use or by moving the missing-generation branch earlier: if generation is None (or falsy) and task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}), return the base missing_generation dict and then add the task-specific metric keys (bleu for _TRANSLATION_TYPES, cer for "CER", wer for ASR variants) just like the existing returns so .strip() is never invoked on None.nemo_skills/pipeline/utils/eval.py (1)
528-533:⚠️ Potential issue | 🟠 MajorParse
job_server_addressrobustly (URL/IPv6-safe).Line 528 uses
.split(":"), which breaks valid inputs likehttp://host:8000and IPv6 literals.Proposed fix
+ from urllib.parse import urlsplit - host, port = (job_server_address or "localhost:5000").split(":") + raw_address = job_server_address or "localhost:5000" + parsed = urlsplit(raw_address if "://" in raw_address else f"http://{raw_address}") + if parsed.hostname is None or parsed.port is None: + raise ValueError(f"Invalid server address: {raw_address}") + host, port = parsed.hostname, parsed.port model = server_parameters["model"] server_type = server_parameters["server_type"] task_overrides = generation_task.configure_client_overrides( host=host, - port=int(port), + port=port, model=model, server_type=server_type, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/eval.py` around lines 528 - 533, The code that extracts host and port from job_server_address using .split(":") is brittle for URLs and IPv6; update the parsing in the block that assigns host, port and calls generation_task.configure_client_overrides to robustly parse job_server_address using urllib.parse.urlparse (or prepend '//' when no scheme) and then use parsed.netloc (handling IPv6 brackets) or fallback to rsplit(":", 1) to separate host and port, defaulting port to 5000 and casting port to int before passing to generation_task.configure_client_overrides.nemo_skills/inference/mcore_skills.py (2)
503-507:⚠️ Potential issue | 🟠 MajorAvoid truncating
output.jsonlbefore replacement; write atomically.If writing fails mid-way, the current in-place rewrite loses the only output file.
Proposed fix
- with open(output_file, "w", encoding="utf-8") as fout: - for entry in entries: - fout.write(json.dumps(entry) + "\n") + tmp_output = output_path.with_suffix(output_path.suffix + ".tmp") + with open(tmp_output, "w", encoding="utf-8") as fout: + for entry in entries: + fout.write(json.dumps(entry) + "\n") + os.replace(tmp_output, output_file)Based on learnings: "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/mcore_skills.py` around lines 503 - 507, The current logic rewrites output_file in-place and can lose data if the write fails; instead, write the JSONL content for entries to a temporary file (e.g., output_file + ".tmp") and only once the write completes successfully atomically replace the original using os.replace (and optionally fsync/flush the temp file before replace). Locate the block that opens output_file for writing (the loop writing entries via fout.write(json.dumps(entry) + "\n")) and modify it to write to a temp path, ensure the write completes and file is closed, then call os.replace(temp_path, output_file) to atomically swap in the new output.
463-466:⚠️ Potential issue | 🟠 MajorCreate
.doneonly after successful evaluation, and don’t swallow unexpected metric failures.Line 463 marks completion before Line 466 evaluation, while Lines 524-525 suppress failures. This can mark failed runs as complete and skip reruns.
Proposed fix
- Path(f"{self.cfg.output_file}.done").touch() - - # Evaluate using VLMEvalKit (same as eval_kit.py does). - self._evaluate_results() + # Evaluate using VLMEvalKit (same as eval_kit.py does). + self._evaluate_results() + Path(f"{self.cfg.output_file}.done").touch() @@ - except Exception: - LOG.exception("Inline metrics computation failed") + except Exception: + LOG.exception("Inline metrics computation failed") + raiseAs per coding guidelines, "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."
Also applies to: 519-526
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/mcore_skills.py` around lines 463 - 466, The code currently creates the completion marker Path(f"{self.cfg.output_file}.done").touch() before calling self._evaluate_results() and also suppresses unexpected failures around evaluation (see the try/except block near where metrics are handled), which can mark failed runs as complete; move the touch call so the .done file is created only after self._evaluate_results() returns successfully, and remove or narrow the broad exception swallowing (remove bare except/except Exception that merely passes) in the evaluation/metric handling block (the try/except around self._evaluate_results() / metric processing) so unexpected exceptions propagate (or re-raise them) instead of being ignored. Ensure any deliberate, expected metric errors are handled explicitly with targeted exception types and clear logging while still preventing creation of the .done file on failure.nemo_skills/inference/eval/eval_kit.py (2)
282-287:⚠️ Potential issue | 🟠 MajorNarrow transient pickle-read errors; let unexpected errors surface.
Catching
Exceptionhere suppresses non-transient failures and can silently stall async output.Proposed fix
- except Exception: - # pkl may be mid-write; skip this cycle - return + except (EOFError, pickle.UnpicklingError, BlockingIOError): + # pkl may be mid-write; skip this cycle + returnAs per coding guidelines, "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens so users notice it instead of silently misbehaving."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/eval_kit.py` around lines 282 - 287, The broad except around the pickle.load of pkl_path hides unexpected errors; change the handler to only catch transient/read-related exceptions (e.g., EOFError, pickle.UnpicklingError, OSError) when opening/reading pkl_path and return in those cases, while allowing all other exceptions to propagate (i.e., re-raise) so non-transient failures surface; keep the try around the with open(...) / pickle.load(...) block and reference pkl_path and pickle.load when implementing the narrower except.
124-132:⚠️ Potential issue | 🟠 Major
is_self_contained()misclassifies defaultmcoreruns.With default
model_type="mcore", emptyextra_argumentsreturnsFalseand can incorrectly trigger server-based flow.Proposed fix
- return "++model_type=mcore" in extra_arguments + # Default is mcore unless explicitly overridden. + return "++model_type=vllm" not in extra_arguments🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/eval/eval_kit.py` around lines 124 - 132, The is_self_contained(cls, extra_arguments: str = "") currently returns True only when the explicit flag "++model_type=mcore" is present, which misclassifies runs where the runtime default EvalKitConfig.model_type is "mcore" and extra_arguments is empty; update is_self_contained to also return True when extra_arguments is empty AND the configured/default model type equals "mcore" (e.g., check EvalKitConfig.model_type or a class-level default like cls.default_model_type), preserving the original explicit-flag check so either the explicit "++model_type=mcore" in extra_arguments or the runtime/default model_type == "mcore" yields True.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 97-100: The dataclass currently declares skip_filled and
eval_config as accepted but unused, which hides unsupported user args; either
remove these fields so Hydra/arg parsing fails on unknown parameters, or add
explicit validation in the class's initializer (e.g., __post_init__ of the class
that defines skip_filled/eval_config or VLMEvalKit) that raises a clear error if
skip_filled or eval_config are provided with non-default values; reference the
skip_filled and eval_config symbols and the class (the dataclass that contains
them / VLMEvalKit) when implementing the change so callers cannot silently pass
unsupported arguments.
- Around line 528-536: The current loop writes directly to self.cfg.output_file
with "w" which can leave a partial file on failure; instead, serialize all rows
into the target JSONL content first and write atomically by writing to a
temporary file in the same directory (e.g., using tempfile.NamedTemporaryFile or
creating a .tmp path), close it, then os.replace(temp_path,
self.cfg.output_file) to atomically move it into place; update the block that
iterates df (reference: df, self.cfg.output_file, LOG) to build or stream into
the temp file and call os.replace so the final JSONL is either complete or
untouched, and keep the LOG.info after the atomic replace.
In `@nemo_skills/inference/mcore_skills.py`:
- Around line 106-112: These fields silently accept user overrides; instead add
explicit validation (e.g., in the class's __post_init__ or initializer) that
raises an error if any of eval_config is non-empty, or eval_type or
prompt_format is not None, or enable_audio is True, so user-specified
unsupported args fail fast. Locate the dataclass or class that declares
eval_config, eval_type, prompt_format, enable_audio in mcore_skills.py and
implement checks that raise a clear ValueError mentioning the offending symbol
(eval_config/eval_type/prompt_format/enable_audio) when they are set, preventing
silent acceptance of unsupported pipeline overrides.
- Around line 491-500: The evaluation path is hardcoding the "generation" key
while generate() uses self.cfg.generation_key, causing mismatches; update the
block in question to use self.cfg.generation_key everywhere (when reading,
stripping via _strip_thinking_tags, assigning back to entry, and when building
the results dict) and replace .get(...) usages with direct indexing
(entry[self.cfg.generation_key] and entry["expected_answer"] as appropriate) so
missing keys fail loudly and evaluation uses the configured generation field
consistently.
---
Duplicate comments:
In `@nemo_skills/dataset/eval_kit/__init__.py`:
- Around line 39-45: The code currently allows "eval_kit." and returns an empty
vlm_dataset; update the validation in the function handling the benchmark string
so that after splitting (where sub = benchmark.split(".", 1)[1]) you check that
sub is non-empty and raise a ValueError with a clear message if it is empty;
ensure the error mentions the required 'eval_kit.<dataset_name>' format and that
the function (where benchmark and sub are used and that returns f"
++vlm_dataset={sub} ") fails fast when no dataset suffix is provided.
In `@nemo_skills/evaluation/evaluator/audio.py`:
- Around line 522-533: The missing-generation branch handling when task_type is
in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) currently returns only "wer" for
ASR-PC; update the branch so that when task_type corresponds to ASR-PC (identify
via the value used for ASR-PC in _ASR_TYPES or by name if present) you return
the full set of default metrics {**base, "wer": 1.0, "wer_c": 1.0, "wer_pc":
1.0, "per": 1.0} instead of just "wer" so aggregation sees consistent keys;
modify the final ASR / ASR-PC / ASR-ZH return logic to branch on ASR-PC and
include these additional fields while keeping existing behavior for other ASR
variants.
- Around line 522-533: The code currently calls generation.strip() earlier which
raises when generation is None and prevents the fallback in the block checking
task_type in (_ASR_TYPES | _TRANSLATION_TYPES | {"CER"}) and not generation from
returning the intended "missing_generation" metrics; fix by adding a None check
before any .strip() use or by moving the missing-generation branch earlier: if
generation is None (or falsy) and task_type in (_ASR_TYPES | _TRANSLATION_TYPES
| {"CER"}), return the base missing_generation dict and then add the
task-specific metric keys (bleu for _TRANSLATION_TYPES, cer for "CER", wer for
ASR variants) just like the existing returns so .strip() is never invoked on
None.
In `@nemo_skills/evaluation/metrics/eval_kit_metrics.py`:
- Around line 45-57: The setup() method can leave self.eval_kit_metrics_file
pointing at a previous run when the new candidate doesn't exist; update
EvalKitMetrics.setup so that when the candidate file is missing you explicitly
clear the instance path (set self.eval_kit_metrics_file = None) in addition to
resetting EvalKitMetrics._shared_metrics_file, checking the candidate
(metrics_dir / "eval_kit_metrics.json") and only assigning both when it exists.
- Around line 41-43: The constructor currently swallows **kwargs silently;
update the __init__ in the class that defines __init__(self, **kwargs) so
unexpected user arguments fail fast: either (a) replace **kwargs with explicit
parameters (e.g., eval_kit_metrics_file=None) and pass known values to
super().__init__(compute_no_answer=False), or (b) validate kwargs at the start
of __init__ by extracting any supported keys (e.g., "eval_kit_metrics_file") and
if any keys remain raise TypeError("Unexpected keyword arguments: ..."); ensure
you still call super().__init__(compute_no_answer=False) and set
self.eval_kit_metrics_file from the validated argument.
In `@nemo_skills/inference/eval/eval_kit.py`:
- Around line 282-287: The broad except around the pickle.load of pkl_path hides
unexpected errors; change the handler to only catch transient/read-related
exceptions (e.g., EOFError, pickle.UnpicklingError, OSError) when
opening/reading pkl_path and return in those cases, while allowing all other
exceptions to propagate (i.e., re-raise) so non-transient failures surface; keep
the try around the with open(...) / pickle.load(...) block and reference
pkl_path and pickle.load when implementing the narrower except.
- Around line 124-132: The is_self_contained(cls, extra_arguments: str = "")
currently returns True only when the explicit flag "++model_type=mcore" is
present, which misclassifies runs where the runtime default
EvalKitConfig.model_type is "mcore" and extra_arguments is empty; update
is_self_contained to also return True when extra_arguments is empty AND the
configured/default model type equals "mcore" (e.g., check
EvalKitConfig.model_type or a class-level default like cls.default_model_type),
preserving the original explicit-flag check so either the explicit
"++model_type=mcore" in extra_arguments or the runtime/default model_type ==
"mcore" yields True.
In `@nemo_skills/inference/mcore_skills.py`:
- Around line 503-507: The current logic rewrites output_file in-place and can
lose data if the write fails; instead, write the JSONL content for entries to a
temporary file (e.g., output_file + ".tmp") and only once the write completes
successfully atomically replace the original using os.replace (and optionally
fsync/flush the temp file before replace). Locate the block that opens
output_file for writing (the loop writing entries via
fout.write(json.dumps(entry) + "\n")) and modify it to write to a temp path,
ensure the write completes and file is closed, then call os.replace(temp_path,
output_file) to atomically swap in the new output.
- Around line 463-466: The code currently creates the completion marker
Path(f"{self.cfg.output_file}.done").touch() before calling
self._evaluate_results() and also suppresses unexpected failures around
evaluation (see the try/except block near where metrics are handled), which can
mark failed runs as complete; move the touch call so the .done file is created
only after self._evaluate_results() returns successfully, and remove or narrow
the broad exception swallowing (remove bare except/except Exception that merely
passes) in the evaluation/metric handling block (the try/except around
self._evaluate_results() / metric processing) so unexpected exceptions propagate
(or re-raise them) instead of being ignored. Ensure any deliberate, expected
metric errors are handled explicitly with targeted exception types and clear
logging while still preventing creation of the .done file on failure.
In `@nemo_skills/pipeline/eval.py`:
- Around line 57-61: The loop inconsistently uses
cluster_config.get("containers", {}) while earlier accessing
cluster_config["containers"] directly; change the lookup inside the for-loop to
use direct access (cluster_config["containers"]) when checking membership of key
so that missing container data fails loudly; update the condition that currently
uses cluster_config.get("containers", {}) to reference
cluster_config["containers"] when evaluating key in the containers mapping
(relating to variables/container assignment, task_classes loop, tc and its
CONTAINER_KEY).
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 528-533: The code that extracts host and port from
job_server_address using .split(":") is brittle for URLs and IPv6; update the
parsing in the block that assigns host, port and calls
generation_task.configure_client_overrides to robustly parse job_server_address
using urllib.parse.urlparse (or prepend '//' when no scheme) and then use
parsed.netloc (handling IPv6 brackets) or fallback to rsplit(":", 1) to separate
host and port, defaulting port to 5000 and casting port to int before passing to
generation_task.configure_client_overrides.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 68b4b1ff-f123-4a46-9a68-756b464a14ad
📒 Files selected for processing (14)
nemo_skills/dataset/eval_kit/__init__.pynemo_skills/dataset/utils.pynemo_skills/evaluation/evaluator/audio.pynemo_skills/evaluation/metrics/eval_kit_metrics.pynemo_skills/evaluation/metrics/map_metrics.pynemo_skills/evaluation/metrics/translation_metrics.pynemo_skills/inference/eval/eval_kit.pynemo_skills/inference/factory.pynemo_skills/inference/generate.pynemo_skills/inference/mcore_skills.pynemo_skills/pipeline/eval.pynemo_skills/pipeline/utils/eval.pynemo_skills/pipeline/utils/generation.pyrequirements/eval-kit.txt
🚧 Files skipped from review as they are similar to previous changes (4)
- requirements/eval-kit.txt
- nemo_skills/evaluation/metrics/map_metrics.py
- nemo_skills/dataset/utils.py
- nemo_skills/evaluation/metrics/translation_metrics.py
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
|
@melllinia |
|
@Jorjeous are we considering adding the container to the nemo-skills containers? |
melllinia
left a comment
There was a problem hiding this comment.
Looks good, can you please add some simple instruction about how to run it?
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
nemo_skills/inference/mcore_skills.py (1)
521-527:⚠️ Potential issue | 🟠 MajorBare
Exceptioncatch swallows failures silently.The
ImportErrorcatch at line 521 is appropriate, but the bareexcept Exceptionat line 526 logs the failure but allows the run to complete and be marked.done. This can produce incomplete/invalideval_kit_metrics.jsonwithout users noticing.Per coding guidelines: "Don't catch exceptions when they are not expected to be normally raised; let the code fail when something unexpected happens."
🔧 Proposed fix: Re-raise after logging
except ImportError: LOG.warning( "VLMEvalKit asr_wer not available — skipping eval-kit-style metrics. " "The summarize_results job will compute metrics separately." ) except Exception: LOG.exception("Inline metrics computation failed") + raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/mcore_skills.py` around lines 521 - 527, In the exception handling block for inline metrics computation, the bare `except Exception` clause at line 526 logs the failure but allows execution to continue silently. After the LOG.exception call in this block, add a re-raise statement to propagate the exception up the call stack. This ensures that unexpected failures in inline metrics computation will cause the run to fail rather than completing with incomplete or invalid eval_kit_metrics.json, while still allowing the ImportError catch to handle the expected case of VLMEvalKit not being available.
🧹 Nitpick comments (4)
nemo_skills/inference/mcore_skills.py (2)
42-55: Fallback pattern for missing GenerationTask works but has unusedclsparameter.The fallback
_get_server_command_fnis decorated with@classmethodbut defines a standalone function. Theclsparameter is unused because the decorator is applied incorrectly for this context.💡 Suggested fix
if GenerationTask is not None: _get_server_command_fn = GenerationTask.get_server_command_fn else: - `@classmethod` - def _get_server_command_fn(cls): + def _get_server_command_fn(): from nemo_skills.pipeline.utils import get_server_command return get_server_commandNote: This may require adjusting how it's assigned to the class attribute at line 128.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/mcore_skills.py` around lines 42 - 55, The fallback _get_server_command_fn is declared with `@classmethod` but the function signature doesn't use cls; fix by either removing the `@classmethod` decorator and defining a plain function def _get_server_command_fn(): ... that returns get_server_command, or keep it as a proper classmethod def _get_server_command_fn(cls): ... and reference cls if needed; then ensure the attribute assignment/override for GenerationTask.get_server_command_fn (or the class that expects this method) uses the correctly-typed callable so the fallback is invoked without an unused cls parameter.
106-111: Silently accepting unused pipeline args may hide misconfiguration.These fields accept user-passed overrides that are documented as unused. Per coding guidelines, code should fail if user specifies an unsupported argument.
Consider adding validation in
__init__or__post_init__to warn or fail if these are set to non-default values:💡 Suggested validation
def __post_init__(self): if self.eval_config: LOG.warning("eval_config is ignored by mcore_skills generation") if self.eval_type is not None: LOG.warning("eval_type is ignored by mcore_skills generation") # etc.As per coding guidelines: "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/inference/mcore_skills.py` around lines 106 - 111, Add a __post_init__ to the mcore_skills dataclass that validates the pipeline override fields (eval_config, eval_type, prompt_format, enable_audio): if any of these are set to non-default values (eval_config non-empty, eval_type or prompt_format not None, enable_audio True) raise a ValueError listing the offending field names (or alternatively LOG.warning then raise) so user-supplied unsupported arguments fail fast; implement this check inside __post_init__ and reference the exact field names (eval_config, eval_type, prompt_format, enable_audio) in the error message.nemo_skills/pipeline/utils/eval.py (1)
528-531: Improved URL parsing, but still fragile for edge cases.Using
rsplit(":", 1)is better thansplit(":")for URLs likehttp://host:8000, but it can still fail for:
- IPv6 addresses:
[::1]:8000→ would split incorrectly- URLs with scheme:
http://host:8000→host="http://host",port="8000"If these edge cases are expected usage, consider using
urllib.parse:💡 More robust URL parsing
+ from urllib.parse import urlsplit # rsplit to handle URLs like http://host:port (takes last colon) - host, port = (job_server_address or "localhost:5000").rsplit(":", 1) + raw_address = job_server_address or "localhost:5000" + if "://" in raw_address: + parsed = urlsplit(raw_address) + host, port = parsed.hostname, str(parsed.port) + else: + host, port = raw_address.rsplit(":", 1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/eval.py` around lines 528 - 531, The current rsplit-based parsing for job_server_address (which sets host, port used later with server_parameters["model"] and server_parameters["server_type"]) is fragile for URLs with schemes and IPv6; replace that rsplit logic with urllib.parse.urlparse: if job_server_address lacks a scheme, prepend "tcp://" or "http://" to ensure urlparse recognizes netloc, then extract parsed.hostname and parsed.port (which correctly handles IPv6 brackets and strips schemes); if parsed.port is None, default to 5000 and if parsed.hostname is None default to "localhost"; finally assign host and port from these parsed values before using them.docs/evaluation/eval-kit.md (1)
1-282: Documentation looks comprehensive, but consider adding expected results.The documentation provides clear instructions and example commands for running eval_kit benchmarks. However, as per coding guidelines, when adding new benchmarks, documentation should include "expected results for tested models."
Consider adding a section with baseline metrics (e.g., expected WER for LibriSpeech with a tested model) so users can validate their setup is working correctly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/evaluation/eval-kit.md` around lines 1 - 282, Add a new "Expected Results / Baselines" subsection to the eval_kit docs that lists baseline metrics for representative benchmarks (e.g., eval_kit.LibriSpeech_test_clean) and example output files (eval_kit_metrics.json, metrics.json, output.jsonl) so users can validate runs; include specific baseline numbers (e.g., WER for the tested model) and the metric format/schema to compare against, and place this under "Understanding Results" near the existing output directory example so it's discoverable when users open eval_kit.LibriSpeech_test_clean results.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/inference/mcore_skills.py`:
- Around line 521-527: In the exception handling block for inline metrics
computation, the bare `except Exception` clause at line 526 logs the failure but
allows execution to continue silently. After the LOG.exception call in this
block, add a re-raise statement to propagate the exception up the call stack.
This ensures that unexpected failures in inline metrics computation will cause
the run to fail rather than completing with incomplete or invalid
eval_kit_metrics.json, while still allowing the ImportError catch to handle the
expected case of VLMEvalKit not being available.
---
Nitpick comments:
In `@docs/evaluation/eval-kit.md`:
- Around line 1-282: Add a new "Expected Results / Baselines" subsection to the
eval_kit docs that lists baseline metrics for representative benchmarks (e.g.,
eval_kit.LibriSpeech_test_clean) and example output files
(eval_kit_metrics.json, metrics.json, output.jsonl) so users can validate runs;
include specific baseline numbers (e.g., WER for the tested model) and the
metric format/schema to compare against, and place this under "Understanding
Results" near the existing output directory example so it's discoverable when
users open eval_kit.LibriSpeech_test_clean results.
In `@nemo_skills/inference/mcore_skills.py`:
- Around line 42-55: The fallback _get_server_command_fn is declared with
`@classmethod` but the function signature doesn't use cls; fix by either removing
the `@classmethod` decorator and defining a plain function def
_get_server_command_fn(): ... that returns get_server_command, or keep it as a
proper classmethod def _get_server_command_fn(cls): ... and reference cls if
needed; then ensure the attribute assignment/override for
GenerationTask.get_server_command_fn (or the class that expects this method)
uses the correctly-typed callable so the fallback is invoked without an unused
cls parameter.
- Around line 106-111: Add a __post_init__ to the mcore_skills dataclass that
validates the pipeline override fields (eval_config, eval_type, prompt_format,
enable_audio): if any of these are set to non-default values (eval_config
non-empty, eval_type or prompt_format not None, enable_audio True) raise a
ValueError listing the offending field names (or alternatively LOG.warning then
raise) so user-supplied unsupported arguments fail fast; implement this check
inside __post_init__ and reference the exact field names (eval_config,
eval_type, prompt_format, enable_audio) in the error message.
In `@nemo_skills/pipeline/utils/eval.py`:
- Around line 528-531: The current rsplit-based parsing for job_server_address
(which sets host, port used later with server_parameters["model"] and
server_parameters["server_type"]) is fragile for URLs with schemes and IPv6;
replace that rsplit logic with urllib.parse.urlparse: if job_server_address
lacks a scheme, prepend "tcp://" or "http://" to ensure urlparse recognizes
netloc, then extract parsed.hostname and parsed.port (which correctly handles
IPv6 brackets and strips schemes); if parsed.port is None, default to 5000 and
if parsed.hostname is None default to "localhost"; finally assign host and port
from these parsed values before using them.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f0287a47-0f73-49e5-b5d5-160e3b694409
📒 Files selected for processing (5)
docs/evaluation/eval-kit.mddocs/evaluation/index.mdnemo_skills/evaluation/evaluator/audio.pynemo_skills/inference/mcore_skills.pynemo_skills/pipeline/utils/eval.py
✅ Files skipped from review due to trivial changes (1)
- docs/evaluation/index.md
This reverts commit b237e33. Signed-off-by: Igor Gitman <igitman@nvidia.com>
commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>
VLMEvalKit's MultiModalMCore inference engine:
Key changes:
Summary by CodeRabbit
New Features
Improvements
Documentation