feat: add custom judge type support for external repo integration by peri044 · Pull Request #1274 · NVIDIA-NeMo/Skills

peri044 · 2026-02-24T19:39:14Z

Add generic 'custom' judge type that allows external repos to register their own judge task creators using the locate() pattern. This enables external benchmarks to define judge tasks without modifying nemo-skills source code.

Changes:

Add _create_custom_judge_tasks() function to dynamically import and call external judge creators
Add 'custom' judge type routing in prepare_eval_commands()
Uses existing locate() pattern (same as METRICS_TYPE, eval_type)

Summary by CodeRabbit

New Features
- Evaluation pipeline now accepts a runtime judge creator path (use this instead of the old judge type) and dynamically creates per-benchmark judge tasks.
- New COMET- and embedding-based judge integrations are available as pluggable judge task implementations.
Documentation
- Multilingual evaluation docs updated to show the new judge configuration and invocation.
Tests
- Test strings updated to reflect revised model and endpoint identifiers.
Chore
- Copyright year updated.

coderabbitai · 2026-02-24T19:41:30Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces per-benchmark judge_type with a dynamic judge_step_fn locator; adds comet_judge and nvembed_judge creator modules; updates eval CLI, docs, and dataset configs to reference factory-style judge creators; eval loads and invokes create_judge_tasks, passing judge_model and output_dir; LLM judge remains fallback.

Changes

Cohort / File(s)	Summary
Eval CLI / Core Flow `nemo_skills/pipeline/eval.py`	Removed `judge_type` option and per-benchmark dispatch; added `judge_step_fn` option. Eval dynamically loads/invokes a judge creator when `judge_step_fn` is provided (passes `judge_model` and `output_dir`); retains LLM fallback. Removed Comet/NVEmbed helper functions.
New Judge Implementations `nemo_skills/pipeline/judges/comet_judge.py`, `nemo_skills/pipeline/judges/nvembed_judge.py`	Added `create_judge_tasks(...)` functions that compute seeds, check remaining jobs, build CLI commands (COMET / NVEmbed), and create cluster-aware judge tasks (GPU/node/partition, SBATCH options, deps). Return list of created tasks or empty when skipping.
Judges Package Init `nemo_skills/pipeline/judges/__init__.py`	Added package initializer with license header and module docstring.
Docs / Config Updates `docs/evaluation/multilingual.md`, `nemo_skills/dataset/mmau-pro/closed_form/__init__.py`	Replaced `--judge_type=comet` / `{"judge_type": ...}` usages with `judge_step_fn` locator strings (e.g., `nemo_skills.pipeline.judges.comet_judge::create_judge_tasks`).
Tests / Strings `tests/test_generation.py`	Updated NVIDIA Nemotron model and inference endpoint string literals (host and model identifiers); no logic changes.
Minor `nemo_skills/pipeline/__init__.py`	Copyright year bump; no functional change.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Eval as eval()
    participant Loader as locate()
    participant JudgeFactory as Judge Creator
    participant TaskMgmt as Task Management
    participant Cluster as Cluster Config

    Client->>Eval: eval(..., judge_step_fn, judge_pipeline_args)
    alt judge_step_fn provided
        Eval->>Loader: load(judge_step_fn)
        Loader-->>Eval: judge_creator_fn
        Eval->>JudgeFactory: judge_creator_fn(exp, benchmark, judge_pipeline_args, output_dir, ...)
        JudgeFactory->>TaskMgmt: get_remaining_jobs(...)
        JudgeFactory->>Cluster: read cluster_config / GPUs / partition
        JudgeFactory->>TaskMgmt: add_task(cmd, deps, cluster settings)
        TaskMgmt-->>JudgeFactory: task(s) created
        JudgeFactory-->>Eval: list of judge tasks
    else fallback (LLM)
        Eval->>TaskMgmt: _create_llm_judge_tasks(...)
        TaskMgmt-->>Eval: llm judge tasks
    end
    Eval-->>Client: evaluation scheduled/completed

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main architectural change: introducing dynamic judge type support via judge_step_fn instead of hardcoded judge types, enabling external repositories to register custom judge creators.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch peri044/external_judge

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/pipeline/eval.py (1)
483-514: ⚠️ Potential issue | 🟡 Minor

has_tasks not set to True in the dynamic judge path.

The LLM judge branch (line 514) sets has_tasks = True, but the dynamic judge_creator_path branch does not. If the dynamic judge creator returns tasks, these tasks are added to all_tasks (line 547) but the experiment may not be run if no main eval jobs were scheduled (since has_tasks gates run_exp at line 647).

In practice this is unlikely to cause issues since main eval jobs typically set has_tasks, but for consistency and correctness:
Proposed fix
            if judge_creator_path:
+               has_tasks = True
                # Use locate() to dynamically load judge creator function
                from nemo_skills.dataset.utils import locate
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 483 - 514, The dynamic
judge_creator_path branch calls locate() to get judge_creator_fn and assigns its
result to judge_tasks but never sets has_tasks; update the branch in the
judge_creator_path flow so that after calling judge_creator_fn (the judge_tasks
variable) you set has_tasks = True if judge_tasks is truthy (or always set
has_tasks = True if you prefer), ensuring the existing all_tasks handling and
the run_exp gate that depends on has_tasks will run the experiment when dynamic
judge tasks are produced.

🧹 Nitpick comments (1)

nemo_skills/pipeline/eval.py (1)
475-487: Move import to the top of the file.

The from nemo_skills.dataset.utils import locate on line 485 is inside a loop body. While functionally fine (Python caches module imports), placing it at the top of the file with the other imports is more conventional and avoids repeated import lookups.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 475 - 487, The import of locate is
inside the branch that handles judge_creator_path (see judge_creator_path and
judge_creator_fn) which causes a local import in the loop; move "from
nemo_skills.dataset.utils import locate" up to the module imports at the top of
the file and remove the in-function import so judge_creator_fn =
locate(judge_creator_path) uses the top-level imported locate; update any import
grouping to include locate alongside the other imports.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Line 157: The help string for the typer.Option in the judge_step variable is
too long; update the judge_step definition so the help argument is split across
multiple lines (use implicit string concatenation or wrap the typer.Option(...)
call in parentheses and break the help string into multiple quoted segments) so
the line length is reduced and ruff-format will pass; keep the same text (e.g.,
split "Path to the judge creator function to use for the judge (locate()
convention). Eg: nemo_skills.pipeline.judges.nvembed_judge::create_judge_tasks")
across lines in the judge_step = typer.Option(...) call.

In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 100-106: In the branch where input_file is None, replace the
dictionary access judge_pipeline_args.get("input_dir") with direct indexing
judge_pipeline_args["input_dir"] so that missing input_dir raises an error
instead of generating a "--input-dir None" argument; update the code around the
input_file/input_dir handling that builds script_args (references: input_file,
judge_pipeline_args, script_args, num_seeds) to use
judge_pipeline_args["input_dir"] and keep adding the "--input-dir" and
"--num-seeds" entries as before.
- Around line 75-77: Replace the permissive dict.get() calls with direct
dictionary access so missing critical keys fail fast: change uses of
judge_pipeline_args.get("output_dir"), .get("input_file"), and
.get("judge_model") to judge_pipeline_args["output_dir"],
judge_pipeline_args["input_file"], and judge_pipeline_args["judge_model"] in
comet_judge.py (variables output_dir_path, input_file, comet_model_path) so a
KeyError is raised immediately if those required arguments are absent.

In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 75-76: The code currently uses
judge_pipeline_args.get("output_dir") which hides missing-key errors; replace
the .get usage for output_dir with direct dictionary access (use
judge_pipeline_args["output_dir"]) so the code fails fast with a clear KeyError
if the caller stops providing it; update the reference to output_dir_path
accordingly (leave input_file handling unchanged unless you also expect it to be
mandatory) and run tests to ensure no other call sites assume .get semantics.
- Around line 100-102: The code path in nvembed_judge.py that handles when
input_file is None currently uses judge_pipeline_args.get("input_dir") which can
silently return None and interpolate it into script_args; update that branch to
access the required key directly (judge_pipeline_args["input_dir"]) so the code
fails fast if input_dir is missing and prevents None from being appended to the
command string; locate the block around the if input_file is None check and
replace the .get() access with direct dictionary indexing.

---

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 483-514: The dynamic judge_creator_path branch calls locate() to
get judge_creator_fn and assigns its result to judge_tasks but never sets
has_tasks; update the branch in the judge_creator_path flow so that after
calling judge_creator_fn (the judge_tasks variable) you set has_tasks = True if
judge_tasks is truthy (or always set has_tasks = True if you prefer), ensuring
the existing all_tasks handling and the run_exp gate that depends on has_tasks
will run the experiment when dynamic judge tasks are produced.

---

Nitpick comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 475-487: The import of locate is inside the branch that handles
judge_creator_path (see judge_creator_path and judge_creator_fn) which causes a
local import in the loop; move "from nemo_skills.dataset.utils import locate" up
to the module imports at the top of the file and remove the in-function import
so judge_creator_fn = locate(judge_creator_path) uses the top-level imported
locate; update any import grouping to include locate alongside the other
imports.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6da2219 and 40d7c38.

📒 Files selected for processing (6)

docs/evaluation/multilingual.md
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/judges/__init__.py
nemo_skills/pipeline/judges/comet_judge.py
nemo_skills/pipeline/judges/nvembed_judge.py

nemo_skills/pipeline/eval.py

coderabbitai · 2026-02-24T19:47:22Z

nemo_skills/pipeline/judges/comet_judge.py

+    output_dir_path = judge_pipeline_args.get("output_dir")
+    input_file = judge_pipeline_args.get("input_file")
+    comet_model_path = judge_pipeline_args.get("judge_model")


⚠️ Potential issue | 🟠 Major

Use direct dictionary access for keys that are expected to be present.

output_dir is always set by eval.py before calling the judge creator (line 473), and judge_model is critical for COMET evaluation — if it's missing, the command will interpolate None as a string into the CLI args (--comet-model-path None), producing a silent, hard-to-debug failure. Use judge_pipeline_args["key"] to fail fast with a clear KeyError.

Proposed fix

- output_dir_path = judge_pipeline_args.get("output_dir") - input_file = judge_pipeline_args.get("input_file") - comet_model_path = judge_pipeline_args.get("judge_model") + output_dir_path = judge_pipeline_args["output_dir"] + input_file = judge_pipeline_args.get("input_file") # legitimately optional + comet_model_path = judge_pipeline_args["judge_model"]

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

output_dir_path = judge_pipeline_args.get("output_dir")

input_file = judge_pipeline_args.get("input_file")

comet_model_path = judge_pipeline_args.get("judge_model")

output_dir_path = judge_pipeline_args["output_dir"]

input_file = judge_pipeline_args.get("input_file") # legitimately optional

comet_model_path = judge_pipeline_args["judge_model"]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 75 - 77, Replace the permissive dict.get() calls with direct dictionary access so missing critical keys fail fast: change uses of judge_pipeline_args.get("output_dir"), .get("input_file"), and .get("judge_model") to judge_pipeline_args["output_dir"], judge_pipeline_args["input_file"], and judge_pipeline_args["judge_model"] in comet_judge.py (variables output_dir_path, input_file, comet_model_path) so a KeyError is raised immediately if those required arguments are absent.

coderabbitai · 2026-02-24T19:47:22Z

nemo_skills/pipeline/judges/comet_judge.py

+
+    if input_file is None:
+        input_dir = judge_pipeline_args.get("input_dir")
+        script_args.append(f"--input-dir {input_dir}")
+        script_args.append(f"--num-seeds {num_seeds}")
+    else:
+        script_args.append(f"--input-file {input_file}")


⚠️ Potential issue | 🟡 Minor

Use direct access for input_dir — it is always set when input_file is None.

In the input_file is None branch, eval.py (line 468) always populates input_dir in judge_pipeline_args. Using .get() here would silently produce --input-dir None in the command string instead of failing fast.

Proposed fix

- input_dir = judge_pipeline_args.get("input_dir") + input_dir = judge_pipeline_args["input_dir"]

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 100 - 106, In the branch where input_file is None, replace the dictionary access judge_pipeline_args.get("input_dir") with direct indexing judge_pipeline_args["input_dir"] so that missing input_dir raises an error instead of generating a "--input-dir None" argument; update the code around the input_file/input_dir handling that builds script_args (references: input_file, judge_pipeline_args, script_args, num_seeds) to use judge_pipeline_args["input_dir"] and keep adding the "--input-dir" and "--num-seeds" entries as before.

coderabbitai · 2026-02-24T19:47:22Z

nemo_skills/pipeline/judges/nvembed_judge.py

+    output_dir_path = judge_pipeline_args.get("output_dir")
+    input_file = judge_pipeline_args.get("input_file")


⚠️ Potential issue | 🟡 Minor

Use direct dictionary access for output_dir — it is always populated by the caller.

Same as the Comet judge: output_dir is guaranteed to be set by eval.py (line 473) before invoking the judge creator. Direct access provides clear error messages if assumptions change.

Proposed fix

- output_dir_path = judge_pipeline_args.get("output_dir") - input_file = judge_pipeline_args.get("input_file") + output_dir_path = judge_pipeline_args["output_dir"] + input_file = judge_pipeline_args.get("input_file") # legitimately optional

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 75 - 76, The code currently uses judge_pipeline_args.get("output_dir") which hides missing-key errors; replace the .get usage for output_dir with direct dictionary access (use judge_pipeline_args["output_dir"]) so the code fails fast with a clear KeyError if the caller stops providing it; update the reference to output_dir_path accordingly (leave input_file handling unchanged unless you also expect it to be mandatory) and run tests to ensure no other call sites assume .get semantics.

coderabbitai · 2026-02-24T19:47:22Z

nemo_skills/pipeline/judges/nvembed_judge.py

+    if input_file is None:
+        input_dir = judge_pipeline_args.get("input_dir")
+        script_args.append(f"--input-dir {input_dir}")


⚠️ Potential issue | 🟡 Minor

Use direct access for input_dir.

When input_file is None, eval.py always sets input_dir in judge_pipeline_args (line 468). Using .get() would silently interpolate None into the command string.

Proposed fix

- input_dir = judge_pipeline_args.get("input_dir") + input_dir = judge_pipeline_args["input_dir"]

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if input_file is None:

input_dir = judge_pipeline_args.get("input_dir")

script_args.append(f"--input-dir {input_dir}")

if input_file is None:

input_dir = judge_pipeline_args["input_dir"]

script_args.append(f"--input-dir {input_dir}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 100 - 102, The code path in nvembed_judge.py that handles when input_file is None currently uses judge_pipeline_args.get("input_dir") which can silently return None and interpolate it into script_args; update that branch to access the required key directly (judge_pipeline_args["input_dir"]) so the code fails fast if input_dir is missing and prevents None from being appended to the command string; locate the block around the if input_file is None check and replace the .get() access with direct dictionary indexing.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

nemo_skills/pipeline/eval.py (1)

475-511: ⚠️ Potential issue | 🟠 Major

Handle conflicting judge_path sources and mark dynamic judge tasks as work.
Two issues here:

If a benchmark defines judge_path, CLI --judge_path is silently ignored. Prefer CLI override or raise on conflict.
When dynamic judge tasks are created, has_tasks stays false, so runs with only judge tasks (and auto_summarize_results=False) won’t execute.

✅ Proposed fix

-            judge_creator_path = judge_pipeline_args.pop("judge_path", judge_path)
+            benchmark_judge_path = judge_pipeline_args.pop("judge_path", None)
+            if judge_path and benchmark_judge_path and judge_path != benchmark_judge_path:
+                raise ValueError(
+                    f"Conflicting judge_path: CLI={judge_path} vs benchmark={benchmark_judge_path}"
+                )
+            judge_creator_path = judge_path or benchmark_judge_path
@@
-            if judge_tasks:
+            if judge_tasks:
+                has_tasks = True
                 benchmark_to_judge_tasks[benchmark] = judge_tasks
                 all_tasks.extend(judge_tasks)

As per coding guidelines, "Avoid silently ignoring unused user-passed parameters; the code should fail if a required argument is not specified or if unsupported arguments are provided. Use dataclasses or **kwargs to automatically handle parameter validation."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 475 - 511, The code currently lets
a benchmark-provided judge_path in judge_pipeline_args silently override a CLI
judge_path and creates dynamic judge tasks without marking the run as having
work; change the logic so CLI judge_path takes precedence: retrieve any
benchmark value with judge_pipeline_args.pop("judge_path", None), then set
judge_creator_path = judge_path if judge_path is truthy else the popped value
(and remove the key from judge_pipeline_args to avoid silent conflicts), and
optionally log a warning if both were provided; after calling
judge_creator_fn(...) and assigning judge_tasks, ensure the run is marked as
having work by setting has_tasks = True (or marking the created judge_tasks as
work) when judge_tasks is non-empty so runs with only dynamic judge tasks will
execute even if auto_summarize_results=False.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Line 351: The current eval_requires_judge computation ignores a user-specified
--judge_path when no other CLI judge args are truthy; update the logic that
computes eval_requires_judge (currently using cli_judge_pipeline_args and the
variable eval_requires_judge) so that the presence of
cli_judge_pipeline_args['judge_path'] (non-empty string) forces
eval_requires_judge to True and/or raises a validation error if judge_path is
provided without required companion args; locate the code that builds
cli_judge_pipeline_args and the eval_requires_judge assignment and either
include an explicit check for judge_path or add parameter validation (e.g., in
the same function that constructs cli_judge_pipeline_args) to fail fast for
unsupported or incomplete judge-related args.

---

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 475-511: The code currently lets a benchmark-provided judge_path
in judge_pipeline_args silently override a CLI judge_path and creates dynamic
judge tasks without marking the run as having work; change the logic so CLI
judge_path takes precedence: retrieve any benchmark value with
judge_pipeline_args.pop("judge_path", None), then set judge_creator_path =
judge_path if judge_path is truthy else the popped value (and remove the key
from judge_pipeline_args to avoid silent conflicts), and optionally log a
warning if both were provided; after calling judge_creator_fn(...) and assigning
judge_tasks, ensure the run is marked as having work by setting has_tasks = True
(or marking the created judge_tasks as work) when judge_tasks is non-empty so
runs with only dynamic judge tasks will execute even if
auto_summarize_results=False.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40d7c38 and 7367594.

📒 Files selected for processing (3)

docs/evaluation/multilingual.md
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/pipeline/eval.py

nemo_skills/pipeline/eval.py

peri044 · 2026-02-24T19:55:03Z

@Kipok I've figured out write access problem (authorization of NVIDIA-Nemo org issue). I've closed the fork PR #1265 and addressed your recent feedback in this PR. Does this align with your comments ? Are gpu-tests-qwen in CI supposed to be skipped or something I need to fix on my end ?

coderabbitai

♻️ Duplicate comments (1)

nemo_skills/pipeline/eval.py (1)
354-354: ⚠️ Potential issue | 🟠 Major

Ensure --judge_path forces judge scheduling.
eval_requires_judge ignores a user-specified judge_path, so judge jobs can be skipped for benchmarks that don’t require a judge by default. This silently drops a user-provided parameter.
✅ Proposed fix
-    eval_requires_judge = any(param_value for param_value in cli_judge_pipeline_args.values())
+    eval_requires_judge = bool(judge_path) or any(param_value for param_value in cli_judge_pipeline_args.values())
As per coding guidelines: "Avoid silently ignoring unused user-passed parameters; the code should fail if a required argument is not specified or if unsupported arguments are provided. Use dataclasses or **kwargs to automatically handle parameter validation."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` at line 354, The current eval_requires_judge
computation ignores a user-specified judge_path in cli_judge_pipeline_args;
update the logic so that presence/non-empty
cli_judge_pipeline_args.get('judge_path') forces scheduling of judge jobs (e.g.,
include a check for truthiness of 'judge_path' in the eval_requires_judge
expression or set eval_requires_judge = True when
cli_judge_pipeline_args.get('judge_path') is provided), and additionally
validate unexpected/unsupported CLI judge args by raising an error or warning if
unknown keys are present in cli_judge_pipeline_args to avoid silently ignoring
user parameters.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/pipeline/eval.py`:
- Line 354: The current eval_requires_judge computation ignores a user-specified
judge_path in cli_judge_pipeline_args; update the logic so that
presence/non-empty cli_judge_pipeline_args.get('judge_path') forces scheduling
of judge jobs (e.g., include a check for truthiness of 'judge_path' in the
eval_requires_judge expression or set eval_requires_judge = True when
cli_judge_pipeline_args.get('judge_path') is provided), and additionally
validate unexpected/unsupported CLI judge args by raising an error or warning if
unknown keys are present in cli_judge_pipeline_args to avoid silently ignoring
user parameters.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7367594 and 8d562b4.

📒 Files selected for processing (1)

nemo_skills/pipeline/eval.py

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add generic 'custom' judge type that allows external repos to register their own judge task creators using the locate() pattern. This enables external benchmarks to define judge tasks without modifying nemo-skills source code. Changes: - Add _create_custom_judge_tasks() function to dynamically import and call external judge creators - Add 'custom' judge type routing in prepare_eval_commands() - Uses existing locate() pattern (same as METRICS_TYPE, eval_type) This change is minimal (~50 lines) and reusable for any external repo. Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok · 2026-02-24T21:27:18Z

@peri044 I renamed the parameter to judge_step_fn, I think that's a bit clearer as judge_path looks like a path to judge model. Also fixed DCO and a bug that caused one of the tests to fail. Let me know if naming looks good to you and we can merge after that

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (4)

nemo_skills/pipeline/judges/comet_judge.py (2)
100-106: Use direct access for input_dir — it is always set when input_file is None.

eval.py (line 472) always populates input_dir when benchmark_seeds != 0. Using .get() silently interpolates None into the command string.
Proposed fix
-        input_dir = judge_pipeline_args.get("input_dir")
+        input_dir = judge_pipeline_args["input_dir"]
As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 100 - 106, When
input_file is None, replace judge_pipeline_args.get("input_dir") with direct
access judge_pipeline_args["input_dir"] so the constructed command uses the
guaranteed input_dir value (do this in the block that appends to script_args
alongside num_seeds); ensure you reference the variables input_file,
judge_pipeline_args, input_dir, script_args, and num_seeds in the change.
75-77: Use direct dictionary access for keys expected to be present.

output_dir is always set by eval.py (line 477) before invoking this function, and judge_model is set via setdefault at line 496. Using .get() will silently produce None values that get interpolated as the string "None" in the command, leading to hard-to-debug failures (e.g., --comet-model-path None). input_file is legitimately optional since None drives the branching logic.
Proposed fix
-    output_dir_path = judge_pipeline_args.get("output_dir")
-    input_file = judge_pipeline_args.get("input_file")
-    comet_model_path = judge_pipeline_args.get("judge_model")
+    output_dir_path = judge_pipeline_args["output_dir"]
+    input_file = judge_pipeline_args.get("input_file")  # legitimately optional
+    comet_model_path = judge_pipeline_args["judge_model"]
As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 75 - 77, The code
uses judge_pipeline_args.get(...) for keys that must exist, causing silent None
values; change to direct dictionary access for required keys: replace
judge_pipeline_args.get("output_dir") and judge_pipeline_args.get("judge_model")
with judge_pipeline_args["output_dir"] and judge_pipeline_args["judge_model"]
respectively (leave input_file as judge_pipeline_args.get("input_file") since it
is optional), so failures raise KeyError and fail fast; update any variable
names output_dir_path and comet_model_path accordingly where they are assigned
from judge_pipeline_args.
nemo_skills/pipeline/judges/nvembed_judge.py (2)
100-102: Use direct access for input_dir.

When input_file is None, eval.py always sets input_dir. Using .get() silently interpolates None into the command string.
Proposed fix
-        input_dir = judge_pipeline_args.get("input_dir")
+        input_dir = judge_pipeline_args["input_dir"]
As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 100 - 102, The
branch handling when input_file is None uses
judge_pipeline_args.get("input_dir") which can return None and be interpolated
into the command; change it to direct access judge_pipeline_args["input_dir"]
and use that value when building script_args (the conditional block around
input_file and the script_args.append call) so the code fails loudly if
input_dir is missing instead of inserting "None" into the command.
75-76: Use direct dictionary access for output_dir — it is always populated by the caller.

Same issue as in comet_judge.py: output_dir is guaranteed set by eval.py (line 477). Direct access provides a clear KeyError if assumptions change.
Proposed fix
-    output_dir_path = judge_pipeline_args.get("output_dir")
-    input_file = judge_pipeline_args.get("input_file")
+    output_dir_path = judge_pipeline_args["output_dir"]
+    input_file = judge_pipeline_args.get("input_file")  # legitimately optional
As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 75 - 76, The code
uses judge_pipeline_args.get("output_dir") even though output_dir is always
provided by the caller; change this to direct dictionary access
judge_pipeline_args["output_dir"] to surface a KeyError if the assumption is
violated. Update the assignment that sets output_dir_path (and leave input_file
as-is) so it uses judge_pipeline_args["output_dir"]; ensure any downstream logic
that relies on output_dir_path continues to work unchanged.

🧹 Nitpick comments (1)

nemo_skills/pipeline/judges/nvembed_judge.py (1)
26-134: Consider extracting shared logic between comet_judge.py and nvembed_judge.py.

Both judge modules share nearly identical structure: seed determination, remaining-job checks, script-arg assembly, and add_task invocation. The main differences are the command string and the --skip-existing flag. A shared base or helper function could reduce duplication. Not urgent given this is the initial extraction, but worth noting for follow-up.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 26 - 134,
create_judge_tasks duplicates logic found in comet_judge.py (seed determination,
get_remaining_jobs check, script args assembly, and add_task call); refactor by
extracting a shared helper (e.g., build_judge_task or create_common_judge_task)
that accepts parameters for the script path, unique script_args (like
"--input-file" vs "--input-dir" and the presence of "--skip-existing"),
container key, and GPU/node defaults, then have create_judge_tasks and the comet
counterpart call that helper; reuse existing symbols get_remaining_jobs and
add_task inside the helper and preserve behavior for run_cmd formation,
task_name suffixes (e.g., "-nvembed-judge"), and task parameters (num_gpus,
num_nodes, partition, run_after, reuse_code_exp, reuse_code, task_dependencies,
installation_command, skip_hf_home_check, sbatch_kwargs).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 486-488: The flag has_tasks is being set unconditionally when a
judge_step_fn exists, which can be incorrect if judge_step_fn() returns an empty
list; change the logic in the block handling judge_step_fn so that you only set
has_tasks = True after calling judge_step_fn and confirming it returned a
non-empty list (e.g., inside the if judge_tasks: guard that processes the
returned tasks), mirroring the existing pattern used when _generate returns
None; ensure run_exp is only triggered when has_tasks is true after this check.
- Around line 479-492: The bug is that judge_step_fn is mutated inside the
benchmark loop by using judge_pipeline_args.pop("judge_step_fn", judge_step_fn),
causing the previous iteration's value to be used as the fallback; to fix,
capture the original CLI/default value (e.g., orig_judge_step_fn =
judge_step_fn) before entering the loop and inside the loop use a local variable
or call pop with that original as the default
(judge_pipeline_args.pop("judge_step_fn", orig_judge_step_fn)), and ensure you
don't assign back to the outer-scope judge_step_fn so the CLI default isn't
overwritten across iterations; keep the locate() dynamic-loading logic (from
nemo_skills.dataset.utils import locate) but apply it to the loop-local variable
only.

In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 94-96: The early-return in the Comet judge currently returns an
empty list which leaves the caller's pre-set has_tasks=True unchanged; change
the early-return to return None (or another clearly falsy sentinel) from the
judge function so callers can detect "no work" by truthiness, and update the
caller in eval.py (the code around line 487 / the logic that sets has_tasks
before calling judge_step_fn) to set has_tasks based on the truthiness of
judge_step_fn's result (e.g., assign has_tasks = bool(judge_tasks) after calling
judge_step_fn) so the experiment isn't launched when there are no actual tasks.
Ensure references to the Comet judge function (judge_step_fn) and the caller in
eval.py are updated consistently.

In `@tests/test_generation.py`:
- Around line 31-32: The test uses an incorrect NVIDIA NIM API model identifier
and endpoint: replace the model string
"--model=nvidia/nvidia/Nemotron-3-Nano-30B-A3B" with the documented lowercase
single-prefix identifier "--model=nvidia/nemotron-3-nano-30b-a3b" and revert the
server address "--server_address=https://inference-api.nvidia.com/v1/" to the
official base URL "--server_address=https://integrate.api.nvidia.com"; update
these exact f-string literals in tests/test_generation.py so the test uses the
documented NIM API formats.

---

Duplicate comments:
In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 100-106: When input_file is None, replace
judge_pipeline_args.get("input_dir") with direct access
judge_pipeline_args["input_dir"] so the constructed command uses the guaranteed
input_dir value (do this in the block that appends to script_args alongside
num_seeds); ensure you reference the variables input_file, judge_pipeline_args,
input_dir, script_args, and num_seeds in the change.
- Around line 75-77: The code uses judge_pipeline_args.get(...) for keys that
must exist, causing silent None values; change to direct dictionary access for
required keys: replace judge_pipeline_args.get("output_dir") and
judge_pipeline_args.get("judge_model") with judge_pipeline_args["output_dir"]
and judge_pipeline_args["judge_model"] respectively (leave input_file as
judge_pipeline_args.get("input_file") since it is optional), so failures raise
KeyError and fail fast; update any variable names output_dir_path and
comet_model_path accordingly where they are assigned from judge_pipeline_args.

In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 100-102: The branch handling when input_file is None uses
judge_pipeline_args.get("input_dir") which can return None and be interpolated
into the command; change it to direct access judge_pipeline_args["input_dir"]
and use that value when building script_args (the conditional block around
input_file and the script_args.append call) so the code fails loudly if
input_dir is missing instead of inserting "None" into the command.
- Around line 75-76: The code uses judge_pipeline_args.get("output_dir") even
though output_dir is always provided by the caller; change this to direct
dictionary access judge_pipeline_args["output_dir"] to surface a KeyError if the
assumption is violated. Update the assignment that sets output_dir_path (and
leave input_file as-is) so it uses judge_pipeline_args["output_dir"]; ensure any
downstream logic that relies on output_dir_path continues to work unchanged.

---

Nitpick comments:
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 26-134: create_judge_tasks duplicates logic found in
comet_judge.py (seed determination, get_remaining_jobs check, script args
assembly, and add_task call); refactor by extracting a shared helper (e.g.,
build_judge_task or create_common_judge_task) that accepts parameters for the
script path, unique script_args (like "--input-file" vs "--input-dir" and the
presence of "--skip-existing"), container key, and GPU/node defaults, then have
create_judge_tasks and the comet counterpart call that helper; reuse existing
symbols get_remaining_jobs and add_task inside the helper and preserve behavior
for run_cmd formation, task_name suffixes (e.g., "-nvembed-judge"), and task
parameters (num_gpus, num_nodes, partition, run_after, reuse_code_exp,
reuse_code, task_dependencies, installation_command, skip_hf_home_check,
sbatch_kwargs).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d562b4 and 31f6ff7.

📒 Files selected for processing (7)

docs/evaluation/multilingual.md
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
nemo_skills/pipeline/eval.py
nemo_skills/pipeline/judges/__init__.py
nemo_skills/pipeline/judges/comet_judge.py
nemo_skills/pipeline/judges/nvembed_judge.py
tests/test_generation.py

✅ Files skipped from review due to trivial changes (1)

nemo_skills/pipeline/judges/init.py

🚧 Files skipped from review as they are similar to previous changes (1)

docs/evaluation/multilingual.md

coderabbitai · 2026-02-24T21:28:43Z

nemo_skills/pipeline/eval.py

+            # judge_step_fn is a :: path to the judge creator function (locate() convention).
+            # Could be set directly in JUDGE_PIPELINE_ARGS; falls back to None for LLM judge.
+            judge_step_fn = judge_pipeline_args.pop("judge_step_fn", judge_step_fn)

-            # Create judge tasks based on judge type
-            if benchmark_judge_type == "nvembed":
-                judge_tasks = _create_nvembed_judge_tasks(
-                    exp=exp,
-                    expname=expname,
-                    benchmark=benchmark,
-                    judge_pipeline_args=judge_pipeline_args,
-                    rerun_done=rerun_done,
-                    log_dir=log_dir,
-                    server_parameters=server_parameters,
-                    cluster_config=cluster_config,
-                    judge_server_gpus=judge_server_gpus,
-                    judge_server_nodes=judge_server_nodes,
-                    partition=partition,
-                    run_after=run_after,
-                    reuse_code_exp=reuse_code_exp,
-                    reuse_code=reuse_code,
-                    dependent_tasks=dependent_tasks,
-                    all_tasks=all_tasks,
-                    _task_dependencies=_task_dependencies,
-                    installation_command=installation_command,
-                    skip_hf_home_check=skip_hf_home_check,
-                    sbatch_kwargs=sbatch_kwargs,
-                )
-            elif benchmark_judge_type == "comet":
-                judge_pipeline_args["judge_model"] = judge_model
-                judge_tasks = _create_comet_judge_tasks(
+            # TODO: we should rework the interface here to have consistent parameters between main llm and custom
+            # judge creation steps. E.g. things like judge_model assignment below shouldn't be necessary
+
+            if judge_step_fn:
+                has_tasks = True
+                if not callable(judge_step_fn):
+                    # Use locate() to dynamically load judge creator function
+                    from nemo_skills.dataset.utils import locate
+
+                    judge_step_fn = locate(judge_step_fn)


⚠️ Potential issue | 🔴 Critical

Bug: judge_step_fn leaks across benchmark iterations.

On line 481, judge_step_fn is reassigned from judge_pipeline_args.pop("judge_step_fn", judge_step_fn). Since this is inside the for benchmark ... loop (line 457), the fallback value on subsequent iterations is whatever was set by the previous benchmark — not the original CLI value.

Example: If benchmark A's JUDGE_PIPELINE_ARGS defines judge_step_fn = "...nvembed_judge::create_judge_tasks" but benchmark B does not, benchmark B will incorrectly inherit benchmark A's judge_step_fn instead of falling back to the CLI default (likely None → LLM judge).

Save the original CLI value before the loop and use it as the fallback:

Proposed fix

all_tasks = [] if _task_dependencies is None: _task_dependencies = [] + cli_judge_step_fn = judge_step_fn with pipeline_utils.get_exp(expname, cluster_config, _reuse_exp) as exp: # scheduling main eval jobs ... # scheduling judge jobs if needed for idx, (benchmark, benchmark_args) in enumerate(benchmarks_dict.items()): ... - judge_step_fn = judge_pipeline_args.pop("judge_step_fn", judge_step_fn) + benchmark_judge_step_fn = judge_pipeline_args.pop("judge_step_fn", cli_judge_step_fn) ... - if judge_step_fn: + if benchmark_judge_step_fn: has_tasks = True - if not callable(judge_step_fn): + if not callable(benchmark_judge_step_fn): from nemo_skills.dataset.utils import locate - judge_step_fn = locate(judge_step_fn) + benchmark_judge_step_fn = locate(benchmark_judge_step_fn) ... - judge_tasks = judge_step_fn( + judge_tasks = benchmark_judge_step_fn(

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` around lines 479 - 492, The bug is that judge_step_fn is mutated inside the benchmark loop by using judge_pipeline_args.pop("judge_step_fn", judge_step_fn), causing the previous iteration's value to be used as the fallback; to fix, capture the original CLI/default value (e.g., orig_judge_step_fn = judge_step_fn) before entering the loop and inside the loop use a local variable or call pop with that original as the default (judge_pipeline_args.pop("judge_step_fn", orig_judge_step_fn)), and ensure you don't assign back to the outer-scope judge_step_fn so the CLI default isn't overwritten across iterations; keep the locate() dynamic-loading logic (from nemo_skills.dataset.utils import locate) but apply it to the loop-local variable only.

coderabbitai · 2026-02-24T21:28:43Z

nemo_skills/pipeline/eval.py

+            if judge_step_fn:
+                has_tasks = True
+                if not callable(judge_step_fn):


⚠️ Potential issue | 🟡 Minor

has_tasks = True is set before confirming the judge creator returns tasks.

If judge_step_fn() returns [] (e.g., all outputs already exist), has_tasks is still True. In an edge case where this is the only scheduled work, run_exp would be called on an experiment with no tasks. Consider moving has_tasks = True inside the if judge_tasks: guard at line 554, consistent with how _generate returning None is handled.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` around lines 486 - 488, The flag has_tasks is being set unconditionally when a judge_step_fn exists, which can be incorrect if judge_step_fn() returns an empty list; change the logic in the block handling judge_step_fn so that you only set has_tasks = True after calling judge_step_fn and confirming it returned a non-empty list (e.g., inside the if judge_tasks: guard that processes the returned tasks), mirroring the existing pattern used when _generate returns None; ensure run_exp is only triggered when has_tasks is true after this check.

coderabbitai · 2026-02-24T21:28:43Z

nemo_skills/pipeline/judges/comet_judge.py

+    if not remaining_jobs or all(not chunks for chunks in remaining_jobs.values()):
+        LOG.info(f"Skipping Comet judge for {benchmark} - all output files and .done markers exist")
+        return []


⚠️ Potential issue | 🟡 Minor

Early-return skips has_tasks = True in the caller — verify the caller handles empty returns correctly.

When remaining_jobs is empty, this returns []. In eval.py (line 487), has_tasks is set to True before calling judge_step_fn, so if this is the only task scheduled and it returns empty, the experiment will still attempt to run with no actual tasks. This is a minor inconsistency — the if judge_tasks: guard at line 554 prevents the worst outcome (NoneType errors), but the experiment still launches unnecessarily.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 94 - 96, The early-return in the Comet judge currently returns an empty list which leaves the caller's pre-set has_tasks=True unchanged; change the early-return to return None (or another clearly falsy sentinel) from the judge function so callers can detect "no work" by truthiness, and update the caller in eval.py (the code around line 487 / the logic that sets has_tasks before calling judge_step_fn) to set has_tasks based on the truthiness of judge_step_fn's result (e.g., assign has_tasks = bool(judge_tasks) after calling judge_step_fn) so the experiment isn't launched when there are no actual tasks. Ensure references to the Comet judge function (judge_step_fn) and the caller in eval.py are updated consistently.

coderabbitai · 2026-02-24T21:28:43Z

tests/test_generation.py

+        f"    --model=nvidia/nvidia/Nemotron-3-Nano-30B-A3B "
+        f"    --server_address=https://inference-api.nvidia.com/v1/ "


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

NVIDIA NIM inference API endpoint documentation 2025

💡 Result:

NVIDIA NIM inference API endpoint docs (2025-era)

1) NVIDIA-hosted NIM (API Catalog) – LLM inference

Base URL: https://integrate.api.nvidia.com

Chat endpoint: POST /v1/chat/completions (docs.api.nvidia.com)

Reference page (LLM APIs): (see citation) (docs.api.nvidia.com)

2) Self-hosted / Downloadable NIM for LLMs – OpenAPI + endpoints

The NIM for LLMs API reference lists these key endpoints (OpenAI-compatible):

GET /v1/health/ready

GET /v1/models

POST /v1/chat/completions

POST /v1/completions (docs.nvidia.com)
It also notes experimental endpoints like /v1/responses and Llama Stack experimental routes. (docs.nvidia.com)

3) Retrieval (embeddings / rerank) NIM APIs

NVIDIA’s “Retrieval APIs” reference covers embedding and reranking endpoints for NeMo Retriever NIMs. (docs.api.nvidia.com)

Quick example (self-hosted NIM)

curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ -d '{"model":"model_name","prompt":"Once upon a time","max_tokens":64}'

Example shown in NVIDIA’s NIM microservices page. (nvidia.com)

Citations:

1: https://docs.api.nvidia.com/nim/reference/llm-apis?utm_source=openai

2: https://docs.api.nvidia.com/nim/reference/llm-apis?utm_source=openai

3: https://docs.nvidia.com/nim/large-language-models/latest/api-reference.html?utm_source=openai

4: https://docs.nvidia.com/nim/large-language-models/latest/api-reference.html?utm_source=openai

5: https://docs.api.nvidia.com/nim/reference/retrieval-apis?utm_source=openai

6: https://www.nvidia.com/en-us/ai-data-science/products/nim-microservices/?utm_source=openai

🌐 Web query:

Nemotron-3-Nano-30B-A3B model NVIDIA API

💡 Result:

Nemotron-3-Nano-30B-A3B is available in the NVIDIA NIM API as nvidia / nemotron-3-nano-30b-a3b. The chat endpoint is OpenAI-compatible and uses Bearer auth. [1], [2]

HTTP endpoint (Chat Completions):
POST https://integrate.api.nvidia.com/v1/chat/completions [2]

Request (example):

curl https://integrate.api.nvidia.com/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $NVIDIA_API_KEY" \ -d '{ "model": "nvidia/nemotron-3-nano-30b-a3b", "messages": [{"role":"user","content":"Write a haiku about GPUs"}] }'

Model/API reference and model details are in NVIDIA’s NIM docs. [1]

Sources:
[1] NVIDIA NIM API Reference – nvidia / nemotron-3-nano-30b-a3b: https://docs.api.nvidia.com/nim/reference/nvidia-nemotron-3-nano-30b-a3b
[2] NVIDIA NIM API Reference – Chat Completions endpoint for this model: https://docs.api.nvidia.com/nim/reference/nvidia-nemotron-3-nano-30b-a3b-infer

🌐 Web query:

NVIDIA inference-api.nvidia.com versus integrate.api.nvidia.com endpoint

💡 Result:

https://integrate.api.nvidia.com is NVIDIA’s documented, public base URL for the NVIDIA API Catalog (NIM hosted endpoints). For LLMs, it exposes an OpenAI-compatible endpoint like POST /v1/chat/completions. [1]

Note: NVIDIA documents at least one OpenAI-style endpoint difference vs “generic OpenAI” expectations—e.g., /v1/completions is not supported on integrate.api.nvidia.com/v1 (use chat-completions-style APIs instead, or the specific service docs). [2]

https://inference-api.nvidia.com is not the base URL shown in NVIDIA’s NIM API Catalog reference docs; hitting the root returns "LiteLLM: RUNNING", which indicates it’s a LiteLLM-powered gateway/proxy rather than the primary documented public entrypoint. [3]

Practical implication: treat it as internal/legacy/undocumented unless NVIDIA specifically instructed you to use it for your account/tenant—prefer integrate.api.nvidia.com for supported integrations. [1][3]

Sources:
[1] NVIDIA NIM LLM APIs reference (base URL + /v1/chat/completions) (docs.api.nvidia.com)
[2] NeMo Guardrails doc noting integrate.api.nvidia.com/v1 doesn’t support /completions (docs.nvidia.com)
[3] inference-api.nvidia.com root response (inference-api.nvidia.com)

Citations:

1: https://docs.api.nvidia.com/nim/reference/llm-apis?utm_source=openai

2: https://docs.nvidia.com/nemo/microservices/latest/guardrails/running-inference.html?utm_source=openai

3: https://inference-api.nvidia.com/

Revert endpoint and model identifier to documented NVIDIA NIM API standards.

The endpoint change to inference-api.nvidia.com is incorrect. NVIDIA's official documented base URL for the NIM API Catalog is https://integrate.api.nvidia.com. The inference-api.nvidia.com endpoint is an undocumented internal LiteLLM gateway and should not be used for production integrations.

The model identifier format is also incorrect. The correct format per NVIDIA's NIM API documentation is nvidia/nemotron-3-nano-30b-a3b (lowercase with single nvidia/ prefix), not the mixed-case Nemotron-3-Nano-30B-A3B. The tests will fail with these changes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test_generation.py` around lines 31 - 32, The test uses an incorrect NVIDIA NIM API model identifier and endpoint: replace the model string "--model=nvidia/nvidia/Nemotron-3-Nano-30B-A3B" with the documented lowercase single-prefix identifier "--model=nvidia/nemotron-3-nano-30b-a3b" and revert the server address "--server_address=https://inference-api.nvidia.com/v1/" to the official base URL "--server_address=https://integrate.api.nvidia.com"; update these exact f-string literals in tests/test_generation.py so the test uses the documented NIM API formats.

Signed-off-by: Igor Gitman <igitman@nvidia.com>

peri044 · 2026-02-24T23:24:52Z

@peri044 I renamed the parameter to judge_step_fn, I think that's a bit clearer as judge_path looks like a path to judge model. Also fixed DCO and a bug that caused one of the tests to fail. Let me know if naming looks good to you and we can merge after that

Thank you @Kipok for the fixes. The naming looks good to me.

commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

nemo_skills/pipeline/eval.py Outdated Show resolved Hide resolved

peri044 requested a review from Kipok February 24, 2026 19:53

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

Kipok and others added 21 commits February 24, 2026 13:21

Remove incorrect presence-penalty setting (#1259)

26f94c1

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add reference to internal benchmarks repo (#1261)

305a3a8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Rename custom -> external benchmarks (#1262)

d72cf82

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add aime26 (#1256)

598c841

Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

fix deps (#1258)

0857d88

Signed-off-by: Igor Gitman <igitman@nvidia.com>

chore: Address review comments

1f40be5

Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Remove deprecated dataset group (#1263)

9c738a2

Signed-off-by: Igor Gitman <igitman@nvidia.com>

chore: Fix argument

57761a9

Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

chore: address review comments

9604fd5

Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

chore: address review comments

285cbbc

Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

chore: rename judge_step to judge_path

25633a0

Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix incorrect prompt tokens count due to HF api update (#1264)

39f25ce

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix no_answer metric overcounting in _compute_pass_at_k (#1245)

f90490e

Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add CritPt benchmark (#1200)

9c6800d

Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Add DSBench-DA evaluation (#1254)

d9b1649

Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

Numb3rs ds addition (#1174)

b7b7dbe

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>

style: fix ruff format for judge_path line

af7454a

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Small renaming

4111e16

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Update tests with new api

a1fdfe2

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Fix judge assignment

d27e339

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok force-pushed the peri044/external_judge branch from b5f419f to d27e339 Compare February 24, 2026 21:21

Merge branch 'main' into peri044/external_judge

31f6ff7

coderabbitai bot reviewed Feb 24, 2026

View reviewed changes

Kipok added the run GPU tests label Feb 24, 2026

Year change

b397ca8

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok added run GPU tests and removed run GPU tests labels Feb 24, 2026

Merge branch 'main' into peri044/external_judge

6984468

Kipok added run GPU tests and removed run GPU tests labels Feb 24, 2026

Kipok approved these changes Feb 24, 2026

View reviewed changes

gwarmstrong enabled auto-merge (squash) February 25, 2026 20:42

gwarmstrong merged commit 9fa8e83 into main Feb 25, 2026
6 of 10 checks passed

gwarmstrong deleted the peri044/external_judge branch February 25, 2026 20:56

		output_dir_path = judge_pipeline_args.get("output_dir")
		input_file = judge_pipeline_args.get("input_file")

		f" --model=nvidia/nvidia/Nemotron-3-Nano-30B-A3B "
		f" --server_address=https://inference-api.nvidia.com/v1/ "

Conversation

peri044 commented Feb 24, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

peri044 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Kipok commented Feb 24, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 24, 2026

Choose a reason for hiding this comment

NVIDIA NIM inference API endpoint docs (2025-era)

1) NVIDIA-hosted NIM (API Catalog) – LLM inference

2) Self-hosted / Downloadable NIM for LLMs – OpenAPI + endpoints

3) Retrieval (embeddings / rerank) NIM APIs

Quick example (self-hosted NIM)

Uh oh!

peri044 commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

peri044 commented Feb 24, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 24, 2026 •

edited

Loading

peri044 commented Feb 24, 2026 •

edited

Loading