Skip to content

feat: add custom judge type support for external repo integration#1274

Merged
gwarmstrong merged 24 commits intomainfrom
peri044/external_judge
Feb 25, 2026
Merged

feat: add custom judge type support for external repo integration#1274
gwarmstrong merged 24 commits intomainfrom
peri044/external_judge

Conversation

@peri044
Copy link
Collaborator

@peri044 peri044 commented Feb 24, 2026

Add generic 'custom' judge type that allows external repos to register their own judge task creators using the locate() pattern. This enables external benchmarks to define judge tasks without modifying nemo-skills source code.

Changes:

  • Add _create_custom_judge_tasks() function to dynamically import and call external judge creators
  • Add 'custom' judge type routing in prepare_eval_commands()
  • Uses existing locate() pattern (same as METRICS_TYPE, eval_type)

Summary by CodeRabbit

  • New Features

    • Evaluation pipeline now accepts a runtime judge creator path (use this instead of the old judge type) and dynamically creates per-benchmark judge tasks.
    • New COMET- and embedding-based judge integrations are available as pluggable judge task implementations.
  • Documentation

    • Multilingual evaluation docs updated to show the new judge configuration and invocation.
  • Tests

    • Test strings updated to reflect revised model and endpoint identifiers.
  • Chore

    • Copyright year updated.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces per-benchmark judge_type with a dynamic judge_step_fn locator; adds comet_judge and nvembed_judge creator modules; updates eval CLI, docs, and dataset configs to reference factory-style judge creators; eval loads and invokes create_judge_tasks, passing judge_model and output_dir; LLM judge remains fallback.

Changes

Cohort / File(s) Summary
Eval CLI / Core Flow
nemo_skills/pipeline/eval.py
Removed judge_type option and per-benchmark dispatch; added judge_step_fn option. Eval dynamically loads/invokes a judge creator when judge_step_fn is provided (passes judge_model and output_dir); retains LLM fallback. Removed Comet/NVEmbed helper functions.
New Judge Implementations
nemo_skills/pipeline/judges/comet_judge.py, nemo_skills/pipeline/judges/nvembed_judge.py
Added create_judge_tasks(...) functions that compute seeds, check remaining jobs, build CLI commands (COMET / NVEmbed), and create cluster-aware judge tasks (GPU/node/partition, SBATCH options, deps). Return list of created tasks or empty when skipping.
Judges Package Init
nemo_skills/pipeline/judges/__init__.py
Added package initializer with license header and module docstring.
Docs / Config Updates
docs/evaluation/multilingual.md, nemo_skills/dataset/mmau-pro/closed_form/__init__.py
Replaced --judge_type=comet / {"judge_type": ...} usages with judge_step_fn locator strings (e.g., nemo_skills.pipeline.judges.comet_judge::create_judge_tasks).
Tests / Strings
tests/test_generation.py
Updated NVIDIA Nemotron model and inference endpoint string literals (host and model identifiers); no logic changes.
Minor
nemo_skills/pipeline/__init__.py
Copyright year bump; no functional change.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Eval as eval()
    participant Loader as locate()
    participant JudgeFactory as Judge Creator
    participant TaskMgmt as Task Management
    participant Cluster as Cluster Config

    Client->>Eval: eval(..., judge_step_fn, judge_pipeline_args)
    alt judge_step_fn provided
        Eval->>Loader: load(judge_step_fn)
        Loader-->>Eval: judge_creator_fn
        Eval->>JudgeFactory: judge_creator_fn(exp, benchmark, judge_pipeline_args, output_dir, ...)
        JudgeFactory->>TaskMgmt: get_remaining_jobs(...)
        JudgeFactory->>Cluster: read cluster_config / GPUs / partition
        JudgeFactory->>TaskMgmt: add_task(cmd, deps, cluster settings)
        TaskMgmt-->>JudgeFactory: task(s) created
        JudgeFactory-->>Eval: list of judge tasks
    else fallback (LLM)
        Eval->>TaskMgmt: _create_llm_judge_tasks(...)
        TaskMgmt-->>Eval: llm judge tasks
    end
    Eval-->>Client: evaluation scheduled/completed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main architectural change: introducing dynamic judge type support via judge_step_fn instead of hardcoded judge types, enabling external repositories to register custom judge creators.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch peri044/external_judge

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)

483-514: ⚠️ Potential issue | 🟡 Minor

has_tasks not set to True in the dynamic judge path.

The LLM judge branch (line 514) sets has_tasks = True, but the dynamic judge_creator_path branch does not. If the dynamic judge creator returns tasks, these tasks are added to all_tasks (line 547) but the experiment may not be run if no main eval jobs were scheduled (since has_tasks gates run_exp at line 647).

In practice this is unlikely to cause issues since main eval jobs typically set has_tasks, but for consistency and correctness:

Proposed fix
            if judge_creator_path:
+               has_tasks = True
                # Use locate() to dynamically load judge creator function
                from nemo_skills.dataset.utils import locate
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 483 - 514, The dynamic
judge_creator_path branch calls locate() to get judge_creator_fn and assigns its
result to judge_tasks but never sets has_tasks; update the branch in the
judge_creator_path flow so that after calling judge_creator_fn (the judge_tasks
variable) you set has_tasks = True if judge_tasks is truthy (or always set
has_tasks = True if you prefer), ensuring the existing all_tasks handling and
the run_exp gate that depends on has_tasks will run the experiment when dynamic
judge tasks are produced.
🧹 Nitpick comments (1)
nemo_skills/pipeline/eval.py (1)

475-487: Move import to the top of the file.

The from nemo_skills.dataset.utils import locate on line 485 is inside a loop body. While functionally fine (Python caches module imports), placing it at the top of the file with the other imports is more conventional and avoids repeated import lookups.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 475 - 487, The import of locate is
inside the branch that handles judge_creator_path (see judge_creator_path and
judge_creator_fn) which causes a local import in the loop; move "from
nemo_skills.dataset.utils import locate" up to the module imports at the top of
the file and remove the in-function import so judge_creator_fn =
locate(judge_creator_path) uses the top-level imported locate; update any import
grouping to include locate alongside the other imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Line 157: The help string for the typer.Option in the judge_step variable is
too long; update the judge_step definition so the help argument is split across
multiple lines (use implicit string concatenation or wrap the typer.Option(...)
call in parentheses and break the help string into multiple quoted segments) so
the line length is reduced and ruff-format will pass; keep the same text (e.g.,
split "Path to the judge creator function to use for the judge (locate()
convention). Eg: nemo_skills.pipeline.judges.nvembed_judge::create_judge_tasks")
across lines in the judge_step = typer.Option(...) call.

In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 100-106: In the branch where input_file is None, replace the
dictionary access judge_pipeline_args.get("input_dir") with direct indexing
judge_pipeline_args["input_dir"] so that missing input_dir raises an error
instead of generating a "--input-dir None" argument; update the code around the
input_file/input_dir handling that builds script_args (references: input_file,
judge_pipeline_args, script_args, num_seeds) to use
judge_pipeline_args["input_dir"] and keep adding the "--input-dir" and
"--num-seeds" entries as before.
- Around line 75-77: Replace the permissive dict.get() calls with direct
dictionary access so missing critical keys fail fast: change uses of
judge_pipeline_args.get("output_dir"), .get("input_file"), and
.get("judge_model") to judge_pipeline_args["output_dir"],
judge_pipeline_args["input_file"], and judge_pipeline_args["judge_model"] in
comet_judge.py (variables output_dir_path, input_file, comet_model_path) so a
KeyError is raised immediately if those required arguments are absent.

In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 75-76: The code currently uses
judge_pipeline_args.get("output_dir") which hides missing-key errors; replace
the .get usage for output_dir with direct dictionary access (use
judge_pipeline_args["output_dir"]) so the code fails fast with a clear KeyError
if the caller stops providing it; update the reference to output_dir_path
accordingly (leave input_file handling unchanged unless you also expect it to be
mandatory) and run tests to ensure no other call sites assume .get semantics.
- Around line 100-102: The code path in nvembed_judge.py that handles when
input_file is None currently uses judge_pipeline_args.get("input_dir") which can
silently return None and interpolate it into script_args; update that branch to
access the required key directly (judge_pipeline_args["input_dir"]) so the code
fails fast if input_dir is missing and prevents None from being appended to the
command string; locate the block around the if input_file is None check and
replace the .get() access with direct dictionary indexing.

---

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 483-514: The dynamic judge_creator_path branch calls locate() to
get judge_creator_fn and assigns its result to judge_tasks but never sets
has_tasks; update the branch in the judge_creator_path flow so that after
calling judge_creator_fn (the judge_tasks variable) you set has_tasks = True if
judge_tasks is truthy (or always set has_tasks = True if you prefer), ensuring
the existing all_tasks handling and the run_exp gate that depends on has_tasks
will run the experiment when dynamic judge tasks are produced.

---

Nitpick comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 475-487: The import of locate is inside the branch that handles
judge_creator_path (see judge_creator_path and judge_creator_fn) which causes a
local import in the loop; move "from nemo_skills.dataset.utils import locate" up
to the module imports at the top of the file and remove the in-function import
so judge_creator_fn = locate(judge_creator_path) uses the top-level imported
locate; update any import grouping to include locate alongside the other
imports.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6da2219 and 40d7c38.

📒 Files selected for processing (6)
  • docs/evaluation/multilingual.md
  • nemo_skills/dataset/mmau-pro/closed_form/__init__.py
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/judges/__init__.py
  • nemo_skills/pipeline/judges/comet_judge.py
  • nemo_skills/pipeline/judges/nvembed_judge.py

Comment on lines +75 to +77
output_dir_path = judge_pipeline_args.get("output_dir")
input_file = judge_pipeline_args.get("input_file")
comet_model_path = judge_pipeline_args.get("judge_model")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use direct dictionary access for keys that are expected to be present.

output_dir is always set by eval.py before calling the judge creator (line 473), and judge_model is critical for COMET evaluation — if it's missing, the command will interpolate None as a string into the CLI args (--comet-model-path None), producing a silent, hard-to-debug failure. Use judge_pipeline_args["key"] to fail fast with a clear KeyError.

Proposed fix
-    output_dir_path = judge_pipeline_args.get("output_dir")
-    input_file = judge_pipeline_args.get("input_file")
-    comet_model_path = judge_pipeline_args.get("judge_model")
+    output_dir_path = judge_pipeline_args["output_dir"]
+    input_file = judge_pipeline_args.get("input_file")  # legitimately optional
+    comet_model_path = judge_pipeline_args["judge_model"]

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
output_dir_path = judge_pipeline_args.get("output_dir")
input_file = judge_pipeline_args.get("input_file")
comet_model_path = judge_pipeline_args.get("judge_model")
output_dir_path = judge_pipeline_args["output_dir"]
input_file = judge_pipeline_args.get("input_file") # legitimately optional
comet_model_path = judge_pipeline_args["judge_model"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 75 - 77, Replace the
permissive dict.get() calls with direct dictionary access so missing critical
keys fail fast: change uses of judge_pipeline_args.get("output_dir"),
.get("input_file"), and .get("judge_model") to
judge_pipeline_args["output_dir"], judge_pipeline_args["input_file"], and
judge_pipeline_args["judge_model"] in comet_judge.py (variables output_dir_path,
input_file, comet_model_path) so a KeyError is raised immediately if those
required arguments are absent.

Comment on lines +100 to +106

if input_file is None:
input_dir = judge_pipeline_args.get("input_dir")
script_args.append(f"--input-dir {input_dir}")
script_args.append(f"--num-seeds {num_seeds}")
else:
script_args.append(f"--input-file {input_file}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use direct access for input_dir — it is always set when input_file is None.

In the input_file is None branch, eval.py (line 468) always populates input_dir in judge_pipeline_args. Using .get() here would silently produce --input-dir None in the command string instead of failing fast.

Proposed fix
-        input_dir = judge_pipeline_args.get("input_dir")
+        input_dir = judge_pipeline_args["input_dir"]

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 100 - 106, In the
branch where input_file is None, replace the dictionary access
judge_pipeline_args.get("input_dir") with direct indexing
judge_pipeline_args["input_dir"] so that missing input_dir raises an error
instead of generating a "--input-dir None" argument; update the code around the
input_file/input_dir handling that builds script_args (references: input_file,
judge_pipeline_args, script_args, num_seeds) to use
judge_pipeline_args["input_dir"] and keep adding the "--input-dir" and
"--num-seeds" entries as before.

Comment on lines +75 to +76
output_dir_path = judge_pipeline_args.get("output_dir")
input_file = judge_pipeline_args.get("input_file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use direct dictionary access for output_dir — it is always populated by the caller.

Same as the Comet judge: output_dir is guaranteed to be set by eval.py (line 473) before invoking the judge creator. Direct access provides clear error messages if assumptions change.

Proposed fix
-    output_dir_path = judge_pipeline_args.get("output_dir")
-    input_file = judge_pipeline_args.get("input_file")
+    output_dir_path = judge_pipeline_args["output_dir"]
+    input_file = judge_pipeline_args.get("input_file")  # legitimately optional

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 75 - 76, The code
currently uses judge_pipeline_args.get("output_dir") which hides missing-key
errors; replace the .get usage for output_dir with direct dictionary access (use
judge_pipeline_args["output_dir"]) so the code fails fast with a clear KeyError
if the caller stops providing it; update the reference to output_dir_path
accordingly (leave input_file handling unchanged unless you also expect it to be
mandatory) and run tests to ensure no other call sites assume .get semantics.

Comment on lines +100 to +102
if input_file is None:
input_dir = judge_pipeline_args.get("input_dir")
script_args.append(f"--input-dir {input_dir}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use direct access for input_dir.

When input_file is None, eval.py always sets input_dir in judge_pipeline_args (line 468). Using .get() would silently interpolate None into the command string.

Proposed fix
-        input_dir = judge_pipeline_args.get("input_dir")
+        input_dir = judge_pipeline_args["input_dir"]

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if input_file is None:
input_dir = judge_pipeline_args.get("input_dir")
script_args.append(f"--input-dir {input_dir}")
if input_file is None:
input_dir = judge_pipeline_args["input_dir"]
script_args.append(f"--input-dir {input_dir}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 100 - 102, The
code path in nvembed_judge.py that handles when input_file is None currently
uses judge_pipeline_args.get("input_dir") which can silently return None and
interpolate it into script_args; update that branch to access the required key
directly (judge_pipeline_args["input_dir"]) so the code fails fast if input_dir
is missing and prevents None from being appended to the command string; locate
the block around the if input_file is None check and replace the .get() access
with direct dictionary indexing.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)

475-511: ⚠️ Potential issue | 🟠 Major

Handle conflicting judge_path sources and mark dynamic judge tasks as work.
Two issues here:

  1. If a benchmark defines judge_path, CLI --judge_path is silently ignored. Prefer CLI override or raise on conflict.
  2. When dynamic judge tasks are created, has_tasks stays false, so runs with only judge tasks (and auto_summarize_results=False) won’t execute.
✅ Proposed fix
-            judge_creator_path = judge_pipeline_args.pop("judge_path", judge_path)
+            benchmark_judge_path = judge_pipeline_args.pop("judge_path", None)
+            if judge_path and benchmark_judge_path and judge_path != benchmark_judge_path:
+                raise ValueError(
+                    f"Conflicting judge_path: CLI={judge_path} vs benchmark={benchmark_judge_path}"
+                )
+            judge_creator_path = judge_path or benchmark_judge_path
@@
-            if judge_tasks:
+            if judge_tasks:
+                has_tasks = True
                 benchmark_to_judge_tasks[benchmark] = judge_tasks
                 all_tasks.extend(judge_tasks)

As per coding guidelines, "Avoid silently ignoring unused user-passed parameters; the code should fail if a required argument is not specified or if unsupported arguments are provided. Use dataclasses or **kwargs to automatically handle parameter validation."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 475 - 511, The code currently lets
a benchmark-provided judge_path in judge_pipeline_args silently override a CLI
judge_path and creates dynamic judge tasks without marking the run as having
work; change the logic so CLI judge_path takes precedence: retrieve any
benchmark value with judge_pipeline_args.pop("judge_path", None), then set
judge_creator_path = judge_path if judge_path is truthy else the popped value
(and remove the key from judge_pipeline_args to avoid silent conflicts), and
optionally log a warning if both were provided; after calling
judge_creator_fn(...) and assigning judge_tasks, ensure the run is marked as
having work by setting has_tasks = True (or marking the created judge_tasks as
work) when judge_tasks is non-empty so runs with only dynamic judge tasks will
execute even if auto_summarize_results=False.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Line 351: The current eval_requires_judge computation ignores a user-specified
--judge_path when no other CLI judge args are truthy; update the logic that
computes eval_requires_judge (currently using cli_judge_pipeline_args and the
variable eval_requires_judge) so that the presence of
cli_judge_pipeline_args['judge_path'] (non-empty string) forces
eval_requires_judge to True and/or raises a validation error if judge_path is
provided without required companion args; locate the code that builds
cli_judge_pipeline_args and the eval_requires_judge assignment and either
include an explicit check for judge_path or add parameter validation (e.g., in
the same function that constructs cli_judge_pipeline_args) to fail fast for
unsupported or incomplete judge-related args.

---

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 475-511: The code currently lets a benchmark-provided judge_path
in judge_pipeline_args silently override a CLI judge_path and creates dynamic
judge tasks without marking the run as having work; change the logic so CLI
judge_path takes precedence: retrieve any benchmark value with
judge_pipeline_args.pop("judge_path", None), then set judge_creator_path =
judge_path if judge_path is truthy else the popped value (and remove the key
from judge_pipeline_args to avoid silent conflicts), and optionally log a
warning if both were provided; after calling judge_creator_fn(...) and assigning
judge_tasks, ensure the run is marked as having work by setting has_tasks = True
(or marking the created judge_tasks as work) when judge_tasks is non-empty so
runs with only dynamic judge tasks will execute even if
auto_summarize_results=False.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 40d7c38 and 7367594.

📒 Files selected for processing (3)
  • docs/evaluation/multilingual.md
  • nemo_skills/dataset/mmau-pro/closed_form/__init__.py
  • nemo_skills/pipeline/eval.py

@peri044 peri044 requested a review from Kipok February 24, 2026 19:53
@peri044
Copy link
Collaborator Author

peri044 commented Feb 24, 2026

@Kipok I've figured out write access problem (authorization of NVIDIA-Nemo org issue). I've closed the fork PR #1265 and addressed your recent feedback in this PR. Does this align with your comments ? Are gpu-tests-qwen in CI supposed to be skipped or something I need to fix on my end ?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
nemo_skills/pipeline/eval.py (1)

354-354: ⚠️ Potential issue | 🟠 Major

Ensure --judge_path forces judge scheduling.
eval_requires_judge ignores a user-specified judge_path, so judge jobs can be skipped for benchmarks that don’t require a judge by default. This silently drops a user-provided parameter.

✅ Proposed fix
-    eval_requires_judge = any(param_value for param_value in cli_judge_pipeline_args.values())
+    eval_requires_judge = bool(judge_path) or any(param_value for param_value in cli_judge_pipeline_args.values())

As per coding guidelines: "Avoid silently ignoring unused user-passed parameters; the code should fail if a required argument is not specified or if unsupported arguments are provided. Use dataclasses or **kwargs to automatically handle parameter validation."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` at line 354, The current eval_requires_judge
computation ignores a user-specified judge_path in cli_judge_pipeline_args;
update the logic so that presence/non-empty
cli_judge_pipeline_args.get('judge_path') forces scheduling of judge jobs (e.g.,
include a check for truthiness of 'judge_path' in the eval_requires_judge
expression or set eval_requires_judge = True when
cli_judge_pipeline_args.get('judge_path') is provided), and additionally
validate unexpected/unsupported CLI judge args by raising an error or warning if
unknown keys are present in cli_judge_pipeline_args to avoid silently ignoring
user parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nemo_skills/pipeline/eval.py`:
- Line 354: The current eval_requires_judge computation ignores a user-specified
judge_path in cli_judge_pipeline_args; update the logic so that
presence/non-empty cli_judge_pipeline_args.get('judge_path') forces scheduling
of judge jobs (e.g., include a check for truthiness of 'judge_path' in the
eval_requires_judge expression or set eval_requires_judge = True when
cli_judge_pipeline_args.get('judge_path') is provided), and additionally
validate unexpected/unsupported CLI judge args by raising an error or warning if
unknown keys are present in cli_judge_pipeline_args to avoid silently ignoring
user parameters.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7367594 and 8d562b4.

📒 Files selected for processing (1)
  • nemo_skills/pipeline/eval.py

Kipok and others added 21 commits February 24, 2026 13:21
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Add generic 'custom' judge type that allows external repos to register
their own judge task creators using the locate() pattern. This enables
external benchmarks to define judge tasks without modifying nemo-skills
source code.

Changes:
- Add _create_custom_judge_tasks() function to dynamically import and
  call external judge creators
- Add 'custom' judge type routing in prepare_eval_commands()
- Uses existing locate() pattern (same as METRICS_TYPE, eval_type)

This change is minimal (~50 lines) and reusable for any external repo.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Squash merge of changes during code-review.
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
@Kipok Kipok force-pushed the peri044/external_judge branch from b5f419f to d27e339 Compare February 24, 2026 21:21
@Kipok
Copy link
Collaborator

Kipok commented Feb 24, 2026

@peri044 I renamed the parameter to judge_step_fn, I think that's a bit clearer as judge_path looks like a path to judge model. Also fixed DCO and a bug that caused one of the tests to fail. Let me know if naming looks good to you and we can merge after that

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (4)
nemo_skills/pipeline/judges/comet_judge.py (2)

100-106: Use direct access for input_dir — it is always set when input_file is None.

eval.py (line 472) always populates input_dir when benchmark_seeds != 0. Using .get() silently interpolates None into the command string.

Proposed fix
-        input_dir = judge_pipeline_args.get("input_dir")
+        input_dir = judge_pipeline_args["input_dir"]

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 100 - 106, When
input_file is None, replace judge_pipeline_args.get("input_dir") with direct
access judge_pipeline_args["input_dir"] so the constructed command uses the
guaranteed input_dir value (do this in the block that appends to script_args
alongside num_seeds); ensure you reference the variables input_file,
judge_pipeline_args, input_dir, script_args, and num_seeds in the change.

75-77: Use direct dictionary access for keys expected to be present.

output_dir is always set by eval.py (line 477) before invoking this function, and judge_model is set via setdefault at line 496. Using .get() will silently produce None values that get interpolated as the string "None" in the command, leading to hard-to-debug failures (e.g., --comet-model-path None). input_file is legitimately optional since None drives the branching logic.

Proposed fix
-    output_dir_path = judge_pipeline_args.get("output_dir")
-    input_file = judge_pipeline_args.get("input_file")
-    comet_model_path = judge_pipeline_args.get("judge_model")
+    output_dir_path = judge_pipeline_args["output_dir"]
+    input_file = judge_pipeline_args.get("input_file")  # legitimately optional
+    comet_model_path = judge_pipeline_args["judge_model"]

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 75 - 77, The code
uses judge_pipeline_args.get(...) for keys that must exist, causing silent None
values; change to direct dictionary access for required keys: replace
judge_pipeline_args.get("output_dir") and judge_pipeline_args.get("judge_model")
with judge_pipeline_args["output_dir"] and judge_pipeline_args["judge_model"]
respectively (leave input_file as judge_pipeline_args.get("input_file") since it
is optional), so failures raise KeyError and fail fast; update any variable
names output_dir_path and comet_model_path accordingly where they are assigned
from judge_pipeline_args.
nemo_skills/pipeline/judges/nvembed_judge.py (2)

100-102: Use direct access for input_dir.

When input_file is None, eval.py always sets input_dir. Using .get() silently interpolates None into the command string.

Proposed fix
-        input_dir = judge_pipeline_args.get("input_dir")
+        input_dir = judge_pipeline_args["input_dir"]

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 100 - 102, The
branch handling when input_file is None uses
judge_pipeline_args.get("input_dir") which can return None and be interpolated
into the command; change it to direct access judge_pipeline_args["input_dir"]
and use that value when building script_args (the conditional block around
input_file and the script_args.append call) so the code fails loudly if
input_dir is missing instead of inserting "None" into the command.

75-76: Use direct dictionary access for output_dir — it is always populated by the caller.

Same issue as in comet_judge.py: output_dir is guaranteed set by eval.py (line 477). Direct access provides a clear KeyError if assumptions change.

Proposed fix
-    output_dir_path = judge_pipeline_args.get("output_dir")
-    input_file = judge_pipeline_args.get("input_file")
+    output_dir_path = judge_pipeline_args["output_dir"]
+    input_file = judge_pipeline_args.get("input_file")  # legitimately optional

As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 75 - 76, The code
uses judge_pipeline_args.get("output_dir") even though output_dir is always
provided by the caller; change this to direct dictionary access
judge_pipeline_args["output_dir"] to surface a KeyError if the assumption is
violated. Update the assignment that sets output_dir_path (and leave input_file
as-is) so it uses judge_pipeline_args["output_dir"]; ensure any downstream logic
that relies on output_dir_path continues to work unchanged.
🧹 Nitpick comments (1)
nemo_skills/pipeline/judges/nvembed_judge.py (1)

26-134: Consider extracting shared logic between comet_judge.py and nvembed_judge.py.

Both judge modules share nearly identical structure: seed determination, remaining-job checks, script-arg assembly, and add_task invocation. The main differences are the command string and the --skip-existing flag. A shared base or helper function could reduce duplication. Not urgent given this is the initial extraction, but worth noting for follow-up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 26 - 134,
create_judge_tasks duplicates logic found in comet_judge.py (seed determination,
get_remaining_jobs check, script args assembly, and add_task call); refactor by
extracting a shared helper (e.g., build_judge_task or create_common_judge_task)
that accepts parameters for the script path, unique script_args (like
"--input-file" vs "--input-dir" and the presence of "--skip-existing"),
container key, and GPU/node defaults, then have create_judge_tasks and the comet
counterpart call that helper; reuse existing symbols get_remaining_jobs and
add_task inside the helper and preserve behavior for run_cmd formation,
task_name suffixes (e.g., "-nvembed-judge"), and task parameters (num_gpus,
num_nodes, partition, run_after, reuse_code_exp, reuse_code, task_dependencies,
installation_command, skip_hf_home_check, sbatch_kwargs).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 486-488: The flag has_tasks is being set unconditionally when a
judge_step_fn exists, which can be incorrect if judge_step_fn() returns an empty
list; change the logic in the block handling judge_step_fn so that you only set
has_tasks = True after calling judge_step_fn and confirming it returned a
non-empty list (e.g., inside the if judge_tasks: guard that processes the
returned tasks), mirroring the existing pattern used when _generate returns
None; ensure run_exp is only triggered when has_tasks is true after this check.
- Around line 479-492: The bug is that judge_step_fn is mutated inside the
benchmark loop by using judge_pipeline_args.pop("judge_step_fn", judge_step_fn),
causing the previous iteration's value to be used as the fallback; to fix,
capture the original CLI/default value (e.g., orig_judge_step_fn =
judge_step_fn) before entering the loop and inside the loop use a local variable
or call pop with that original as the default
(judge_pipeline_args.pop("judge_step_fn", orig_judge_step_fn)), and ensure you
don't assign back to the outer-scope judge_step_fn so the CLI default isn't
overwritten across iterations; keep the locate() dynamic-loading logic (from
nemo_skills.dataset.utils import locate) but apply it to the loop-local variable
only.

In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 94-96: The early-return in the Comet judge currently returns an
empty list which leaves the caller's pre-set has_tasks=True unchanged; change
the early-return to return None (or another clearly falsy sentinel) from the
judge function so callers can detect "no work" by truthiness, and update the
caller in eval.py (the code around line 487 / the logic that sets has_tasks
before calling judge_step_fn) to set has_tasks based on the truthiness of
judge_step_fn's result (e.g., assign has_tasks = bool(judge_tasks) after calling
judge_step_fn) so the experiment isn't launched when there are no actual tasks.
Ensure references to the Comet judge function (judge_step_fn) and the caller in
eval.py are updated consistently.

In `@tests/test_generation.py`:
- Around line 31-32: The test uses an incorrect NVIDIA NIM API model identifier
and endpoint: replace the model string
"--model=nvidia/nvidia/Nemotron-3-Nano-30B-A3B" with the documented lowercase
single-prefix identifier "--model=nvidia/nemotron-3-nano-30b-a3b" and revert the
server address "--server_address=https://inference-api.nvidia.com/v1/" to the
official base URL "--server_address=https://integrate.api.nvidia.com"; update
these exact f-string literals in tests/test_generation.py so the test uses the
documented NIM API formats.

---

Duplicate comments:
In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 100-106: When input_file is None, replace
judge_pipeline_args.get("input_dir") with direct access
judge_pipeline_args["input_dir"] so the constructed command uses the guaranteed
input_dir value (do this in the block that appends to script_args alongside
num_seeds); ensure you reference the variables input_file, judge_pipeline_args,
input_dir, script_args, and num_seeds in the change.
- Around line 75-77: The code uses judge_pipeline_args.get(...) for keys that
must exist, causing silent None values; change to direct dictionary access for
required keys: replace judge_pipeline_args.get("output_dir") and
judge_pipeline_args.get("judge_model") with judge_pipeline_args["output_dir"]
and judge_pipeline_args["judge_model"] respectively (leave input_file as
judge_pipeline_args.get("input_file") since it is optional), so failures raise
KeyError and fail fast; update any variable names output_dir_path and
comet_model_path accordingly where they are assigned from judge_pipeline_args.

In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 100-102: The branch handling when input_file is None uses
judge_pipeline_args.get("input_dir") which can return None and be interpolated
into the command; change it to direct access judge_pipeline_args["input_dir"]
and use that value when building script_args (the conditional block around
input_file and the script_args.append call) so the code fails loudly if
input_dir is missing instead of inserting "None" into the command.
- Around line 75-76: The code uses judge_pipeline_args.get("output_dir") even
though output_dir is always provided by the caller; change this to direct
dictionary access judge_pipeline_args["output_dir"] to surface a KeyError if the
assumption is violated. Update the assignment that sets output_dir_path (and
leave input_file as-is) so it uses judge_pipeline_args["output_dir"]; ensure any
downstream logic that relies on output_dir_path continues to work unchanged.

---

Nitpick comments:
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 26-134: create_judge_tasks duplicates logic found in
comet_judge.py (seed determination, get_remaining_jobs check, script args
assembly, and add_task call); refactor by extracting a shared helper (e.g.,
build_judge_task or create_common_judge_task) that accepts parameters for the
script path, unique script_args (like "--input-file" vs "--input-dir" and the
presence of "--skip-existing"), container key, and GPU/node defaults, then have
create_judge_tasks and the comet counterpart call that helper; reuse existing
symbols get_remaining_jobs and add_task inside the helper and preserve behavior
for run_cmd formation, task_name suffixes (e.g., "-nvembed-judge"), and task
parameters (num_gpus, num_nodes, partition, run_after, reuse_code_exp,
reuse_code, task_dependencies, installation_command, skip_hf_home_check,
sbatch_kwargs).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d562b4 and 31f6ff7.

📒 Files selected for processing (7)
  • docs/evaluation/multilingual.md
  • nemo_skills/dataset/mmau-pro/closed_form/__init__.py
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/judges/__init__.py
  • nemo_skills/pipeline/judges/comet_judge.py
  • nemo_skills/pipeline/judges/nvembed_judge.py
  • tests/test_generation.py
✅ Files skipped from review due to trivial changes (1)
  • nemo_skills/pipeline/judges/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/evaluation/multilingual.md

Comment on lines +479 to +492
# judge_step_fn is a :: path to the judge creator function (locate() convention).
# Could be set directly in JUDGE_PIPELINE_ARGS; falls back to None for LLM judge.
judge_step_fn = judge_pipeline_args.pop("judge_step_fn", judge_step_fn)

# Create judge tasks based on judge type
if benchmark_judge_type == "nvembed":
judge_tasks = _create_nvembed_judge_tasks(
exp=exp,
expname=expname,
benchmark=benchmark,
judge_pipeline_args=judge_pipeline_args,
rerun_done=rerun_done,
log_dir=log_dir,
server_parameters=server_parameters,
cluster_config=cluster_config,
judge_server_gpus=judge_server_gpus,
judge_server_nodes=judge_server_nodes,
partition=partition,
run_after=run_after,
reuse_code_exp=reuse_code_exp,
reuse_code=reuse_code,
dependent_tasks=dependent_tasks,
all_tasks=all_tasks,
_task_dependencies=_task_dependencies,
installation_command=installation_command,
skip_hf_home_check=skip_hf_home_check,
sbatch_kwargs=sbatch_kwargs,
)
elif benchmark_judge_type == "comet":
judge_pipeline_args["judge_model"] = judge_model
judge_tasks = _create_comet_judge_tasks(
# TODO: we should rework the interface here to have consistent parameters between main llm and custom
# judge creation steps. E.g. things like judge_model assignment below shouldn't be necessary

if judge_step_fn:
has_tasks = True
if not callable(judge_step_fn):
# Use locate() to dynamically load judge creator function
from nemo_skills.dataset.utils import locate

judge_step_fn = locate(judge_step_fn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: judge_step_fn leaks across benchmark iterations.

On line 481, judge_step_fn is reassigned from judge_pipeline_args.pop("judge_step_fn", judge_step_fn). Since this is inside the for benchmark ... loop (line 457), the fallback value on subsequent iterations is whatever was set by the previous benchmark — not the original CLI value.

Example: If benchmark A's JUDGE_PIPELINE_ARGS defines judge_step_fn = "...nvembed_judge::create_judge_tasks" but benchmark B does not, benchmark B will incorrectly inherit benchmark A's judge_step_fn instead of falling back to the CLI default (likely None → LLM judge).

Save the original CLI value before the loop and use it as the fallback:

Proposed fix
     all_tasks = []
     if _task_dependencies is None:
         _task_dependencies = []
+    cli_judge_step_fn = judge_step_fn
     with pipeline_utils.get_exp(expname, cluster_config, _reuse_exp) as exp:
         # scheduling main eval jobs
         ...
         # scheduling judge jobs if needed
         for idx, (benchmark, benchmark_args) in enumerate(benchmarks_dict.items()):
             ...
-            judge_step_fn = judge_pipeline_args.pop("judge_step_fn", judge_step_fn)
+            benchmark_judge_step_fn = judge_pipeline_args.pop("judge_step_fn", cli_judge_step_fn)
             ...
-            if judge_step_fn:
+            if benchmark_judge_step_fn:
                 has_tasks = True
-                if not callable(judge_step_fn):
+                if not callable(benchmark_judge_step_fn):
                     from nemo_skills.dataset.utils import locate
-                    judge_step_fn = locate(judge_step_fn)
+                    benchmark_judge_step_fn = locate(benchmark_judge_step_fn)
                 ...
-                judge_tasks = judge_step_fn(
+                judge_tasks = benchmark_judge_step_fn(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 479 - 492, The bug is that
judge_step_fn is mutated inside the benchmark loop by using
judge_pipeline_args.pop("judge_step_fn", judge_step_fn), causing the previous
iteration's value to be used as the fallback; to fix, capture the original
CLI/default value (e.g., orig_judge_step_fn = judge_step_fn) before entering the
loop and inside the loop use a local variable or call pop with that original as
the default (judge_pipeline_args.pop("judge_step_fn", orig_judge_step_fn)), and
ensure you don't assign back to the outer-scope judge_step_fn so the CLI default
isn't overwritten across iterations; keep the locate() dynamic-loading logic
(from nemo_skills.dataset.utils import locate) but apply it to the loop-local
variable only.

Comment on lines +486 to +488
if judge_step_fn:
has_tasks = True
if not callable(judge_step_fn):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

has_tasks = True is set before confirming the judge creator returns tasks.

If judge_step_fn() returns [] (e.g., all outputs already exist), has_tasks is still True. In an edge case where this is the only scheduled work, run_exp would be called on an experiment with no tasks. Consider moving has_tasks = True inside the if judge_tasks: guard at line 554, consistent with how _generate returning None is handled.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 486 - 488, The flag has_tasks is
being set unconditionally when a judge_step_fn exists, which can be incorrect if
judge_step_fn() returns an empty list; change the logic in the block handling
judge_step_fn so that you only set has_tasks = True after calling judge_step_fn
and confirming it returned a non-empty list (e.g., inside the if judge_tasks:
guard that processes the returned tasks), mirroring the existing pattern used
when _generate returns None; ensure run_exp is only triggered when has_tasks is
true after this check.

Comment on lines +94 to +96
if not remaining_jobs or all(not chunks for chunks in remaining_jobs.values()):
LOG.info(f"Skipping Comet judge for {benchmark} - all output files and .done markers exist")
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Early-return skips has_tasks = True in the caller — verify the caller handles empty returns correctly.

When remaining_jobs is empty, this returns []. In eval.py (line 487), has_tasks is set to True before calling judge_step_fn, so if this is the only task scheduled and it returns empty, the experiment will still attempt to run with no actual tasks. This is a minor inconsistency — the if judge_tasks: guard at line 554 prevents the worst outcome (NoneType errors), but the experiment still launches unnecessarily.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 94 - 96, The
early-return in the Comet judge currently returns an empty list which leaves the
caller's pre-set has_tasks=True unchanged; change the early-return to return
None (or another clearly falsy sentinel) from the judge function so callers can
detect "no work" by truthiness, and update the caller in eval.py (the code
around line 487 / the logic that sets has_tasks before calling judge_step_fn) to
set has_tasks based on the truthiness of judge_step_fn's result (e.g., assign
has_tasks = bool(judge_tasks) after calling judge_step_fn) so the experiment
isn't launched when there are no actual tasks. Ensure references to the Comet
judge function (judge_step_fn) and the caller in eval.py are updated
consistently.

Comment on lines +31 to +32
f" --model=nvidia/nvidia/Nemotron-3-Nano-30B-A3B "
f" --server_address=https://inference-api.nvidia.com/v1/ "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

NVIDIA NIM inference API endpoint documentation 2025

💡 Result:

NVIDIA NIM inference API endpoint docs (2025-era)

1) NVIDIA-hosted NIM (API Catalog) – LLM inference

2) Self-hosted / Downloadable NIM for LLMs – OpenAPI + endpoints

The NIM for LLMs API reference lists these key endpoints (OpenAI-compatible):

  • GET /v1/health/ready
  • GET /v1/models
  • POST /v1/chat/completions
  • POST /v1/completions (docs.nvidia.com)
    It also notes experimental endpoints like /v1/responses and Llama Stack experimental routes. (docs.nvidia.com)

3) Retrieval (embeddings / rerank) NIM APIs

NVIDIA’s “Retrieval APIs” reference covers embedding and reranking endpoints for NeMo Retriever NIMs. (docs.api.nvidia.com)

Quick example (self-hosted NIM)

curl -X POST "http://localhost:8000/v1/completions" \
  -H "Content-Type: application/json" \
  -d '{"model":"model_name","prompt":"Once upon a time","max_tokens":64}'

Example shown in NVIDIA’s NIM microservices page. (nvidia.com)

Citations:


🌐 Web query:

Nemotron-3-Nano-30B-A3B model NVIDIA API

💡 Result:

Nemotron-3-Nano-30B-A3B is available in the NVIDIA NIM API as nvidia / nemotron-3-nano-30b-a3b. The chat endpoint is OpenAI-compatible and uses Bearer auth. [1], [2]

HTTP endpoint (Chat Completions):
POST https://integrate.api.nvidia.com/v1/chat/completions [2]

Request (example):

curl https://integrate.api.nvidia.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $NVIDIA_API_KEY" \
  -d '{
    "model": "nvidia/nemotron-3-nano-30b-a3b",
    "messages": [{"role":"user","content":"Write a haiku about GPUs"}]
  }'

Model/API reference and model details are in NVIDIA’s NIM docs. [1]

Sources:
[1] NVIDIA NIM API Reference – nvidia / nemotron-3-nano-30b-a3b: https://docs.api.nvidia.com/nim/reference/nvidia-nemotron-3-nano-30b-a3b
[2] NVIDIA NIM API Reference – Chat Completions endpoint for this model: https://docs.api.nvidia.com/nim/reference/nvidia-nemotron-3-nano-30b-a3b-infer


🌐 Web query:

NVIDIA inference-api.nvidia.com versus integrate.api.nvidia.com endpoint

💡 Result:

  • https://integrate.api.nvidia.com is NVIDIA’s documented, public base URL for the NVIDIA API Catalog (NIM hosted endpoints). For LLMs, it exposes an OpenAI-compatible endpoint like POST /v1/chat/completions. [1]

    • Note: NVIDIA documents at least one OpenAI-style endpoint difference vs “generic OpenAI” expectations—e.g., /v1/completions is not supported on integrate.api.nvidia.com/v1 (use chat-completions-style APIs instead, or the specific service docs). [2]
  • https://inference-api.nvidia.com is not the base URL shown in NVIDIA’s NIM API Catalog reference docs; hitting the root returns "LiteLLM: RUNNING", which indicates it’s a LiteLLM-powered gateway/proxy rather than the primary documented public entrypoint. [3]

    • Practical implication: treat it as internal/legacy/undocumented unless NVIDIA specifically instructed you to use it for your account/tenant—prefer integrate.api.nvidia.com for supported integrations. [1][3]

Sources:
[1] NVIDIA NIM LLM APIs reference (base URL + /v1/chat/completions) (docs.api.nvidia.com)
[2] NeMo Guardrails doc noting integrate.api.nvidia.com/v1 doesn’t support /completions (docs.nvidia.com)
[3] inference-api.nvidia.com root response (inference-api.nvidia.com)

Citations:


Revert endpoint and model identifier to documented NVIDIA NIM API standards.

The endpoint change to inference-api.nvidia.com is incorrect. NVIDIA's official documented base URL for the NIM API Catalog is https://integrate.api.nvidia.com. The inference-api.nvidia.com endpoint is an undocumented internal LiteLLM gateway and should not be used for production integrations.

The model identifier format is also incorrect. The correct format per NVIDIA's NIM API documentation is nvidia/nemotron-3-nano-30b-a3b (lowercase with single nvidia/ prefix), not the mixed-case Nemotron-3-Nano-30B-A3B. The tests will fail with these changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_generation.py` around lines 31 - 32, The test uses an incorrect
NVIDIA NIM API model identifier and endpoint: replace the model string
"--model=nvidia/nvidia/Nemotron-3-Nano-30B-A3B" with the documented lowercase
single-prefix identifier "--model=nvidia/nemotron-3-nano-30b-a3b" and revert the
server address "--server_address=https://inference-api.nvidia.com/v1/" to the
official base URL "--server_address=https://integrate.api.nvidia.com"; update
these exact f-string literals in tests/test_generation.py so the test uses the
documented NIM API formats.

Signed-off-by: Igor Gitman <igitman@nvidia.com>
@peri044
Copy link
Collaborator Author

peri044 commented Feb 24, 2026

@peri044 I renamed the parameter to judge_step_fn, I think that's a bit clearer as judge_path looks like a path to judge model. Also fixed DCO and a bug that caused one of the tests to fail. Let me know if naming looks good to you and we can merge after that

Thank you @Kipok for the fixes. The naming looks good to me.

@gwarmstrong gwarmstrong enabled auto-merge (squash) February 25, 2026 20:42
@gwarmstrong gwarmstrong merged commit 9fa8e83 into main Feb 25, 2026
6 of 10 checks passed
@gwarmstrong gwarmstrong deleted the peri044/external_judge branch February 25, 2026 20:56
sgunasekar added a commit that referenced this pull request Mar 11, 2026
commit a5da597
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Mar 6 12:13:36 2026 -0800

    Revert "Eval kit support  (#1239)" (#1294)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit b237e33
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Mar 6 20:25:37 2026 +0400

    Eval kit support  (#1239)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

commit dc28bbf
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 5 10:17:44 2026 -0800

    Python direct tool calling without MCP (#1286)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 12454dd
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Mar 4 13:06:21 2026 -0800

    Allow het servers for nemo-rl jobs (#1223)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8884a68
Author: Prasoon Varshney <prasoon1995@gmail.com>
Date:   Wed Mar 4 10:24:02 2026 -0800

    Support source_lang param for translation recipe (#1290)

    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 4618b19
Author: Meriem B. <113170426+ka00ri@users.noreply.github.com>
Date:   Wed Mar 4 18:59:28 2026 +0100

    Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 5ac8609
Author: Talor Abramovich <talor19@gmail.com>
Date:   Wed Mar 4 02:30:06 2026 +0200

    Add SPEED-Bench (within repo) (#1279)

    Signed-off-by: Talor Abramovich <talora@nvidia.com>
    Signed-off-by: talora <talora@nvidia.com>
    Signed-off-by: Talor Abramovich <talor19@gmail.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

commit c31eec5
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 12:18:15 2026 -0800

    Fix os.getlogin() crash in ns setup (#1289)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit c228e66
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 11:04:54 2026 -0800

    Fix streaming TypeError when delta.content is None (#1267) (#1288)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit aa47923
Author: Matvei Novikov <mnovikov@nvidia.com>
Date:   Mon Mar 2 16:28:41 2026 -0800

    Add LibTrace recipe for generating domain-specific reasoning data (#1224)

    Signed-off-by: jubick1337 <mnovikov@nvidia.com>
    Signed-off-by: mnovikov <mnovikov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 313cad7
Author: Stephen Ge <stepheng@nvidia.com>
Date:   Mon Mar 2 18:28:49 2026 -0500

    fix: clean parse-failure retries in prover (#1284)

    Signed-off-by: Stephen Ge <stepheng@nvidia.com>

commit 813cfa3
Author: George Armstrong <georgea@nvidia.com>
Date:   Mon Mar 2 15:10:08 2026 -0800

    tst: rollback inference-api to integrate (#1287)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 31735f9
Author: Valentin Mendelev <vmendelev@nvidia.com>
Date:   Mon Mar 2 23:11:25 2026 +0100

    Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250)

    Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

commit d4ef8c0
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Feb 27 23:58:54 2026 +0400

    Update promt_config to working with openai format + inline setup (#1210)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit e879cbc
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:41:23 2026 -0800

    Update noc tutorial (#1282)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit f6e3505
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:17:33 2026 -0800

    Add noc reasoning tutorial (#1278)

    Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com>
    Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com>
    Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com>
    Co-authored-by: Cursor <cursoragent@cursor.com>
    Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com>

commit fc2072a
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 27 10:10:25 2026 -0800

    CritPt generation add prompt_format=None (#1280)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit c8abe5d
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 27 09:31:26 2026 -0800

    New slurm customization parameters (account, containers) (#1209)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 2b38cce
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 25 17:59:52 2026 -0800

    Add nemo-skills-core subpackage for lightweight installs (#1229)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 9fa8e83
Author: Dheeraj Peri <peri.dheeraj@gmail.com>
Date:   Wed Feb 25 12:56:35 2026 -0800

    feat: add custom judge type support for external repo integration (#1274)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
    Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

commit 8a32b13
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 24 15:24:42 2026 -0800

    Exclude numb3rs form test_eval.py (#1275)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6da2219
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Mon Feb 23 18:37:46 2026 +0400

    Numb3rs ds addition (#1174)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

commit ad034b5
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Sun Feb 22 11:55:24 2026 -0800

    Add DSBench-DA evaluation (#1254)

    Squash merge of changes during code-review.
    Signed-off-by: suriya <sgunasekar@nvidia.com>

commit 7593ab3
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 20 16:42:01 2026 -0800

    Add CritPt benchmark (#1200)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 58c31b2
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 20 16:19:22 2026 -0800

    Fix no_answer metric overcounting in _compute_pass_at_k (#1245)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 1f1a2e7
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 15:58:40 2026 -0800

    Fix incorrect prompt tokens count due to HF api update (#1264)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8ebc6f5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 09:05:33 2026 -0800

    Remove deprecated dataset group (#1263)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit ea4177f
Author: Yongqiang Wang <yongqiang.seagull@gmail.com>
Date:   Thu Feb 19 19:57:25 2026 -0500

    fix deps (#1258)

commit 60905a7
Author: Minho Ryu <ryumin93@gmail.com>
Date:   Fri Feb 20 09:39:39 2026 +0900

    Add aime26 (#1256)

    Signed-off-by: bzantium <ryumin93@gmail.com>

commit b28afc5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:18:25 2026 -0800

    Rename custom -> external benchmarks (#1262)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6cc9c45
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:10:33 2026 -0800

    Add reference to internal benchmarks repo (#1261)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 5202af6
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:08:05 2026 -0800

    Remove incorrect presence-penalty setting (#1259)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 144c70b
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 15:26:33 2026 -0800

    Adding an option to store benchmarks in external repo (#1240)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 10e6e39
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Thu Feb 19 19:57:21 2026 +0400

    update vllm miltimodal for api calls convenience (#1213)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com>

commit 1ba4219
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Wed Feb 18 03:28:23 2026 +0400

    Fix --server_container not being applied to dependent jobs (#1244)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 9517614
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Mon Feb 16 11:13:24 2026 -0800

    Support mini-swe-agent as agent harness (#1212)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Signed-off-by: Charlie Truong <chtruong@nvidia.com>
    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Stephen Ge <stepheng@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
    Co-authored-by: Ivan <imoshkov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Charlie Truong <chtruong@nvidia.com>
    Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
    Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Stephen Ge <stepheng@nvidia.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
    Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
    Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
    Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
    Co-authored-by: Wei Du <wedu@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
    Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
    Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
    Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

commit a3d44dc
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 13 22:32:15 2026 -0800

    Add --installation_command support to prepare_data (#1243)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

commit e80d524
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 12 17:26:00 2026 -0800

    Fix CI disk space for Docker image builds (#1241)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d22236c
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Feb 11 17:55:00 2026 -0800

    Fix answerbench prompt parsing (#1235)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 2401628
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 11 14:56:43 2026 -0800

    feat: add lockfiles for reproducible sandbox builds (#1233)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5a0a84d
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Wed Feb 11 13:30:03 2026 -0800

    removing datasets version restriction for LCB eval (#1230)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit ef0a890
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Wed Feb 11 12:03:16 2026 +0400

    Gnalbandyan/add physics (#1214)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

commit bd9d30c
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Tue Feb 10 15:13:27 2026 -0800

    LCB generic prompting (#1215)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 7d6c49a
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Sat Feb 7 08:45:46 2026 -0800

    Add support for different variations of nemo-rl (#1220)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit b19ba96
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 6 21:40:56 2026 -0800

    Add multi-node sandbox support for SLURM clusters (#1218)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 8950bb0
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Sat Feb 7 01:38:00 2026 +0100

    support structured outputs in hle judge for optional AA compatibility (#1186)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b84f7a2
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 6 14:51:02 2026 -0800

    A small update on running tests docs (#1219)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8e838e1
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 5 18:01:35 2026 -0800

    feat: add flag to disable sandbox replay (#1217)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5fd9085
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 5 15:57:01 2026 -0800

    Add an option to limit number of tool calls (#1216)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit d820200
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 3 10:43:55 2026 -0800

    Add arena-hard v2 (#1205)

    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: bzantium <ryumin93@gmail.com>

commit a30920e
Author: Igor Gitman <igitman@nvidia.com>
Date:   Mon Feb 2 10:53:55 2026 -0800

    Fix mkdocs warnings (#1204)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 19d7788
Author: Ivan <imoshkov@nvidia.com>
Date:   Mon Feb 2 23:25:13 2026 +0500

    Fix infinite wait in sandbox.wait_for_sandbox (#1206)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>

commit 3e65fbf
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Fri Jan 30 19:38:38 2026 -0800

    Improve tts (#1203)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 250c862
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Fri Jan 30 22:12:29 2026 +0400

    SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

commit 7ded756
Author: Ivan <imoshkov@nvidia.com>
Date:   Fri Jan 30 09:57:41 2026 +0500

     Add proper token counting to code execution model (#1184)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b986304
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Jan 29 17:57:07 2026 -0800

    Upgrade containers (#1198)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 3b44f02
Author: Dan Lord <blahblahasdf@gmail.com>
Date:   Thu Jan 29 16:40:47 2026 -0800

    Fix incorrect string format (#1199)

    Signed-off-by: dlord <dlord@nvidia.com>

commit c4854b8
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Thu Jan 29 13:43:36 2026 -0800

    Update nemo-rl to latest (#1087)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
)

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
)

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants