feat: add custom judge type support for external repo integration#1274
feat: add custom judge type support for external repo integration#1274gwarmstrong merged 24 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces per-benchmark Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Eval as eval()
participant Loader as locate()
participant JudgeFactory as Judge Creator
participant TaskMgmt as Task Management
participant Cluster as Cluster Config
Client->>Eval: eval(..., judge_step_fn, judge_pipeline_args)
alt judge_step_fn provided
Eval->>Loader: load(judge_step_fn)
Loader-->>Eval: judge_creator_fn
Eval->>JudgeFactory: judge_creator_fn(exp, benchmark, judge_pipeline_args, output_dir, ...)
JudgeFactory->>TaskMgmt: get_remaining_jobs(...)
JudgeFactory->>Cluster: read cluster_config / GPUs / partition
JudgeFactory->>TaskMgmt: add_task(cmd, deps, cluster settings)
TaskMgmt-->>JudgeFactory: task(s) created
JudgeFactory-->>Eval: list of judge tasks
else fallback (LLM)
Eval->>TaskMgmt: _create_llm_judge_tasks(...)
TaskMgmt-->>Eval: llm judge tasks
end
Eval-->>Client: evaluation scheduled/completed
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)
483-514:⚠️ Potential issue | 🟡 Minor
has_tasksnot set toTruein the dynamic judge path.The LLM judge branch (line 514) sets
has_tasks = True, but the dynamicjudge_creator_pathbranch does not. If the dynamic judge creator returns tasks, these tasks are added toall_tasks(line 547) but the experiment may not be run if no main eval jobs were scheduled (sincehas_tasksgatesrun_expat line 647).In practice this is unlikely to cause issues since main eval jobs typically set
has_tasks, but for consistency and correctness:Proposed fix
if judge_creator_path: + has_tasks = True # Use locate() to dynamically load judge creator function from nemo_skills.dataset.utils import locate🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` around lines 483 - 514, The dynamic judge_creator_path branch calls locate() to get judge_creator_fn and assigns its result to judge_tasks but never sets has_tasks; update the branch in the judge_creator_path flow so that after calling judge_creator_fn (the judge_tasks variable) you set has_tasks = True if judge_tasks is truthy (or always set has_tasks = True if you prefer), ensuring the existing all_tasks handling and the run_exp gate that depends on has_tasks will run the experiment when dynamic judge tasks are produced.
🧹 Nitpick comments (1)
nemo_skills/pipeline/eval.py (1)
475-487: Move import to the top of the file.The
from nemo_skills.dataset.utils import locateon line 485 is inside a loop body. While functionally fine (Python caches module imports), placing it at the top of the file with the other imports is more conventional and avoids repeated import lookups.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` around lines 475 - 487, The import of locate is inside the branch that handles judge_creator_path (see judge_creator_path and judge_creator_fn) which causes a local import in the loop; move "from nemo_skills.dataset.utils import locate" up to the module imports at the top of the file and remove the in-function import so judge_creator_fn = locate(judge_creator_path) uses the top-level imported locate; update any import grouping to include locate alongside the other imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Line 157: The help string for the typer.Option in the judge_step variable is
too long; update the judge_step definition so the help argument is split across
multiple lines (use implicit string concatenation or wrap the typer.Option(...)
call in parentheses and break the help string into multiple quoted segments) so
the line length is reduced and ruff-format will pass; keep the same text (e.g.,
split "Path to the judge creator function to use for the judge (locate()
convention). Eg: nemo_skills.pipeline.judges.nvembed_judge::create_judge_tasks")
across lines in the judge_step = typer.Option(...) call.
In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 100-106: In the branch where input_file is None, replace the
dictionary access judge_pipeline_args.get("input_dir") with direct indexing
judge_pipeline_args["input_dir"] so that missing input_dir raises an error
instead of generating a "--input-dir None" argument; update the code around the
input_file/input_dir handling that builds script_args (references: input_file,
judge_pipeline_args, script_args, num_seeds) to use
judge_pipeline_args["input_dir"] and keep adding the "--input-dir" and
"--num-seeds" entries as before.
- Around line 75-77: Replace the permissive dict.get() calls with direct
dictionary access so missing critical keys fail fast: change uses of
judge_pipeline_args.get("output_dir"), .get("input_file"), and
.get("judge_model") to judge_pipeline_args["output_dir"],
judge_pipeline_args["input_file"], and judge_pipeline_args["judge_model"] in
comet_judge.py (variables output_dir_path, input_file, comet_model_path) so a
KeyError is raised immediately if those required arguments are absent.
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 75-76: The code currently uses
judge_pipeline_args.get("output_dir") which hides missing-key errors; replace
the .get usage for output_dir with direct dictionary access (use
judge_pipeline_args["output_dir"]) so the code fails fast with a clear KeyError
if the caller stops providing it; update the reference to output_dir_path
accordingly (leave input_file handling unchanged unless you also expect it to be
mandatory) and run tests to ensure no other call sites assume .get semantics.
- Around line 100-102: The code path in nvembed_judge.py that handles when
input_file is None currently uses judge_pipeline_args.get("input_dir") which can
silently return None and interpolate it into script_args; update that branch to
access the required key directly (judge_pipeline_args["input_dir"]) so the code
fails fast if input_dir is missing and prevents None from being appended to the
command string; locate the block around the if input_file is None check and
replace the .get() access with direct dictionary indexing.
---
Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 483-514: The dynamic judge_creator_path branch calls locate() to
get judge_creator_fn and assigns its result to judge_tasks but never sets
has_tasks; update the branch in the judge_creator_path flow so that after
calling judge_creator_fn (the judge_tasks variable) you set has_tasks = True if
judge_tasks is truthy (or always set has_tasks = True if you prefer), ensuring
the existing all_tasks handling and the run_exp gate that depends on has_tasks
will run the experiment when dynamic judge tasks are produced.
---
Nitpick comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 475-487: The import of locate is inside the branch that handles
judge_creator_path (see judge_creator_path and judge_creator_fn) which causes a
local import in the loop; move "from nemo_skills.dataset.utils import locate" up
to the module imports at the top of the file and remove the in-function import
so judge_creator_fn = locate(judge_creator_path) uses the top-level imported
locate; update any import grouping to include locate alongside the other
imports.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
docs/evaluation/multilingual.mdnemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/pipeline/eval.pynemo_skills/pipeline/judges/__init__.pynemo_skills/pipeline/judges/comet_judge.pynemo_skills/pipeline/judges/nvembed_judge.py
| output_dir_path = judge_pipeline_args.get("output_dir") | ||
| input_file = judge_pipeline_args.get("input_file") | ||
| comet_model_path = judge_pipeline_args.get("judge_model") |
There was a problem hiding this comment.
Use direct dictionary access for keys that are expected to be present.
output_dir is always set by eval.py before calling the judge creator (line 473), and judge_model is critical for COMET evaluation — if it's missing, the command will interpolate None as a string into the CLI args (--comet-model-path None), producing a silent, hard-to-debug failure. Use judge_pipeline_args["key"] to fail fast with a clear KeyError.
Proposed fix
- output_dir_path = judge_pipeline_args.get("output_dir")
- input_file = judge_pipeline_args.get("input_file")
- comet_model_path = judge_pipeline_args.get("judge_model")
+ output_dir_path = judge_pipeline_args["output_dir"]
+ input_file = judge_pipeline_args.get("input_file") # legitimately optional
+ comet_model_path = judge_pipeline_args["judge_model"]As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| output_dir_path = judge_pipeline_args.get("output_dir") | |
| input_file = judge_pipeline_args.get("input_file") | |
| comet_model_path = judge_pipeline_args.get("judge_model") | |
| output_dir_path = judge_pipeline_args["output_dir"] | |
| input_file = judge_pipeline_args.get("input_file") # legitimately optional | |
| comet_model_path = judge_pipeline_args["judge_model"] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 75 - 77, Replace the
permissive dict.get() calls with direct dictionary access so missing critical
keys fail fast: change uses of judge_pipeline_args.get("output_dir"),
.get("input_file"), and .get("judge_model") to
judge_pipeline_args["output_dir"], judge_pipeline_args["input_file"], and
judge_pipeline_args["judge_model"] in comet_judge.py (variables output_dir_path,
input_file, comet_model_path) so a KeyError is raised immediately if those
required arguments are absent.
|
|
||
| if input_file is None: | ||
| input_dir = judge_pipeline_args.get("input_dir") | ||
| script_args.append(f"--input-dir {input_dir}") | ||
| script_args.append(f"--num-seeds {num_seeds}") | ||
| else: | ||
| script_args.append(f"--input-file {input_file}") |
There was a problem hiding this comment.
Use direct access for input_dir — it is always set when input_file is None.
In the input_file is None branch, eval.py (line 468) always populates input_dir in judge_pipeline_args. Using .get() here would silently produce --input-dir None in the command string instead of failing fast.
Proposed fix
- input_dir = judge_pipeline_args.get("input_dir")
+ input_dir = judge_pipeline_args["input_dir"]As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 100 - 106, In the
branch where input_file is None, replace the dictionary access
judge_pipeline_args.get("input_dir") with direct indexing
judge_pipeline_args["input_dir"] so that missing input_dir raises an error
instead of generating a "--input-dir None" argument; update the code around the
input_file/input_dir handling that builds script_args (references: input_file,
judge_pipeline_args, script_args, num_seeds) to use
judge_pipeline_args["input_dir"] and keep adding the "--input-dir" and
"--num-seeds" entries as before.
| output_dir_path = judge_pipeline_args.get("output_dir") | ||
| input_file = judge_pipeline_args.get("input_file") |
There was a problem hiding this comment.
Use direct dictionary access for output_dir — it is always populated by the caller.
Same as the Comet judge: output_dir is guaranteed to be set by eval.py (line 473) before invoking the judge creator. Direct access provides clear error messages if assumptions change.
Proposed fix
- output_dir_path = judge_pipeline_args.get("output_dir")
- input_file = judge_pipeline_args.get("input_file")
+ output_dir_path = judge_pipeline_args["output_dir"]
+ input_file = judge_pipeline_args.get("input_file") # legitimately optionalAs per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 75 - 76, The code
currently uses judge_pipeline_args.get("output_dir") which hides missing-key
errors; replace the .get usage for output_dir with direct dictionary access (use
judge_pipeline_args["output_dir"]) so the code fails fast with a clear KeyError
if the caller stops providing it; update the reference to output_dir_path
accordingly (leave input_file handling unchanged unless you also expect it to be
mandatory) and run tests to ensure no other call sites assume .get semantics.
| if input_file is None: | ||
| input_dir = judge_pipeline_args.get("input_dir") | ||
| script_args.append(f"--input-dir {input_dir}") |
There was a problem hiding this comment.
Use direct access for input_dir.
When input_file is None, eval.py always sets input_dir in judge_pipeline_args (line 468). Using .get() would silently interpolate None into the command string.
Proposed fix
- input_dir = judge_pipeline_args.get("input_dir")
+ input_dir = judge_pipeline_args["input_dir"]As per coding guidelines, **/*.py: "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if input_file is None: | |
| input_dir = judge_pipeline_args.get("input_dir") | |
| script_args.append(f"--input-dir {input_dir}") | |
| if input_file is None: | |
| input_dir = judge_pipeline_args["input_dir"] | |
| script_args.append(f"--input-dir {input_dir}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 100 - 102, The
code path in nvembed_judge.py that handles when input_file is None currently
uses judge_pipeline_args.get("input_dir") which can silently return None and
interpolate it into script_args; update that branch to access the required key
directly (judge_pipeline_args["input_dir"]) so the code fails fast if input_dir
is missing and prevents None from being appended to the command string; locate
the block around the if input_file is None check and replace the .get() access
with direct dictionary indexing.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)
475-511:⚠️ Potential issue | 🟠 MajorHandle conflicting judge_path sources and mark dynamic judge tasks as work.
Two issues here:
- If a benchmark defines
judge_path, CLI--judge_pathis silently ignored. Prefer CLI override or raise on conflict.- When dynamic judge tasks are created,
has_tasksstays false, so runs with only judge tasks (andauto_summarize_results=False) won’t execute.✅ Proposed fix
- judge_creator_path = judge_pipeline_args.pop("judge_path", judge_path) + benchmark_judge_path = judge_pipeline_args.pop("judge_path", None) + if judge_path and benchmark_judge_path and judge_path != benchmark_judge_path: + raise ValueError( + f"Conflicting judge_path: CLI={judge_path} vs benchmark={benchmark_judge_path}" + ) + judge_creator_path = judge_path or benchmark_judge_path @@ - if judge_tasks: + if judge_tasks: + has_tasks = True benchmark_to_judge_tasks[benchmark] = judge_tasks all_tasks.extend(judge_tasks)As per coding guidelines, "Avoid silently ignoring unused user-passed parameters; the code should fail if a required argument is not specified or if unsupported arguments are provided. Use dataclasses or **kwargs to automatically handle parameter validation."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` around lines 475 - 511, The code currently lets a benchmark-provided judge_path in judge_pipeline_args silently override a CLI judge_path and creates dynamic judge tasks without marking the run as having work; change the logic so CLI judge_path takes precedence: retrieve any benchmark value with judge_pipeline_args.pop("judge_path", None), then set judge_creator_path = judge_path if judge_path is truthy else the popped value (and remove the key from judge_pipeline_args to avoid silent conflicts), and optionally log a warning if both were provided; after calling judge_creator_fn(...) and assigning judge_tasks, ensure the run is marked as having work by setting has_tasks = True (or marking the created judge_tasks as work) when judge_tasks is non-empty so runs with only dynamic judge tasks will execute even if auto_summarize_results=False.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Line 351: The current eval_requires_judge computation ignores a user-specified
--judge_path when no other CLI judge args are truthy; update the logic that
computes eval_requires_judge (currently using cli_judge_pipeline_args and the
variable eval_requires_judge) so that the presence of
cli_judge_pipeline_args['judge_path'] (non-empty string) forces
eval_requires_judge to True and/or raises a validation error if judge_path is
provided without required companion args; locate the code that builds
cli_judge_pipeline_args and the eval_requires_judge assignment and either
include an explicit check for judge_path or add parameter validation (e.g., in
the same function that constructs cli_judge_pipeline_args) to fail fast for
unsupported or incomplete judge-related args.
---
Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 475-511: The code currently lets a benchmark-provided judge_path
in judge_pipeline_args silently override a CLI judge_path and creates dynamic
judge tasks without marking the run as having work; change the logic so CLI
judge_path takes precedence: retrieve any benchmark value with
judge_pipeline_args.pop("judge_path", None), then set judge_creator_path =
judge_path if judge_path is truthy else the popped value (and remove the key
from judge_pipeline_args to avoid silent conflicts), and optionally log a
warning if both were provided; after calling judge_creator_fn(...) and assigning
judge_tasks, ensure the run is marked as having work by setting has_tasks = True
(or marking the created judge_tasks as work) when judge_tasks is non-empty so
runs with only dynamic judge tasks will execute even if
auto_summarize_results=False.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
docs/evaluation/multilingual.mdnemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/pipeline/eval.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
nemo_skills/pipeline/eval.py (1)
354-354:⚠️ Potential issue | 🟠 MajorEnsure
--judge_pathforces judge scheduling.
eval_requires_judgeignores a user-specifiedjudge_path, so judge jobs can be skipped for benchmarks that don’t require a judge by default. This silently drops a user-provided parameter.✅ Proposed fix
- eval_requires_judge = any(param_value for param_value in cli_judge_pipeline_args.values()) + eval_requires_judge = bool(judge_path) or any(param_value for param_value in cli_judge_pipeline_args.values())As per coding guidelines: "Avoid silently ignoring unused user-passed parameters; the code should fail if a required argument is not specified or if unsupported arguments are provided. Use dataclasses or **kwargs to automatically handle parameter validation."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` at line 354, The current eval_requires_judge computation ignores a user-specified judge_path in cli_judge_pipeline_args; update the logic so that presence/non-empty cli_judge_pipeline_args.get('judge_path') forces scheduling of judge jobs (e.g., include a check for truthiness of 'judge_path' in the eval_requires_judge expression or set eval_requires_judge = True when cli_judge_pipeline_args.get('judge_path') is provided), and additionally validate unexpected/unsupported CLI judge args by raising an error or warning if unknown keys are present in cli_judge_pipeline_args to avoid silently ignoring user parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nemo_skills/pipeline/eval.py`:
- Line 354: The current eval_requires_judge computation ignores a user-specified
judge_path in cli_judge_pipeline_args; update the logic so that
presence/non-empty cli_judge_pipeline_args.get('judge_path') forces scheduling
of judge jobs (e.g., include a check for truthiness of 'judge_path' in the
eval_requires_judge expression or set eval_requires_judge = True when
cli_judge_pipeline_args.get('judge_path') is provided), and additionally
validate unexpected/unsupported CLI judge args by raising an error or warning if
unknown keys are present in cli_judge_pipeline_args to avoid silently ignoring
user parameters.
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Add generic 'custom' judge type that allows external repos to register their own judge task creators using the locate() pattern. This enables external benchmarks to define judge tasks without modifying nemo-skills source code. Changes: - Add _create_custom_judge_tasks() function to dynamically import and call external judge creators - Add 'custom' judge type routing in prepare_eval_commands() - Uses existing locate() pattern (same as METRICS_TYPE, eval_type) This change is minimal (~50 lines) and reusable for any external repo. Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
b5f419f to
d27e339
Compare
|
@peri044 I renamed the parameter to judge_step_fn, I think that's a bit clearer as judge_path looks like a path to judge model. Also fixed DCO and a bug that caused one of the tests to fail. Let me know if naming looks good to you and we can merge after that |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (4)
nemo_skills/pipeline/judges/comet_judge.py (2)
100-106: Use direct access forinput_dir— it is always set wheninput_file is None.
eval.py(line 472) always populatesinput_dirwhenbenchmark_seeds != 0. Using.get()silently interpolatesNoneinto the command string.Proposed fix
- input_dir = judge_pipeline_args.get("input_dir") + input_dir = judge_pipeline_args["input_dir"]As per coding guidelines,
**/*.py: "Do not use.get()for accessing dictionary keys if the code expects them to be present; use direct dictionary accessdict[key]instead."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 100 - 106, When input_file is None, replace judge_pipeline_args.get("input_dir") with direct access judge_pipeline_args["input_dir"] so the constructed command uses the guaranteed input_dir value (do this in the block that appends to script_args alongside num_seeds); ensure you reference the variables input_file, judge_pipeline_args, input_dir, script_args, and num_seeds in the change.
75-77: Use direct dictionary access for keys expected to be present.
output_diris always set byeval.py(line 477) before invoking this function, andjudge_modelis set viasetdefaultat line 496. Using.get()will silently produceNonevalues that get interpolated as the string"None"in the command, leading to hard-to-debug failures (e.g.,--comet-model-path None).input_fileis legitimately optional sinceNonedrives the branching logic.Proposed fix
- output_dir_path = judge_pipeline_args.get("output_dir") - input_file = judge_pipeline_args.get("input_file") - comet_model_path = judge_pipeline_args.get("judge_model") + output_dir_path = judge_pipeline_args["output_dir"] + input_file = judge_pipeline_args.get("input_file") # legitimately optional + comet_model_path = judge_pipeline_args["judge_model"]As per coding guidelines,
**/*.py: "Do not use.get()for accessing dictionary keys if the code expects them to be present; use direct dictionary accessdict[key]instead to allow proper error handling and fail fast with clear errors."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 75 - 77, The code uses judge_pipeline_args.get(...) for keys that must exist, causing silent None values; change to direct dictionary access for required keys: replace judge_pipeline_args.get("output_dir") and judge_pipeline_args.get("judge_model") with judge_pipeline_args["output_dir"] and judge_pipeline_args["judge_model"] respectively (leave input_file as judge_pipeline_args.get("input_file") since it is optional), so failures raise KeyError and fail fast; update any variable names output_dir_path and comet_model_path accordingly where they are assigned from judge_pipeline_args.nemo_skills/pipeline/judges/nvembed_judge.py (2)
100-102: Use direct access forinput_dir.When
input_file is None,eval.pyalways setsinput_dir. Using.get()silently interpolatesNoneinto the command string.Proposed fix
- input_dir = judge_pipeline_args.get("input_dir") + input_dir = judge_pipeline_args["input_dir"]As per coding guidelines,
**/*.py: "Do not use.get()for accessing dictionary keys if the code expects them to be present; use direct dictionary accessdict[key]instead."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 100 - 102, The branch handling when input_file is None uses judge_pipeline_args.get("input_dir") which can return None and be interpolated into the command; change it to direct access judge_pipeline_args["input_dir"] and use that value when building script_args (the conditional block around input_file and the script_args.append call) so the code fails loudly if input_dir is missing instead of inserting "None" into the command.
75-76: Use direct dictionary access foroutput_dir— it is always populated by the caller.Same issue as in
comet_judge.py:output_diris guaranteed set byeval.py(line 477). Direct access provides a clearKeyErrorif assumptions change.Proposed fix
- output_dir_path = judge_pipeline_args.get("output_dir") - input_file = judge_pipeline_args.get("input_file") + output_dir_path = judge_pipeline_args["output_dir"] + input_file = judge_pipeline_args.get("input_file") # legitimately optionalAs per coding guidelines,
**/*.py: "Do not use.get()for accessing dictionary keys if the code expects them to be present; use direct dictionary accessdict[key]instead."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 75 - 76, The code uses judge_pipeline_args.get("output_dir") even though output_dir is always provided by the caller; change this to direct dictionary access judge_pipeline_args["output_dir"] to surface a KeyError if the assumption is violated. Update the assignment that sets output_dir_path (and leave input_file as-is) so it uses judge_pipeline_args["output_dir"]; ensure any downstream logic that relies on output_dir_path continues to work unchanged.
🧹 Nitpick comments (1)
nemo_skills/pipeline/judges/nvembed_judge.py (1)
26-134: Consider extracting shared logic betweencomet_judge.pyandnvembed_judge.py.Both judge modules share nearly identical structure: seed determination, remaining-job checks, script-arg assembly, and
add_taskinvocation. The main differences are the command string and the--skip-existingflag. A shared base or helper function could reduce duplication. Not urgent given this is the initial extraction, but worth noting for follow-up.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 26 - 134, create_judge_tasks duplicates logic found in comet_judge.py (seed determination, get_remaining_jobs check, script args assembly, and add_task call); refactor by extracting a shared helper (e.g., build_judge_task or create_common_judge_task) that accepts parameters for the script path, unique script_args (like "--input-file" vs "--input-dir" and the presence of "--skip-existing"), container key, and GPU/node defaults, then have create_judge_tasks and the comet counterpart call that helper; reuse existing symbols get_remaining_jobs and add_task inside the helper and preserve behavior for run_cmd formation, task_name suffixes (e.g., "-nvembed-judge"), and task parameters (num_gpus, num_nodes, partition, run_after, reuse_code_exp, reuse_code, task_dependencies, installation_command, skip_hf_home_check, sbatch_kwargs).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 486-488: The flag has_tasks is being set unconditionally when a
judge_step_fn exists, which can be incorrect if judge_step_fn() returns an empty
list; change the logic in the block handling judge_step_fn so that you only set
has_tasks = True after calling judge_step_fn and confirming it returned a
non-empty list (e.g., inside the if judge_tasks: guard that processes the
returned tasks), mirroring the existing pattern used when _generate returns
None; ensure run_exp is only triggered when has_tasks is true after this check.
- Around line 479-492: The bug is that judge_step_fn is mutated inside the
benchmark loop by using judge_pipeline_args.pop("judge_step_fn", judge_step_fn),
causing the previous iteration's value to be used as the fallback; to fix,
capture the original CLI/default value (e.g., orig_judge_step_fn =
judge_step_fn) before entering the loop and inside the loop use a local variable
or call pop with that original as the default
(judge_pipeline_args.pop("judge_step_fn", orig_judge_step_fn)), and ensure you
don't assign back to the outer-scope judge_step_fn so the CLI default isn't
overwritten across iterations; keep the locate() dynamic-loading logic (from
nemo_skills.dataset.utils import locate) but apply it to the loop-local variable
only.
In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 94-96: The early-return in the Comet judge currently returns an
empty list which leaves the caller's pre-set has_tasks=True unchanged; change
the early-return to return None (or another clearly falsy sentinel) from the
judge function so callers can detect "no work" by truthiness, and update the
caller in eval.py (the code around line 487 / the logic that sets has_tasks
before calling judge_step_fn) to set has_tasks based on the truthiness of
judge_step_fn's result (e.g., assign has_tasks = bool(judge_tasks) after calling
judge_step_fn) so the experiment isn't launched when there are no actual tasks.
Ensure references to the Comet judge function (judge_step_fn) and the caller in
eval.py are updated consistently.
In `@tests/test_generation.py`:
- Around line 31-32: The test uses an incorrect NVIDIA NIM API model identifier
and endpoint: replace the model string
"--model=nvidia/nvidia/Nemotron-3-Nano-30B-A3B" with the documented lowercase
single-prefix identifier "--model=nvidia/nemotron-3-nano-30b-a3b" and revert the
server address "--server_address=https://inference-api.nvidia.com/v1/" to the
official base URL "--server_address=https://integrate.api.nvidia.com"; update
these exact f-string literals in tests/test_generation.py so the test uses the
documented NIM API formats.
---
Duplicate comments:
In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 100-106: When input_file is None, replace
judge_pipeline_args.get("input_dir") with direct access
judge_pipeline_args["input_dir"] so the constructed command uses the guaranteed
input_dir value (do this in the block that appends to script_args alongside
num_seeds); ensure you reference the variables input_file, judge_pipeline_args,
input_dir, script_args, and num_seeds in the change.
- Around line 75-77: The code uses judge_pipeline_args.get(...) for keys that
must exist, causing silent None values; change to direct dictionary access for
required keys: replace judge_pipeline_args.get("output_dir") and
judge_pipeline_args.get("judge_model") with judge_pipeline_args["output_dir"]
and judge_pipeline_args["judge_model"] respectively (leave input_file as
judge_pipeline_args.get("input_file") since it is optional), so failures raise
KeyError and fail fast; update any variable names output_dir_path and
comet_model_path accordingly where they are assigned from judge_pipeline_args.
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 100-102: The branch handling when input_file is None uses
judge_pipeline_args.get("input_dir") which can return None and be interpolated
into the command; change it to direct access judge_pipeline_args["input_dir"]
and use that value when building script_args (the conditional block around
input_file and the script_args.append call) so the code fails loudly if
input_dir is missing instead of inserting "None" into the command.
- Around line 75-76: The code uses judge_pipeline_args.get("output_dir") even
though output_dir is always provided by the caller; change this to direct
dictionary access judge_pipeline_args["output_dir"] to surface a KeyError if the
assumption is violated. Update the assignment that sets output_dir_path (and
leave input_file as-is) so it uses judge_pipeline_args["output_dir"]; ensure any
downstream logic that relies on output_dir_path continues to work unchanged.
---
Nitpick comments:
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 26-134: create_judge_tasks duplicates logic found in
comet_judge.py (seed determination, get_remaining_jobs check, script args
assembly, and add_task call); refactor by extracting a shared helper (e.g.,
build_judge_task or create_common_judge_task) that accepts parameters for the
script path, unique script_args (like "--input-file" vs "--input-dir" and the
presence of "--skip-existing"), container key, and GPU/node defaults, then have
create_judge_tasks and the comet counterpart call that helper; reuse existing
symbols get_remaining_jobs and add_task inside the helper and preserve behavior
for run_cmd formation, task_name suffixes (e.g., "-nvembed-judge"), and task
parameters (num_gpus, num_nodes, partition, run_after, reuse_code_exp,
reuse_code, task_dependencies, installation_command, skip_hf_home_check,
sbatch_kwargs).
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
docs/evaluation/multilingual.mdnemo_skills/dataset/mmau-pro/closed_form/__init__.pynemo_skills/pipeline/eval.pynemo_skills/pipeline/judges/__init__.pynemo_skills/pipeline/judges/comet_judge.pynemo_skills/pipeline/judges/nvembed_judge.pytests/test_generation.py
✅ Files skipped from review due to trivial changes (1)
- nemo_skills/pipeline/judges/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/evaluation/multilingual.md
| # judge_step_fn is a :: path to the judge creator function (locate() convention). | ||
| # Could be set directly in JUDGE_PIPELINE_ARGS; falls back to None for LLM judge. | ||
| judge_step_fn = judge_pipeline_args.pop("judge_step_fn", judge_step_fn) | ||
|
|
||
| # Create judge tasks based on judge type | ||
| if benchmark_judge_type == "nvembed": | ||
| judge_tasks = _create_nvembed_judge_tasks( | ||
| exp=exp, | ||
| expname=expname, | ||
| benchmark=benchmark, | ||
| judge_pipeline_args=judge_pipeline_args, | ||
| rerun_done=rerun_done, | ||
| log_dir=log_dir, | ||
| server_parameters=server_parameters, | ||
| cluster_config=cluster_config, | ||
| judge_server_gpus=judge_server_gpus, | ||
| judge_server_nodes=judge_server_nodes, | ||
| partition=partition, | ||
| run_after=run_after, | ||
| reuse_code_exp=reuse_code_exp, | ||
| reuse_code=reuse_code, | ||
| dependent_tasks=dependent_tasks, | ||
| all_tasks=all_tasks, | ||
| _task_dependencies=_task_dependencies, | ||
| installation_command=installation_command, | ||
| skip_hf_home_check=skip_hf_home_check, | ||
| sbatch_kwargs=sbatch_kwargs, | ||
| ) | ||
| elif benchmark_judge_type == "comet": | ||
| judge_pipeline_args["judge_model"] = judge_model | ||
| judge_tasks = _create_comet_judge_tasks( | ||
| # TODO: we should rework the interface here to have consistent parameters between main llm and custom | ||
| # judge creation steps. E.g. things like judge_model assignment below shouldn't be necessary | ||
|
|
||
| if judge_step_fn: | ||
| has_tasks = True | ||
| if not callable(judge_step_fn): | ||
| # Use locate() to dynamically load judge creator function | ||
| from nemo_skills.dataset.utils import locate | ||
|
|
||
| judge_step_fn = locate(judge_step_fn) |
There was a problem hiding this comment.
Bug: judge_step_fn leaks across benchmark iterations.
On line 481, judge_step_fn is reassigned from judge_pipeline_args.pop("judge_step_fn", judge_step_fn). Since this is inside the for benchmark ... loop (line 457), the fallback value on subsequent iterations is whatever was set by the previous benchmark — not the original CLI value.
Example: If benchmark A's JUDGE_PIPELINE_ARGS defines judge_step_fn = "...nvembed_judge::create_judge_tasks" but benchmark B does not, benchmark B will incorrectly inherit benchmark A's judge_step_fn instead of falling back to the CLI default (likely None → LLM judge).
Save the original CLI value before the loop and use it as the fallback:
Proposed fix
all_tasks = []
if _task_dependencies is None:
_task_dependencies = []
+ cli_judge_step_fn = judge_step_fn
with pipeline_utils.get_exp(expname, cluster_config, _reuse_exp) as exp:
# scheduling main eval jobs
...
# scheduling judge jobs if needed
for idx, (benchmark, benchmark_args) in enumerate(benchmarks_dict.items()):
...
- judge_step_fn = judge_pipeline_args.pop("judge_step_fn", judge_step_fn)
+ benchmark_judge_step_fn = judge_pipeline_args.pop("judge_step_fn", cli_judge_step_fn)
...
- if judge_step_fn:
+ if benchmark_judge_step_fn:
has_tasks = True
- if not callable(judge_step_fn):
+ if not callable(benchmark_judge_step_fn):
from nemo_skills.dataset.utils import locate
- judge_step_fn = locate(judge_step_fn)
+ benchmark_judge_step_fn = locate(benchmark_judge_step_fn)
...
- judge_tasks = judge_step_fn(
+ judge_tasks = benchmark_judge_step_fn(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/eval.py` around lines 479 - 492, The bug is that
judge_step_fn is mutated inside the benchmark loop by using
judge_pipeline_args.pop("judge_step_fn", judge_step_fn), causing the previous
iteration's value to be used as the fallback; to fix, capture the original
CLI/default value (e.g., orig_judge_step_fn = judge_step_fn) before entering the
loop and inside the loop use a local variable or call pop with that original as
the default (judge_pipeline_args.pop("judge_step_fn", orig_judge_step_fn)), and
ensure you don't assign back to the outer-scope judge_step_fn so the CLI default
isn't overwritten across iterations; keep the locate() dynamic-loading logic
(from nemo_skills.dataset.utils import locate) but apply it to the loop-local
variable only.
| if judge_step_fn: | ||
| has_tasks = True | ||
| if not callable(judge_step_fn): |
There was a problem hiding this comment.
has_tasks = True is set before confirming the judge creator returns tasks.
If judge_step_fn() returns [] (e.g., all outputs already exist), has_tasks is still True. In an edge case where this is the only scheduled work, run_exp would be called on an experiment with no tasks. Consider moving has_tasks = True inside the if judge_tasks: guard at line 554, consistent with how _generate returning None is handled.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/eval.py` around lines 486 - 488, The flag has_tasks is
being set unconditionally when a judge_step_fn exists, which can be incorrect if
judge_step_fn() returns an empty list; change the logic in the block handling
judge_step_fn so that you only set has_tasks = True after calling judge_step_fn
and confirming it returned a non-empty list (e.g., inside the if judge_tasks:
guard that processes the returned tasks), mirroring the existing pattern used
when _generate returns None; ensure run_exp is only triggered when has_tasks is
true after this check.
| if not remaining_jobs or all(not chunks for chunks in remaining_jobs.values()): | ||
| LOG.info(f"Skipping Comet judge for {benchmark} - all output files and .done markers exist") | ||
| return [] |
There was a problem hiding this comment.
Early-return skips has_tasks = True in the caller — verify the caller handles empty returns correctly.
When remaining_jobs is empty, this returns []. In eval.py (line 487), has_tasks is set to True before calling judge_step_fn, so if this is the only task scheduled and it returns empty, the experiment will still attempt to run with no actual tasks. This is a minor inconsistency — the if judge_tasks: guard at line 554 prevents the worst outcome (NoneType errors), but the experiment still launches unnecessarily.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 94 - 96, The
early-return in the Comet judge currently returns an empty list which leaves the
caller's pre-set has_tasks=True unchanged; change the early-return to return
None (or another clearly falsy sentinel) from the judge function so callers can
detect "no work" by truthiness, and update the caller in eval.py (the code
around line 487 / the logic that sets has_tasks before calling judge_step_fn) to
set has_tasks based on the truthiness of judge_step_fn's result (e.g., assign
has_tasks = bool(judge_tasks) after calling judge_step_fn) so the experiment
isn't launched when there are no actual tasks. Ensure references to the Comet
judge function (judge_step_fn) and the caller in eval.py are updated
consistently.
| f" --model=nvidia/nvidia/Nemotron-3-Nano-30B-A3B " | ||
| f" --server_address=https://inference-api.nvidia.com/v1/ " |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
NVIDIA NIM inference API endpoint documentation 2025
💡 Result:
NVIDIA NIM inference API endpoint docs (2025-era)
1) NVIDIA-hosted NIM (API Catalog) – LLM inference
- Base URL:
https://integrate.api.nvidia.com - Chat endpoint:
POST /v1/chat/completions(docs.api.nvidia.com) - Reference page (LLM APIs): (see citation) (docs.api.nvidia.com)
2) Self-hosted / Downloadable NIM for LLMs – OpenAPI + endpoints
The NIM for LLMs API reference lists these key endpoints (OpenAI-compatible):
GET /v1/health/readyGET /v1/modelsPOST /v1/chat/completionsPOST /v1/completions(docs.nvidia.com)
It also notes experimental endpoints like/v1/responsesand Llama Stack experimental routes. (docs.nvidia.com)
3) Retrieval (embeddings / rerank) NIM APIs
NVIDIA’s “Retrieval APIs” reference covers embedding and reranking endpoints for NeMo Retriever NIMs. (docs.api.nvidia.com)
Quick example (self-hosted NIM)
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
-d '{"model":"model_name","prompt":"Once upon a time","max_tokens":64}'Example shown in NVIDIA’s NIM microservices page. (nvidia.com)
Citations:
- 1: https://docs.api.nvidia.com/nim/reference/llm-apis?utm_source=openai
- 2: https://docs.api.nvidia.com/nim/reference/llm-apis?utm_source=openai
- 3: https://docs.nvidia.com/nim/large-language-models/latest/api-reference.html?utm_source=openai
- 4: https://docs.nvidia.com/nim/large-language-models/latest/api-reference.html?utm_source=openai
- 5: https://docs.api.nvidia.com/nim/reference/retrieval-apis?utm_source=openai
- 6: https://www.nvidia.com/en-us/ai-data-science/products/nim-microservices/?utm_source=openai
🌐 Web query:
Nemotron-3-Nano-30B-A3B model NVIDIA API
💡 Result:
Nemotron-3-Nano-30B-A3B is available in the NVIDIA NIM API as nvidia / nemotron-3-nano-30b-a3b. The chat endpoint is OpenAI-compatible and uses Bearer auth. [1], [2]
HTTP endpoint (Chat Completions):
POST https://integrate.api.nvidia.com/v1/chat/completions [2]
Request (example):
curl https://integrate.api.nvidia.com/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $NVIDIA_API_KEY" \
-d '{
"model": "nvidia/nemotron-3-nano-30b-a3b",
"messages": [{"role":"user","content":"Write a haiku about GPUs"}]
}'Model/API reference and model details are in NVIDIA’s NIM docs. [1]
Sources:
[1] NVIDIA NIM API Reference – nvidia / nemotron-3-nano-30b-a3b: https://docs.api.nvidia.com/nim/reference/nvidia-nemotron-3-nano-30b-a3b
[2] NVIDIA NIM API Reference – Chat Completions endpoint for this model: https://docs.api.nvidia.com/nim/reference/nvidia-nemotron-3-nano-30b-a3b-infer
🌐 Web query:
NVIDIA inference-api.nvidia.com versus integrate.api.nvidia.com endpoint
💡 Result:
-
https://integrate.api.nvidia.comis NVIDIA’s documented, public base URL for the NVIDIA API Catalog (NIM hosted endpoints). For LLMs, it exposes an OpenAI-compatible endpoint likePOST /v1/chat/completions. [1]- Note: NVIDIA documents at least one OpenAI-style endpoint difference vs “generic OpenAI” expectations—e.g.,
/v1/completionsis not supported onintegrate.api.nvidia.com/v1(use chat-completions-style APIs instead, or the specific service docs). [2]
- Note: NVIDIA documents at least one OpenAI-style endpoint difference vs “generic OpenAI” expectations—e.g.,
-
https://inference-api.nvidia.comis not the base URL shown in NVIDIA’s NIM API Catalog reference docs; hitting the root returns"LiteLLM: RUNNING", which indicates it’s a LiteLLM-powered gateway/proxy rather than the primary documented public entrypoint. [3]- Practical implication: treat it as internal/legacy/undocumented unless NVIDIA specifically instructed you to use it for your account/tenant—prefer
integrate.api.nvidia.comfor supported integrations. [1][3]
- Practical implication: treat it as internal/legacy/undocumented unless NVIDIA specifically instructed you to use it for your account/tenant—prefer
Sources:
[1] NVIDIA NIM LLM APIs reference (base URL + /v1/chat/completions) (docs.api.nvidia.com)
[2] NeMo Guardrails doc noting integrate.api.nvidia.com/v1 doesn’t support /completions (docs.nvidia.com)
[3] inference-api.nvidia.com root response (inference-api.nvidia.com)
Citations:
- 1: https://docs.api.nvidia.com/nim/reference/llm-apis?utm_source=openai
- 2: https://docs.nvidia.com/nemo/microservices/latest/guardrails/running-inference.html?utm_source=openai
- 3: https://inference-api.nvidia.com/
Revert endpoint and model identifier to documented NVIDIA NIM API standards.
The endpoint change to inference-api.nvidia.com is incorrect. NVIDIA's official documented base URL for the NIM API Catalog is https://integrate.api.nvidia.com. The inference-api.nvidia.com endpoint is an undocumented internal LiteLLM gateway and should not be used for production integrations.
The model identifier format is also incorrect. The correct format per NVIDIA's NIM API documentation is nvidia/nemotron-3-nano-30b-a3b (lowercase with single nvidia/ prefix), not the mixed-case Nemotron-3-Nano-30B-A3B. The tests will fail with these changes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test_generation.py` around lines 31 - 32, The test uses an incorrect
NVIDIA NIM API model identifier and endpoint: replace the model string
"--model=nvidia/nvidia/Nemotron-3-Nano-30B-A3B" with the documented lowercase
single-prefix identifier "--model=nvidia/nemotron-3-nano-30b-a3b" and revert the
server address "--server_address=https://inference-api.nvidia.com/v1/" to the
official base URL "--server_address=https://integrate.api.nvidia.com"; update
these exact f-string literals in tests/test_generation.py so the test uses the
documented NIM API formats.
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Thank you @Kipok for the fixes. The naming looks good to me. |
commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Add generic 'custom' judge type that allows external repos to register their own judge task creators using the locate() pattern. This enables external benchmarks to define judge tasks without modifying nemo-skills source code.
Changes:
Summary by CodeRabbit
New Features
Documentation
Tests
Chore