Skip to content

feat: add custom judge type support for external repo integration#1265

Closed
peri044 wants to merge 9 commits intoNVIDIA-NeMo:mainfrom
peri044:peri044/external_judge
Closed

feat: add custom judge type support for external repo integration#1265
peri044 wants to merge 9 commits intoNVIDIA-NeMo:mainfrom
peri044:peri044/external_judge

Conversation

@peri044
Copy link
Collaborator

@peri044 peri044 commented Feb 20, 2026

Add generic 'custom' judge type that allows external repos to register their own judge task creators using the locate() pattern. This enables external benchmarks to define judge tasks without modifying nemo-skills source code.

Changes:

  • Add _create_custom_judge_tasks() function to dynamically import and call external judge creators
  • Add 'custom' judge type routing in prepare_eval_commands()
  • Uses existing locate() pattern (same as METRICS_TYPE, eval_type)

Summary by CodeRabbit

  • New Features

    • Support for dynamic judge loading by path, including user-provided custom judge creators.
    • Added built-in Comet and NVEmbed judge integrations for translation and embedding-based evaluations.
  • Refactor

    • Consolidated judge creation to the dynamic loader; removed legacy specialized judge creation paths.
    • Preserved existing LLM-judge fallback and ensured judge model and pipeline args propagate to dynamic judges.

Add generic 'custom' judge type that allows external repos to register
their own judge task creators using the locate() pattern. This enables
external benchmarks to define judge tasks without modifying nemo-skills
source code.

Changes:
- Add _create_custom_judge_tasks() function to dynamically import and
  call external judge creators
- Add 'custom' judge type routing in prepare_eval_commands()
- Uses existing locate() pattern (same as METRICS_TYPE, eval_type)

This change is minimal (~50 lines) and reusable for any external repo.
@peri044 peri044 requested a review from Kipok February 20, 2026 21:42
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

📝 Walkthrough

Walkthrough

Replaces hardcoded per-benchmark judge creators with a dynamic judge_step loader (via locate), removes internal comet/nvembed creators from eval.py, adds a nemo_skills.pipeline.judges package with separate comet and nvembed create_judge_tasks implementations, and propagates judge_model into judge pipeline args.

Changes

Cohort / File(s) Summary
Eval pipeline
nemo_skills/pipeline/eval.py
Removed _create_comet_judge_tasks and _create_nvembed_judge_tasks. Introduces dynamic judge loading using judge_step/judge_creator_path with locate(), calls loaded create_judge_tasks(...) with standardized params (benchmarks as list, output_dir, judge_pipeline_args) and falls back to existing LLM judge flow when no dynamic creator is provided. Removed judge_type parameter and related logic; judge_model is forwarded into judge_pipeline_args.
Judges package init
nemo_skills/pipeline/judges/__init__.py
Adds package initializer (license header and module docstring) to expose nemo_skills.pipeline.judges.
Comet judge module
nemo_skills/pipeline/judges/comet_judge.py
New create_judge_tasks(...) implementing Comet-based evaluation task creation: seed selection, remaining-jobs check, build/install/run commands (unbabel-comet), GPU/cluster task creation, dependencies, and returns created task(s).
NVEmbed judge module
nemo_skills/pipeline/judges/nvembed_judge.py
New create_judge_tasks(...) implementing NVEmbed embedding-based judge task creation: input/output/seed handling, remaining-jobs check, build/run command for nvembed_judge, GPU/cluster task creation, dependencies, and returns created task(s).
Dataset metadata update
nemo_skills/dataset/mmau-pro/closed_form/__init__.py
Replaced JUDGE_PIPELINE_ARGS key judge_type with judge_step pointing to nemo_skills.pipeline.judges.nvembed_judge::create_judge_tasks (switches config to dynamic creator path).

Sequence Diagram(s)

sequenceDiagram
    participant Eval as Eval Pipeline
    participant Loc as locate()
    participant Judge as Judge Module\n(create_judge_tasks)
    participant Scheduler as Task Scheduler / Cluster

    Eval->>Loc: resolve `judge_step` / creator path
    Loc-->>Eval: returns create_judge_tasks function
    Eval->>Judge: call create_judge_tasks(exp, expname, benchmarks, judge_pipeline_args, output_dir, ...)
    Judge-->>Eval: returns list of judge tasks
    Eval->>Scheduler: schedule returned tasks (sbatch / executor)
    Scheduler-->>Eval: task execution results / done markers
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

run GPU tests

Suggested reviewers

  • Kipok
  • gwarmstrong
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: introducing support for custom judge types that allow external repositories to integrate their own judge task creators, which is the core objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)

388-388: ⚠️ Potential issue | 🟡 Minor

judge_type help string doesn't mention the new "custom" value (or the existing "comet").

📝 Proposed fix
-    judge_type: str = typer.Option("llm", help="Type of judge to use: 'llm' (default) or 'nvembed'"),
+    judge_type: str = typer.Option("llm", help="Type of judge to use: 'llm' (default), 'nvembed', 'comet', or 'custom'"),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` at line 388, The help text for the Typer option
judge_type (in the function signature where judge_type: str = typer.Option(...))
is missing the "custom" (and currently "comet") allowed values; update the help
string for judge_type to list all supported values (e.g., "'llm' (default),
'nvembed', 'comet', or 'custom'") or otherwise describe the accepted options so
users know "custom" and "comet" are valid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 215-235: The _create_custom_judge_tasks function is missing the
rerun_done parameter so external creators cannot implement skip-if-done logic;
add rerun_done to the _create_custom_judge_tasks signature and ensure it is
forwarded into any internal calls that call get_remaining_jobs (matching how
_create_comet_judge_tasks and _create_nvembed_judge_tasks accept rerun_done),
and update the call site that invokes _create_custom_judge_tasks to pass through
the rerun_done argument so completed outputs can be skipped.

---

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Line 388: The help text for the Typer option judge_type (in the function
signature where judge_type: str = typer.Option(...)) is missing the "custom"
(and currently "comet") allowed values; update the help string for judge_type to
list all supported values (e.g., "'llm' (default), 'nvembed', 'comet', or
'custom'") or otherwise describe the accepted options so users know "custom" and
"comet" are valid.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 77-79: Replace silent .get() calls for required
judge_pipeline_args keys with direct dict access so missing required values
raise immediately: change output_dir_path =
judge_pipeline_args.get("output_dir") and comet_model_path =
judge_pipeline_args.get("judge_model") to output_dir_path =
judge_pipeline_args["output_dir"] and comet_model_path =
judge_pipeline_args["judge_model"], and likewise when you reference "input_dir"
in the branch that runs when input_file is None use
judge_pipeline_args["input_dir"] instead of .get(); leave input_file using
.get() since None is a valid sentinel, and ensure the values passed into
get_remaining_jobs(...) and the constructed command string come from these
direct accesses to fail fast with a clear KeyError.

In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 77-78: Replace the silent .get() accesses for required keys with
direct dictionary indexing so missing required args raise immediately: change
uses of judge_pipeline_args.get("output_dir") and
judge_pipeline_args.get("input_file") to judge_pipeline_args["output_dir"] and
judge_pipeline_args["input_file"], and likewise inside the branch use
judge_pipeline_args["input_dir"] instead of .get(); this will cause a KeyError
to fail fast (consistent with comet_judge.py) and prevent passing None into the
run command or get_remaining_jobs.

Comment on lines +77 to +79
output_dir_path = judge_pipeline_args.get("output_dir")
input_file = judge_pipeline_args.get("input_file")
comet_model_path = judge_pipeline_args.get("judge_model")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Use direct dict access for required judge_pipeline_args keys.

output_dir, judge_model, and input_dir (inside the if input_file is None branch) are all expected to be present. Using .get() on them silently yields None, which then propagates into both get_remaining_jobs(output_dir=None, …) and the constructed command string (e.g., --output-dir None --comet-model-path None), producing confusing downstream failures instead of an immediate, clear KeyError. Per coding guidelines, use direct dict access dict[key] for required keys.

input_file correctly uses .get() since None is its legitimate "not-provided" sentinel.

♻️ Proposed fix
-    output_dir_path = judge_pipeline_args.get("output_dir")
-    input_file = judge_pipeline_args.get("input_file")
-    comet_model_path = judge_pipeline_args.get("judge_model")
+    output_dir_path = judge_pipeline_args["output_dir"]
+    input_file = judge_pipeline_args.get("input_file")   # intentional sentinel
+    comet_model_path = judge_pipeline_args["judge_model"]
     if input_file is None:
-        input_dir = judge_pipeline_args.get("input_dir")
+        input_dir = judge_pipeline_args["input_dir"]
         script_args.append(f"--input-dir {input_dir}")

As per coding guidelines, "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

Also applies to: 101-101, 104-105

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 77 - 79, Replace
silent .get() calls for required judge_pipeline_args keys with direct dict
access so missing required values raise immediately: change output_dir_path =
judge_pipeline_args.get("output_dir") and comet_model_path =
judge_pipeline_args.get("judge_model") to output_dir_path =
judge_pipeline_args["output_dir"] and comet_model_path =
judge_pipeline_args["judge_model"], and likewise when you reference "input_dir"
in the branch that runs when input_file is None use
judge_pipeline_args["input_dir"] instead of .get(); leave input_file using
.get() since None is a valid sentinel, and ensure the values passed into
get_remaining_jobs(...) and the constructed command string come from these
direct accesses to fail fast with a clear KeyError.

Comment on lines +77 to +78
output_dir_path = judge_pipeline_args.get("output_dir")
input_file = judge_pipeline_args.get("input_file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Same .get() on required keys as in comet_judge.py.

output_dir and input_dir (inside the if input_file is None branch) are required fields. Using .get() silently produces None, which embeds --output-dir None / --input-dir None in the run command and passes None to get_remaining_jobs, rather than failing immediately with a clear KeyError.

♻️ Proposed fix
-    output_dir_path = judge_pipeline_args.get("output_dir")
-    input_file = judge_pipeline_args.get("input_file")
+    output_dir_path = judge_pipeline_args["output_dir"]
+    input_file = judge_pipeline_args.get("input_file")   # intentional sentinel
     if input_file is None:
-        input_dir = judge_pipeline_args.get("input_dir")
+        input_dir = judge_pipeline_args["input_dir"]
         script_args.append(f"--input-dir {input_dir}")

As per coding guidelines, "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."

Also applies to: 100-100, 103-104

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 77 - 78, Replace
the silent .get() accesses for required keys with direct dictionary indexing so
missing required args raise immediately: change uses of
judge_pipeline_args.get("output_dir") and judge_pipeline_args.get("input_file")
to judge_pipeline_args["output_dir"] and judge_pipeline_args["input_file"], and
likewise inside the branch use judge_pipeline_args["input_dir"] instead of
.get(); this will cause a KeyError to fail fast (consistent with comet_judge.py)
and prevent passing None into the run command or get_remaining_jobs.

Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)

490-518: ⚠️ Potential issue | 🟡 Minor

has_tasks not set in the dynamic judge path — experiment may silently not run.

The LLM path (line 521) explicitly sets has_tasks = True before creating judge tasks. The dynamic path omits this. If job_batches is empty (all eval outputs already exist, so the main loop never sets has_tasks = True) and auto_summarize_results=False, pipeline_utils.run_exp at line 653 is never called, and the experiment silently no-ops despite dynamic judge tasks having been created and appended to all_tasks.

🐛 Proposed fix
             if judge_creator_path:
-                # Use locate() to dynamically load judge creator function
-                from nemo_skills.dataset.utils import locate
-
+                has_tasks = True
                 judge_creator_fn = locate(judge_creator_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 490 - 518, The dynamic
judge-creator branch (when judge_creator_path is used and judge_creator_fn
returns judge_tasks) never sets the has_tasks flag, so downstream
pipeline_utils.run_exp may be skipped; after calling judge_creator_fn and
confirming judge_tasks is non-empty (the judge_tasks variable created in this
block), set has_tasks = True and ensure those tasks are appended to all_tasks so
run_exp (pipeline_utils.run_exp) will execute just like the LLM path does.
🧹 Nitpick comments (1)
nemo_skills/pipeline/eval.py (1)

490-492: Hoist locate import out of the benchmark loop.

from nemo_skills.dataset.utils import locate is re-executed on every benchmark iteration that has a judge. Python's import cache makes it a no-op after the first call, but placing imports inside a loop is non-idiomatic and obscures the module's true dependencies.

♻️ Proposed fix
 from nemo_skills.pipeline.utils.eval import combine_cmds, prepare_eval_commands
+from nemo_skills.dataset.utils import locate
 from nemo_skills.utils import (

Then remove the inline import:

-            if judge_creator_path:
-                # Use locate() to dynamically load judge creator function
-                from nemo_skills.dataset.utils import locate
-
-                judge_creator_fn = locate(judge_creator_path)
+            if judge_creator_path:
+                judge_creator_fn = locate(judge_creator_path)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 490 - 492, The inline import "from
nemo_skills.dataset.utils import locate" inside the block that checks
judge_creator_path should be hoisted to module-level (or at least outside the
benchmark loop) so the import is declared once and dependencies are clear;
remove the inline import from the loop and add a single top-level import for
locate, leaving the existing logic that uses judge_creator_path and locate
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 490-518: The dynamic judge-creator branch (when judge_creator_path
is used and judge_creator_fn returns judge_tasks) never sets the has_tasks flag,
so downstream pipeline_utils.run_exp may be skipped; after calling
judge_creator_fn and confirming judge_tasks is non-empty (the judge_tasks
variable created in this block), set has_tasks = True and ensure those tasks are
appended to all_tasks so run_exp (pipeline_utils.run_exp) will execute just like
the LLM path does.

---

Nitpick comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 490-492: The inline import "from nemo_skills.dataset.utils import
locate" inside the block that checks judge_creator_path should be hoisted to
module-level (or at least outside the benchmark loop) so the import is declared
once and dependencies are clear; remove the inline import from the loop and add
a single top-level import for locate, leaving the existing logic that uses
judge_creator_path and locate unchanged.

Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)

511-546: ⚠️ Potential issue | 🟡 Minor

has_tasks is not set for the dynamic judge path.

Line 513 sets has_tasks = True only in the LLM judge branch. The dynamic judge branch (Lines 482–510) skips it. While main eval tasks will have already set has_tasks = True in practice, this is inconsistent and fragile if the flow is ever refactored.

Move the flag into the shared null-check block so both paths are covered:

Proposed fix
-            else:
-                # Use default LLM judge pipeline
-                has_tasks = True
-                judge_tasks = _create_llm_judge_tasks(
+            else:
+                # Use default LLM judge pipeline
+                judge_tasks = _create_llm_judge_tasks(
                     ...
                 )
             # _generate can return None when there are no jobs to run (e.g., outputs already exist)
             # Only record and extend when tasks are present to avoid NoneType errors
             if judge_tasks:
+                has_tasks = True
                 benchmark_to_judge_tasks[benchmark] = judge_tasks
                 all_tasks.extend(judge_tasks)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 511 - 546, The has_tasks flag is
only set inside the LLM judge branch; move setting has_tasks = True into the
shared post-generation null-check so it's set for both the LLM and dynamic judge
paths: after obtaining judge_tasks (from _create_llm_judge_tasks or the dynamic
judge generator) and before you assign benchmark_to_judge_tasks or extend
all_tasks, set has_tasks = True when judge_tasks is truthy; this ensures the
flag reflects any generated judge tasks regardless of which creation function
produced them.
🧹 Nitpick comments (1)
nemo_skills/pipeline/eval.py (1)

482-510: The judge module interface is well-documented but consider adding a discoverable reference at the call site.

Both nvembed_judge.py and comet_judge.py define create_judge_tasks() functions with comprehensive docstrings documenting all parameters and return types, matching exactly the 20 kwargs passed in eval.py lines 489–510. The interface contract is properly specified in the judge modules themselves.

However, external authors looking at eval.py won't see these docstrings without examining the judge modules. Consider adding a brief reference comment near line 481–484 (before the locate() call) pointing to the expected signature in the judge modules, or documenting the standard create_judge_tasks() interface in judges/__init__.py for improved discoverability.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 482 - 510, Add a discoverable
reference to the expected judge creator interface by inserting a short comment
before the locate() call in eval.py that names the standardized function and its
parameters (e.g., "expects create_judge_tasks(exp, expname, benchmarks,
judge_pipeline_args, rerun_done, log_dir, output_dir, cluster_config,
judge_server_gpus, judge_server_nodes, partition, run_after, reuse_code_exp,
reuse_code, dependent_tasks, all_tasks, _task_dependencies,
installation_command, skip_hf_home_check, sbatch_kwargs)") and/or add a concise
module-level docstring in judges/__init__.py that documents the
create_judge_tasks signature so external authors can discover the contract
without opening nvembed_judge.py or comet_judge.py; update references to
judge_creator_path, locate, and judge_creator_fn accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 511-546: The has_tasks flag is only set inside the LLM judge
branch; move setting has_tasks = True into the shared post-generation null-check
so it's set for both the LLM and dynamic judge paths: after obtaining
judge_tasks (from _create_llm_judge_tasks or the dynamic judge generator) and
before you assign benchmark_to_judge_tasks or extend all_tasks, set has_tasks =
True when judge_tasks is truthy; this ensures the flag reflects any generated
judge tasks regardless of which creation function produced them.

---

Nitpick comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 482-510: Add a discoverable reference to the expected judge
creator interface by inserting a short comment before the locate() call in
eval.py that names the standardized function and its parameters (e.g., "expects
create_judge_tasks(exp, expname, benchmarks, judge_pipeline_args, rerun_done,
log_dir, output_dir, cluster_config, judge_server_gpus, judge_server_nodes,
partition, run_after, reuse_code_exp, reuse_code, dependent_tasks, all_tasks,
_task_dependencies, installation_command, skip_hf_home_check, sbatch_kwargs)")
and/or add a concise module-level docstring in judges/__init__.py that documents
the create_judge_tasks signature so external authors can discover the contract
without opening nvembed_judge.py or comet_judge.py; update references to
judge_creator_path, locate, and judge_creator_fn accordingly.

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update a reference to judge_type in the docs in here /home/igitman/workspace/NeMo-Skills/docs/evaluation/multilingual.md?

Also can you please re-create from a branch, so that our tests can run?

benchmark_judge_type = judge_pipeline_args.pop("judge_type", judge_type)
# judge_step is a :: path to the judge creator function (locate() convention).
# Benchmarks set this directly in JUDGE_PIPELINE_ARGS; falls back to None for LLM judge.
judge_creator_path = judge_pipeline_args.pop("judge_step", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have this as an explicit pipeline parameter? Similar to how judge_type was? That way it will be simpler to update from command line

Copy link
Collaborator Author

@peri044 peri044 Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the judge_step to judge_path and added it as a pipeline parameter.

exp=exp,
expname=expname,
benchmark=benchmark,
benchmarks=[benchmark],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason to have multiple benchmarks here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No specific reason. Reverted this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants