feat: add custom judge type support for external repo integration#1265
feat: add custom judge type support for external repo integration#1265peri044 wants to merge 9 commits intoNVIDIA-NeMo:mainfrom
Conversation
Add generic 'custom' judge type that allows external repos to register their own judge task creators using the locate() pattern. This enables external benchmarks to define judge tasks without modifying nemo-skills source code. Changes: - Add _create_custom_judge_tasks() function to dynamically import and call external judge creators - Add 'custom' judge type routing in prepare_eval_commands() - Uses existing locate() pattern (same as METRICS_TYPE, eval_type) This change is minimal (~50 lines) and reusable for any external repo.
📝 WalkthroughWalkthroughReplaces hardcoded per-benchmark judge creators with a dynamic judge_step loader (via locate), removes internal comet/nvembed creators from eval.py, adds a nemo_skills.pipeline.judges package with separate comet and nvembed create_judge_tasks implementations, and propagates judge_model into judge pipeline args. Changes
Sequence Diagram(s)sequenceDiagram
participant Eval as Eval Pipeline
participant Loc as locate()
participant Judge as Judge Module\n(create_judge_tasks)
participant Scheduler as Task Scheduler / Cluster
Eval->>Loc: resolve `judge_step` / creator path
Loc-->>Eval: returns create_judge_tasks function
Eval->>Judge: call create_judge_tasks(exp, expname, benchmarks, judge_pipeline_args, output_dir, ...)
Judge-->>Eval: returns list of judge tasks
Eval->>Scheduler: schedule returned tasks (sbatch / executor)
Scheduler-->>Eval: task execution results / done markers
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)
388-388:⚠️ Potential issue | 🟡 Minor
judge_typehelp string doesn't mention the new"custom"value (or the existing"comet").📝 Proposed fix
- judge_type: str = typer.Option("llm", help="Type of judge to use: 'llm' (default) or 'nvembed'"), + judge_type: str = typer.Option("llm", help="Type of judge to use: 'llm' (default), 'nvembed', 'comet', or 'custom'"),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` at line 388, The help text for the Typer option judge_type (in the function signature where judge_type: str = typer.Option(...)) is missing the "custom" (and currently "comet") allowed values; update the help string for judge_type to list all supported values (e.g., "'llm' (default), 'nvembed', 'comet', or 'custom'") or otherwise describe the accepted options so users know "custom" and "comet" are valid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 215-235: The _create_custom_judge_tasks function is missing the
rerun_done parameter so external creators cannot implement skip-if-done logic;
add rerun_done to the _create_custom_judge_tasks signature and ensure it is
forwarded into any internal calls that call get_remaining_jobs (matching how
_create_comet_judge_tasks and _create_nvembed_judge_tasks accept rerun_done),
and update the call site that invokes _create_custom_judge_tasks to pass through
the rerun_done argument so completed outputs can be skipped.
---
Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Line 388: The help text for the Typer option judge_type (in the function
signature where judge_type: str = typer.Option(...)) is missing the "custom"
(and currently "comet") allowed values; update the help string for judge_type to
list all supported values (e.g., "'llm' (default), 'nvembed', 'comet', or
'custom'") or otherwise describe the accepted options so users know "custom" and
"comet" are valid.
Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
…to peri044/external_judge
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nemo_skills/pipeline/judges/comet_judge.py`:
- Around line 77-79: Replace silent .get() calls for required
judge_pipeline_args keys with direct dict access so missing required values
raise immediately: change output_dir_path =
judge_pipeline_args.get("output_dir") and comet_model_path =
judge_pipeline_args.get("judge_model") to output_dir_path =
judge_pipeline_args["output_dir"] and comet_model_path =
judge_pipeline_args["judge_model"], and likewise when you reference "input_dir"
in the branch that runs when input_file is None use
judge_pipeline_args["input_dir"] instead of .get(); leave input_file using
.get() since None is a valid sentinel, and ensure the values passed into
get_remaining_jobs(...) and the constructed command string come from these
direct accesses to fail fast with a clear KeyError.
In `@nemo_skills/pipeline/judges/nvembed_judge.py`:
- Around line 77-78: Replace the silent .get() accesses for required keys with
direct dictionary indexing so missing required args raise immediately: change
uses of judge_pipeline_args.get("output_dir") and
judge_pipeline_args.get("input_file") to judge_pipeline_args["output_dir"] and
judge_pipeline_args["input_file"], and likewise inside the branch use
judge_pipeline_args["input_dir"] instead of .get(); this will cause a KeyError
to fail fast (consistent with comet_judge.py) and prevent passing None into the
run command or get_remaining_jobs.
| output_dir_path = judge_pipeline_args.get("output_dir") | ||
| input_file = judge_pipeline_args.get("input_file") | ||
| comet_model_path = judge_pipeline_args.get("judge_model") |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Use direct dict access for required judge_pipeline_args keys.
output_dir, judge_model, and input_dir (inside the if input_file is None branch) are all expected to be present. Using .get() on them silently yields None, which then propagates into both get_remaining_jobs(output_dir=None, …) and the constructed command string (e.g., --output-dir None --comet-model-path None), producing confusing downstream failures instead of an immediate, clear KeyError. Per coding guidelines, use direct dict access dict[key] for required keys.
input_file correctly uses .get() since None is its legitimate "not-provided" sentinel.
♻️ Proposed fix
- output_dir_path = judge_pipeline_args.get("output_dir")
- input_file = judge_pipeline_args.get("input_file")
- comet_model_path = judge_pipeline_args.get("judge_model")
+ output_dir_path = judge_pipeline_args["output_dir"]
+ input_file = judge_pipeline_args.get("input_file") # intentional sentinel
+ comet_model_path = judge_pipeline_args["judge_model"] if input_file is None:
- input_dir = judge_pipeline_args.get("input_dir")
+ input_dir = judge_pipeline_args["input_dir"]
script_args.append(f"--input-dir {input_dir}")As per coding guidelines, "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."
Also applies to: 101-101, 104-105
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/judges/comet_judge.py` around lines 77 - 79, Replace
silent .get() calls for required judge_pipeline_args keys with direct dict
access so missing required values raise immediately: change output_dir_path =
judge_pipeline_args.get("output_dir") and comet_model_path =
judge_pipeline_args.get("judge_model") to output_dir_path =
judge_pipeline_args["output_dir"] and comet_model_path =
judge_pipeline_args["judge_model"], and likewise when you reference "input_dir"
in the branch that runs when input_file is None use
judge_pipeline_args["input_dir"] instead of .get(); leave input_file using
.get() since None is a valid sentinel, and ensure the values passed into
get_remaining_jobs(...) and the constructed command string come from these
direct accesses to fail fast with a clear KeyError.
| output_dir_path = judge_pipeline_args.get("output_dir") | ||
| input_file = judge_pipeline_args.get("input_file") |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Same .get() on required keys as in comet_judge.py.
output_dir and input_dir (inside the if input_file is None branch) are required fields. Using .get() silently produces None, which embeds --output-dir None / --input-dir None in the run command and passes None to get_remaining_jobs, rather than failing immediately with a clear KeyError.
♻️ Proposed fix
- output_dir_path = judge_pipeline_args.get("output_dir")
- input_file = judge_pipeline_args.get("input_file")
+ output_dir_path = judge_pipeline_args["output_dir"]
+ input_file = judge_pipeline_args.get("input_file") # intentional sentinel if input_file is None:
- input_dir = judge_pipeline_args.get("input_dir")
+ input_dir = judge_pipeline_args["input_dir"]
script_args.append(f"--input-dir {input_dir}")As per coding guidelines, "Do not use .get() for accessing dictionary keys if the code expects them to be present; use direct dictionary access dict[key] instead to allow proper error handling and fail fast with clear errors."
Also applies to: 100-100, 103-104
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@nemo_skills/pipeline/judges/nvembed_judge.py` around lines 77 - 78, Replace
the silent .get() accesses for required keys with direct dictionary indexing so
missing required args raise immediately: change uses of
judge_pipeline_args.get("output_dir") and judge_pipeline_args.get("input_file")
to judge_pipeline_args["output_dir"] and judge_pipeline_args["input_file"], and
likewise inside the branch use judge_pipeline_args["input_dir"] instead of
.get(); this will cause a KeyError to fail fast (consistent with comet_judge.py)
and prevent passing None into the run command or get_remaining_jobs.
Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)
490-518:⚠️ Potential issue | 🟡 Minor
has_tasksnot set in the dynamic judge path — experiment may silently not run.The LLM path (line 521) explicitly sets
has_tasks = Truebefore creating judge tasks. The dynamic path omits this. Ifjob_batchesis empty (all eval outputs already exist, so the main loop never setshas_tasks = True) andauto_summarize_results=False,pipeline_utils.run_expat line 653 is never called, and the experiment silently no-ops despite dynamic judge tasks having been created and appended toall_tasks.🐛 Proposed fix
if judge_creator_path: - # Use locate() to dynamically load judge creator function - from nemo_skills.dataset.utils import locate - + has_tasks = True judge_creator_fn = locate(judge_creator_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` around lines 490 - 518, The dynamic judge-creator branch (when judge_creator_path is used and judge_creator_fn returns judge_tasks) never sets the has_tasks flag, so downstream pipeline_utils.run_exp may be skipped; after calling judge_creator_fn and confirming judge_tasks is non-empty (the judge_tasks variable created in this block), set has_tasks = True and ensure those tasks are appended to all_tasks so run_exp (pipeline_utils.run_exp) will execute just like the LLM path does.
🧹 Nitpick comments (1)
nemo_skills/pipeline/eval.py (1)
490-492: Hoistlocateimport out of the benchmark loop.
from nemo_skills.dataset.utils import locateis re-executed on every benchmark iteration that has a judge. Python's import cache makes it a no-op after the first call, but placing imports inside a loop is non-idiomatic and obscures the module's true dependencies.♻️ Proposed fix
from nemo_skills.pipeline.utils.eval import combine_cmds, prepare_eval_commands +from nemo_skills.dataset.utils import locate from nemo_skills.utils import (Then remove the inline import:
- if judge_creator_path: - # Use locate() to dynamically load judge creator function - from nemo_skills.dataset.utils import locate - - judge_creator_fn = locate(judge_creator_path) + if judge_creator_path: + judge_creator_fn = locate(judge_creator_path)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` around lines 490 - 492, The inline import "from nemo_skills.dataset.utils import locate" inside the block that checks judge_creator_path should be hoisted to module-level (or at least outside the benchmark loop) so the import is declared once and dependencies are clear; remove the inline import from the loop and add a single top-level import for locate, leaving the existing logic that uses judge_creator_path and locate unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 490-518: The dynamic judge-creator branch (when judge_creator_path
is used and judge_creator_fn returns judge_tasks) never sets the has_tasks flag,
so downstream pipeline_utils.run_exp may be skipped; after calling
judge_creator_fn and confirming judge_tasks is non-empty (the judge_tasks
variable created in this block), set has_tasks = True and ensure those tasks are
appended to all_tasks so run_exp (pipeline_utils.run_exp) will execute just like
the LLM path does.
---
Nitpick comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 490-492: The inline import "from nemo_skills.dataset.utils import
locate" inside the block that checks judge_creator_path should be hoisted to
module-level (or at least outside the benchmark loop) so the import is declared
once and dependencies are clear; remove the inline import from the loop and add
a single top-level import for locate, leaving the existing logic that uses
judge_creator_path and locate unchanged.
Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
nemo_skills/pipeline/eval.py (1)
511-546:⚠️ Potential issue | 🟡 Minor
has_tasksis not set for the dynamic judge path.Line 513 sets
has_tasks = Trueonly in the LLM judge branch. The dynamic judge branch (Lines 482–510) skips it. While main eval tasks will have already sethas_tasks = Truein practice, this is inconsistent and fragile if the flow is ever refactored.Move the flag into the shared null-check block so both paths are covered:
Proposed fix
- else: - # Use default LLM judge pipeline - has_tasks = True - judge_tasks = _create_llm_judge_tasks( + else: + # Use default LLM judge pipeline + judge_tasks = _create_llm_judge_tasks( ... ) # _generate can return None when there are no jobs to run (e.g., outputs already exist) # Only record and extend when tasks are present to avoid NoneType errors if judge_tasks: + has_tasks = True benchmark_to_judge_tasks[benchmark] = judge_tasks all_tasks.extend(judge_tasks)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` around lines 511 - 546, The has_tasks flag is only set inside the LLM judge branch; move setting has_tasks = True into the shared post-generation null-check so it's set for both the LLM and dynamic judge paths: after obtaining judge_tasks (from _create_llm_judge_tasks or the dynamic judge generator) and before you assign benchmark_to_judge_tasks or extend all_tasks, set has_tasks = True when judge_tasks is truthy; this ensures the flag reflects any generated judge tasks regardless of which creation function produced them.
🧹 Nitpick comments (1)
nemo_skills/pipeline/eval.py (1)
482-510: The judge module interface is well-documented but consider adding a discoverable reference at the call site.Both
nvembed_judge.pyandcomet_judge.pydefinecreate_judge_tasks()functions with comprehensive docstrings documenting all parameters and return types, matching exactly the 20 kwargs passed in eval.py lines 489–510. The interface contract is properly specified in the judge modules themselves.However, external authors looking at eval.py won't see these docstrings without examining the judge modules. Consider adding a brief reference comment near line 481–484 (before the
locate()call) pointing to the expected signature in the judge modules, or documenting the standardcreate_judge_tasks()interface injudges/__init__.pyfor improved discoverability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/eval.py` around lines 482 - 510, Add a discoverable reference to the expected judge creator interface by inserting a short comment before the locate() call in eval.py that names the standardized function and its parameters (e.g., "expects create_judge_tasks(exp, expname, benchmarks, judge_pipeline_args, rerun_done, log_dir, output_dir, cluster_config, judge_server_gpus, judge_server_nodes, partition, run_after, reuse_code_exp, reuse_code, dependent_tasks, all_tasks, _task_dependencies, installation_command, skip_hf_home_check, sbatch_kwargs)") and/or add a concise module-level docstring in judges/__init__.py that documents the create_judge_tasks signature so external authors can discover the contract without opening nvembed_judge.py or comet_judge.py; update references to judge_creator_path, locate, and judge_creator_fn accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 511-546: The has_tasks flag is only set inside the LLM judge
branch; move setting has_tasks = True into the shared post-generation null-check
so it's set for both the LLM and dynamic judge paths: after obtaining
judge_tasks (from _create_llm_judge_tasks or the dynamic judge generator) and
before you assign benchmark_to_judge_tasks or extend all_tasks, set has_tasks =
True when judge_tasks is truthy; this ensures the flag reflects any generated
judge tasks regardless of which creation function produced them.
---
Nitpick comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 482-510: Add a discoverable reference to the expected judge
creator interface by inserting a short comment before the locate() call in
eval.py that names the standardized function and its parameters (e.g., "expects
create_judge_tasks(exp, expname, benchmarks, judge_pipeline_args, rerun_done,
log_dir, output_dir, cluster_config, judge_server_gpus, judge_server_nodes,
partition, run_after, reuse_code_exp, reuse_code, dependent_tasks, all_tasks,
_task_dependencies, installation_command, skip_hf_home_check, sbatch_kwargs)")
and/or add a concise module-level docstring in judges/__init__.py that documents
the create_judge_tasks signature so external authors can discover the contract
without opening nvembed_judge.py or comet_judge.py; update references to
judge_creator_path, locate, and judge_creator_fn accordingly.
Kipok
left a comment
There was a problem hiding this comment.
Could you please update a reference to judge_type in the docs in here /home/igitman/workspace/NeMo-Skills/docs/evaluation/multilingual.md?
Also can you please re-create from a branch, so that our tests can run?
| benchmark_judge_type = judge_pipeline_args.pop("judge_type", judge_type) | ||
| # judge_step is a :: path to the judge creator function (locate() convention). | ||
| # Benchmarks set this directly in JUDGE_PIPELINE_ARGS; falls back to None for LLM judge. | ||
| judge_creator_path = judge_pipeline_args.pop("judge_step", None) |
There was a problem hiding this comment.
can we have this as an explicit pipeline parameter? Similar to how judge_type was? That way it will be simpler to update from command line
There was a problem hiding this comment.
Renamed the judge_step to judge_path and added it as a pipeline parameter.
| exp=exp, | ||
| expname=expname, | ||
| benchmark=benchmark, | ||
| benchmarks=[benchmark], |
There was a problem hiding this comment.
what's the reason to have multiple benchmarks here?
There was a problem hiding this comment.
No specific reason. Reverted this.
Add generic 'custom' judge type that allows external repos to register their own judge task creators using the locate() pattern. This enables external benchmarks to define judge tasks without modifying nemo-skills source code.
Changes:
Summary by CodeRabbit
New Features
Refactor