support structured outputs in hle judge for optional AA compatibility#1186
support structured outputs in hle judge for optional AA compatibility#1186
Conversation
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Greptile OverviewGreptile SummaryThis PR adds structured output support to enable optional AA-compatibility in HLE benchmark implementation. The changes introduce a new Key Changes:
Architecture: Provider Support:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant GenerationTask
participant STRUCTURED_OUTPUTS
participant Model
participant LiteLLM
participant LLM API
User->>GenerationTask: configure structured_output="HLE_JUDGE_AA"
GenerationTask->>GenerationTask: process_single_datapoint()
GenerationTask->>STRUCTURED_OUTPUTS: lookup HLE_JUDGE_AA
STRUCTURED_OUTPUTS-->>GenerationTask: return HLEJudgeAAResponseFormat (Pydantic class)
GenerationTask->>Model: generate_async(response_format=HLEJudgeAAResponseFormat)
Model->>Model: _build_chat_request_params(response_format=...)
Model->>LiteLLM: acompletion(response_format=HLEJudgeAAResponseFormat)
LiteLLM->>LLM API: POST with structured output schema
LLM API-->>LiteLLM: JSON response matching schema
LiteLLM-->>Model: response
Model-->>GenerationTask: generation result
GenerationTask->>GenerationTask: postprocess_single_output()
GenerationTask->>GenerationTask: parse JSON and extract "correct" field
alt JSON parsing succeeds
GenerationTask->>GenerationTask: format as "Judgement: {correct}"
else JSON parsing fails
GenerationTask->>GenerationTask: fallback to "Judgement: FAILED_TO_PARSE"
end
GenerationTask-->>User: final output
|
📝 WalkthroughWalkthroughAdds structured output support (HLE_JUDGE_AA) throughout the inference pipeline by introducing a Changes
Sequence DiagramsequenceDiagram
participant Config as GenerationTaskConfig
participant Generate as generate.py
participant BaseModel as BaseModel
participant ModelImpl as Model Implementation
participant LiteLLM as LiteLLM/API
Config->>Generate: structured_output="HLE_JUDGE_AA"
Generate->>Generate: process_single_datapoint()
Generate->>BaseModel: generate_async(..., response_format=HLEJudgeAAResponseFormat)
BaseModel->>ModelImpl: _build_chat_request_params(..., response_format)
alt Supports structured (OpenAI/SGLang/VLLM chat)
ModelImpl->>ModelImpl: Include response_format in request dict
else Rejects structured (Gemini/Megatron)
ModelImpl->>ModelImpl: assert response_format is None
end
ModelImpl->>LiteLLM: Send request with response_format
LiteLLM-->>ModelImpl: JSON response {correct: "yes", ...}
ModelImpl-->>BaseModel: Response
BaseModel-->>Generate: Raw generation
Generate->>Generate: postprocess_single_output()
Generate->>Generate: Parse JSON, extract "correct" field
Generate-->>Config: Judgement: yes/no
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
nemo_skills/inference/model/megatron.py (2)
39-53: Replaceassertstatements with explicit exceptions for parameter validation.Both
_build_chat_request_paramsand_build_completion_request_paramsuseassertto validatetoolsandresponse_formatparameters. Sinceassertstatements can be disabled with Python's-Oflag, unsupported parameters would silently pass through, violating the guideline to "fail loudly" on invalid inputs. Other parameters in the same validation block correctly use explicitif+raise NotImplementedError, so match that pattern consistently.Proposed fix
- assert kwargs.get("tools") is None, "Megatron server does not support tools parameter." - assert response_format is None, "Megatron server does not support response_format parameter." + if kwargs.get("tools") is not None: + raise NotImplementedError("Megatron server does not support tools parameter.") + if response_format is not None: + raise NotImplementedError("Megatron server does not support response_format parameter.")Apply this fix to both methods.
86-100: Replace assert statements with explicit exceptions for parameter validation.Both
toolsandresponse_formatparameters use assert statements, which can be disabled with Python's-Oflag, causing silent failures. This violates the coding guideline to "Let the code fail with clear errors instead of silently misbehaving" and "Avoid silently ignoring unused user-passed parameters."This same pattern appears in two methods (around lines 51-52 and 98-99). Replace all four assert statements with explicit
if+raise NotImplementedError()to match the approach used for other unsupported parameters (stream,min_p,repetition_penalty,top_k).Example fix for lines 98-99
- assert kwargs.get("tools") is None, "Megatron server does not support tools parameter." - assert response_format is None, "Megatron server does not support response_format parameter." + if kwargs.get("tools") is not None: + raise NotImplementedError("Megatron server does not support tools parameter.") + if response_format is not None: + raise NotImplementedError("Megatron server does not support response_format parameter.")
🤖 Fix all issues with AI agents
In `@nemo_skills/inference/generate.py`:
- Around line 695-696: The code silently ignores unknown structured_output
values; add a validation in GenerationTaskConfig.__post_init__ (or call a helper
_post_init_validate_params from __post_init__) that checks if
self.structured_output is not None and not in STRUCTURED_OUTPUTS and raise a
ValueError listing the invalid value and valid keys (referencing
STRUCTURED_OUTPUTS and the attribute structured_output); this ensures
process_single_datapoint/generation_params population logic (where
generation_params["response_format"] is set) never silently drops an unsupported
structured_output.
In `@nemo_skills/inference/structured_outputs.py`:
- Around line 1-2: Add the standard NVIDIA copyright header at the very top of
the module (above the imports) in nemo_skills/inference/structured_outputs.py so
the file begins with the required multi-line copyright notice; do not alter the
existing imports (from typing import Literal, from pydantic import
BaseModel)—just prepend the header block exactly as the project's canonical
NVIDIA header.
- Around line 5-10: The HLEJudgeAAResponseFormat model wrongly includes a
non-response field strict: Literal[True]; remove the strict attribute from the
class so the model only defines extracted_final_answer, reasoning, correct, and
confidence, and then remove any now-unused imports (e.g., Literal[True] or
Literal if no longer needed); ensure any strict:true configuration is applied at
the OpenAI request/schema configuration level rather than as a field on
HLEJudgeAAResponseFormat.
🧹 Nitpick comments (2)
nemo_skills/inference/model/base.py (1)
239-239: Consider adding type annotation for consistency.The
response_formatparameter lacks a type annotation while other parameters in this method have them. Consider adding a type hint for consistency.Proposed fix
- response_format = None, + response_format: dict | None = None,nemo_skills/inference/generate.py (1)
636-642: Remove unused exception variable and consider logging the failure.The exception variable
eis assigned but never used (also flagged by static analysis). Additionally, silently settingFAILED_TO_PARSEwithout logging could make debugging difficult when generation fails to parse.Proposed fix
if self.cfg.structured_output == "HLE_JUDGE_AA": try: output[self.cfg.generation_key] = "Judgement: {}".format( json.loads(output[self.cfg.generation_key])["correct"] ) - except json.JSONDecodeError as e: + except json.JSONDecodeError: + LOG.warning( + "Failed to parse structured output as JSON: %s", + output[self.cfg.generation_key][:200] if output[self.cfg.generation_key] else "<empty>" + ) output[self.cfg.generation_key] = "Judgement: FAILED_TO_PARSE"
nemo_skills/inference/generate.py
Outdated
| if self.cfg.structured_output in STRUCTURED_OUTPUTS: | ||
| generation_params["response_format"] = STRUCTURED_OUTPUTS[self.cfg.structured_output] |
There was a problem hiding this comment.
Consider validating structured_output against registry early.
If a user specifies a structured_output value that's not in STRUCTURED_OUTPUTS, the code silently ignores it without injecting response_format. This could lead to unexpected behavior. Per coding guidelines, the code should fail if a user specifies an unsupported argument.
Proposed fix in `__post_init__` or `process_single_datapoint`
Add validation in GenerationTaskConfig.__post_init__:
def _post_init_validate_params(self):
# ... existing validations ...
if self.structured_output is not None and self.structured_output not in STRUCTURED_OUTPUTS:
raise ValueError(
f"Unknown structured_output '{self.structured_output}'. "
f"Valid options: {list(STRUCTURED_OUTPUTS.keys())}"
)🤖 Prompt for AI Agents
In `@nemo_skills/inference/generate.py` around lines 695 - 696, The code silently
ignores unknown structured_output values; add a validation in
GenerationTaskConfig.__post_init__ (or call a helper _post_init_validate_params
from __post_init__) that checks if self.structured_output is not None and not in
STRUCTURED_OUTPUTS and raise a ValueError listing the invalid value and valid
keys (referencing STRUCTURED_OUTPUTS and the attribute structured_output); this
ensures process_single_datapoint/generation_params population logic (where
generation_params["response_format"] is set) never silently drops an unsupported
structured_output.
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
nemo_skills/inference/generate.py
Outdated
| if self.cfg.structured_output == "HLE_JUDGE_AA": | ||
| try: | ||
| output[self.cfg.generation_key] = "Judgement: {}".format( | ||
| json.loads(output[self.cfg.generation_key])["correct"] | ||
| ) | ||
| except (json.JSONDecodeError, KeyError): | ||
| output[self.cfg.generation_key] = "Judgement: FAILED_TO_PARSE" |
There was a problem hiding this comment.
Hardcoded check for "HLE_JUDGE_AA" creates inconsistency with line 695 which uses in STRUCTURED_OUTPUTS. If new structured output formats are added to STRUCTURED_OUTPUTS, they'll set response_format but won't have corresponding postprocessing logic. Consider using self.cfg.structured_output in STRUCTURED_OUTPUTS here or creating a registry of postprocessing handlers.
jiacheng-xu
left a comment
There was a problem hiding this comment.
I would request @gwarmstrong to review and leave some comments here since it's changing a core function / logic in generation flow.
I could be wrong, but some thoughts I have after reviewing the changes:
- The naming of "response_format" is vague. It could be image or text, or it could be text or JSON. Need maybe renaming and more docs about it.
- The use of response_format might change the behavior of EndpointType, and vice versa. There need to be more test cases.
- Test cases needed for at least one example. MathReasoning from https://platform.openai.com/docs/guides/structured-outputs?example=structured-data is good.
- It is a broad feature and not only for HLE_JUDGE_AA.
nemo_skills/inference/generate.py
Outdated
| # all of the original data to the output file alongside the new generations | ||
| output[self.cfg.generation_key] = output.pop("generation") | ||
|
|
||
| if self.cfg.structured_output == "HLE_JUDGE_AA": |
There was a problem hiding this comment.
It is not be a good idea to hard code HLE_JUDGE_AA in generate.py.
Can we build a function to handle that like
Skills/nemo_skills/inference/generate.py
Line 658 in 54c8bc0
There was a problem hiding this comment.
@anowaczynski-nvidia can we move this logic into metrics? Why does it need to be in the generation?
There was a problem hiding this comment.
Reasons I added if with postprocessing here:
- to enable AA-compatible HLE judge,
++structured_output=HLE_JUDGE_AAneeds to be added only in one place (judge generations pipeline command) - with the current version
summarize_resultscommand and pipeline logic for aggregating hle judge outputs into metrics doesn't require any modifications (the same command + code handles both: default and AA-compatible judges)
I am aware this code is fundamental to the entire package, all generations pass through it.
Regarding moving this to metrics: I see the possibility to create hleaa_metrics.py in evaluation/metrics, inherit from MathMetrics, and override only _get_score_dict, such that postprocessing of judgement (parsing into json etc) is applied before is_correct_judgement. Do you approve this plan?
There was a problem hiding this comment.
Yes, either that or we can just have this as option for main math metrics, so that any dataset, not just HLE can be evaluated in this setup. The one problem is I am not fully sure if metrics are currently customizable, but I guess if not, then we should enable customization in a similar way to how it's done for eval / generation parameters. Let me know if you need help with the design on that, happy to discuss in more details
There was a problem hiding this comment.
@Kipok I tried the hard way first, but nothing I created was correct and convincing, so I pushed one commit with class HLEAAMetrics(MathMetrics) solution as it was conceptually much simpler. The main downside is that I had to add metric_type to eval command. It doesn't look right here. It doesn't compose with eval on multiple benchmarks idea. Can you take a look? If we're doing Metrics Config idea, I need a sync how to approach it.
There was a problem hiding this comment.
I think this is the right approach. When doing eval on multiple benchmarks you can't really customize anything except maybe inference parameters. E.g. doing prompt change or eval arguments will also break things, so I think adding metric_type is a good change. An alternative would be to add this as an argument to MathMetrics and then you can reuse existing metric_kwargs parameter to customize it. But adding metric_type is a good change anyway given that we support metric_kwargs already.
If the current implementation fully works for you, I think it LGTM as well and we can merge it. But do let me know if you have any concerns or think we should do things differently
There was a problem hiding this comment.
It's probably a good idea to add a new test for this in test_generation.py, but only if models on build.nvidia.com support this response_format argument
There was a problem hiding this comment.
added test_judge_generations_with_structured_output but it takes 10 minutes to complete even with max_samples=2, obviously this can't be merged, but where do we go from here?
There was a problem hiding this comment.
thanks @anowaczynski-nvidia - pushed a change to limit max tokens (since we aren't checking generation correctness anyway), seems to finish very fast now!
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
| extracted_final_answer: str | ||
| reasoning: str | ||
| correct: Literal["yes", "no"] | ||
| confidence: int |
There was a problem hiding this comment.
confidence field has no validation constraints. Should be confidence: int = Field(ge=0, le=100) or similar to ensure valid confidence values.
| confidence: int | |
| confidence: int = Field(ge=0, le=100, description="Confidence score from 0 to 100") |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Kipok
left a comment
There was a problem hiding this comment.
lgtm as long as the tests pass
Signed-off-by: Igor Gitman <igitman@nvidia.com>
| def test_judge_generations_with_structured_output(tmp_path): | ||
| cmd = ( | ||
| f"ns eval " | ||
| f" --server_type=openai " | ||
| f" --model=nvidia/nemotron-3-nano-30b-a3b " | ||
| f" --server_address=https://integrate.api.nvidia.com/v1 " | ||
| f" --benchmarks=hle " | ||
| f" --output_dir={tmp_path} " | ||
| f" --judge_model=nvidia/nemotron-3-nano-30b-a3b " | ||
| f" --judge_server_address=https://integrate.api.nvidia.com/v1 " | ||
| f" --judge_server_type=openai " | ||
| f" --metric_type=hle-aa " | ||
| f' --extra_judge_args="++structured_output=HLE_JUDGE_AA" ' | ||
| f" ++max_samples=2 " | ||
| f" ++inference.tokens_to_generate=1024 " # to make test go fast | ||
| ) | ||
| subprocess.run(cmd, shell=True, check=True) |
There was a problem hiding this comment.
Networked integration test
test_judge_generations_with_structured_output shells out to ns eval with real external endpoints (https://integrate.api.nvidia.com/v1) and a specific hosted model. This will fail in CI/offline test environments (no credentials / no network), so the PR will become unmergeable due to nondeterministic test failures. This should be rewritten as a unit/integration test that mocks the model call or uses the existing local test server fixtures, or it should be gated/marked as an opt-in test (e.g., skipped unless an env var is set).
| if self.cfg.structured_output is not None: | ||
| generation_params["response_format"] = STRUCTURED_OUTPUTS[self.cfg.structured_output] | ||
|
|
There was a problem hiding this comment.
Unhandled invalid key
When structured_output is set to any non-None value that is not present in STRUCTURED_OUTPUTS, process_single_datapoint will throw a KeyError at STRUCTURED_OUTPUTS[self.cfg.structured_output]. Since this is a user-provided config value (Hydra/CLI via ++structured_output=...), this becomes an unhelpful crash path. Consider validating structured_output in GenerationTaskConfig.__post_init__ (or using .get() with an explicit ValueError listing allowed keys) so users get a clear error message.
…#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>
commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
…#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>
…#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Why?
Enable optional AA-compatibility in HLE benchmark implementation.
What?
Support structured output in generations. See https://platform.openai.com/docs/guides/structured-outputs
How?
nemo_skills/inference/structured_outputs.pywith predefined response formats for structured outputsstructured_outputstr, default Noneprocess_single_datapointbefore generation: add response_format to generation_params based onself.cfg.structured_outputif not Nonepostprocess_single_outputprase generation and extractcorrectfield in the expected judgement formatSummary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.