support structured outputs in hle judge for optional AA compatibility by anowaczynski-nvidia · Pull Request #1186 · NVIDIA-NeMo/Skills

anowaczynski-nvidia · 2026-01-26T23:53:32Z

Why?
Enable optional AA-compatibility in HLE benchmark implementation.

What?
Support structured output in generations. See https://platform.openai.com/docs/guides/structured-outputs

How?

new file nemo_skills/inference/structured_outputs.py with predefined response formats for structured outputs
new parameter in GenerationTaskConfig: structured_output str, default None
in process_single_datapoint before generation: add response_format to generation_params based on self.cfg.structured_output if not None
in postprocess_single_output prase generation and extract correct field in the expected judgement format

Summary by CodeRabbit

New Features
- Added structured output support to inference generation, enabling formatted JSON responses from compatible models
- Automatic response parsing and field extraction for improved result handling
- Response format configuration with consistent API across supported model providers

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps · 2026-01-26T23:57:09Z

Greptile Overview

Greptile Summary

This PR adds structured output support to enable optional AA-compatibility in HLE benchmark implementation. The changes introduce a new structured_output configuration parameter that allows specifying predefined response formats (currently supporting HLE_JUDGE_AA).

Key Changes:

New file structured_outputs.py defines Pydantic models for structured response formats
GenerationTaskConfig adds structured_output parameter (str, default None)
Before generation: response_format is added to generation params based on the configured structured output
After generation: JSON response is parsed to extract the correct field and format as "Judgement: {correct}"
All model implementations updated to support or reject response_format parameter appropriately

Architecture:
The implementation follows a clean separation of concerns: structured output schemas are defined centrally in structured_outputs.py, the generation task coordinates the flow, and individual model adapters handle provider-specific support. Error handling includes both JSONDecodeError and KeyError catching with a fallback to "Judgement: FAILED_TO_PARSE".

Provider Support:

OpenAI, VLLM, SGLang: Full support added
Gemini: Raises NotImplementedError
Megatron: Assertions reject the parameter
All providers correctly exclude response_format from text completion endpoints

Confidence Score: 4/5

This PR is relatively safe to merge with minor API compatibility concerns already flagged
The implementation is well-structured with comprehensive error handling and proper provider-specific adaptations. Error handling for JSON parsing is thorough (catches both JSONDecodeError and KeyError). However, there's an unverified assumption about how LiteLLM handles raw Pydantic classes in response_format - the OpenAI API typically expects a specific dict format with json_schema, not raw classes. LiteLLM may handle this conversion automatically, but this needs verification through testing. The changes are isolated to the structured outputs feature and won't affect existing functionality.
nemo_skills/inference/structured_outputs.py - verify that LiteLLM correctly converts the raw Pydantic class to the expected API format

Important Files Changed

Filename	Overview
nemo_skills/inference/structured_outputs.py	Added Pydantic model for HLE judge format, but raw class may not work with all providers' APIs
nemo_skills/inference/generate.py	Added structured_output config and parsing logic with proper error handling (KeyError and JSONDecodeError)
nemo_skills/inference/model/openai.py	Added response_format support for chat requests, correctly excluded from completion requests
nemo_skills/inference/model/vllm.py	Added response_format parameter to chat requests, correctly excluded from completion requests

Sequence Diagram

sequenceDiagram
    participant User
    participant GenerationTask
    participant STRUCTURED_OUTPUTS
    participant Model
    participant LiteLLM
    participant LLM API

    User->>GenerationTask: configure structured_output="HLE_JUDGE_AA"
    GenerationTask->>GenerationTask: process_single_datapoint()
    GenerationTask->>STRUCTURED_OUTPUTS: lookup HLE_JUDGE_AA
    STRUCTURED_OUTPUTS-->>GenerationTask: return HLEJudgeAAResponseFormat (Pydantic class)
    GenerationTask->>Model: generate_async(response_format=HLEJudgeAAResponseFormat)
    Model->>Model: _build_chat_request_params(response_format=...)
    Model->>LiteLLM: acompletion(response_format=HLEJudgeAAResponseFormat)
    LiteLLM->>LLM API: POST with structured output schema
    LLM API-->>LiteLLM: JSON response matching schema
    LiteLLM-->>Model: response
    Model-->>GenerationTask: generation result
    GenerationTask->>GenerationTask: postprocess_single_output()
    GenerationTask->>GenerationTask: parse JSON and extract "correct" field
    alt JSON parsing succeeds
        GenerationTask->>GenerationTask: format as "Judgement: {correct}"
    else JSON parsing fails
        GenerationTask->>GenerationTask: fallback to "Judgement: FAILED_TO_PARSE"
    end
    GenerationTask-->>User: final output

coderabbitai · 2026-01-26T23:57:10Z

📝 Walkthrough

Walkthrough

Adds structured output support (HLE_JUDGE_AA) throughout the inference pipeline by introducing a response_format parameter, enabling structured JSON response parsing and post-processing of judge correctness assessments across multiple model implementations.

Changes

Cohort / File(s)	Summary
Structured Output Definition `nemo_skills/inference/structured_outputs.py`	New module defining `HLEJudgeAAResponseFormat` Pydantic model with fields for answer, reasoning, correctness (`correct` as "yes"/"no"`), and confidence; introduces` STRUCTURED_OUTPUTS` registry mapping "HLE_JUDGE_AA" key.
Generation Orchestration `nemo_skills/inference/generate.py`	Adds `structured_output: str \| None` config field to `GenerationTaskConfig`; injects `response_format` into generation params when structured output is registered; post-processing parses JSON response, extracts "correct" field, wraps as "Judgement: " or "Judgement: FAILED_TO_PARSE" on error.
Base Model Abstraction `nemo_skills/inference/model/base.py`	Adds optional `response_format` parameter to `generate_async` method; propagates parameter into per-call kwargs forwarded to underlying request builders and litellm calls.
Model-Specific Implementations (Restrictive) `nemo_skills/inference/model/gemini.py`, `nemo_skills/inference/model/megatron.py`	Both add `response_format` parameter with assertions enforcing it must remain `None`, explicitly disallowing structured outputs due to API limitations.
Model-Specific Implementations (Chat-Only) `nemo_skills/inference/model/openai.py`, `nemo_skills/inference/model/vllm.py`	Add `response_format` parameter; chat paths pass through to request payload; completion paths reject with assertions. OpenAI preserves reasoning model defaults when response_format used.
Model-Specific Implementations (Pass-Through) `nemo_skills/inference/model/sglang.py`	Adds `response_format` parameter and threads it to parent class call; parameter included in final request dictionary when provided.

Sequence Diagram

sequenceDiagram
    participant Config as GenerationTaskConfig
    participant Generate as generate.py
    participant BaseModel as BaseModel
    participant ModelImpl as Model Implementation
    participant LiteLLM as LiteLLM/API

    Config->>Generate: structured_output="HLE_JUDGE_AA"
    Generate->>Generate: process_single_datapoint()
    Generate->>BaseModel: generate_async(..., response_format=HLEJudgeAAResponseFormat)
    BaseModel->>ModelImpl: _build_chat_request_params(..., response_format)
    alt Supports structured (OpenAI/SGLang/VLLM chat)
        ModelImpl->>ModelImpl: Include response_format in request dict
    else Rejects structured (Gemini/Megatron)
        ModelImpl->>ModelImpl: assert response_format is None
    end
    ModelImpl->>LiteLLM: Send request with response_format
    LiteLLM-->>ModelImpl: JSON response {correct: "yes", ...}
    ModelImpl-->>BaseModel: Response
    BaseModel-->>Generate: Raw generation
    Generate->>Generate: postprocess_single_output()
    Generate->>Generate: Parse JSON, extract "correct" field
    Generate-->>Config: Judgement: yes/no

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'support structured outputs in hle judge for optional AA compatibility' clearly and specifically summarizes the main change: adding structured output support to the HLE judge for AA compatibility.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/generate.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

nemo_skills/inference/model/megatron.py (2)
39-53: Replace assert statements with explicit exceptions for parameter validation.

Both _build_chat_request_params and _build_completion_request_params use assert to validate tools and response_format parameters. Since assert statements can be disabled with Python's -O flag, unsupported parameters would silently pass through, violating the guideline to "fail loudly" on invalid inputs. Other parameters in the same validation block correctly use explicit if + raise NotImplementedError, so match that pattern consistently.
Proposed fix
-        assert kwargs.get("tools") is None, "Megatron server does not support tools parameter."
-        assert response_format is None, "Megatron server does not support response_format parameter."
+        if kwargs.get("tools") is not None:
+            raise NotImplementedError("Megatron server does not support tools parameter.")
+        if response_format is not None:
+            raise NotImplementedError("Megatron server does not support response_format parameter.")
Apply this fix to both methods.
86-100: Replace assert statements with explicit exceptions for parameter validation.

Both tools and response_format parameters use assert statements, which can be disabled with Python's -O flag, causing silent failures. This violates the coding guideline to "Let the code fail with clear errors instead of silently misbehaving" and "Avoid silently ignoring unused user-passed parameters."

This same pattern appears in two methods (around lines 51-52 and 98-99). Replace all four assert statements with explicit if + raise NotImplementedError() to match the approach used for other unsupported parameters (stream, min_p, repetition_penalty, top_k).
Example fix for lines 98-99
-        assert kwargs.get("tools") is None, "Megatron server does not support tools parameter."
-        assert response_format is None, "Megatron server does not support response_format parameter."
+        if kwargs.get("tools") is not None:
+            raise NotImplementedError("Megatron server does not support tools parameter.")
+        if response_format is not None:
+            raise NotImplementedError("Megatron server does not support response_format parameter.")

🤖 Fix all issues with AI agents

In `@nemo_skills/inference/generate.py`:
- Around line 695-696: The code silently ignores unknown structured_output
values; add a validation in GenerationTaskConfig.__post_init__ (or call a helper
_post_init_validate_params from __post_init__) that checks if
self.structured_output is not None and not in STRUCTURED_OUTPUTS and raise a
ValueError listing the invalid value and valid keys (referencing
STRUCTURED_OUTPUTS and the attribute structured_output); this ensures
process_single_datapoint/generation_params population logic (where
generation_params["response_format"] is set) never silently drops an unsupported
structured_output.

In `@nemo_skills/inference/structured_outputs.py`:
- Around line 1-2: Add the standard NVIDIA copyright header at the very top of
the module (above the imports) in nemo_skills/inference/structured_outputs.py so
the file begins with the required multi-line copyright notice; do not alter the
existing imports (from typing import Literal, from pydantic import
BaseModel)—just prepend the header block exactly as the project's canonical
NVIDIA header.
- Around line 5-10: The HLEJudgeAAResponseFormat model wrongly includes a
non-response field strict: Literal[True]; remove the strict attribute from the
class so the model only defines extracted_final_answer, reasoning, correct, and
confidence, and then remove any now-unused imports (e.g., Literal[True] or
Literal if no longer needed); ensure any strict:true configuration is applied at
the OpenAI request/schema configuration level rather than as a field on
HLEJudgeAAResponseFormat.

🧹 Nitpick comments (2)

nemo_skills/inference/model/base.py (1)
239-239: Consider adding type annotation for consistency.

The response_format parameter lacks a type annotation while other parameters in this method have them. Consider adding a type hint for consistency.
Proposed fix
-        response_format = None,
+        response_format: dict | None = None,
nemo_skills/inference/generate.py (1)
636-642: Remove unused exception variable and consider logging the failure.

The exception variable e is assigned but never used (also flagged by static analysis). Additionally, silently setting FAILED_TO_PARSE without logging could make debugging difficult when generation fails to parse.
Proposed fix
         if self.cfg.structured_output == "HLE_JUDGE_AA":
             try:
                 output[self.cfg.generation_key] = "Judgement: {}".format(
                     json.loads(output[self.cfg.generation_key])["correct"]
                 )
-            except json.JSONDecodeError as e:
+            except json.JSONDecodeError:
+                LOG.warning(
+                    "Failed to parse structured output as JSON: %s",
+                    output[self.cfg.generation_key][:200] if output[self.cfg.generation_key] else "<empty>"
+                )
                 output[self.cfg.generation_key] = "Judgement: FAILED_TO_PARSE"

coderabbitai · 2026-01-26T23:57:13Z

nemo_skills/inference/generate.py

+        if self.cfg.structured_output in STRUCTURED_OUTPUTS:
+            generation_params["response_format"] = STRUCTURED_OUTPUTS[self.cfg.structured_output]


⚠️ Potential issue | 🟡 Minor

Consider validating structured_output against registry early.

If a user specifies a structured_output value that's not in STRUCTURED_OUTPUTS, the code silently ignores it without injecting response_format. This could lead to unexpected behavior. Per coding guidelines, the code should fail if a user specifies an unsupported argument.

Proposed fix in `__post_init__` or `process_single_datapoint`

Add validation in GenerationTaskConfig.__post_init__:

def _post_init_validate_params(self): # ... existing validations ... if self.structured_output is not None and self.structured_output not in STRUCTURED_OUTPUTS: raise ValueError( f"Unknown structured_output '{self.structured_output}'. " f"Valid options: {list(STRUCTURED_OUTPUTS.keys())}" )

🤖 Prompt for AI Agents

In `@nemo_skills/inference/generate.py` around lines 695 - 696, The code silently ignores unknown structured_output values; add a validation in GenerationTaskConfig.__post_init__ (or call a helper _post_init_validate_params from __post_init__) that checks if self.structured_output is not None and not in STRUCTURED_OUTPUTS and raise a ValueError listing the invalid value and valid keys (referencing STRUCTURED_OUTPUTS and the attribute structured_output); this ensures process_single_datapoint/generation_params population logic (where generation_params["response_format"] is set) never silently drops an unsupported structured_output.

nemo_skills/inference/structured_outputs.py

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nemo_skills/inference/structured_outputs.py

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-27T00:26:48Z

nemo_skills/inference/generate.py

+        if self.cfg.structured_output == "HLE_JUDGE_AA":
+            try:
+                output[self.cfg.generation_key] = "Judgement: {}".format(
+                    json.loads(output[self.cfg.generation_key])["correct"]
+                )
+            except (json.JSONDecodeError, KeyError):
+                output[self.cfg.generation_key] = "Judgement: FAILED_TO_PARSE"


Hardcoded check for "HLE_JUDGE_AA" creates inconsistency with line 695 which uses in STRUCTURED_OUTPUTS. If new structured output formats are added to STRUCTURED_OUTPUTS, they'll set response_format but won't have corresponding postprocessing logic. Consider using self.cfg.structured_output in STRUCTURED_OUTPUTS here or creating a registry of postprocessing handlers.

jiacheng-xu

I would request @gwarmstrong to review and leave some comments here since it's changing a core function / logic in generation flow.
I could be wrong, but some thoughts I have after reviewing the changes:

The naming of "response_format" is vague. It could be image or text, or it could be text or JSON. Need maybe renaming and more docs about it.
The use of response_format might change the behavior of EndpointType, and vice versa. There need to be more test cases.
Test cases needed for at least one example. MathReasoning from https://platform.openai.com/docs/guides/structured-outputs?example=structured-data is good.
It is a broad feature and not only for HLE_JUDGE_AA.

jiacheng-xu · 2026-01-27T19:17:35Z

nemo_skills/inference/generate.py

        # all of the original data to the output file alongside the new generations
        output[self.cfg.generation_key] = output.pop("generation")

+        if self.cfg.structured_output == "HLE_JUDGE_AA":


It is not be a good idea to hard code HLE_JUDGE_AA in generate.py.
Can we build a function to handle that like

Skills/nemo_skills/inference/generate.py

Line 658 in 54c8bc0

if self.cfg.parse_reasoning:

?

@anowaczynski-nvidia can we move this logic into metrics? Why does it need to be in the generation?

Reasons I added if with postprocessing here:

to enable AA-compatible HLE judge, ++structured_output=HLE_JUDGE_AA needs to be added only in one place (judge generations pipeline command)

with the current version summarize_results command and pipeline logic for aggregating hle judge outputs into metrics doesn't require any modifications (the same command + code handles both: default and AA-compatible judges)

I am aware this code is fundamental to the entire package, all generations pass through it.

Regarding moving this to metrics: I see the possibility to create hleaa_metrics.py in evaluation/metrics, inherit from MathMetrics, and override only _get_score_dict, such that postprocessing of judgement (parsing into json etc) is applied before is_correct_judgement. Do you approve this plan?

Yes, either that or we can just have this as option for main math metrics, so that any dataset, not just HLE can be evaluated in this setup. The one problem is I am not fully sure if metrics are currently customizable, but I guess if not, then we should enable customization in a similar way to how it's done for eval / generation parameters. Let me know if you need help with the design on that, happy to discuss in more details

@Kipok I tried the hard way first, but nothing I created was correct and convincing, so I pushed one commit with class HLEAAMetrics(MathMetrics) solution as it was conceptually much simpler. The main downside is that I had to add metric_type to eval command. It doesn't look right here. It doesn't compose with eval on multiple benchmarks idea. Can you take a look? If we're doing Metrics Config idea, I need a sync how to approach it.

I think this is the right approach. When doing eval on multiple benchmarks you can't really customize anything except maybe inference parameters. E.g. doing prompt change or eval arguments will also break things, so I think adding metric_type is a good change. An alternative would be to add this as an argument to MathMetrics and then you can reuse existing metric_kwargs parameter to customize it. But adding metric_type is a good change anyway given that we support metric_kwargs already.

If the current implementation fully works for you, I think it LGTM as well and we can merge it. But do let me know if you have any concerns or think we should do things differently

It's probably a good idea to add a new test for this in test_generation.py, but only if models on build.nvidia.com support this response_format argument

added test_judge_generations_with_structured_output but it takes 10 minutes to complete even with max_samples=2, obviously this can't be merged, but where do we go from here?

thanks @anowaczynski-nvidia - pushed a change to limit max tokens (since we aren't checking generation correctness anyway), seems to finish very fast now!

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps

_{No files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-06T22:19:03Z

nemo_skills/inference/structured_outputs.py

+    extracted_final_answer: str
+    reasoning: str
+    correct: Literal["yes", "no"]
+    confidence: int


confidence field has no validation constraints. Should be confidence: int = Field(ge=0, le=100) or similar to ensure valid confidence values.

Suggested change

confidence: int

confidence: int = Field(ge=0, le=100, description="Confidence score from 0 to 100")

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

Kipok

lgtm as long as the tests pass

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps

_{2 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-06T22:36:06Z

tests/test_generation.py

+def test_judge_generations_with_structured_output(tmp_path):
+    cmd = (
+        f"ns eval "
+        f"    --server_type=openai "
+        f"    --model=nvidia/nemotron-3-nano-30b-a3b "
+        f"    --server_address=https://integrate.api.nvidia.com/v1 "
+        f"    --benchmarks=hle "
+        f"    --output_dir={tmp_path} "
+        f"    --judge_model=nvidia/nemotron-3-nano-30b-a3b "
+        f"    --judge_server_address=https://integrate.api.nvidia.com/v1 "
+        f"    --judge_server_type=openai "
+        f"    --metric_type=hle-aa "
+        f'    --extra_judge_args="++structured_output=HLE_JUDGE_AA" '
+        f"    ++max_samples=2 "
+        f"    ++inference.tokens_to_generate=1024 "  # to make test go fast
+    )
+    subprocess.run(cmd, shell=True, check=True)


Networked integration test

test_judge_generations_with_structured_output shells out to ns eval with real external endpoints (https://integrate.api.nvidia.com/v1) and a specific hosted model. This will fail in CI/offline test environments (no credentials / no network), so the PR will become unmergeable due to nondeterministic test failures. This should be rewritten as a unit/integration test that mocks the model call or uses the existing local test server fixtures, or it should be gated/marked as an opt-in test (e.g., skipped unless an env var is set).

greptile-apps · 2026-02-06T22:36:07Z

nemo_skills/inference/generate.py

+        if self.cfg.structured_output is not None:
+            generation_params["response_format"] = STRUCTURED_OUTPUTS[self.cfg.structured_output]
+


Unhandled invalid key

When structured_output is set to any non-None value that is not present in STRUCTURED_OUTPUTS, process_single_datapoint will throw a KeyError at STRUCTURED_OUTPUTS[self.cfg.structured_output]. Since this is a user-provided config value (Hydra/CLI via ++structured_output=...), this becomes an unhelpful crash path. Consider validating structured_output in GenerationTaskConfig.__post_init__ (or using .get() with an explicit ValueError listing allowed keys) so users get a clear error message.

…#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com>

commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

…#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

…#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

support structured outputs in hle judge for AA

0379e3c

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

anowaczynski-nvidia self-assigned this Jan 26, 2026

copyright notice

c7a5c6a

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps bot reviewed Jan 26, 2026

View reviewed changes

nemo_skills/inference/generate.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

anowaczynski-nvidia added 2 commits January 27, 2026 00:59

KeyError

493a793

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

remove strict field

ff96906

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

nemo_skills/inference/structured_outputs.py Show resolved Hide resolved

anowaczynski-nvidia removed their assignment Jan 27, 2026

format code

f86bc95

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

format more code

54c8bc0

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps bot reviewed Jan 27, 2026

View reviewed changes

ekmb requested a review from jiacheng-xu January 27, 2026 01:57

jiacheng-xu requested changes Jan 27, 2026

View reviewed changes

jiacheng-xu requested a review from gwarmstrong January 27, 2026 19:27

move judgement postprocessing to hleaa metrics

534a6c0

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps bot reviewed Jan 29, 2026

View reviewed changes

test_judge_generations_with_structured_output

adcff37

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

fix FileNotFoundError in tests

8442962

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps bot reviewed Jan 30, 2026

View reviewed changes

add hle to ns prepare_data in .github/workflows/tests.yml

8509918

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>

greptile-apps bot reviewed Feb 2, 2026

View reviewed changes

Kipok added 3 commits February 5, 2026 16:10

Merge branch 'main' into anowaczynski/hle-judge-aa-structured-outputs

ba33ee1

Merge branch 'main' into anowaczynski/hle-judge-aa-structured-outputs

cf9725b

Add token limit for the test

52d59b4

Signed-off-by: Igor Gitman <igitman@nvidia.com>

Kipok added the run GPU tests label Feb 6, 2026

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

Kipok approved these changes Feb 6, 2026

View reviewed changes

Make code fail if inconsistent structured output

6539eae

Signed-off-by: Igor Gitman <igitman@nvidia.com>

greptile-apps bot reviewed Feb 6, 2026

View reviewed changes

Kipok added run GPU tests and removed run GPU tests labels Feb 6, 2026

Kipok merged commit 8950bb0 into main Feb 7, 2026
6 checks passed

Kipok deleted the anowaczynski/hle-judge-aa-structured-outputs branch February 7, 2026 00:38

		if self.cfg.structured_output in STRUCTURED_OUTPUTS:
		generation_params["response_format"] = STRUCTURED_OUTPUTS[self.cfg.structured_output]

	confidence: int
	confidence: int = Field(ge=0, le=100, description="Confidence score from 0 to 100")

		if self.cfg.structured_output is not None:
		generation_params["response_format"] = STRUCTURED_OUTPUTS[self.cfg.structured_output]

Conversation

anowaczynski-nvidia commented Jan 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

greptile-apps bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

coderabbitai bot commented Jan 26, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

jiacheng-xu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anowaczynski-nvidia Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

anowaczynski-nvidia commented Jan 26, 2026 •

edited by coderabbitai bot

Loading

greptile-apps bot commented Jan 26, 2026 •

edited

Loading

anowaczynski-nvidia Jan 28, 2026 •

edited

Loading