Skip to content

support structured outputs in hle judge for optional AA compatibility#1186

Merged
Kipok merged 14 commits intomainfrom
anowaczynski/hle-judge-aa-structured-outputs
Feb 7, 2026
Merged

support structured outputs in hle judge for optional AA compatibility#1186
Kipok merged 14 commits intomainfrom
anowaczynski/hle-judge-aa-structured-outputs

Conversation

@anowaczynski-nvidia
Copy link
Collaborator

@anowaczynski-nvidia anowaczynski-nvidia commented Jan 26, 2026

Why?
Enable optional AA-compatibility in HLE benchmark implementation.

What?
Support structured output in generations. See https://platform.openai.com/docs/guides/structured-outputs

How?

  • new file nemo_skills/inference/structured_outputs.py with predefined response formats for structured outputs
  • new parameter in GenerationTaskConfig: structured_output str, default None
  • in process_single_datapoint before generation: add response_format to generation_params based on self.cfg.structured_output if not None
  • in postprocess_single_output prase generation and extract correct field in the expected judgement format

Summary by CodeRabbit

  • New Features
    • Added structured output support to inference generation, enabling formatted JSON responses from compatible models
    • Automatic response parsing and field extraction for improved result handling
    • Response format configuration with consistent API across supported model providers

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
@anowaczynski-nvidia anowaczynski-nvidia self-assigned this Jan 26, 2026
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 26, 2026

Greptile Overview

Greptile Summary

This PR adds structured output support to enable optional AA-compatibility in HLE benchmark implementation. The changes introduce a new structured_output configuration parameter that allows specifying predefined response formats (currently supporting HLE_JUDGE_AA).

Key Changes:

  • New file structured_outputs.py defines Pydantic models for structured response formats
  • GenerationTaskConfig adds structured_output parameter (str, default None)
  • Before generation: response_format is added to generation params based on the configured structured output
  • After generation: JSON response is parsed to extract the correct field and format as "Judgement: {correct}"
  • All model implementations updated to support or reject response_format parameter appropriately

Architecture:
The implementation follows a clean separation of concerns: structured output schemas are defined centrally in structured_outputs.py, the generation task coordinates the flow, and individual model adapters handle provider-specific support. Error handling includes both JSONDecodeError and KeyError catching with a fallback to "Judgement: FAILED_TO_PARSE".

Provider Support:

  • OpenAI, VLLM, SGLang: Full support added
  • Gemini: Raises NotImplementedError
  • Megatron: Assertions reject the parameter
  • All providers correctly exclude response_format from text completion endpoints

Confidence Score: 4/5

  • This PR is relatively safe to merge with minor API compatibility concerns already flagged
  • The implementation is well-structured with comprehensive error handling and proper provider-specific adaptations. Error handling for JSON parsing is thorough (catches both JSONDecodeError and KeyError). However, there's an unverified assumption about how LiteLLM handles raw Pydantic classes in response_format - the OpenAI API typically expects a specific dict format with json_schema, not raw classes. LiteLLM may handle this conversion automatically, but this needs verification through testing. The changes are isolated to the structured outputs feature and won't affect existing functionality.
  • nemo_skills/inference/structured_outputs.py - verify that LiteLLM correctly converts the raw Pydantic class to the expected API format

Important Files Changed

Filename Overview
nemo_skills/inference/structured_outputs.py Added Pydantic model for HLE judge format, but raw class may not work with all providers' APIs
nemo_skills/inference/generate.py Added structured_output config and parsing logic with proper error handling (KeyError and JSONDecodeError)
nemo_skills/inference/model/openai.py Added response_format support for chat requests, correctly excluded from completion requests
nemo_skills/inference/model/vllm.py Added response_format parameter to chat requests, correctly excluded from completion requests

Sequence Diagram

sequenceDiagram
    participant User
    participant GenerationTask
    participant STRUCTURED_OUTPUTS
    participant Model
    participant LiteLLM
    participant LLM API

    User->>GenerationTask: configure structured_output="HLE_JUDGE_AA"
    GenerationTask->>GenerationTask: process_single_datapoint()
    GenerationTask->>STRUCTURED_OUTPUTS: lookup HLE_JUDGE_AA
    STRUCTURED_OUTPUTS-->>GenerationTask: return HLEJudgeAAResponseFormat (Pydantic class)
    GenerationTask->>Model: generate_async(response_format=HLEJudgeAAResponseFormat)
    Model->>Model: _build_chat_request_params(response_format=...)
    Model->>LiteLLM: acompletion(response_format=HLEJudgeAAResponseFormat)
    LiteLLM->>LLM API: POST with structured output schema
    LLM API-->>LiteLLM: JSON response matching schema
    LiteLLM-->>Model: response
    Model-->>GenerationTask: generation result
    GenerationTask->>GenerationTask: postprocess_single_output()
    GenerationTask->>GenerationTask: parse JSON and extract "correct" field
    alt JSON parsing succeeds
        GenerationTask->>GenerationTask: format as "Judgement: {correct}"
    else JSON parsing fails
        GenerationTask->>GenerationTask: fallback to "Judgement: FAILED_TO_PARSE"
    end
    GenerationTask-->>User: final output
Loading

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2026

📝 Walkthrough

Walkthrough

Adds structured output support (HLE_JUDGE_AA) throughout the inference pipeline by introducing a response_format parameter, enabling structured JSON response parsing and post-processing of judge correctness assessments across multiple model implementations.

Changes

Cohort / File(s) Summary
Structured Output Definition
nemo_skills/inference/structured_outputs.py
New module defining HLEJudgeAAResponseFormat Pydantic model with fields for answer, reasoning, correctness (correct as "yes"/"no"), and confidence; introduces STRUCTURED_OUTPUTS` registry mapping "HLE_JUDGE_AA" key.
Generation Orchestration
nemo_skills/inference/generate.py
Adds structured_output: str | None config field to GenerationTaskConfig; injects response_format into generation params when structured output is registered; post-processing parses JSON response, extracts "correct" field, wraps as "Judgement: " or "Judgement: FAILED_TO_PARSE" on error.
Base Model Abstraction
nemo_skills/inference/model/base.py
Adds optional response_format parameter to generate_async method; propagates parameter into per-call kwargs forwarded to underlying request builders and litellm calls.
Model-Specific Implementations (Restrictive)
nemo_skills/inference/model/gemini.py, nemo_skills/inference/model/megatron.py
Both add response_format parameter with assertions enforcing it must remain None, explicitly disallowing structured outputs due to API limitations.
Model-Specific Implementations (Chat-Only)
nemo_skills/inference/model/openai.py, nemo_skills/inference/model/vllm.py
Add response_format parameter; chat paths pass through to request payload; completion paths reject with assertions. OpenAI preserves reasoning model defaults when response_format used.
Model-Specific Implementations (Pass-Through)
nemo_skills/inference/model/sglang.py
Adds response_format parameter and threads it to parent class call; parameter included in final request dictionary when provided.

Sequence Diagram

sequenceDiagram
    participant Config as GenerationTaskConfig
    participant Generate as generate.py
    participant BaseModel as BaseModel
    participant ModelImpl as Model Implementation
    participant LiteLLM as LiteLLM/API

    Config->>Generate: structured_output="HLE_JUDGE_AA"
    Generate->>Generate: process_single_datapoint()
    Generate->>BaseModel: generate_async(..., response_format=HLEJudgeAAResponseFormat)
    BaseModel->>ModelImpl: _build_chat_request_params(..., response_format)
    alt Supports structured (OpenAI/SGLang/VLLM chat)
        ModelImpl->>ModelImpl: Include response_format in request dict
    else Rejects structured (Gemini/Megatron)
        ModelImpl->>ModelImpl: assert response_format is None
    end
    ModelImpl->>LiteLLM: Send request with response_format
    LiteLLM-->>ModelImpl: JSON response {correct: "yes", ...}
    ModelImpl-->>BaseModel: Response
    BaseModel-->>Generate: Raw generation
    Generate->>Generate: postprocess_single_output()
    Generate->>Generate: Parse JSON, extract "correct" field
    Generate-->>Config: Judgement: yes/no
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'support structured outputs in hle judge for optional AA compatibility' clearly and specifically summarizes the main change: adding structured output support to the HLE judge for AA compatibility.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
nemo_skills/inference/model/megatron.py (2)

39-53: Replace assert statements with explicit exceptions for parameter validation.

Both _build_chat_request_params and _build_completion_request_params use assert to validate tools and response_format parameters. Since assert statements can be disabled with Python's -O flag, unsupported parameters would silently pass through, violating the guideline to "fail loudly" on invalid inputs. Other parameters in the same validation block correctly use explicit if + raise NotImplementedError, so match that pattern consistently.

Proposed fix
-        assert kwargs.get("tools") is None, "Megatron server does not support tools parameter."
-        assert response_format is None, "Megatron server does not support response_format parameter."
+        if kwargs.get("tools") is not None:
+            raise NotImplementedError("Megatron server does not support tools parameter.")
+        if response_format is not None:
+            raise NotImplementedError("Megatron server does not support response_format parameter.")

Apply this fix to both methods.


86-100: Replace assert statements with explicit exceptions for parameter validation.

Both tools and response_format parameters use assert statements, which can be disabled with Python's -O flag, causing silent failures. This violates the coding guideline to "Let the code fail with clear errors instead of silently misbehaving" and "Avoid silently ignoring unused user-passed parameters."

This same pattern appears in two methods (around lines 51-52 and 98-99). Replace all four assert statements with explicit if + raise NotImplementedError() to match the approach used for other unsupported parameters (stream, min_p, repetition_penalty, top_k).

Example fix for lines 98-99
-        assert kwargs.get("tools") is None, "Megatron server does not support tools parameter."
-        assert response_format is None, "Megatron server does not support response_format parameter."
+        if kwargs.get("tools") is not None:
+            raise NotImplementedError("Megatron server does not support tools parameter.")
+        if response_format is not None:
+            raise NotImplementedError("Megatron server does not support response_format parameter.")
🤖 Fix all issues with AI agents
In `@nemo_skills/inference/generate.py`:
- Around line 695-696: The code silently ignores unknown structured_output
values; add a validation in GenerationTaskConfig.__post_init__ (or call a helper
_post_init_validate_params from __post_init__) that checks if
self.structured_output is not None and not in STRUCTURED_OUTPUTS and raise a
ValueError listing the invalid value and valid keys (referencing
STRUCTURED_OUTPUTS and the attribute structured_output); this ensures
process_single_datapoint/generation_params population logic (where
generation_params["response_format"] is set) never silently drops an unsupported
structured_output.

In `@nemo_skills/inference/structured_outputs.py`:
- Around line 1-2: Add the standard NVIDIA copyright header at the very top of
the module (above the imports) in nemo_skills/inference/structured_outputs.py so
the file begins with the required multi-line copyright notice; do not alter the
existing imports (from typing import Literal, from pydantic import
BaseModel)—just prepend the header block exactly as the project's canonical
NVIDIA header.
- Around line 5-10: The HLEJudgeAAResponseFormat model wrongly includes a
non-response field strict: Literal[True]; remove the strict attribute from the
class so the model only defines extracted_final_answer, reasoning, correct, and
confidence, and then remove any now-unused imports (e.g., Literal[True] or
Literal if no longer needed); ensure any strict:true configuration is applied at
the OpenAI request/schema configuration level rather than as a field on
HLEJudgeAAResponseFormat.
🧹 Nitpick comments (2)
nemo_skills/inference/model/base.py (1)

239-239: Consider adding type annotation for consistency.

The response_format parameter lacks a type annotation while other parameters in this method have them. Consider adding a type hint for consistency.

Proposed fix
-        response_format = None,
+        response_format: dict | None = None,
nemo_skills/inference/generate.py (1)

636-642: Remove unused exception variable and consider logging the failure.

The exception variable e is assigned but never used (also flagged by static analysis). Additionally, silently setting FAILED_TO_PARSE without logging could make debugging difficult when generation fails to parse.

Proposed fix
         if self.cfg.structured_output == "HLE_JUDGE_AA":
             try:
                 output[self.cfg.generation_key] = "Judgement: {}".format(
                     json.loads(output[self.cfg.generation_key])["correct"]
                 )
-            except json.JSONDecodeError as e:
+            except json.JSONDecodeError:
+                LOG.warning(
+                    "Failed to parse structured output as JSON: %s",
+                    output[self.cfg.generation_key][:200] if output[self.cfg.generation_key] else "<empty>"
+                )
                 output[self.cfg.generation_key] = "Judgement: FAILED_TO_PARSE"

Comment on lines +695 to +696
if self.cfg.structured_output in STRUCTURED_OUTPUTS:
generation_params["response_format"] = STRUCTURED_OUTPUTS[self.cfg.structured_output]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider validating structured_output against registry early.

If a user specifies a structured_output value that's not in STRUCTURED_OUTPUTS, the code silently ignores it without injecting response_format. This could lead to unexpected behavior. Per coding guidelines, the code should fail if a user specifies an unsupported argument.

Proposed fix in `__post_init__` or `process_single_datapoint`

Add validation in GenerationTaskConfig.__post_init__:

def _post_init_validate_params(self):
    # ... existing validations ...
    if self.structured_output is not None and self.structured_output not in STRUCTURED_OUTPUTS:
        raise ValueError(
            f"Unknown structured_output '{self.structured_output}'. "
            f"Valid options: {list(STRUCTURED_OUTPUTS.keys())}"
        )
🤖 Prompt for AI Agents
In `@nemo_skills/inference/generate.py` around lines 695 - 696, The code silently
ignores unknown structured_output values; add a validation in
GenerationTaskConfig.__post_init__ (or call a helper _post_init_validate_params
from __post_init__) that checks if self.structured_output is not None and not in
STRUCTURED_OUTPUTS and raise a ValueError listing the invalid value and valid
keys (referencing STRUCTURED_OUTPUTS and the attribute structured_output); this
ensures process_single_datapoint/generation_params population logic (where
generation_params["response_format"] is set) never silently drops an unsupported
structured_output.

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@anowaczynski-nvidia anowaczynski-nvidia removed their assignment Jan 27, 2026
Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +636 to +642
if self.cfg.structured_output == "HLE_JUDGE_AA":
try:
output[self.cfg.generation_key] = "Judgement: {}".format(
json.loads(output[self.cfg.generation_key])["correct"]
)
except (json.JSONDecodeError, KeyError):
output[self.cfg.generation_key] = "Judgement: FAILED_TO_PARSE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded check for "HLE_JUDGE_AA" creates inconsistency with line 695 which uses in STRUCTURED_OUTPUTS. If new structured output formats are added to STRUCTURED_OUTPUTS, they'll set response_format but won't have corresponding postprocessing logic. Consider using self.cfg.structured_output in STRUCTURED_OUTPUTS here or creating a registry of postprocessing handlers.

@ekmb ekmb requested a review from jiacheng-xu January 27, 2026 01:57
Copy link
Collaborator

@jiacheng-xu jiacheng-xu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would request @gwarmstrong to review and leave some comments here since it's changing a core function / logic in generation flow.
I could be wrong, but some thoughts I have after reviewing the changes:

  1. The naming of "response_format" is vague. It could be image or text, or it could be text or JSON. Need maybe renaming and more docs about it.
  2. The use of response_format might change the behavior of EndpointType, and vice versa. There need to be more test cases.
  3. Test cases needed for at least one example. MathReasoning from https://platform.openai.com/docs/guides/structured-outputs?example=structured-data is good.
  4. It is a broad feature and not only for HLE_JUDGE_AA.

# all of the original data to the output file alongside the new generations
output[self.cfg.generation_key] = output.pop("generation")

if self.cfg.structured_output == "HLE_JUDGE_AA":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not be a good idea to hard code HLE_JUDGE_AA in generate.py.
Can we build a function to handle that like

if self.cfg.parse_reasoning:
?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anowaczynski-nvidia can we move this logic into metrics? Why does it need to be in the generation?

Copy link
Collaborator Author

@anowaczynski-nvidia anowaczynski-nvidia Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reasons I added if with postprocessing here:

  • to enable AA-compatible HLE judge, ++structured_output=HLE_JUDGE_AA needs to be added only in one place (judge generations pipeline command)
  • with the current version summarize_results command and pipeline logic for aggregating hle judge outputs into metrics doesn't require any modifications (the same command + code handles both: default and AA-compatible judges)

I am aware this code is fundamental to the entire package, all generations pass through it.

Regarding moving this to metrics: I see the possibility to create hleaa_metrics.py in evaluation/metrics, inherit from MathMetrics, and override only _get_score_dict, such that postprocessing of judgement (parsing into json etc) is applied before is_correct_judgement. Do you approve this plan?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, either that or we can just have this as option for main math metrics, so that any dataset, not just HLE can be evaluated in this setup. The one problem is I am not fully sure if metrics are currently customizable, but I guess if not, then we should enable customization in a similar way to how it's done for eval / generation parameters. Let me know if you need help with the design on that, happy to discuss in more details

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kipok I tried the hard way first, but nothing I created was correct and convincing, so I pushed one commit with class HLEAAMetrics(MathMetrics) solution as it was conceptually much simpler. The main downside is that I had to add metric_type to eval command. It doesn't look right here. It doesn't compose with eval on multiple benchmarks idea. Can you take a look? If we're doing Metrics Config idea, I need a sync how to approach it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right approach. When doing eval on multiple benchmarks you can't really customize anything except maybe inference parameters. E.g. doing prompt change or eval arguments will also break things, so I think adding metric_type is a good change. An alternative would be to add this as an argument to MathMetrics and then you can reuse existing metric_kwargs parameter to customize it. But adding metric_type is a good change anyway given that we support metric_kwargs already.

If the current implementation fully works for you, I think it LGTM as well and we can merge it. But do let me know if you have any concerns or think we should do things differently

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to add a new test for this in test_generation.py, but only if models on build.nvidia.com support this response_format argument

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test_judge_generations_with_structured_output but it takes 10 minutes to complete even with max_samples=2, obviously this can't be merged, but where do we go from here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @anowaczynski-nvidia - pushed a change to limit max tokens (since we aren't checking generation correctness anyway), seems to finish very fast now!

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

extracted_final_answer: str
reasoning: str
correct: Literal["yes", "no"]
confidence: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confidence field has no validation constraints. Should be confidence: int = Field(ge=0, le=100) or similar to ensure valid confidence values.

Suggested change
confidence: int
confidence: int = Field(ge=0, le=100, description="Confidence score from 0 to 100")

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm as long as the tests pass

Signed-off-by: Igor Gitman <igitman@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +197 to +213
def test_judge_generations_with_structured_output(tmp_path):
cmd = (
f"ns eval "
f" --server_type=openai "
f" --model=nvidia/nemotron-3-nano-30b-a3b "
f" --server_address=https://integrate.api.nvidia.com/v1 "
f" --benchmarks=hle "
f" --output_dir={tmp_path} "
f" --judge_model=nvidia/nemotron-3-nano-30b-a3b "
f" --judge_server_address=https://integrate.api.nvidia.com/v1 "
f" --judge_server_type=openai "
f" --metric_type=hle-aa "
f' --extra_judge_args="++structured_output=HLE_JUDGE_AA" '
f" ++max_samples=2 "
f" ++inference.tokens_to_generate=1024 " # to make test go fast
)
subprocess.run(cmd, shell=True, check=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Networked integration test

test_judge_generations_with_structured_output shells out to ns eval with real external endpoints (https://integrate.api.nvidia.com/v1) and a specific hosted model. This will fail in CI/offline test environments (no credentials / no network), so the PR will become unmergeable due to nondeterministic test failures. This should be rewritten as a unit/integration test that mocks the model call or uses the existing local test server fixtures, or it should be gated/marked as an opt-in test (e.g., skipped unless an env var is set).

Comment on lines +694 to +696
if self.cfg.structured_output is not None:
generation_params["response_format"] = STRUCTURED_OUTPUTS[self.cfg.structured_output]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unhandled invalid key

When structured_output is set to any non-None value that is not present in STRUCTURED_OUTPUTS, process_single_datapoint will throw a KeyError at STRUCTURED_OUTPUTS[self.cfg.structured_output]. Since this is a user-provided config value (Hydra/CLI via ++structured_output=...), this becomes an unhelpful crash path. Consider validating structured_output in GenerationTaskConfig.__post_init__ (or using .get() with an explicit ValueError listing allowed keys) so users get a clear error message.

@Kipok Kipok merged commit 8950bb0 into main Feb 7, 2026
6 checks passed
@Kipok Kipok deleted the anowaczynski/hle-judge-aa-structured-outputs branch February 7, 2026 00:38
gwarmstrong pushed a commit that referenced this pull request Feb 7, 2026
…#1186)

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: George Armstrong <georgea@nvidia.com>
sgunasekar added a commit that referenced this pull request Mar 11, 2026
commit a5da597
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Mar 6 12:13:36 2026 -0800

    Revert "Eval kit support  (#1239)" (#1294)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit b237e33
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Mar 6 20:25:37 2026 +0400

    Eval kit support  (#1239)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

commit dc28bbf
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Mar 5 10:17:44 2026 -0800

    Python direct tool calling without MCP (#1286)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 12454dd
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Mar 4 13:06:21 2026 -0800

    Allow het servers for nemo-rl jobs (#1223)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 8884a68
Author: Prasoon Varshney <prasoon1995@gmail.com>
Date:   Wed Mar 4 10:24:02 2026 -0800

    Support source_lang param for translation recipe (#1290)

    Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 4618b19
Author: Meriem B. <113170426+ka00ri@users.noreply.github.com>
Date:   Wed Mar 4 18:59:28 2026 +0100

    Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285)

    Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 5ac8609
Author: Talor Abramovich <talor19@gmail.com>
Date:   Wed Mar 4 02:30:06 2026 +0200

    Add SPEED-Bench (within repo) (#1279)

    Signed-off-by: Talor Abramovich <talora@nvidia.com>
    Signed-off-by: talora <talora@nvidia.com>
    Signed-off-by: Talor Abramovich <talor19@gmail.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com>

commit c31eec5
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 12:18:15 2026 -0800

    Fix os.getlogin() crash in ns setup (#1289)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit c228e66
Author: George Armstrong <georgea@nvidia.com>
Date:   Tue Mar 3 11:04:54 2026 -0800

    Fix streaming TypeError when delta.content is None (#1267) (#1288)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit aa47923
Author: Matvei Novikov <mnovikov@nvidia.com>
Date:   Mon Mar 2 16:28:41 2026 -0800

    Add LibTrace recipe for generating domain-specific reasoning data (#1224)

    Signed-off-by: jubick1337 <mnovikov@nvidia.com>
    Signed-off-by: mnovikov <mnovikov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 313cad7
Author: Stephen Ge <stepheng@nvidia.com>
Date:   Mon Mar 2 18:28:49 2026 -0500

    fix: clean parse-failure retries in prover (#1284)

    Signed-off-by: Stephen Ge <stepheng@nvidia.com>

commit 813cfa3
Author: George Armstrong <georgea@nvidia.com>
Date:   Mon Mar 2 15:10:08 2026 -0800

    tst: rollback inference-api to integrate (#1287)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 31735f9
Author: Valentin Mendelev <vmendelev@nvidia.com>
Date:   Mon Mar 2 23:11:25 2026 +0100

    Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250)

    Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com>

commit d4ef8c0
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Fri Feb 27 23:58:54 2026 +0400

    Update promt_config to working with openai format + inline setup (#1210)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit e879cbc
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:41:23 2026 -0800

    Update noc tutorial (#1282)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit f6e3505
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 27 10:17:33 2026 -0800

    Add noc reasoning tutorial (#1278)

    Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com>
    Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com>
    Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com>
    Co-authored-by: Cursor <cursoragent@cursor.com>
    Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com>
    Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com>

commit fc2072a
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 27 10:10:25 2026 -0800

    CritPt generation add prompt_format=None (#1280)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit c8abe5d
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 27 09:31:26 2026 -0800

    New slurm customization parameters (account, containers) (#1209)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 2b38cce
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 25 17:59:52 2026 -0800

    Add nemo-skills-core subpackage for lightweight installs (#1229)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 9fa8e83
Author: Dheeraj Peri <peri.dheeraj@gmail.com>
Date:   Wed Feb 25 12:56:35 2026 -0800

    feat: add custom judge type support for external repo integration (#1274)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Dheeraj Peri <dperi@nvidia.com>
    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com>
    Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>

commit 8a32b13
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 24 15:24:42 2026 -0800

    Exclude numb3rs form test_eval.py (#1275)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6da2219
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Mon Feb 23 18:37:46 2026 +0400

    Numb3rs ds addition (#1174)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

commit ad034b5
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Sun Feb 22 11:55:24 2026 -0800

    Add DSBench-DA evaluation (#1254)

    Squash merge of changes during code-review.
    Signed-off-by: suriya <sgunasekar@nvidia.com>

commit 7593ab3
Author: Jiacheng Xu <jcxu@utexas.edu>
Date:   Fri Feb 20 16:42:01 2026 -0800

    Add CritPt benchmark (#1200)

    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 58c31b2
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 20 16:19:22 2026 -0800

    Fix no_answer metric overcounting in _compute_pass_at_k (#1245)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 1f1a2e7
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 15:58:40 2026 -0800

    Fix incorrect prompt tokens count due to HF api update (#1264)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8ebc6f5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 20 09:05:33 2026 -0800

    Remove deprecated dataset group (#1263)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit ea4177f
Author: Yongqiang Wang <yongqiang.seagull@gmail.com>
Date:   Thu Feb 19 19:57:25 2026 -0500

    fix deps (#1258)

commit 60905a7
Author: Minho Ryu <ryumin93@gmail.com>
Date:   Fri Feb 20 09:39:39 2026 +0900

    Add aime26 (#1256)

    Signed-off-by: bzantium <ryumin93@gmail.com>

commit b28afc5
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:18:25 2026 -0800

    Rename custom -> external benchmarks (#1262)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 6cc9c45
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:10:33 2026 -0800

    Add reference to internal benchmarks repo (#1261)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 5202af6
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 16:08:05 2026 -0800

    Remove incorrect presence-penalty setting (#1259)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 144c70b
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 19 15:26:33 2026 -0800

    Adding an option to store benchmarks in external repo (#1240)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>

commit 10e6e39
Author: George <37293288+Jorjeous@users.noreply.github.com>
Date:   Thu Feb 19 19:57:21 2026 +0400

    update vllm miltimodal for api calls convenience (#1213)

    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com>

commit 1ba4219
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Wed Feb 18 03:28:23 2026 +0400

    Fix --server_container not being applied to dependent jobs (#1244)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit 9517614
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Mon Feb 16 11:13:24 2026 -0800

    Support mini-swe-agent as agent harness (#1212)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>
    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Signed-off-by: George Armstrong <georgea@nvidia.com>
    Signed-off-by: Charlie Truong <chtruong@nvidia.com>
    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>
    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Stephen Ge <stepheng@nvidia.com>
    Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com>
    Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
    Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com>
    Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com>
    Signed-off-by: Wei Du <wedu@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com>
    Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
    Co-authored-by: Ivan <imoshkov@nvidia.com>
    Co-authored-by: George Armstrong <georgea@nvidia.com>
    Co-authored-by: Charlie Truong <chtruong@nvidia.com>
    Co-authored-by: Nick Ludwig <nliudvig@nvidia.com>
    Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com>
    Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
    Co-authored-by: Minho Ryu <ryumin93@gmail.com>
    Co-authored-by: Stephen Ge <stepheng@nvidia.com>
    Co-authored-by: Jiacheng Xu <jcxu@utexas.edu>
    Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com>
    Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com>
    Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com>
    Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com>
    Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
    Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com>
    Co-authored-by: Wei Du <wedu@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sean Naren <snarenthiran@nvidia.com>
    Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com>
    Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com>
    Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>

commit a3d44dc
Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com>
Date:   Fri Feb 13 22:32:15 2026 -0800

    Add --installation_command support to prepare_data (#1243)

    Signed-off-by: suriya <sgunasekar@nvidia.com>
    Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>

commit e80d524
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 12 17:26:00 2026 -0800

    Fix CI disk space for Docker image builds (#1241)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit d22236c
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Wed Feb 11 17:55:00 2026 -0800

    Fix answerbench prompt parsing (#1235)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 2401628
Author: George Armstrong <georgea@nvidia.com>
Date:   Wed Feb 11 14:56:43 2026 -0800

    feat: add lockfiles for reproducible sandbox builds (#1233)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5a0a84d
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Wed Feb 11 13:30:03 2026 -0800

    removing datasets version restriction for LCB eval (#1230)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit ef0a890
Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>
Date:   Wed Feb 11 12:03:16 2026 +0400

    Gnalbandyan/add physics (#1214)

    Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com>
    Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com>

commit bd9d30c
Author: Wasi Ahmad <wasiahmad@ucla.edu>
Date:   Tue Feb 10 15:13:27 2026 -0800

    LCB generic prompting (#1215)

    Signed-off-by: wasiahmad <wasiahmad@ucla.edu>

commit 7d6c49a
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Sat Feb 7 08:45:46 2026 -0800

    Add support for different variations of nemo-rl (#1220)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit b19ba96
Author: George Armstrong <georgea@nvidia.com>
Date:   Fri Feb 6 21:40:56 2026 -0800

    Add multi-node sandbox support for SLURM clusters (#1218)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 8950bb0
Author: anowaczynski-nvidia <anowaczynski@nvidia.com>
Date:   Sat Feb 7 01:38:00 2026 +0100

    support structured outputs in hle judge for optional AA compatibility (#1186)

    Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b84f7a2
Author: Igor Gitman <igitman@nvidia.com>
Date:   Fri Feb 6 14:51:02 2026 -0800

    A small update on running tests docs (#1219)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 8e838e1
Author: George Armstrong <georgea@nvidia.com>
Date:   Thu Feb 5 18:01:35 2026 -0800

    feat: add flag to disable sandbox replay (#1217)

    Signed-off-by: George Armstrong <georgea@nvidia.com>

commit 5fd9085
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Feb 5 15:57:01 2026 -0800

    Add an option to limit number of tool calls (#1216)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit d820200
Author: Igor Gitman <igitman@nvidia.com>
Date:   Tue Feb 3 10:43:55 2026 -0800

    Add arena-hard v2 (#1205)

    Signed-off-by: bzantium <ryumin93@gmail.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: bzantium <ryumin93@gmail.com>

commit a30920e
Author: Igor Gitman <igitman@nvidia.com>
Date:   Mon Feb 2 10:53:55 2026 -0800

    Fix mkdocs warnings (#1204)

    Signed-off-by: Igor Gitman <igitman@nvidia.com>

commit 19d7788
Author: Ivan <imoshkov@nvidia.com>
Date:   Mon Feb 2 23:25:13 2026 +0500

    Fix infinite wait in sandbox.wait_for_sandbox (#1206)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>

commit 3e65fbf
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Fri Jan 30 19:38:38 2026 -0800

    Improve tts (#1203)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 250c862
Author: Nick Ludwig <nliudvig@nvidia.com>
Date:   Fri Jan 30 22:12:29 2026 +0400

    SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202)

    Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

commit 7ded756
Author: Ivan <imoshkov@nvidia.com>
Date:   Fri Jan 30 09:57:41 2026 +0500

     Add proper token counting to code execution model (#1184)

    Signed-off-by: i-vainn <imoshkov@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>

commit b986304
Author: Igor Gitman <igitman@nvidia.com>
Date:   Thu Jan 29 17:57:07 2026 -0800

    Upgrade containers (#1198)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com>

commit 3b44f02
Author: Dan Lord <blahblahasdf@gmail.com>
Date:   Thu Jan 29 16:40:47 2026 -0800

    Fix incorrect string format (#1199)

    Signed-off-by: dlord <dlord@nvidia.com>

commit c4854b8
Author: Sadegh Mahdavi <smahdavi4@gmail.com>
Date:   Thu Jan 29 13:43:36 2026 -0800

    Update nemo-rl to latest (#1087)

    Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
    Signed-off-by: Igor Gitman <igitman@nvidia.com>
    Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
…#1186)

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
…#1186)

Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com>
Signed-off-by: Igor Gitman <igitman@nvidia.com>
Co-authored-by: Igor Gitman <igitman@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants