Skip to content

Add responses api type#889

Merged
Kipok merged 66 commits intomainfrom
smahdavi/responses-api
Oct 7, 2025
Merged

Add responses api type#889
Kipok merged 66 commits intomainfrom
smahdavi/responses-api

Conversation

@smahdavi4
Copy link
Collaborator

@smahdavi4 smahdavi4 commented Oct 3, 2025

  • Adding responses api type (currently for vllm only)
  • Deprecating use_completions_api in favour of completion_type.
  • Removing all the sync logic.

Running example:

ns eval \
  --benchmarks aime25:2 \
  --cluster slurm \
  --model /hf_models/gpt-oss-20b \
  --server_gpus 8 \
  --server_type vllm \
  --output_dir /workspace/aime-responses \
  ++tool_modules="[nemo_skills.mcp.servers.python_tool::PythonTool]" \
  ++inference.tokens_to_generate=100000 \
  ++inference.endpoint_type=responses \
  --with_sandbox

Limitations with VLLM+GPT-OSS:

  • VLLM fails to parse most of the tool calls. Most of the times putting <channel> or other template related keywords in tool name
  • Usually first tool succeeds, but when calling tool for the second time, vllm usually fails even worse: it returns the tool call as a text reasoning content.

Summary by CodeRabbit

  • New Features

    • Added explicit endpoint types (text, chat, responses), Responses API support, expanded async generation, and improved tool-calling with serialized outputs.
  • Deprecations / Breaking Changes

    • Legacy completions flag replaced by inference.endpoint_type; legacy flag deprecated and mapped to text with migration guidance.
  • Documentation

    • Examples updated to use async invocation and the new inference.endpoint_type syntax.
  • Chores

    • Pinned a dependency version and disabled noisy logging worker for stability.

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
nemo_skills/inference/model/tool_call.py (1)

71-75: Handle non-string tool arguments and use LOG.exception.

This is a duplicate of a previous review comment that remains unaddressed. The code has two issues:

  1. json.loads(tool_args) raises TypeError (not JSONDecodeError) when tool_args is already a dict/list. Some providers return pre-parsed arguments, so this exception escapes the handler and breaks subsequent tool calls.
  2. Line 73 uses LOG.error instead of LOG.exception, making debugging harder.

Apply this diff to fix both issues:

-        try:
-            tool_args = json.loads(tool_args)
-        except json.decoder.JSONDecodeError as e:
-            LOG.error(f"Tool arguments are not in JSON format: {tool_args}")
-            LOG.exception(e)
+        try:
+            if isinstance(tool_args, str):
+                tool_args = json.loads(tool_args)
+        except (json.JSONDecodeError, TypeError) as e:
+            LOG.exception(f"Tool arguments are not in JSON format: {tool_args}")
             return {"error": "Tool argument parsing failed."}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51f7d80 and 77fc5c9.

📒 Files selected for processing (1)
  • nemo_skills/inference/model/tool_call.py (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/inference/model/tool_call.py (2)
nemo_skills/mcp/adapters.py (3)
  • format_tool_list_by_completion_type (51-75)
  • format_tool_response_by_completion_type (95-110)
  • get_tool_details_by_completion_type (113-123)
nemo_skills/inference/model/base.py (1)
  • CompletionType (37-40)
🪛 Ruff (0.13.3)
nemo_skills/inference/model/tool_call.py

73-73: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


95-95: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🔇 Additional comments (2)
nemo_skills/inference/model/tool_call.py (2)

23-31: LGTM! Good use of adapter functions.

The imports align well with the PR's objective to add completion_type support. The new adapter functions properly encapsulate the completion_type-specific formatting logic.


98-158: LGTM! Clean completion_type integration.

The method correctly threads completion_type through the entire tool calling flow. The use of adapter functions for tool formatting and the serialized_output approach align well with the PR's objectives.

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_skills/inference/generate.py (1)

227-229: LGTM with optional refinement.

The hard deprecation approach is correct, providing clear migration guidance to users. The error message appropriately directs users to the new ++inference.completion_type=text parameter.

Optional: The static analysis tool suggests moving the error message to a constant to follow best practices (TRY003). While not critical, this could improve consistency if similar patterns appear elsewhere:

+_DEPRECATED_USE_COMPLETIONS_API_MSG = "use_completions_api is deprecated, please use ++inference.completion_type=text instead."
+
 def _post_init_deprecated_params(self):
     if self.use_completions_api:
-        raise ValueError("use_completions_api is deprecated, please use ++inference.completion_type=text instead.")
+        raise ValueError(_DEPRECATED_USE_COMPLETIONS_API_MSG)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77fc5c9 and 3c15220.

📒 Files selected for processing (1)
  • nemo_skills/inference/generate.py (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
nemo_skills/inference/generate.py (1)
nemo_skills/inference/model/base.py (1)
  • CompletionType (37-40)
🪛 Ruff (0.13.3)
nemo_skills/inference/generate.py

229-229: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (6)
nemo_skills/inference/generate.py (6)

43-43: LGTM!

The import of CompletionType is correctly placed and necessary for the new completion type logic.


60-66: LGTM!

The completion_type field is well-defined with clear documentation covering all three modes (chat, text, responses). The default of CompletionType.chat maintains backward compatibility.


87-89: LGTM!

The deprecation comment clearly directs users to the new completion_type parameter. The field is retained to provide a meaningful error message when used, which aligns with the hard deprecation strategy discussed in previous reviews.


190-190: LGTM!

The call to _post_init_deprecated_params() ensures that deprecated parameter usage is caught during initialization, providing early feedback to users.


211-211: LGTM!

Correctly sets completion_type to CompletionType.text for Megatron server, replacing the old use_completions_api behavior. This aligns with Megatron's lack of chat completion support.


277-277: LGTM!

All logic correctly migrates from the deprecated use_completions_api to the new completion_type parameter:

  • Line 277: Chat template kwargs handling excludes text completions
  • Line 289: Tokenizer setup includes text completions (plus soft_fail and count_prompt_tokens cases)
  • Line 301: Tokens_to_generate validation applies only to text completions
  • Line 358: Tokenizer passed to prompt setup only for text completions

The logic treats "responses" mode similarly to "chat" mode in this file, with responses-specific handling presumably implemented in the model classes.

Also applies to: 289-289, 301-302, 358-358

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
@gwarmstrong gwarmstrong mentioned this pull request Oct 7, 2025
Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please test if responses api is supported in https://build.nvidia.com/openai/gpt-oss-20b? If so, then we can add it as a simple generation test in here to ensure this functionality doesn't break https://github.com/NVIDIA/NeMo-Skills/blob/main/tests/test_generation.py

ipython
iso639-lang
litellm[caching] < 1.75.0 # some bug with asyncio.run hanging forever
litellm[caching] == 1.77.5 # Requires patching the logging worker
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to pin to exact version? What happens if we use latest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the patch explicitly change one of its classes, so if they modify that class in the future, we need to change that class's modification too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can youplease update the comment to add a little more details, so that when we look at this line in the future it's easy to remember why we pinned this and what we should test to unpin.

And the behavior here is that this keeps slurm jobs from existing and we waste gpus? But it only happens in some cases, not always?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, left more comments about that.

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
@smahdavi4
Copy link
Collaborator Author

Could you please test if responses api is supported in https://build.nvidia.com/openai/gpt-oss-20b? If so, then we can add it as a simple generation test in here to ensure this functionality doesn't break https://github.com/NVIDIA/NeMo-Skills/blob/main/tests/test_generation.py

I tested that endpoint, and it seems to be worse than vllm: Its function call responses are empty!

@Kipok
Copy link
Collaborator

Kipok commented Oct 7, 2025

ok, but should we still have a responses test with that endpoint, but without function calls? Just to ensure there is no obvious mistakes in the code, like some parameter not passed or some other issue like this?

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
nemo_skills/mcp/adapters.py (1)

119-119: Replace assert with explicit validation.

The assert statement for validating tool_call["type"] can be disabled with Python's -O optimization flag, which could allow invalid tool calls to proceed silently in production environments.

Apply this diff to use explicit validation:

     elif endpoint_type == EndpointType.responses:
-        assert tool_call["type"] == "function_call", "Tool call must be a function call"
+        if tool_call.get("type") != "function_call":
+            raise ValueError(f"Tool call must be a function call, got type: {tool_call.get('type')}")
         tool_name = tool_call["name"]
         tool_args = tool_call["arguments"]

Based on past review comments and security best practices.

nemo_skills/inference/model/tool_call.py (1)

93-96: Add strict=True to zip() for safety.

The zip() call should use strict=True to ensure tool_calls and tool_results have matching lengths. This prevents silent bugs if the lists don't match (e.g., if a tool call fails to return a result).

Apply this diff:

         return [
             format_tool_response_by_endpoint_type(tool_call, tool_result, endpoint_type)
-            for tool_call, tool_result in zip(tool_calls, tool_results)
+            for tool_call, tool_result in zip(tool_calls, tool_results, strict=True)
         ]

Based on past review comments and static analysis hints.

🧹 Nitpick comments (4)
docs/basics/prompt-format.md (1)

111-112: Reflect the new completion_type flag in this doc snippet

The CLI now steers completions vs. responses via the completion_type setting (per this PR’s goals). This example still references only ++inference.endpoint_type=text, which will leave readers without guidance on the new flag. Update the prose (and sample command, if applicable) to mention the completion_type setting alongside the endpoint type so the instructions stay accurate.

nemo_skills/inference/eval/bfcl.py (1)

120-135: Improve error logging in exception handler.

The exception handler logs the error but can be simplified by using logging.exception() which automatically includes the traceback and exception details.

Apply this diff to improve the logging:

         try:
             fmted_prompt = self.message_formatter(messages, tools=tools)
         except Exception as e:
-            # Sometimes the parsed tool-call is a string, which is not JSON serializable
-            # Putting a debugging here in case it happens in the future and we need to address it.
             LOG.info(f"Messages: {messages}, Tools: {tools}")
-            LOG.error(f"Error formatting prompt: {e}")
-            raise e
+            LOG.exception("Error formatting prompt")
+            raise

Based on static analysis hints.

nemo_skills/inference/model/base.py (1)

445-445: Consider removing unused kwargs parameter.

The kwargs parameter is not used in this method. If it's included for API consistency with other parse methods, that's acceptable, but consider adding a comment explaining this or removing it if not needed.

Based on static analysis hints.

nemo_skills/inference/model/tool_call.py (1)

71-75: Improve JSON parsing error handling.

The current error handling only catches JSONDecodeError but doesn't handle cases where tool_args is already a parsed dict/list (not a string), which would raise TypeError. Additionally, the logging can be simplified.

Apply this diff to handle both cases:

         try:
-            tool_args = json.loads(tool_args)
+            if isinstance(tool_args, str):
+                tool_args = json.loads(tool_args)
+            elif not isinstance(tool_args, dict):
+                raise ValueError(f"Tool arguments must be a string or dict, got {type(tool_args)}")
         except json.decoder.JSONDecodeError as e:
-            LOG.error(f"Tool arguments are not in JSON format: {tool_args}")
-            LOG.exception(e)
+            LOG.exception("Tool arguments are not in valid JSON format: %s", tool_args)
             return {"error": "Tool argument parsing failed."}

Based on past review comments and static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5046adb and 3a5b96b.

📒 Files selected for processing (15)
  • docs/basics/prompt-format.md (1 hunks)
  • docs/pipelines/generation.md (7 hunks)
  • docs/releases/openmathinstruct2/dataset.md (6 hunks)
  • docs/releases/openmathreasoning/evaluation.md (2 hunks)
  • docs/tutorials/posts/gpt-oss-python.md (2 hunks)
  • nemo_skills/dataset/ruler/prepare.py (1 hunks)
  • nemo_skills/inference/eval/bfcl.py (2 hunks)
  • nemo_skills/inference/generate.py (10 hunks)
  • nemo_skills/inference/model/base.py (8 hunks)
  • nemo_skills/inference/model/code_execution.py (8 hunks)
  • nemo_skills/inference/model/openai.py (1 hunks)
  • nemo_skills/inference/model/parallel_thinking.py (2 hunks)
  • nemo_skills/inference/model/tool_call.py (6 hunks)
  • nemo_skills/mcp/adapters.py (4 hunks)
  • tests/slurm-tests/gpt_oss_python_aime25/run_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • docs/releases/openmathreasoning/evaluation.md
  • nemo_skills/dataset/ruler/prepare.py
  • tests/slurm-tests/gpt_oss_python_aime25/run_test.py
  • docs/releases/openmathinstruct2/dataset.md
  • docs/pipelines/generation.md
  • docs/tutorials/posts/gpt-oss-python.md
🧰 Additional context used
🧬 Code graph analysis (8)
nemo_skills/inference/model/code_execution.py (1)
nemo_skills/inference/model/base.py (2)
  • BaseModel (43-509)
  • EndpointType (37-40)
nemo_skills/inference/model/parallel_thinking.py (1)
nemo_skills/inference/model/base.py (2)
  • BaseModel (43-509)
  • EndpointType (37-40)
nemo_skills/mcp/adapters.py (1)
nemo_skills/inference/model/base.py (1)
  • EndpointType (37-40)
nemo_skills/inference/model/openai.py (2)
nemo_skills/inference/model/base.py (2)
  • _build_responses_request_params (205-206)
  • _build_chat_request_params (198-199)
nemo_skills/inference/model/vllm.py (2)
  • _build_responses_request_params (142-148)
  • _build_chat_request_params (102-140)
nemo_skills/inference/model/base.py (4)
nemo_skills/inference/patch_litellm_logging.py (1)
  • patch_litellm_logging_worker (59-71)
nemo_skills/inference/model/openai.py (1)
  • _build_responses_request_params (155-160)
nemo_skills/inference/model/vllm.py (1)
  • _build_responses_request_params (142-148)
nemo_skills/inference/model/tool_call.py (1)
  • generate_async (98-160)
nemo_skills/inference/model/tool_call.py (2)
nemo_skills/mcp/adapters.py (3)
  • format_tool_list_by_endpoint_type (51-76)
  • format_tool_response_by_endpoint_type (96-111)
  • get_tool_details_by_endpoint_type (114-124)
nemo_skills/inference/model/base.py (1)
  • EndpointType (37-40)
nemo_skills/inference/generate.py (1)
nemo_skills/inference/model/base.py (1)
  • EndpointType (37-40)
nemo_skills/inference/eval/bfcl.py (1)
nemo_skills/inference/model/base.py (1)
  • EndpointType (37-40)
🪛 Ruff (0.13.3)
nemo_skills/mcp/adapters.py

76-76: Avoid specifying long messages outside the exception class

(TRY003)


111-111: Avoid specifying long messages outside the exception class

(TRY003)


123-123: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/inference/model/base.py

291-291: Avoid specifying long messages outside the exception class

(TRY003)


445-445: Unused method argument: kwargs

(ARG002)


502-502: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/inference/model/tool_call.py

73-73: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


95-95: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

nemo_skills/inference/generate.py

229-229: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/inference/eval/bfcl.py

126-126: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


127-127: Use raise without specifying exception name

Remove exception name

(TRY201)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (7)
nemo_skills/inference/model/base.py (1)

491-503: Robust serialization with clear error handling.

The _serialize_output method handles both responses and chat completion formats cleanly. The error message for unsupported types is clear and actionable.

nemo_skills/inference/model/openai.py (1)

155-160: Clean adapter for responses API format.

The method correctly delegates to the existing chat request builder and remaps the keys to match the responses API surface (messagesinput, max_completion_tokensmax_output_tokens). This avoids code duplication while providing the necessary format transformation.

nemo_skills/inference/model/parallel_thinking.py (1)

55-55: Clean migration to endpoint_type enum.

The replacement of use_completions_api: bool = False with endpoint_type: EndpointType = EndpointType.chat is a clean refactor that maintains backward compatibility in behavior while providing better clarity and extensibility.

nemo_skills/inference/model/code_execution.py (1)

68-68: Proper endpoint_type propagation through code execution flow.

The endpoint_type parameter is correctly threaded through the entire code execution call chain (_generate_single, generate_async, _stream_single) and included in request dictionaries, ensuring consistent behavior with the new endpoint-type based routing.

Also applies to: 113-113, 263-263, 321-321, 336-336

nemo_skills/inference/generate.py (2)

60-66: Well-documented endpoint_type configuration.

The new endpoint_type field is clearly documented with examples for each endpoint type ("chat", "text", "responses") and includes guidance on tokenizer override behavior. This provides good clarity for users migrating from the deprecated use_completions_api flag.


227-229: Explicit deprecation prevents silent errors.

The hard deprecation of use_completions_api with a clear error message and migration guidance is the right approach. This addresses the concerns raised in past review comments about silent failures being worse than breaking changes, ensuring users are immediately aware they need to update their configuration.

nemo_skills/inference/model/tool_call.py (1)

112-113: Solid tool invocation flow with endpoint-type awareness.

The tool invocation logic correctly:

  • Formats tools based on endpoint type
  • Propagates endpoint_type through all generation calls
  • Serializes outputs appropriately
  • Handles tool call execution with proper error handling

The integration of endpoint_type throughout the flow maintains consistency with the broader refactor.

Also applies to: 138-147

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_skills/inference/patch_litellm_logging.py (1)

28-59: Consider adding type hints for completeness.

While the current implementation is functional, adding complete type hints would improve code clarity and maintainability. Consider adding return type hints (e.g., -> None) to all methods.

Example:

-    def start(self) -> None:
+    def start(self) -> None:
         pass

-    async def stop(self) -> None:
+    async def stop(self) -> None:
         pass

Note: Some methods already have -> None, so this would ensure consistency across all methods.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3a5b96b and 3383408.

📒 Files selected for processing (2)
  • nemo_skills/inference/patch_litellm_logging.py (1 hunks)
  • requirements/main.txt (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • requirements/main.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (3)
nemo_skills/inference/patch_litellm_logging.py (3)

15-23: Past review comment addressed.

The docstring now provides the requested details about why this patch is needed, including the specific symptoms (async_loop gets stuck), impact (slurm jobs don't exit, GPU waste), and environment differences (Docker vs. local).


43-49: Coroutine closing prevents resource leaks.

The enqueue and ensure_initialized_and_enqueue methods correctly close coroutines to prevent resource leaks and suppress "coroutine was never awaited" warnings. This is the appropriate approach for a no-op implementation meant to disable functionality.


61-73: LGTM!

The patch function correctly replaces both the LoggingWorker class and the GLOBAL_LOGGING_WORKER instance, and gracefully handles version compatibility by catching ModuleNotFoundError. The implementation is safe to call during module import.

Copy link
Collaborator

@Kipok Kipok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, as long as tests pass

…sponses-api

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
@Kipok Kipok merged commit 475b036 into main Oct 7, 2025
5 of 6 checks passed
@Kipok Kipok deleted the smahdavi/responses-api branch October 7, 2025 21:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
requirements/main.txt (1)

30-30: Document the pin rationale.

Please capture the concrete failure mode and what to retest before unpinning so future upgrades aren’t blind. Right now the comment just says the patch exists, but not why it matters.

-litellm[caching] == 1.77.5 # Requires patching the logging worker (See nemo_skills/inference/patch_litellm_logging.py)
+litellm[caching] == 1.77.5 # Pin because litellm's logging worker can block Slurm job teardown; we patch it (see nemo_skills/inference/patch_litellm_logging.py) and must re-run the multi-GPU teardown test before unpinning.
nemo_skills/inference/eval/bfcl.py (1)

133-140: Consider using logging.exception and bare raise.

The try/except block is useful for debugging, but consider these minor improvements:

  • Line 139: Use logging.exception(...) instead of logging.error(...) to automatically include the exception traceback.
  • Line 140: Use bare raise instead of raise e for cleaner exception re-raising.

Apply this diff:

         try:
             fmted_prompt = self.message_formatter(messages, tools=tools)
         except Exception as e:
             # Sometimes the parsed tool-call is a string, which is not JSON serializable
             # Putting a debugging here in case it happens in the future and we need to address it.
             LOG.info(f"Messages: {messages}, Tools: {tools}")
-            LOG.error(f"Error formatting prompt: {e}")
-            raise e
+            LOG.exception("Error formatting prompt")
+            raise
nemo_skills/inference/generate.py (1)

227-229: Hard deprecation approach aligns with past discussion.

The validation correctly raises an error when use_completions_api=True, forcing users to migrate to endpoint_type. This aligns with the team's decision to hard-deprecate rather than maintain backward compatibility. The error message clearly directs users to the new parameter.

Optional: Address static analysis hint.

Ruff suggests avoiding long error messages inline (TRY003). If preferred, you could extract the message to a constant:

+    _DEPRECATED_USE_COMPLETIONS_API_MSG = (
+        "use_completions_api is deprecated, please use ++inference.endpoint_type=text instead."
+    )
+
     def _post_init_deprecated_params(self):
         if self.use_completions_api:
-            raise ValueError("use_completions_api is deprecated, please use ++inference.endpoint_type=text instead.")
+            raise ValueError(self._DEPRECATED_USE_COMPLETIONS_API_MSG)

However, this is a minor style preference and not required.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3383408 and fc8aca5.

📒 Files selected for processing (3)
  • nemo_skills/inference/eval/bfcl.py (2 hunks)
  • nemo_skills/inference/generate.py (10 hunks)
  • requirements/main.txt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/inference/generate.py (1)
nemo_skills/inference/model/base.py (1)
  • EndpointType (37-40)
nemo_skills/inference/eval/bfcl.py (1)
nemo_skills/inference/model/base.py (1)
  • EndpointType (37-40)
🪛 Ruff (0.13.3)
nemo_skills/inference/generate.py

229-229: Avoid specifying long messages outside the exception class

(TRY003)

nemo_skills/inference/eval/bfcl.py

139-139: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


140-140: Use raise without specifying exception name

Remove exception name

(TRY201)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (5)
nemo_skills/inference/eval/bfcl.py (2)

41-41: LGTM!

The import is necessary for the endpoint_type setting added to construct_input_dict.


141-148: LGTM!

Explicitly setting endpoint_type = EndpointType.text is correct for BFCL's use case and aligns with the broader shift toward endpoint-type aware configurations across the codebase. Based on past review comments, this prevents unwanted assumptions about completion type based on input format.

nemo_skills/inference/generate.py (3)

60-66: LGTM! Clean migration to enum-based endpoint configuration.

The endpoint_type field with the EndpointType enum is a significant improvement over the boolean use_completions_api. The comprehensive documentation clearly explains the behavior for each endpoint type (chat, text, responses), and the enum provides better type safety and extensibility.


211-211: LGTM! Correct migration for Megatron server.

Setting endpoint_type = EndpointType.text for Megatron server correctly replaces the previous use_completions_api logic.


277-277: LGTM! Consistent logic updates across all usage points.

All conditional checks have been correctly updated to use endpoint_type:

  • Line 277: endpoint_type != EndpointType.text for chat_template_kwargs (includes chat and responses)
  • Line 289: endpoint_type == EndpointType.text for tokenizer setup
  • Line 301: endpoint_type == EndpointType.text for tokens_to_generate validation
  • Line 361: endpoint_type == EndpointType.text for prompt tokenizer passing

The logic consistently treats the responses API the same as the chat API in this file, with model-specific implementation details handled in the model classes.

Also applies to: 289-289, 301-302, 361-361

SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 9, 2025
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
SeanNaren pushed a commit that referenced this pull request Oct 9, 2025
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
wasiahmad pushed a commit that referenced this pull request Oct 9, 2025
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Oct 29, 2025
Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>
@coderabbitai coderabbitai bot mentioned this pull request Dec 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants