Skip to content

Parallel thinking fixes#887

Merged
shtoshni merged 8 commits intomainfrom
shtoshni/parallel_thinking_fixes
Oct 3, 2025
Merged

Parallel thinking fixes#887
shtoshni merged 8 commits intomainfrom
shtoshni/parallel_thinking_fixes

Conversation

@shtoshni
Copy link
Contributor

@shtoshni shtoshni commented Oct 3, 2025

There were a few arguments not being passed during genselect generation which are particularly necessary for tool usage during genselect. Testing with gpt-oss distilled models revealed these shortcomings. This PR fixes it.

Summary by CodeRabbit

  • New Features

    • Added a config option to set the assistant response key and enable tokenizer-awareness in parallel-thinking flows.
  • Bug Fixes

    • Empty-result branches now include a "generation" field to meet downstream expectations.
    • Generation-related options are consistently forwarded through all execution paths to avoid dropped settings.
  • Refactor

    • Streamlined generation-parameter handling by stripping conflicting keys before model calls and preserving generation fields across branches.

@shtoshni shtoshni requested a review from Kipok October 3, 2025 19:04
coderabbitai[bot]

This comment was marked as resolved.

@NVIDIA-NeMo NVIDIA-NeMo deleted a comment from coderabbitai bot Oct 3, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 3, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds tokenizer propagation into the parallel-thinking model path, a new config field start_assistant_response_key, forwards prompt/template kwargs into prompt filling, strips conflicting generation keys before model calls, and ensures empty-result branches include "generation": "" while propagating extra kwargs through generate_async and GenSelect.

Changes

Cohort / File(s) Summary of changes
Parallel Thinking Inference
nemo_skills/inference/model/parallel_thinking.py
- Add `start_assistant_response_key: str
Model factory
nemo_skills/inference/model/__init__.py
- Add optional tokenizer parameter to get_parallel_thinking_model(...) and forward it to ParallelThinkingTask(...) initialization.
Generation entrypoint
nemo_skills/inference/generate.py
- GenerationTask.setup_llm now calls get_parallel_thinking_model(..., tokenizer=self.tokenizer) to pass tokenizer into the parallel-thinking wrapper.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Caller
  participant G as GenerationTask
  participant PT as ParallelThinkingTask
  participant PB as PromptBuilder
  participant M as Model (generate_async)
  participant DS as Downstream (GenSelect/GenSynthesis)

  rect rgba(230,240,255,0.4)
    note over C,G: High-level generate flow with tokenizer propagation
    C->>G: request generate(input, **kwargs)
    G->>PT: get_parallel_thinking_model(tokenizer=self.tokenizer)
    G->>PT: generate_async(input, **kwargs)
    PT->>PB: get_prompt(..., tokenizer=PT.tokenizer)
    PB-->>PT: prompt
    PT->>PT: strip {temperature,tokens_to_generate,prompt} from kwargs
    PT->>M: generate_async(prompt, **remaining_kwargs)
    alt results produced
      M-->>PT: outputs
      PT->>DS: post-process / synthesis
      DS-->>PT: result (includes generation)
      PT-->>G: result
      G-->>C: result
    else empty result
      note right of PT: ensure "generation": ""
      PT-->>G: { ..., generation: "" }
      G-->>C: { ..., generation: "" }
    end
  end
Loading
sequenceDiagram
  autonumber
  participant Caller as Caller
  participant PT as ParallelThinkingTask
  participant GS as _run_genselect
  participant PB as PromptBuilder
  participant M as Model

  rect rgba(240,255,240,0.4)
    note over Caller,PT: GenSelect flow with kwargs propagation
    Caller->>PT: genselect(input, **kwargs)
    PT->>GS: _run_genselect(input, **kwargs)
    GS->>PB: get_prompt(..., tokenizer=PT.tokenizer)
    PB-->>GS: prompt
    GS->>GS: remove conflicting generation keys from kwargs
    GS->>M: generate_async(prompt, **remaining_kwargs)
    alt selection found
      M-->>GS: candidates
      GS-->>PT: selection (with generation)
      PT-->>Caller: selection
    else no candidates
      note right of GS: return with "generation": ""
      GS-->>PT: { ..., generation: "" }
      PT-->>Caller: { ..., generation: "" }
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my whiskers at a new key's light,
Tokens hop through tunnels, tidy and bright.
Prompts now whisper where responses start,
Kwargs pass along their tiny cart.
If answers hide, I leave a gentle trace—"generation": ""—a cozy place. 🐇✨

\n\n## Pre-merge checks and finishing touches\n
\n❌ Failed checks (1 warning)\n\n| Check name | Status | Explanation | Resolution |\n| :----------------: | :--------- | :------------------------------------------------------------------------------------ | :----------------------------------------------------------------------------- |\n| Docstring Coverage | ⚠️ Warning | Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. | You can run `@coderabbitai generate docstrings` to improve docstring coverage. |\n\n
\n
\n✅ Passed checks (2 passed)\n\n| Check name | Status | Explanation |\n| :---------------: | :------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |\n| Title Check | ✅ Passed | The title succinctly indicates that this PR addresses fixes in the parallel thinking functionality within the inference model, which aligns with the changes made to argument passing and configuration in that module. It is clear, concise, and directly related to the main modifications without extraneous detail. |\n\n
\n\n
📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eae82c9 and 5ab33a3.

📒 Files selected for processing (1)
  • nemo_skills/inference/model/parallel_thinking.py (5 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
nemo_skills/inference/model/parallel_thinking.py (2)

82-87: Verify tokenizer is set when use_completions_api or start_assistant_response_key is enabled.

The tokenizer parameter is now passed directly to __init__ and stored as self.tokenizer. However, if self.cfg.use_completions_api is True or self.cfg.start_assistant_response_key is set, but tokenizer is None, downstream calls to get_prompt() and prompt.fill() may fail with a ValueError (see nemo_skills/prompt/utils.py lines 252-256).

While the error message from prompt.fill() is clear, consider adding an explicit validation guard in __init__ to catch this misconfiguration early:

if (self.cfg.use_completions_api or self.cfg.start_assistant_response_key) and tokenizer is None:
    raise ValueError(
        "`tokenizer` must be provided when `use_completions_api` is True "
        "or `start_assistant_response_key` is set."
    )

This addresses the concern raised in the previous review and improves the user experience by providing immediate feedback on misconfiguration.


211-220: LGTM with optional style improvement.

The call to prompt.fill() now correctly forwards start_assistant_response_key and chat_template_kwargs, addressing the PR objectives. The duplicate key removal (lines 217-219) prevents conflicts when kwargs overlap with explicit parameters in generate_async.

Optional: Consider using .pop() for a more idiomatic approach (as suggested by static analysis):

-        for duplicate_key in ["temperature", "tokens_to_generate", "prompt"]:
-            if duplicate_key in kwargs:
-                del kwargs[duplicate_key]
+        for duplicate_key in ["temperature", "tokens_to_generate", "prompt"]:
+            kwargs.pop(duplicate_key, None)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 865c127 and eae82c9.

📒 Files selected for processing (3)
  • nemo_skills/inference/generate.py (1 hunks)
  • nemo_skills/inference/model/__init__.py (2 hunks)
  • nemo_skills/inference/model/parallel_thinking.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/inference/model/__init__.py (2)
nemo_skills/inference/model/parallel_thinking.py (1)
  • ParallelThinkingTask (76-395)
nemo_skills/inference/chat_interface/core.py (1)
  • cfg (181-182)
nemo_skills/inference/model/parallel_thinking.py (2)
nemo_skills/inference/model/base.py (1)
  • BaseModel (32-513)
nemo_skills/prompt/utils.py (2)
  • get_prompt (370-403)
  • fill (241-303)
🪛 Ruff (0.13.2)
nemo_skills/inference/model/parallel_thinking.py

219-219: Use pop instead of key in dict followed by del dict[key]

Replace if statement with .pop(..., None)

(RUF051)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: pre-commit
  • GitHub Check: unit-tests
🔇 Additional comments (8)
nemo_skills/inference/model/parallel_thinking.py (6)

58-58: LGTM! New config field for assistant response prefix.

The addition of start_assistant_response_key enables prepending a value to the assistant response during prompt generation, which is necessary for tool usage scenarios as stated in the PR objectives.


91-97: LGTM! Consistent tokenizer propagation.

The calls to get_prompt() now correctly use tokenizer=self.tokenizer, aligning with the updated __init__ signature and ensuring tokenizer context flows through prompt generation.


221-227: LGTM! Kwargs propagation ensures downstream parameters are forwarded.

The addition of **kwargs on line 226 ensures that generation-related parameters (e.g., tools, reasoning_effort) are correctly propagated to the model's generate_async call, addressing the PR objectives for tool usage during GenSelect.


352-362: LGTM! Ensures required generation key is present.

The addition of "generation": "" on line 355 ensures that the result dict always contains a generation key, which is required by downstream code in inference/generate.py (line 490: output[self.cfg.generation_key] = output.pop("generation")). This prevents potential KeyError exceptions in empty-result scenarios.


364-376: LGTM! Kwargs propagation in GenSelect path.

The addition of **kwargs on line 366 ensures that generation-related parameters are correctly propagated through the GenSelect path, aligning with the PR objectives to support tool usage during GenSelect generation.


387-394: LGTM! Ensures generation key is always present.

Lines 390-393 ensure that the generation key is always present in the result dict, even when solution_key is different. This is consistent with the empty-result branch (line 355) and prevents downstream errors in inference/generate.py.

nemo_skills/inference/generate.py (1)

370-383: LGTM! Tokenizer propagation to parallel thinking model.

Line 381 correctly passes tokenizer=self.tokenizer to get_parallel_thinking_model, ensuring that the parallel thinking path has access to the tokenizer context when needed (e.g., for completions API or tool usage). This aligns with the broader PR changes to support tokenizer-aware prompt generation.

nemo_skills/inference/model/__init__.py (1)

71-95: LGTM! Public API updated to accept tokenizer.

Lines 75 and 93-95 correctly update the get_parallel_thinking_model signature to accept an optional tokenizer parameter and forward it to the ParallelThinkingTask constructor. This aligns with the broader PR changes to support tokenizer-aware prompt generation in the parallel thinking path.

Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
@shtoshni shtoshni merged commit f94fdbd into main Oct 3, 2025
4 of 6 checks passed
@shtoshni shtoshni deleted the shtoshni/parallel_thinking_fixes branch October 3, 2025 20:29
SeanNaren pushed a commit to SeanNaren/NeMo-Skills that referenced this pull request Oct 9, 2025
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
SeanNaren pushed a commit that referenced this pull request Oct 9, 2025
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: SeanNaren <snarenthiran@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants