Skip to content

Update for parallel thinking#929

Merged
shtoshni merged 6 commits intomainfrom
shtoshni/parallel_thinking_update
Oct 11, 2025
Merged

Update for parallel thinking#929
shtoshni merged 6 commits intomainfrom
shtoshni/parallel_thinking_update

Conversation

@shtoshni
Copy link
Contributor

@shtoshni shtoshni commented Oct 11, 2025

  • Endpoint fixes for parallel thinking
  • Count num tokens addition
  • Corner cases
  • Slight refactoring

Summary by CodeRabbit

  • New Features

    • Optional prompt token counting; input token totals included in results when enabled.
    • Multi-solution retrieval with optional filtering of incomplete solutions and aggregated token statistics.
  • Changes

    • Default endpoint type switched from “chat” to “text.”
    • Parallel thinking settings (mode, endpoint type, window size, solution key, filtering) now consistently applied during inference when the mode is set.

Shubham Toshniwal added 6 commits October 10, 2025 12:55
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 11, 2025

Walkthrough

Propagates parallel_thinking config fields into inference overrides. In ParallelThinkingTask, adds prompt token counting, changes default endpoint_type to text, and introduces _get_multiple_solutions to gather and optionally filter solutions, computing token counts and totals. Ensures num_input_tokens is attached to outputs when enabled. No other control flow changes.

Changes

Cohort / File(s) Summary
Inference override propagation
nemo_skills/inference/generate.py
When parallel_thinking.mode is set, passes endpoint_type, mode, window_size, solution_key, and filter_incomplete_solutions into inference_override_config for get_parallel_thinking_model. No other flow changes.
ParallelThinking enhancements
nemo_skills/inference/model/parallel_thinking.py
Adds count_prompt_tokens flag and optional HF tokenizer to count input tokens; carries num_input_tokens through results. Changes default endpoint_type from chat to text. Adds async helper _get_multiple_solutions to fetch/generate, optionally filter incomplete solutions, and compute total_generated_tokens; integrates into generate_async pathway.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant Inference as Inference.generate
  participant PTFactory as get_parallel_thinking_model
  participant PT as ParallelThinkingTask
  note over Inference,PTFactory: New: propagate endpoint_type, mode, window_size,<br/>solution_key, filter_incomplete_solutions when mode != None

  Caller->>Inference: generate(prompt, config)
  Inference->>PTFactory: get_parallel_thinking_model(override_config)
  PTFactory-->>Inference: PT instance
  Inference->>PT: generate_async(prompt, ...)
  alt count_prompt_tokens == True
    PT->>PT: init HF tokenizer
  end
  note over PT: New: assemble solutions
  PT->>PT: _get_multiple_solutions(prompt, rng, filter_incomplete)
  alt solutions from cache
    PT-->>PT: load pre-generated solutions
  else generate on-the-fly
    PT->>PT: call underlying LLM for solutions
  end
  opt count_prompt_tokens
    PT->>PT: compute num_input_tokens for prompt
  end
  PT-->>Inference: results (solutions, totals, num_input_tokens?)
  Inference-->>Caller: final output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

I nibble on prompts, count tokens with cheer,
Hop through solutions, both far and near.
Text lanes by default, my whiskers align,
Filtering the stray thoughts, keeping them fine.
With bundles of answers in clovery rows—
Thump! Another clean run, and off the rabbit goes. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The title “Update for parallel thinking” is related to the changeset but is overly generic and does not clearly summarize the main enhancements such as token-counting support, new multi-solution workflow, and configuration propagation. It fails to convey the specific scope or impact of the updates. Please revise the title to clearly reflect the primary changes, for example “Add token counting and multi-solution support to parallel thinking workflow.”
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch shtoshni/parallel_thinking_update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
nemo_skills/inference/model/parallel_thinking.py (1)

106-109: LGTM! Tokenizer initialization is correct.

The tokenizer initialization logic correctly validates that the tokenizer can be loaded when prompt token counting is enabled. The error message is clear and helpful.

If you prefer to address the static analyzer hint (TRY003), you could define a custom exception class, though the current approach is acceptable:

class TokenizerInitializationError(ValueError):
    """Raised when tokenizer cannot be initialized for prompt token counting."""
    pass

# Then use:
raise TokenizerInitializationError()

Based on learnings from static analysis hints.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 85fe307 and 9e1bf9c.

📒 Files selected for processing (2)
  • nemo_skills/inference/generate.py (1 hunks)
  • nemo_skills/inference/model/parallel_thinking.py (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/inference/generate.py (1)
nemo_skills/inference/chat_interface/core.py (1)
  • cfg (181-182)
nemo_skills/inference/model/parallel_thinking.py (2)
nemo_skills/prompt/utils.py (1)
  • get_token_count (310-369)
nemo_skills/inference/model/base.py (2)
  • EndpointType (38-41)
  • generate_async (213-315)
🪛 Ruff (0.13.3)
nemo_skills/inference/model/parallel_thinking.py

109-109: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit-tests
  • GitHub Check: pre-commit
🔇 Additional comments (8)
nemo_skills/inference/generate.py (1)

400-405: LGTM! Configuration propagation is correct.

The additional fields propagated from parallel_thinking config to inference_override_config align well with the parallel thinking functionality requirements. The approach ensures that parallel thinking-specific settings are properly passed to the underlying model.

nemo_skills/inference/model/parallel_thinking.py (7)

27-29: LGTM! Required imports for token counting.

The additional imports support the new prompt token counting feature.


62-63: LGTM! Clean feature addition.

The count_prompt_tokens field provides a clean way to enable token counting with a sensible default.


201-239: LGTM! Well-structured multi-solution retrieval.

The _get_multiple_solutions method cleanly handles both offline (pre-loaded) and online (generate-on-the-fly) solution workflows. The filtering logic correctly identifies incomplete solutions by checking for unclosed thinking markers, and the token counting aggregation is accurate.


267-284: LGTM! Token counting integration is correct.

The token counting logic properly measures input tokens for the parallel thinking prompt and integrates cleanly with the existing generation flow. The addition of endpoint_type to the duplicate keys list correctly prevents parameter conflicts since the endpoint type should be determined by the model's configuration.


370-384: LGTM! Appropriate edge case handling.

The empty solutions case is handled correctly with sensible defaults. Setting num_input_tokens to None is appropriate since no meaningful prompt was processed in this scenario.


408-409: LGTM! Correct token count propagation.

The num_input_tokens is correctly propagated from the parallel thinking result to the final output when prompt token counting is enabled.


57-57: Confirm endpoint_type default change
Default switched from EndpointType.chat to EndpointType.text in ParallelThinkingConfig; verify chat-based prompts still route correctly or revert if necessary.

@shtoshni shtoshni merged commit 55943a2 into main Oct 11, 2025
7 checks passed
@shtoshni shtoshni deleted the shtoshni/parallel_thinking_update branch October 11, 2025 04:23
dgtm777 pushed a commit that referenced this pull request Oct 29, 2025
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
dgtm777 pushed a commit that referenced this pull request Mar 18, 2026
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: dgitman <dgitman@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant