Conversation
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
WalkthroughPropagates parallel_thinking config fields into inference overrides. In ParallelThinkingTask, adds prompt token counting, changes default endpoint_type to text, and introduces _get_multiple_solutions to gather and optionally filter solutions, computing token counts and totals. Ensures num_input_tokens is attached to outputs when enabled. No other control flow changes. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Caller
participant Inference as Inference.generate
participant PTFactory as get_parallel_thinking_model
participant PT as ParallelThinkingTask
note over Inference,PTFactory: New: propagate endpoint_type, mode, window_size,<br/>solution_key, filter_incomplete_solutions when mode != None
Caller->>Inference: generate(prompt, config)
Inference->>PTFactory: get_parallel_thinking_model(override_config)
PTFactory-->>Inference: PT instance
Inference->>PT: generate_async(prompt, ...)
alt count_prompt_tokens == True
PT->>PT: init HF tokenizer
end
note over PT: New: assemble solutions
PT->>PT: _get_multiple_solutions(prompt, rng, filter_incomplete)
alt solutions from cache
PT-->>PT: load pre-generated solutions
else generate on-the-fly
PT->>PT: call underlying LLM for solutions
end
opt count_prompt_tokens
PT->>PT: compute num_input_tokens for prompt
end
PT-->>Inference: results (solutions, totals, num_input_tokens?)
Inference-->>Caller: final output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
nemo_skills/inference/model/parallel_thinking.py (1)
106-109: LGTM! Tokenizer initialization is correct.The tokenizer initialization logic correctly validates that the tokenizer can be loaded when prompt token counting is enabled. The error message is clear and helpful.
If you prefer to address the static analyzer hint (TRY003), you could define a custom exception class, though the current approach is acceptable:
class TokenizerInitializationError(ValueError): """Raised when tokenizer cannot be initialized for prompt token counting.""" pass # Then use: raise TokenizerInitializationError()Based on learnings from static analysis hints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_skills/inference/generate.py(1 hunks)nemo_skills/inference/model/parallel_thinking.py(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
nemo_skills/inference/generate.py (1)
nemo_skills/inference/chat_interface/core.py (1)
cfg(181-182)
nemo_skills/inference/model/parallel_thinking.py (2)
nemo_skills/prompt/utils.py (1)
get_token_count(310-369)nemo_skills/inference/model/base.py (2)
EndpointType(38-41)generate_async(213-315)
🪛 Ruff (0.13.3)
nemo_skills/inference/model/parallel_thinking.py
109-109: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: unit-tests
- GitHub Check: pre-commit
🔇 Additional comments (8)
nemo_skills/inference/generate.py (1)
400-405: LGTM! Configuration propagation is correct.The additional fields propagated from
parallel_thinkingconfig toinference_override_configalign well with the parallel thinking functionality requirements. The approach ensures that parallel thinking-specific settings are properly passed to the underlying model.nemo_skills/inference/model/parallel_thinking.py (7)
27-29: LGTM! Required imports for token counting.The additional imports support the new prompt token counting feature.
62-63: LGTM! Clean feature addition.The
count_prompt_tokensfield provides a clean way to enable token counting with a sensible default.
201-239: LGTM! Well-structured multi-solution retrieval.The
_get_multiple_solutionsmethod cleanly handles both offline (pre-loaded) and online (generate-on-the-fly) solution workflows. The filtering logic correctly identifies incomplete solutions by checking for unclosed thinking markers, and the token counting aggregation is accurate.
267-284: LGTM! Token counting integration is correct.The token counting logic properly measures input tokens for the parallel thinking prompt and integrates cleanly with the existing generation flow. The addition of
endpoint_typeto the duplicate keys list correctly prevents parameter conflicts since the endpoint type should be determined by the model's configuration.
370-384: LGTM! Appropriate edge case handling.The empty solutions case is handled correctly with sensible defaults. Setting
num_input_tokenstoNoneis appropriate since no meaningful prompt was processed in this scenario.
408-409: LGTM! Correct token count propagation.The
num_input_tokensis correctly propagated from the parallel thinking result to the final output when prompt token counting is enabled.
57-57: Confirm endpoint_type default change
Default switched fromEndpointType.chattoEndpointType.textinParallelThinkingConfig; verify chat-based prompts still route correctly or revert if necessary.
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com>
Signed-off-by: Shubham Toshniwal <stoshniwal@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>
Summary by CodeRabbit
New Features
Changes